Data Analytic For Accounting (DAFA) Main Reference
Data Analytic For Accounting (DAFA) Main Reference
Accounting
Vernon J. Richardson
University of Arkansas,
Xi’an Jiaotong Liverpool University
Ryan A. Teeter
University of Pittsburgh
Katie L. Terrell
University of Arkansas
DATA ANALYTICS FOR ACCOUNTING
Published by McGraw-Hill Education, 2 Penn Plaza, New York, NY 10121. Copyright © 2019 by McGraw-
Hill Education. All rights reserved. Printed in the United States of America. No part of this publication may be
reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without the
prior written consent of McGraw-Hill Education, including, but not limited to, in any network or other
electronic storage or transmission, or broadcast for distance learning.
Some ancillaries, including electronic and print components, may not be available to customers outside the
United States.
This book is printed on acid-free
paper. 1 2 3 4 5 6 7 8 9 LWI/LWI 21
20 19 18
ISBN 978-1-260-37519-0
MHID 1-260-37519-6
Portfolio Manager: Steve Schuetz
Product Developer: Alexandra Kukla
Marketing Manager: Michelle Williams
Content Project Managers: Fran Simon/Angela Norris
Buyer: Sue Culbertson
Design: Egzon Shaquiri
Content Licensing Specialist: Shawntel Schmitt
Cover Image: © SUNSHADOW/Shutterstock
Compositor: SPi Global
All credits appearing on page or at the end of the book are considered to be an extension of the copyright page.
The Internet addresses listed in the text were accurate at the time of publication. The inclusion of a website does
not indicate an endorsement by the authors or McGraw-Hill Education, and McGraw-Hill Education does not
guarantee the accuracy of the information presented at these sites.
mheducation.com/highered
Dedications
iii
Preface
iv
Preface v
About the Authors
vi
Acknowledgments
vii
Key Features
viii
Main Text Features
Chapter Maps
These maps provide a guide of
what we’re going to cover in the Chapter 2
chapter as well as a guide of what
we’ve just learned and what’s
Data Preparation and Cleaning
coming next.
Chapter-Opening Vignettes
Because companies are facing the new
and exciting opportunities with their
use of Data Analytics to help with A Look at This Chapter
accounting and business decisions, we This chapter provides an overview of the types of data that are used in the accounting cycle and common data that
are stored in a relational database. The chapter addresses mastering the data, the second step of the IMPACT
detail what they’re doing and why in cycle. We will describe how data are requested and extracted to answer business questions and how to transform
data for use via data preparation, validation, and cleaning. We conclude with an explanation of how to load data into
our chapter-opening vignettes. the appropri- ate tool in preparation for analyzing data to make decisions.
A Look Back
Chapter 1 defined Data Analytics and explained that the value of Data Analytics is in the insights it provides.
We described the Data Analytic Process using the IMPACT cycle model and explained how this process is used
to address both business and accounting questions. We specifically emphasized the importance of identifying
appropri- ate questions that data analytics might be able to address.
We are lucky to live in a world in which data are abundant.
However, even with rich sources of data, when it comes to being
able to analyze data and turn them into useful information A Look Ahead
and insights, very rarely can an analyst hop right into a
dataset and begin analyzing. Datasets almost always need Chapter 3 describes how to go from defining business problems to analyzing data, answering questions, and address-
to be cleaned and validated before they can be used. Not ing business problems. We make the case for three data approaches we argue are most relevant to accountants
knowing how to clean and validate data can, at best, lead to
frustration and poor insights and, at worst, lead to horrible and provide examples of each.
security violations. While this text takes advantage of open
source datasets, these datas- ets have all been scrubbed not
only for accuracy, but also to pro- tect the security and privacy
Shutterstock / Wichy
of any individual or company whose details were in the
original dataset.
In 2015, a pair of researchers named Emil Kirkegaard and
Julius Daugbejerg Bjerrekaer scraped data from OkCupid, a free dating website, and provided the data onto
the “Open Science Framework,” a platform researchers use to obtain and share raw data. While the aim of
the Open Science Framework is to increase transparency, the researchers in this instance took that a step too far—
and a step into illegal territory. Kirkegaard and Bjerrekaer did not obtain permission from OkCupid or from the 70,000
OkCupid users whose identities, ages, genders, religions, personality traits, and other personal details maintained by
the dat- ing site were provided to the public without any work being done to anonymize or sanitize the data. If the
researchers had taken the time to not just validate that the data were complete but also to sanitize them to protect the
individuals’ identities, this would not have been a threat or a news story. On May 13, 2015, the Open Science
Framework removed the OkCupid data from the platform, but the damage of the privacy breach had already
been done.1
Problems Problems
Challenge the student’s ability to see
The following problems correspond to the College
relation- ships in the learning objectives by to answer each question by just looking at the
employing higher-level thinking and DataDictionary.pdf) included in Appendix A, but if y
free to do so (CollegeScorecard_RawData.txt).
analytical skills.
1. Which attributes from the College Scorecard
da attendance across types of institutions
(public, p
2. Which attributes from the College Scorecard d
scores across types of institutions (public,
privat
3. Which attributes from the College Scorecard dat
diversity across types of institutions (public,
priv
4. Which attributes from the College Scorecard
da tion across types of institut public, pri
Company summary
Sláinte is a fictional brewery that has recently gone through big changes.
Sláin ferent products. The brewery has only recently expanded its business to
distribu state to nine states, and now its business has begun stabilizing after
the expans
ility es ett . Yo áin
Data
Connect for Data Analytics
for Ac
xii
Connect for Data Analytics for Accounting xiii
Labs: Select labs are assignable in Connect but will require students to work outside of
Connect to complete the lab. Once completed, students go back into Connect to answer
questions designed to ensure they completed the lab and understood the key skills and
outcomes from their lab work.
Comprehensive Cases: Select comprehensive labs/cases are assignable in Connect but will
require students to work outside of Connect to complete the lab using the Dillard’s real-world
Big Data set. Once students complete the comprehensive lab, they will go back into
Connect to answer questions designed to ensure they completed the lab and understood
the key skills and outcomes from their lab work.
McGraw-Hill Connect® is a highly reliable, easy-
to- use homework and learning management
solution that utilizes learning science and award-
winning adaptive tools to improve student
results.
©McGraw-Hill Education
Robust Analytics and Reporting
More students
earn As and Bs
when they use
Connect.
Trusted Service and Support
▪ Connect integrates with your LMS to provide single sign-on and automatic syncing
of grades. Integration with Blackboard®, D2L®, and Canvas also provides automatic
syncing of the course calendar and assignment-level linking.
▪ Connect offers comprehensive service, support, and training throughout every
phase of your implementation.
▪ If you’re looking for some guidance on how to use Connect, or want to learn
tips and tricks from super users, you can find tutorials as you work. Our
Digital Faculty Consultants and Student Ambassadors offer insight into how to
achieve the results you want with Connect.
www.mheducation.com/connect
Brief Table of Contents
Preface iv
Chapter 1 Data Analytics in Accounting and Business 2
Chapter 2 Data Preparation and Cleaning 38
Chapter 3 Modeling and Evaluation: Going from Defining Business Problems and
Data Understanding to Analyzing Data and Answering Questions 92
Chapter 4 Visualization: Using Visualizations and Summaries to Share Results
with Stakeholders 138
Chapter 5 The Modern Audit and Continuous Auditing
190 Chapter 6 Audit Data Analytics
208
Chapter 7 Generating Key Performance Indicators 250
Chapter 8 Financial Statement Analytics 300
GLOSSARY 326
INDEX 330
xvi
Detailed TOC
Chapter 1 Chapter 2
Data Analytics in Accounting and Business 2 Data Preparation and Cleaning 38
A Look at This Chapter 2 A Look at This Chapter 38
A Look Ahead 2 A Look Back 38
Data Analytics 4 A Look Ahead 38
How Data Analytics Affects Business 4 How Data Are Used and Stored in the Accounting
How Data Analytics Affects Accounting 5 Cycle 40
Auditing 5 Data And Relationships in a Relational
Financial Reporting 6 Database 41
Taxes 7 Columns in a Table: Primary Keys, Foreign Keys, and
The Data Analytics Process Using the Descriptive Attributes 41
IMPACT Cycle 8 Data Dictionaries 43
Step 1: Identify the Question (chapter 1) 8 Extraction, Transformation, and Loading (ETL)
Step 2: Master the Data (chapter 2) 8 of Data 44
Step 3: Perform Test Plan (chapter 3) 9 Extraction 44
Step 4: Address and Refine Results (chapter 4) 11 Step 1: Determine the Purpose and Scope of the Data
Steps 5 and 6: Communicate Insights and Track Request 45
Outcomes (chapter 4 and each chapter thereafter)
Step 2: Obtain the Data 45
11 Back to Step 1 12 Transformation 48
Data Analytic Skills Needed by Analytic- Step 3: Validating the Data for Completeness and
Minded Accountants 12 Integrity 48
Hands-On Example of the IMPACT Model 13 Step 4: Cleaning the Data 49
Identify the Question 13
Loading 50
Master the Data 13
Step 5: Loading the Data for Data Analysis 50
Perform Test Plan 15
Summary 50
Address and Refine Results 17
Key Words 51
Communicate Insights 19
Answers to Progress Checks 51
Track Outcomes 19
Multiple Choice Questions 52
Summary 20 Discussion Questions 53
Key Words 20 Problems 54
Answers to Progress Checks 21 Appendix A: College Scorecard Dataset 55
Multiple Choice Questions 23 Answers to Multiple Choice Questions 55
Discussion Questions 24 Lab 2-1 Create a Request for Data Extraction 57
Problems 24 Lab 2-2 Use PivotTables to Denormalize and
Answers to Multiple Choice Questions 26 Analyze the Data 59
Lab 1-0 How to Complete Labs in This Lab 2-3 Resolve Common Data Problems in
Text 27 Excel and Access 67
Lab 1-1 Data Analytics in Financial Lab 2-4 Generate Summary Statistics in
Accounting 28 Excel 71
Lab 1-2 Data Analytics in Managerial Lab 2-5 College Scorecard Extraction and Data
Accounting 31 Preparation 73
Lab 1-3 Data Analytics in Auditing 33 Lab 2-6 Comprehensive Case: Dillard’s Store
Lab 1-4 Comprehensive Case: Dillard’s Store Data: How to Create an Entity-
Data 34 Relationship Diagram 74
xvii
xviii Detailed TOC
Lab 2-7 Comprehensive Case: Dillard’s Store Lab 3-4 Comprehensive Case: Dillard’s Store
Data: How to Preview Data from Tables Data: Data Abstract (SQL) and
in a Query 77 Regression (Part I) 125
Lab 2-8 Comprehensive Case: Dillard’s Store Lab 3-5 Comprehensive Case: Dillard’s Store
Data: Connecting Excel to a SQL Data: Data Abstract (SQL) and
Database 80 Regression (Part II) 134
Lab 2-9 Comprehensive Case: Dillard’s Store
Data: Joining Tables 89 Chapter 4
Visualization: Using Visualizations and Summaries to
Chapter 3 Share Results with Stakeholders 138
Modeling and Evaluation: Going from Defining
Business Problems and Data Understanding to A Look at This Chapter 138
Analyzing Data and Answering Questions 92 A Look Back 138
A Look Ahead 138
A Look at This Chapter 92 Determine the Purpose of Your Data
A Look Back 92 Visualization 140
A Look Ahead 92 Quadrants 1 and 3 versus Quadrants 2 and
Performing the Test Plan: Defining Data 4: Qualitative versus Quantitative 141
Analytics Approaches 94 Quadrants 1 and 2 versus Quadrants 3 and
Profiling 98 4:
Example of Profiling in Management Accounting 99 Declarative versus Exploratory 143
Example of Profiling in an Internal Audit 99 Choosing the Right Chart 144
Example of Profiling in Auditing and Continuous Charts Appropriate for Qualitative Data 144
Auditing 100 Charts Appropriate for Quantitative Data 146
Data Reduction 101 Tools to Help When Picking a Visual 148
Example of Data Reduction in Internal and External Learning to Create a Good Chart by (Bad) Example
Auditing 101 150 Further Refining Your Chart to Communicate
Examples of Data Reduction in Other Accounting Better 155
Areas 102 Data Scale and Increments 156
Regression 102 Color 156
Examples of the Regression Approach in Managerial Communication: More Than Visuals—Using Words
Accounting 103 to Provide Insights 157
Examples of the Regression Approach in Auditing Content and Organization 157
103 Other Examples of the Regression and Audience and Tone 158
Classification Approach in Accounting 104 Revising 158
Classification 104 Summary 159
Classification Terminology 104 Key Words 159
Evaluating Classifiers 106 Answers to Progress Checks 160
Clustering 107 Multiple Choice Questions 161
Example of the Clustering Approach in Auditing Discussion Questions 162
108 Summary 109 Problems 163
Key Words 110 Answers to Multiple Choice Questions 163
Answers to Progress Checks 111 Lab 4-1 Use PivotCharts to Visualize Declarative
Multiple Choice Questions 111 Data 164
Discussion Questions 113 Lab 4-2 Use Tableau to Perform Exploratory
Problems 113 Analysis and Create Dashboards
Answers to Multiple Choice Questions 114 166
Appendix: Setting Up a Classification Analysis Lab 4-3 Comprehensive Case: Dillard’s
114 Lab 3-1 Data Reduction 116 Store Data: Create Geographic Data
Lab 3-2 Regression in Excel 120 Visualizations in Tableau 175
Lab 3-3 Classification 122 Lab 4-4 Comprehensive Case: Dillard’s Store
Data: Visualizing Regression in
Tableau 186
Detailed TOC xix
Lab 7-3 Comprehensive Case: Dillard’s Store The Use of Sparklines and Trendlines in
Data: Creating KPIs in Excel (Part I) Ratio Analysis 306
273 Text Mining and Sentiment Analysis
Lab 7-4 Comprehensive Case: Dillard’s Store 307 Summary 309
Data: Creating KPIs in Excel Key Words 309
(Part II) 279 Answers to Progress Checks
Lab 7-5 Comprehensive Case: Dillard’s Store 310 Multiple Choice Questions
Data: Creating KPIs in Excel 310 Discussion Questions 312
(Part III) 287 Problems 312
Lab 7-6 Comprehensive Case: Dillard’s Store Answers to Multiple Choice Questions 313
Data: Creating KPIs in Excel Lab 8-1 Use XBRLAnalyst to Access
(Part IV—Putting It All Together) 295 XBRL
Data 314
Chapter 8 Lab 8-2 Use XBRLAnalyst to Create Dynamic
Financial Statement Analytics 300 Common-Size Financial Statements 317
Lab 8-3 Use XBRL to Access and Analyze
A Look at This Chapter 300 Financial Statement Ratios—The Use
A Look Back 300 of DuPont Ratios 320
XBRL 302 Lab 8-4 Use SQL to Query an
Extensible Reporting in XBRL and Standardized XBRL Database 323
Metrics 303
XBRL, XBRL-GL, and Real-Time Financial
Reporting 303 GLOSSARY 326
Ratio Analysis 305
Classes of Ratios 305
INDEX 330
DuPont Ratio Analysis 306
Data Analytics for
Accounting
Chapter 1
Data Analytics in Accounting
and Business
A Look Ahead
Chapter 2 provides a description of how data are prepared and scrubbed to be ready for analysis to answer
business questions. We explain how to extract, transform, and load data and then how to validate and normalize
the data. In addition, we explain how data standards are used to facilitate the exchange of data between senders
and receivers.
2
The Chinese e-commerce company, Alibaba, is perhaps the biggest
online commerce company in the world. Using its three main websites,
Taobao, Tmall, and Alibaba.com, it hosts millions of businesses and hun-
dreds of millions of users with $248 billion in sales last year (more than
eBay and Amazon combined!). With so many transactions and so many
users, Alibaba has worked to capture fraud signals directly from its
exten- sive database of user behaviors and its network, and then
analyzes them in real time using machine learning to accurately sort the
bad users from the good ones. Alibaba has developed five stages of
fraud detection for each user: (1) account check, (2) device check, (3)
activity check, (4) risk strategy, and (5) manual review. These stages all
combine to develop a risk score for each user. This fraud risk prevention
score is so valuable to Alibaba and others, Alibaba shares and sells it
Alibaba.com
to external customers, developing a risk score for each current and
potential customer. What will Data Analytics do next?
Sources: J. Chen, Y. Tao, H. Wang, and T. Chen, “Big Data Based Fraud Risk Management at Alibaba,” The Journal of
Finance and Data Science 1, no. 1 (2015), pp. 1–10; and K. Pal, “How to Combat Financial Fraud by Using Big Data,”
2016, http://www.kdnuggets.com/2016/03/combat-financial-fraud-using-big-data.html.
OBJECTIVES
After reading this chapter, you should be able to:
LO 1-1
LO 1-2 Define Data Analytics
LO 1-5 Describe the Data Analytics Process using the IMPACT cycle
3
4 Chapter 1 Data Analytics in Accounting and Business
PROGRESS CHECK
1. How does having more data around us translate into value for a company?
2. Banks know a lot about us, but they have traditionally used externally
generated credit scores to assess creditworthiness when deciding whether to
extend a loan. How would you suggest a bank use Data Analytics to get a
more complete view of its customers’ creditworthiness? Assume the bank has
access to a cus- tomer’s loan history, credit card transactions, deposit history,
and direct deposit registration. How could it assess whether a loan might be
repaid?
LO 1-2
HOW DATA ANALYTICS AFFECTS BUSINESS
Understand why There is little question that the impact of data analytics on business is overwhelming. In
Data Analytics
fact, in PwC’s 18th Annual Global CEO Survey, 86 percent of chief executive officers
matters to
(CEOs) say they find it important to champion digital technologies and emphasize a clear
business
vision of using technology for a competitive advantage, while 85 percent say they put a
high value on Data Analytics. According to the same survey, many CEOs put a high
value on Data Analytics, and 80 percent of them place data mining and analysis as the
second-most important strategic technology for CEOs. In fact, per PwC’s 6th Annual
Digital IQ survey
1
http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-
read/#2a3289006c1d (accessed November 10, 2016).
2
Roger S. Debreceny and Glen L. Gray, “IT Governance and Process Maturity: A Multinational Field
Study,” Journal of Information Systems 27, no. 1 (Spring 2013), pp. 157–188.
3
H. Chen, R. H. L. Chiang, and V. C. Storey, “Business Intelligence Research,” MIS Quarterly 34, no. 1
Chapter 1 Data Analytics in Accounting and Business 5
of more than 1,400 leaders from digital businesses, the area of investment that tops
CEOs’
list of priorities is business analytics.4
A recent study from McKinsey Global Institute estimates that Data Analytics could
gen- erate up to $3 trillion in value per year in just a subset of the total possible industries
affected.5 Data Analytics could very much transform the manner in which companies run
their businesses in the near future. The real value of data comes from Data Analytics.
With a wealth of data on their hands, companies use Data Analytics to discover the
various buy- ing patterns of their customers, investigate anomalies that were not
predicted, forecast future possibilities, and so on. For example, with insight provided
through Data Analytics, companies could do more directed marketing campaigns based
on patterns observed in their data, giving them a competitive advantage over companies
that do not use this infor- mation to improve their marketing strategies. Patterns
discovered from past archives enable businesses to identify opportunities and risks and
better plan for the future. In addition to producing more value externally, studies show
that Data Analytics affects internal pro- cesses, improving productivity, utilization, and
growth.6
PROGRESS CHECK
3. Let’s assume a brand manager at Samsung identifies that an older
demographic might be concerned with the use of an iPhone and the radiation
impact it might have on the brain. How might Samsung use Data Analytics to
assess if this is a problem?
4. How might Data Analytics assess the higher cost of paying employees to work
overtime? Consider how Data Analytics might be helpful in reducing a compa-
ny’s overtime direct labor costs in a manufacturing setting.
LO 1-3
HOW DATA ANALYTICS AFFECTS ACCOUNTING
Explain why
Data Analytics is expected to have dramatic effects on auditing, financial reporting, and Data Analytics
tax and managerial accounting. We detail how we think this might happen in each of the matters to
following sections. accountants
Auditing
Data Analytics plays an increasingly critical role in the future of audit. In a recent Forbes
Insights/KPMG report, “Audit 2020: A Focus on Change,” the vast majority of survey
respondents believe both that:
1. Audit must better embrace technology.
2. Technology will enhance the quality, transparency, and accuracy of the audit.
Indeed, “As the business landscape for most organizations becomes increasingly
complex and fast-paced, there is a movement toward leveraging advanced business analytic
techniques to
4
“Data Driven: What Students Need to Succeed in a Rapidly Changing Business World,” PwC, http://www
.pwc.com/us/en/faculty-resource/assets/PwC-Data-driven-paper-Feb2015.pdf, February 2015 (accessed
January 9, 2016).
5
“Open Data: Unlocking Innovation and Performance with Liquid Information,” McKinsey Global
Institute,
http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_and_
performance_with_liquid_information, October 2013 (accessed September 7, 2015).
6 Chapter 1 Data Analytics in Accounting and Business
refine the focus on risk and derive deeper insights into an organization.”7 Many auditors
believe
that auditor data analytics will, in fact, lead to deeper insights that will enhance audit
quality. This sentiment of the impact of Data Analytics on the audit has been growing for
several years now and has given many public accounting firms incentives to invest in
technology and person- nel to capture, organize, and analyze financial statement data to
provide enhanced audits, expanded services, and added value to their clients. As a result, Data
Analytics is expected to be the next innovation in the evolution of the audit and professional
accounting industry.
Given the fact that operational data abound and are easier to collect and manage, combined
with CEOs’ desires to utilize these data, the accounting firms may now approach their
engage- ments with a different mindset. No longer will they be simply checking for errors,
material mis- statements, fraud, and risk in financial statements or merely be reporting their
findings at the end of the engagement. Now, audit professionals will be collecting and analyzing
the company’s data similar to the way a business analyst would to help management make better
business decisions. This means that, in many cases, external auditors will stay engaged with
clients beyond the audit. This is a significant paradigm shift. The audit process will be changed
from a traditional process toward a more automated one, which will allow audit professionals
to focus more on the logic and rationale behind data queries and less on the gathering of the
actual data.8 As a result, audits will not only yield important findings from a financial
perspective, but also information that can help companies refine processes, improve
efficiency, and anticipate future problems.
“It’s a massive leap to go from traditional audit approaches to one that fully
integrates big data and analytics in a seamless manner.”9
Data Analytics also expands auditors’ capabilities in services like testing for fraudulent trans-
actions and automating compliance-monitoring activities (like filing financial reports to the SEC
or to the IRS). This is possible because Data Analytics enables auditors to analyze the
complete dataset, rather than the sampling of the financial data done in a traditional audit.
Data Analytics enables auditors to improve its risk assessment in both its substantive and
detailed testing.
Financial Reporting
Data Analytics also potentially has an impact on financial reporting. With the use of so
many estimates and valuations in Financial Accounting, some believe that employing Data
Analytics may substantially improve the quality of the estimates and valuations. Data
from within an enterprise system and external to the company and system might be used to
address many of the questions that face financial reporting. Many financial statement
accounts are just esti- mates and so accountants often ask themselves questions like this to
evaluate those estimates:
1. How much of the accounts receivable balance will ultimately be collected? What
should the allowance for loan losses look like?
2. Is any of our inventory obsolete? Should our inventory be valued at market or cost
(applying the lower-of-cost-or-market rule)? When will it be out of date? Do we need
to offer a discount on it now to get it sold?
7
Deloitte, “Adding Insight to Audit: Transforming Internal Audit through Data Analytics,” http://www2
.deloitte.com/content/dam/Deloitte/ca/Documents/audit/ca-en-audit-adding-insight-to-audit.pdf
(accessed January 10, 2016).
8
PwC, “Data Driven: What Students Need to Succeed in a Rapidly Changing Business World,” http://www
.pwc.com/us/en/faculty-resource/assets/PwC-Data-driven-paper-Feb2015.pdf, posted February 2015
(accessed January 9, 2016).
9
EY, “How Big Data and Analytics Are Transforming the Audit,” https://eyo-iis-pd.ey.com/ARC/documents/
EY-reporting-issue-9.pdf, posted April 2015 (accessed January 27, 2016).
Chapter 1 Data Analytics in Accounting and Business 7
3. Has our goodwill been impaired due to the reduction in profitability from a recent
merger company? Will it regain value in the near future?
4. How should we value contingent liabilities like warranty claims or litigation? Do
we have the right amount?
Data Analytics may also allow an accountant or auditor to assess the probability of
a goodwill write-down, warranty claims or the collectability of bad debts based on what
customers, investors, and other stakeholders are saying about the company in blogs and
in social media (like Facebook and Twitter). This information might help the firm deter-
mine both its optimal response to the situation and appropriate adjustment to its financial
reporting.
It may be possible to use Data Analytics to scan the environment—that is, scanning
Google searches and social media (such as Instagram and Facebook) to identify potential
risks and opportunities to the firm. For example, in a business intelligence sense, it may
allow a firm to monitor its competitors and its customers to better understand opportuni-
ties and threats around it. For example, are its competitors, customers, or suppliers facing
financial difficulty, etc., that might affect the company’s interactions with them and open
up opportunities that otherwise it wouldn’t have considered?
Taxes
Traditionally, tax work dealt with compliance issues based on data from transactions
that have already taken place. Now, however, tax executives are charged with sophis-
ticated tax planning capabilities that assist the company to minimize its taxes and do
it in such a way as to either avoid or prepare for a potential audit. Arguably, one of
the things that Data Analytics does best is predictive analytics—predicting the future!
This shift in focus makes tax data analytics valuable for its ability to help tax staffs to
predict what will happen rather than reacting to what just did happen. An example of
how tax data analytics might be used is the capability to predict the potential tax conse-
quences of a potential international transaction, R&D investment, or proposed merger
or acquisition.
One of the issues of performing predictive Data Analytics is the efficient organization
and use of data stored across multiple systems on varying platforms that were not
originally designed for the tax department. Organizing tax data into a data warehouse to
be able to consistently model and query the data is an important step toward developing
the capability to perform tax data analytics. This issue is exemplified by the 29 percent of
tax departments that find the biggest challenge in executing an analytics strategy is
integration with the IT department and the available technology tools.10
PROGRESS CHECK
5. How could the use of internal audit data analytics find the pattern that one
accountant enters the majority of the journal entries each quarter? Why might
this be an issue that would need addressing?
6. How specifically will Data Analytics change the way a tax staff does its taxes?
10
Deloitte, “The Power of Tax Data Analytics,” http://www2.deloitte.com/us/en/pages/tax/articles/top-
ten- things-about-tax-data-analytics.html (accessed October 12, 2016).
8 Chapter 1 Data Analytics in Accounting and Business
EXHIBIT 1-1
The IMPACT Cycle
Source: J. P. Isson and
J. S. Harriott, Win with
Advanced Business Analytics:
Creating Business Value from
Your Data (Hoboken, NJ:
Wiley, 2013).
In addition, to give us some idea of the data questions, we may want to consider the
following:
• Review data availability in a firm’s internal systems (including those in the
financial reporting system or ERP systems that might occur in its accounting cycles
—financial, procure-to-pay, production, order-to-cash, human resources).
• Review data availability in a firm’s external network, including those that might
already be housed in an existing data warehouse.
• Data dictionaries and other contextual data—to provide details about the data.
• Extraction, transformation, and loading.
• Data validation and completeness—to provide a sense of the reliability of the data.
• Data normalization—to reduce data redundancy and improve data integrity.
• Data preparation and scrubbing—Data Analytics professionals estimate that they spend
between 50 and 90 percent of their time cleaning data so the data can be
analyzed.11
11
“One-Third of BI Pros Spend Up to 90% of Time Cleaning Data,” http://www.eweek.com/database/one-third-
of-bi-pros-spend-up-to-90-of-time-cleaning-data.html, posted June 2015 (accessed March 15, 2016).
12
Foster Provost and Tom Fawcett, Data Science for Business: What You Need to Know about Data
Mining and Data-Analytic Thinking (Sebastopol, CA: O’Reilly Media, Inc.), 2013.
10 Chapter 1 Data Analytics in Accounting and Business
EXHIBIT 1-2
Example of
Co-occurrence
Grouping on
Amazon.com
©Amazon Inc.
EXHIBIT 1-3
Example of Link
Prediction on Facebook
©Facebook Inc.
• Data reduction—A data approach that attempts to reduce the amount of information
that needs to be considered to focus on the most critical items (i.e., highest cost, high-
est risk, largest impact, etc.). It does this by taking a large set of data (perhaps the
population) and reducing it with a smaller set that has the vast majority of the critical
information of the larger set. An example might include the potential to use these tech-
niques in auditing. While auditing has employed various random and stratified
sampling over the years, Data Analytics suggests new ways to highlight which
transactions do not need the same level of vetting as other transactions.
Back to Step 1
Of course, the IMPACT cycle is iterative, so once insights are gained and outcomes are
tracked, new questions emerge and the IMPACT cycle begins anew.
PROGRESS CHECK
7. Let’s say we are trying to predict how much money college students spend on
fast food each week. What would be the response, or dependent, variable?
What would be examples of independent variables?
8. How might a data reduction approach be used in auditing to spend time and
effort on the most important items?
201
2014
2013
2012
2011
$20,687,488,911
in loans issued as of 06/30/16
2010
0 $20 $40 $60 $80 $100 $120 $140 $160 $180 $200 $220
Total loans issued ($)
12
https://www.lendingclub.com/ (accessed September 29, 2016).
14 Chapter 1 Data Analytics in Accounting and Business
Borrowers borrow money for a variety of reasons, including refinancing other debt and
paying off credit cards, as well as borrowing for other purposes (Exhibit 1-5).
LendingClub actually provides datasets: data on the loans they approved and funded
as well as data for the loans that were declined. In this chapter, we will emphasize the
rejected loans and the reasons they were rejected.
The datasets and the data dictionary are available at
https://www.lendingclub.com/info/ download-data.action.
As we learn about the data, it is important to know what is available to us. To that end,
there is a data dictionary that provides descriptions for all of the data attributes of the
data- set. A cut-out of the data dictionary for the rejected stats file (i.e., the statistics about
those loans rejected) is included in the data files as shown in Exhibit 1-6.
EXHIBIT 1-6
2007–2012
LendingClub Data
Dictionary for Declined
Loan Data
Source: Available at
https://www.lendingclub.com/
info/download-data.action
(accessed October 13, 2016)
We could also take a look at the data files available for both the funded loan data.
However, for our analysis in the rest of the chapter, we use the Excel file “RejectStatsA
Ready,” which has rejected loan statistics from 2007 to 2012. It is a cleaned-up file ready
for analysis. We’ll learn more about data scrubbing in chapter 2.
Chapter 1 Data Analytics in Accounting and Business 15
Exhibit 1-7 provides a cut-out of the 2007–2012 “Approved Loan” dataset provided.
EXHIBIT 1-7
2007–2012 Declined
Loan Applications
(RejectStatsA)
Dataset
Available at: https://www
.lendingclub.com/info/
download-data.action.
Accessed 10/6/2016
EXHIBIT 1-8
LendingClub Declined
Loan Applications by
DTI (Debt-to-Income)
DTI bucket includes
high (debt > 20
percent of income),
medium (“mid”) (debt
between 10 and 20
percent of income),
and low (debt < 10
percent of income).
Source: Microsoft Excel 2016
The second analysis was on the length of employment and its relationship with
rejected loans (see Exhibit 1-9). Arguably, the longer the employment, the more stable of
a job and income stream you will have to ultimately repay the loan. LendingClub reports
the number of years for each of the rejected applications. The PivotTable analysis lists the
number of loans by the length of employment. Almost 77 percent (495,109 out of
645,414) out of the total rejected loans had worked at a job for less than 1 year,
suggesting potentially an important reason for rejecting the requested loan. Perhaps some
had worked a week, or just a month, and still want a big loan?
EXHIBIT 1-9
LendingClub Declined
Loan Applications by
DTI (Debt-to-Income)
DTI bucket includes
high (debt > 20
percent of income),
medium (“mid”) (debt
between 10 and 20
percent of income),
and low (debt < 10
percent of income).
Source: Microsoft Excel 2016
Chapter 1 Data Analytics in Accounting and Business 17
The third analysis we perform is to consider the credit or risk score of the applicant.
As
noted in Exhibit 1-10, risk scores are typically classified in this way with those in the
excel- lent and very good category receiving the lowest possible interest rates and best
terms with a credit score above 750. On the other end of the spectrum are those with very
bad credit (with a credit score less than 600).
EXHIBIT 1-10
Excellent Breakdown of Customer
Credit Scores (or Risk
800–850
Those with excellent and Scores)
very good credit scores are
likely to qualify for almost Source: Cafecredit.com
all loans and receive the Very Good
lowest interest rates.
750–799
Good
700–749
Those with good and fair
credit scores are likely to
qualify for most loans and Fair
receive good interest 650–699
rates.
Poor
600–649
Another predictor of loan repayment is the credit score that the borrower has. We clas-
sify the sample according to this breakdown into excellent, very good, good, fair, poor,
and very bad credit according to their credit score noted in Exhibit 1-10.
EXHIBIT 1-11 The Count of LendingClub Rejected Loan Applications by Credit or Risk Score
Classification Using PivotTable Analysis
Source: Microsoft Excel 2016
EXHIBIT 1-12 The Count of LendingClub Declined Loan Applications by Credit Score, Debt-to-
Income, and Employment Length Using PivotTable Analysis (highlighting added)
Source: Microsoft Excel 2016
Chapter 1 Data Analytics in Accounting and Business 19
Perhaps those with excellent credit just asked for too big of a loan given their existing
debt and that is why they were rejected. Exhibit 1-13 shows the PivotTable analysis. The
analysis shows those with excellent credit asked for a larger loan (16.2 percent of
income) given the debt they already had as compared to any of the others, suggesting a
reason even those potential borrowers with excellent credit were rejected.
EXHIBIT 1-13 The Average Debt-to-Income Ratio (shown as a percentage) by Credit (Risk)
Score for LendingClub Declined Loan Applications Using PivotTable Analysis
Source: Microsoft Excel 2016
Communicate Insights
Certainly further and more sophisticated analysis could be performed, but at this point we
have a pretty good idea of what LendingClub uses to decide whether to extend a loan.
We can communicate these insights either by showing the PivotTables or stating what
three of the determinants are.
Track Outcomes
There are a wide variety of outcomes that could be tracked. But in this case, it might be
best to see if we could predict future outcomes. For example, the data we analyzed was
from 2007–2012. We could make our predictions for subsequent years based on what we
had found in the past and then test and see how accurate we are with those predictions.
We could also change our prediction model when we learn new insights and additional data
become available. In this chapter, we discussed how businesses and accountants derive
value from Data Analytics. We gave some specific examples of how Data Analytics is
used in business, audit-
ing, managerial accounting, financial accounting, and tax accounting.
We introduced the IMPACT model and explained how it is used. And then talked spe-
cifically about the importance of identifying the question. We walked through the first
few steps of the IMPACT model and introduced eight data approaches. We also discussed
the data analytic skills needed by analytic-minded accountants.
We followed this up by looking at the case of why LendingClub rejected loans for a set
of its customers using the IMPACT model. We performed this analysis using various
filtering
and PivotTable tasks.
20 Chapter 1 Data Analytics in Accounting and Business
PROGRESS CHECK
9. Doing your own analysis, download the rejected loans dataset titled
“RejectStatsA Ready” and perform an Excel PivotTable analysis by state and
figure out the num- ber of rejected applications for the state of California. That
is, count the loans by state and see what percentage of the rejected loans
came from California. How close is that to the relative proportion of the
population of California as com- pared to that of the United States?
10. Doing your own analysis, download the rejected loans dataset titled
“RejectStatsA Ready” and run an Excel PivotTable by risk (or credit) score
classification and DTI bucket to determine the number of rejected loans
requested by those rated as having an excellent credit score.
Summary
■
With data all around us, businesses and accountants are looking to Data Analytics to
extract the value that the data might possess.
■
Data Analytics is changing the audit and the way that accountants look for risk. Now,
auditors can consider 100 percent of the transactions in their audit testing. It is also
help- ful in finding the anomalous or unusual transactions. Data Analytics is also
changing the way financial accounting, managerial accounting, and taxes are done at a
company.
■
The IMPACT cycle is a means of doing Data Analytics that goes all the way from
iden- tifying the question, to mastering the data, to performing data analyses and
communi- cating results. It is recursive in nature, suggesting that as questions are
addressed, new important questions may emerge that can be addressed in a similar
way.
■
Eight data approaches address different ways of testing the data: classification, regres-
sion, similarity matching, clustering, co-occurrence grouping, profiling, link
prediction, and data reduction. These are explained in more detail in chapter 3.
■
Data analytic skills needed by analytic-minded accountants are specified and are
consis- tent with the IMPACT cycle, including the following:
◦ Develop an analytics mindset.
◦ Data scrubbing and data preparation.
◦ Data quality.
◦ Descriptive data analysis.
◦ Data analysis through data manipulation.
◦ Define and address problems through statistical data analysis.
◦ Data visualization and data reporting.
Key Words
Big Data (4) Datasets that are too large and complex for businesses’ existing systems to handle
utilizing their traditional capabilities to capture, store, manage, and analyze these datasets.
classification (9) A data approach that attempts to assign each unit in a population into a few catego-
ries potentially to help with predictions.
clustering (10) A data approach that attempts to divide individuals (like customers) into groups (or
clusters) in a useful or meaningful way.
co-occurrence grouping (10) A data approach that attempts to discover associations between indi-
viduals based on transactions involving them.
Data Analytics (4) The process of evaluating data with the purpose of drawing conclusions to
address business questions. Indeed, effective Data Analytics provides a way to search through large
structured and unstructured data to identify unknown patterns or relationships.
data dictionary (14) Centralized repository of descriptions for all of the data attributes of the dataset.
data reduction (11) A data approach that attempts to reduce the amount of information that needs to
be considered to focus on the most critical items (i.e., highest cost, highest risk, largest impact, etc.).
link prediction (10) A data approach that attempts to predict a relationship between two data items.
profiling (10) A data approach that attempts to characterize the “typical” behavior of an individual,
group, or population by generating summary statistics about the data (including mean, standard devia-
tions, etc.).
predictor (or independent or explanatory) variable (9) A variable that predicts or explains
another variable, typically called a predictor or independent variable.
response (or dependent) variable (9) A variable that responds to, or is dependent on, another.
regression (9) A data approach that attempts to estimate or predict, for each unit, the numerical value
of some variable using some type of statistical model.
similarity matching (10) A data approach that attempts to identify similar individuals based on data
known about them.
21
6. The tax staff would become much more adept at efficiently organizing data stored
across
multiple systems across an organization and performing Data Analytics to help with
tax planning to structure transactions in a way that might minimize taxes.
7. The dependent variable could be the amount of money spent on fast food.
Independent variables could be proximity of the fast food, ability to cook own food,
discretionary income, socioeconomic status, etc.
8. The data reduction approach might help auditors spend more time and effort on the
riskiest transactions or on those that might be anomalous in nature. This will help
them more efficiently spend their time on items that may well be of highest
importance.
9. An analysis of the rejected loans suggests that 85,793 of the total 645,414 rejected
loans were from the state of California. That represents 13.29 percent of the total
rejected loans. This is greater than the relative population of California to the United
States as of the 2010 census, of 12.1 percent (37,253,956/308,745,538).
10. A PivotTable analysis of the rejected loans suggests that more than 30.5 percent
(762/2,494) of those in the excellent risk/credit score range asked for a loan with a
debt- to-income ratio of more than 20 percent.
23
9. The IMPACT cycle includes all except the following process:
a. Communicate insights.
b. Data preparation.
c. Address and refine results.
d. Perform test plan.
10. By the year 2020, about 1.7 megabytes of new information will be created every:
a. Week.
b. Second.
c. Minute.
d. Day.
Discussion Questions
1. Define Data Analytics and explain how a university might use its techniques to recruit
and attract potential students.
2. Give an example of how Data Analytics creates value for businesses.
3. Give an example of how Data Analytics creates value for accounting.
4. How might Data Analytics be used in financial reporting? And how might it be used in
doing tax planning?
5. Describe the IMPACT cycle. Why does its order of the processes and its recursive
nature make sense?
6. Why is identifying the question such a critical first step in the IMPACT process cycle?
7. What is included in mastering the data as part of the IMPACT cycle described in the
chapter?
8. In the chapter, we mentioned eight different data approaches. Which data approach
was used by Alibaba, as mentioned in the chapter-opening vignette?
9. What data approach mentioned in the chapter might be used by Facebook to find
friends?
10. Auditors will frequently use the data reduction approach when considering potentially
risky transactions. Provide an example of why focusing on a portion of the total
number of transactions might be important for auditors to assess risk.
11. Which data approach might be used to assess the appropriate level of the allowance
for doubtful accounts?
12. Why might the debt-to-income attribute included in the declined loans dataset consid-
ered in the chapter be a predictor of declined loans? How about the credit (risk)
score?
13. To address the question “Will I receive a loan from LendingClub?” we had available
data to assess the relationship among (1) the debt-to-income ratios and number of
rejected loans, (2) the length of employment and number of rejected loans, and (3)
the credit (or risk) score and number of rejected loans. What additional data would
you recommend to further assess whether a loan would be offered? Why would it be
helpful?
Problems
1. Download and consider the data dictionary file “LCDataDictionary,” specifically the
LoanStats tab. This represents the data dictionary for the loans that were funded.
Seeing all of the data attributes listed there, which attributes do you think might
predict which loans will go delinquent and which will ultimately be fully repaid? How
could we test that?
24
2. Download and consider the rejected loans dataset of LendingClub data titled
“RejectStatsA Ready.” Given the analysis performed in the chapter, what three items
do you believe would be most useful in predicting loan acceptance or rejection? What
additional data do you think could be solicited either internally or externally that would
help you predict loan acceptance or rejection?
3. Download the rejected loans dataset of LendingClub data titled “RejectStatsA
Ready” from the Connect website and do an Excel PivotTable by state; then figure
out the num- ber of rejected applications for the state of Arkansas. That is, count the
loans by state and compute the percentage of the total rejected loans in the USA that
came from Arkansas. How close is that to the relative proportion of the population of
Arkansas as compared to the overall U.S. population (per 2010 census)?
4. Download the rejected loans dataset of LendingClub data titled “RejectStatsA
Ready” from the Connect website and do an Excel PivotTable by state; then figure
out the num- ber of rejected applications for each state. Reorder these and make a
graph ordering the states and the number of rejected loans from highest to lowest. Is
there a lot of vari- ability among states?
For Problems 5, 6, and 7, we will be cleaning a data file in preparation for subsequent
analysis.
The analysis performed on LendingClub data in the chapter was for the years 2007–
2012. For this and subsequent problems, please download the declined loans table
for 2013–2014 from the Connect website,
https://www.lendingclub.com/info/download- data.action.
5. Consider the 2013 declined loan data from LendingClub titled “RejectStatsB2013”
from the Connect website. Similar to the analysis done in the chapter, let’s scrub the
risk score data. First, because our analysis requires risk scores, debt-to-income data,
and employment length, we need to make sure each of them has valid data.
a. Open the file in Excel.
b. Sort the file based on risk score and remove those observations (the complete
row or record) that have a missing score or a score of zero.
c. Assign each risk score to a risk score bucket similar to the chapter. That is,
classify the sample according to this breakdown into excellent, very good, good,
fair, poor, and very bad credit according to their credit score noted in Exhibit 1-10.
Classify those with a score greater than 850 as “Excellent”. Consider using if-then
state- ments to complete this. Or sort the row and manually input.
d. Run a PivotTable analysis that shows the number of loans in each risk score
bucket. Which group had the most rejected loans (biggest count)? Which group had
the least rejected loans (smallest count)? This is the deliverable. Is it similar to
Exhibit 1-11 performed on years 2007–2012?
6. Consider the 2013 declined loan data from LendingClub titled “RejectStatsB2013.”
Similar to the analysis done in the chapter, let’s scrub the debt-to-income data.
Because our analysis requires risk scores, debt-to-income data, and employment
length, we need to make sure each of them has valid data.
a. Sort the file based on debt-to-income and remove those observations (the
complete row or record) that have a missing score, a score of zero, or a negative
score.
b. Assign each valid debt-to-income ratio into three buckets (labeled DTI bucket) by clas-
sifying each debt-to-income ratio into high (>20 percent), medium (10–20
percent), and low (<10 percent) buckets. Consider using if-then statements to
complete this. Or sort the row and manually input.
c. Run a PivotTable analysis that shows the number of loans in each DTI bucket.
Any interpretation of why these loans were declined based on debt-to-income
ratios?
7. Consider the 2013 declined loan data from LendingClub titled “RejectStatsB2013.”
Similar to the analysis done in the chapter, let’s scrub the employment length.
Because our analysis requires risk scores, debt-to-income data, and employment
length, we need to make sure each of them has valid data.
25
a. Sort the file based on employment length and remove those observations (the
com-
plete row or record) that have a missing score (“NA”) or a score of zero.
b. Sort the file based on debt-to-income and remove those observations (the
complete row or record) that have a missing score, a score of zero, or a negative
score.
c. Sort the file based on risk score and remove those observations (the complete
row or record) that have a missing score or a score of zero.
d. There should now be 669,993 observations. Any thoughts on what biases are
imposed when we remove observations? Is there another way to do this?
e. Run a PivotTable analysis to show the number of Excellent Risk Scores but High
DTI Bucket loans in each Employment year bucket. Any interpretation of why
these loans were declined based on employment length?
26
Lab 1-0 How to Complete Labs in This Text
The labs in this book will provide valuable hands-on experience in generating and analyz-
ing accounting problems. Each lab will provide a company summary with relevant facts,
techniques that you will use to complete your analysis, software that you’ll need, and an
overview of the lab steps.
When you’ve completed your lab, you will submit a lab report showing your thought
process with written responses and validating that you’ve completed specific checkpoints
by taking screenshots along the way. This lab will demonstrate how to use basic lab tools.
On a Mac
1. Press Cmd + Shift + 4 and draw a rectangle across your screen that includes
your entire window.
2. Your screenshot will be saved in your Desktop folder.
3. Drag the screenshot file into your Word document.
4. Keep your document open and go the next part of the lab.
27
Part 3: Add Another Screenshot and Submit Your
Document
1. Open a new web browser window and go to mhhm.com.
2. Take a screenshot of your results (label it 1-0A) of the page and paste it into your
lab document.
3. Save your document and submit it to your instructor. If you’re using Word Online
on OneDrive, click File > Save As > Download a Copy.
End of Lab
Company summary
You were just hired as an analyst for a credit rating agency that evaluates publicly listed
companies in the United States. The company already has some Data Analytics tools that
it uses to evaluate financial statements and determine which companies have higher risk
and which companies are growing quickly. The company uses these analytics to provide
ratings that will allow lenders to set interest rates and determine whether to lend money in
the first place. As a new analyst, you’re determined to make a good first impression.
Technique
Software needed
• Word processor
• Web browser
• Screen capture tool (Windows: Snipping Tool; Mac: Cmd + Shift + 4)
Part 1: Identify appropriate questions, and develop a hypothesis for each question.
Part 2: Translate questions into target fields and value in a database.
Part 3: Perform a simple analysis.
28
For example, if you wanted to evaluate a company’s profit margin from one year to
the
next your question might be, “Has [Company X’s] gross margin increased in the last
three years?” Type your three questions in your document.
3. Next to each question generate a hypothetical answer to the question to help you iden-
tify what your expected output would be. You may use some insight or intuition or
search for industry averages to inform your hypothesis. For example: “Hypothesis:
Apple Inc’s gross margin has increased slightly in the past 3 years.”
4. Save your document.
30
LAB EXHIBIT 1-1B
Add Your Tags to
Perform a Simple
Analysis Using XBRL
Data
Source: Google
Company summary
LendingClub is a U.S.-based, peer-to-peer lending company, headquartered in San
Francisco, California. LendingClub facilitates both borrowing and lending by providing a
platform for unsecured personal loans between $1,000 and $35,000. The loan period is for
either 3 or 5 years. You have been brought in to help managers improve their loan
application process.
Technique
Software needed
• Word processor
31
be better ways to evaluate this given that the number of defaulted loans has increased in
the
past 2 years. It would like you to propose a model that would help it potentially assign a
risk score to loan applicants.
1. Create a new word processing document and name the file “Lab 1-2 Data Analytics
in Managerial Accounting Lab – [Your name] [Your email address].”
2. Use what you know about loan risk (or search the web if you need a refresher) to
identify three different questions that might influence risk. For example, if you
suspect risky customers live in a certain location, your question might be “Where do
the cus- tomers live?” Type your three questions in your document.
3. Next to each question, generate a hypothetical answer to each question to help
you identify what your expected output would be. You may use some insight or
intuition or search the Internet for ideas on how to inform your hypothesis. For
example: “Hypothesis: Risky customers likely live in coastal towns.”
4. Finally, identify the data that you would need to answer each of your questions. For
example, to determine customer location, you might need the city, state, and zip
code. Additionally, if you hypothesize a specific region, you’d need to know which
cities, state, and/or zip codes belong to that region. Add your required data
sources to each question in your document.
5. Save your document.
32
6. Evaluate each question from Part 1. Do the data you identified in your questions
exist in the table provided? Write the applicable fields next to each question in
your document.
7. Are there data values you identified that don’t exist in the table? Write where else
you might look to collect the missing data or how you might suggest collecting
those it.
8. Save your document and submit to your instructor.
End of Lab
Company summary
ABC Company is a large retailer that collects its order-to-cash data in a large ERP system
that was recently updated to comply with the AICPA’s audit data standards. ABC
Company currently collects all relevant data in the ERP system and digitizes any
contracts, orders, or receipts that are completed on paper. The credit department reviews
customers who request credit. Sales orders are approved by managers before being sent to
the warehouse for prepa- ration and shipment. Cash receipts are collected by a cashier
and applied to a customer’s outstanding balance by an accounts receivable clerk.
You have been assigned to the audit team that will perform the internal controls audit
of ABC Company.
Technique
Software needed
• Word processor
• Web browser
• Screen capture tool (Windows: Snipping Tool; Mac: Cmd + Shift + 4)
Part 1: Identify appropriate questions and develop a hypothesis for each question.
Part 2: Translate questions into target fields and value in a database and perform a
simple analysis.
33
1. Create a new word processing document and name the file “Lab 1-3 Data Analytics
in
Auditing Lab – [Your name] [Your email address].”
2. Use what you know about internal controls over the order-to-cash process (or
search the web if you need a refresher) to identify three different questions that
might indi- cate internal control weakness. For example, if you suspect that a
manager may be delaying approval of shipments sent to customers, your question
might be “Are any shipping managers approving shipments more than 2 days after
they are received?” Type your three questions in your document.
3. Next to each question generate a hypothetical answer to each question to help
you identify what your expected output would be. You may use some insight or
intuition or search the Internet for ideas on how to inform your hypothesis. For
example: “Hypothesis: Only 1 or 2 shipping managers are approving shipments
more than
2 days after they are received.”
4. Finally, identify the data that you would need to answer each of your questions. For
example, to determine the timing of approval and who is involved, you might need
the approver id, the order date, and the approval date. Add your required data
sources to each question in your document.
5. Save your document.
End of Lab
Co
The purpose of this lab is to help you identify relevant questions for Dillard’s Inc. based
on its data.
Company summary
Dillard’s is a department store with approximately 330 stores in 29 states. Its headquar-
ters is in Little Rock, Arkansas. You can learn more about Dillard’s by looking at finance
.yahoo.com (ticker symbol = DDS) and the Wikipedia site for DDS. You’ll quickly note
that William T. Dillard II is an accounting grad of the University of Arkansas and the
Walton College of Business, which may be why he shared transaction data with us to
make
available for this lab and labs throughout this text.
34
Technique
The data for this lab and other all Dillard’s labs are available at http://walton.uark.edu/
enterprise/. Your instructor will be able to help you gain access when it is needed. From
the Walton College website, we note the following:
The Dillard’s Department Store Database contains retail sales information gathered
from store sales transactions. The sale process begins when a customer brings items
intended for purchase (clothing, jewelry, home décor, etc.) to any store register. A
Dillard’s sales associate scans the individual items to be purchased with a barcode
reader. This populates the transaction table (TRANSACT), which will later be used
to generate a sales receipt listing the item, department, and cost information (related
price, sale price, etc.) for the customer. When the customer provides payment for
the items, payment details are recorded in the transaction table, the receipt is
printed, and the transaction is complete. Other tables are used to store information
about stores, products, and departments.
This retail sales information, UA_DILLARDS, was provided to the Walton College of
Business by Dillard’s Stores Inc. The information consists of five tables with more than
128 million rows already populated and ready for use.
This is a gifted dataset that is based on real operational data. Like any real database,
integrity problems may be noted. This can provide a unique opportunity not only to
expose students to real data, but also to illustrate the effects of data integrity problems.
[Source: http://walton.uark.edu/enterprise/dillardshome.php (accessed September 25,
2017).]
Software needed
• Word processor
• Web browser
• Screen capture tool (Windows: Snipping Tool; Mac: Cmd + Shift + 4)
• Access to the dataset is available at http://walton.uark.edu/enterprise/dillardshome
.php. If you plan on doing additional labs on Dillard’s data, you must receive permis-
sion from the Walton College to access the data before use. Additional access instruc-
tions are available from your instructor or on the Connect website.
36
Metadata LAB EXHIBIT 1-4B
Attribute Description Values Source: http://walton.uark.
edu/enterprise/dillardshome.
AMT Total amount of the transaction charge to the customer 26.25, 44.00, . . . php (accessed September 25,
2017).
BRAND The brand name of the stock item TOMMY HI, MARK ECK, . . .
CITY City where the store is located ST. LOUIS, TAMPA, . . .
CLASSID Stock Item Classification 5305, 4505, 8306, . . .
COLOR The color of the stock item BLACK, KHAKI, . . .
COST The cost of the stock item 9.00, 15.00, . . .
DEPT Department where the stock item belong 800, 801, 1100, . . .
DEPTDESC Description of the department CLINIQUE, LESLIE, . . .
INTERID Internal ID 265005802, 671901998, . . .
MIC Master Item Code 862, 689, . . .
ORGPRICE Original price of the item stock 75.00, 44.00, . . .
PACKSIZE The quantity of item per pack 1, 3, . . .
QUANTITY Item quantity of the transaction 1, 2, 3, . . .
REGISTER Register Number of the current transaction 580, 30, 460, . . .
RETAIL The retail price of the stock item 19.75, 34.00, . . .
SALEDATE Sale price of the item stock 2005-01-20, 2005-06-02, . . .
SEQ Sequence number 298100028, 213500030, . . .
SIZE The size of the stock item L, 070N, 22, . . .
SKU Stock Keeping Unit number of the stock item 4757355, 2128748, . . .
SPRICE Sale price of the item stock 26.25, 65.00, . . .
STATE State where the store is located FL, MO, AR, . . .
STORE Store Number 2, 3, 4, 100, . . .
STYLE The specific style of the stock item 51 MERU08, 9 126NAO, . . .
STYPE Type of the transaction (Return or Purchase) P, R
TRANNUM Transaction Code 09700, 01800, . . .
UPC Universal Product Code for the stock item 000400004087945, . . .
VENDOR The vendor number of the stock item 5511283, 2726341, . . .
ZIP ZIP Code 33710, 63126, . . .
8. If you’re interested in learning which product is sold most often at each store,
what tables and fields would you consider?
9. Save your document and submit it to your instructor.
End of Lab
37
Chapter 2
Data Preparation and Cleaning
A Look Back
Chapter 1 defined Data Analytics and explained that the value of Data Analytics is in the insights it provides.
We described the Data Analytic Process using the IMPACT cycle model and explained how this process is
used to address both business and accounting questions. We specifically emphasized the importance of identifying
appropri- ate questions that data analytics might be able to address.
A Look Ahead
Chapter 3 describes how to go from defining business problems to analyzing data, answering questions, and address-
ing business problems. We make the case for three data approaches we argue are most relevant to accountants
and provide examples of each.
38
We are lucky to live in a world in which data are abundant.
However, even with rich sources of data, when it comes to
being able to analyze data and turn them into useful
information and insights, very rarely can an analyst hop right
into a dataset and begin analyzing. Datasets almost always
need to be cleaned and validated before they can be used. Not
knowing how to clean and validate data can, at best, lead to
frustration and poor insights and, at worst, lead to horrible
security violations. While this text takes advantage of open
source datasets, these datas- ets have all been scrubbed not
Shutterstock / Wichy
only for accuracy, but also to pro- tect the security and privacy
of any individual or company whose details were in the original
dataset.
In 2015, a pair of researchers named Emil Kirkegaard and
Julius Daugbejerg Bjerrekaer scraped data from OkCupid, a free dating website, and provided the data onto the
“Open Science Framework,” a platform researchers use to obtain and share raw data. While the aim of the Open
Science Framework is to increase transparency, the researchers in this instance took that a step too far—and a
step into illegal territory. Kirkegaard and Bjerrekaer did not obtain permission from OkCupid or from the 70,000
OkCupid users whose identities, ages, genders, religions, personality traits, and other personal details maintained
by the dat- ing site were provided to the public without any work being done to anonymize or sanitize the data. If
the researchers had taken the time to not just validate that the data were complete but also to sanitize them to
protect the individuals’ identities, this would not have been a threat or a news story. On May 13, 2015, the Open
Science Framework removed the OkCupid data from the platform, but the damage of the privacy breach had
already been done.1
OBJECTIVES
After reading this chapter, you should be able to:
LO 2-1
Understand how data are organized in an accounting information
LO 2-2 system
Understand how data are stored in a relational database
LO 2-3
Explain and apply extraction, transformation, and loading (ETL)
techniques
1
B. Resnick, “Researchers Just Released Profile Data on 70,000 OkCupid Users without Permission,”
2016, http://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release (accessed
October 31, 2016).
39
40 Chapter 2 Data Preparation and Cleaning
EXHIBIT 2-1
Procure-to-Pay
Database Schema
(Simplified)
2
J. P. Isson and J. S. Harriott, Win with Advanced Business Analytics: Creating Business Value from Your
Chapter 2 Data Preparation and Cleaning 41
are stored as a unique record in the university’s data model! Other examples of unique
identifiers that you are familiar with would be check numbers and driver’s license
numbers. One of the biggest differences between a flat file and a relational database is
simply how many tables there are—when you request your data into a flat file, you’ll receive
one big table with a lot of redundancy. While this is ideal for analyzing data, when the
data are stored in the database, each group of information is stored in a separate table.
Then, the tables that are related to one another are identified (e.g., Supplier and
Purchase Order are related; it’s important to know which Supplier the Purchase Order is
from). The relationship is created by placing a foreign key in one of the two tables that
are related. The foreign key is another type of attribute, and its function is to create the
relationship between two tables.
Whenever two tables are related, one of those tables must contain a foreign key to create
the relationship.
The other columns in a table are descriptive attributes. For example, Supplier Name
is a critical piece of data when it comes to understanding the business process, but it is
not necessary to build the data model. Primary and foreign keys facilitate the structure of
a relational database, and the descriptive attributes provide actual business information.
Refer to Exhibit 2-1, the data schema for a typical procure-to-pay process. Each table
has an attribute with the letters “PK” next to them—these are the primary keys for each
table. The primary key for Materials is “Item No.,” the primary key for Purchase Order is
“PO No.,” and so on. Several of the tables also have attributes with the letters “FK” next
to them—these are the foreign keys that create the relationship between pairs of tables.
For example, look at the relationship between Supplier and Purchase Order. The primary
key in the supplier table is “Supplier ID.” The line between the two tables links the
primary key to a foreign key in the Purchase Order table, also named “Supplier ID.”
The line items table in Table 2-1 has so much detail in it that it requires two attributes
to combine as a primary key. This is a special case of a primary key often referred to as a
composite primary key, in which the two foreign keys from the tables that it is linking
com- bine to make up a unique identifier. The theory and details that support the necessity
of this linking table are beyond the scope of this text—if you can identify the primary and
foreign keys, you’ll be able to identify the data that you need to request. Table 2-2 shows
a subset of the data that are represented by the Purchase Order and Supplier tables. You
can see that each of the attributes listed in the class diagram appears as a column, and the
data for each purchase order and each supplier are accounted for in the rows.
PROGRESS CHECK
1. Referring to Exhibit 2-1, locate the relationship between the Employee and
Purchase Order tables. What is the unique identifier of each table? (The
unique identifier attribute is called the primary key—more on how it’s
determined in the next learning objective.) Which table contains the attribute
that creates the relationship? (This attribute is called the foreign key—more on
how it’s deter- mined in the next learning objective.)
2. Referring to Exhibit 2-1, review the attributes in the Suppliers table. There is a
foreign key in this table that doesn’t relate to any of the tables in the diagram.
Which table do you think it is? What type of data would be stored in that table?
DATA DICTIONARIES
In the previous section, you learned about how data are stored by focusing on the
procure- to-pay database schema. Viewing schemas and processes in isolation clarifies
each indi- vidual process, but it can also distort reality—these schemas do not represent
their own separate databases. Rather, each process-specific database schema is a piece of
a greater whole, all combining to form one integrated database.
As you can imagine, once these processes come together to be supported in one data-
base, the amount of data can be massive. Understanding the processes and the basics of
how data are stored is critical, but even with a sound foundation, it would be nearly
impos- sible for an individual to remember where each piece of data is stored, or what
each piece of data represents.
Creating and using a data dictionary is paramount in helping database administrators
maintain databases and analysts identify the data they need to use. In chapter 1, you were
introduced to the data dictionary for the LendingClub. The same cut-out of the
LendingClub data dictionary is provided in Exhibit 2-2 as a reminder.
Because the LendingClub data are provided in a flat file, the only two attributes
neces-
sary to describe the data are the attribute name (e.g., Amount Requested) and a descrip-
tion of that attribute. The description ensures that the data in each attribute are used and
analyzed in the appropriate way—it’s always important to remember that technology will
do exactly what you tell it to, so you must be smarter than the computer! If you run
analysis on an attribute thinking it means one thing, when it actually means another, you
could make some big mistakes and bad decisions even when you’re working with great
data. It’s critical to get to know the data through database schemas and data dictionaries
thoroughly before attempting to do any data analysis.
When you are working with data stored in a relational database, you will have more
attributes to keep track of in the data dictionary. Table 2-3 provides an example of a data
dictionary for a generic Supplier table:
Primary or
Foreign Key? Requir d Attribute Name Description Data Type Default Value Field Size Notes
PK Y Supplier ID Unique Identifier Number n/a 10
for Each Supplier
N Supplier Name First and Last Name Short Text n/a 30
FK N Supplier Type Type Code for Number Null 10 1: Vendor
Different Supplier 2: Misc
Categories
PROGRESS CHECK
3. What is the purpose of the primary key? A foreign key? A non-key attribute?
4. How do data dictionaries help you understand the data from a database or flat
file?
EXTRACTION
Requesting data is often an iterative practice, but the more prepared you are when
request- ing data in the first place, the more time you will save for yourself and the
database admin- istrators in the long run. Determine exactly what data you need in order
to answer your business questions.
Requesting the data involves the first two steps of the ETL extraction process. Each
step
Chapter 2 Data Preparation and Cleaning 45
4
T. Singleton, “What Every IT Auditor Should Know about Data Analytics,” n.d., from http://www.isaca
.org/Journal/archives/2013/Volume-6/Pages/What-Every-IT-Auditor-Should-Know-About-Data-Analytics
.aspx#2.
5
For a description of the audit data standards, please see this website: https://www.aicpa.org/interestareas/
frc/assuranceadvisoryservices/pages/assuranceandadvisory.aspx.
46 Chapter 2 Data Preparation and Cleaning
Request Date:
Required Date:
Intended Audience:
Customer (if not requestor):
Once the data are received, you can move on to the transformation phase of the ETL
process. The next step is to ensure that the data that have been extracted are complete and
correct.
Obtaining the Data Yourself
If you have direct access to the database or information system that holds all the data you
need, you may not need to go through a formal data request process, and you can simply
extract the data yourself.
After identifying the goal of the data analysis project in the first step of the IMPACT
cycle, you can follow a similar process to how you would request the data if you are
going to extract it yourself:
1. Identify the tables that contain the information you need. You can do this by
looking through the data dictionary or the relationship model.
2. Identify which attributes, specifically, hold the information you need in each table.
3. Identify how those tables are related to each other.
6
R&M Data Request Form—Template, Gloucestershire, n.d. Retrieved October 31, 2016, from http://www
.gloucestershire.gov.uk/media/word/n/t/datarequestform.doc.
Chapter 2 Data Preparation and Cleaning 47
Once you have identified the data you need, you can start gathering the information.
There are a variety of methods that you could take to retrieve the data. Two will be
explained briefly here—SQL and Excel—and there will be a deep dive into these methods
in Labs 2-1 and 2-2 at the end of the chapter using Sláinte data.
1. SQL: “Structured Query Language” (SQL, pronounced sequel) can be used to create,
update, and delete records and tables in databases, but we will focus on using SQL
to extract data—that is, to select the precise attributes and records that fit the criteria
of our data analysis goal. Using SQL, we can combine data from one or more tables
and organize it in a way that is more intuitive than the way it is stored in the
relational database. A firm understanding of the data—the tables, how they are
related, and their respective primary and foreign keys—is integral to extracting the
data.
Typically, data should be stored in the database and analyzed in another tool such
as Excel or Tableau. However, you can choose to extract only the portion of the data
that you wish to analyze via SQL instead of extracting full tables and transforming the
data in Excel or Tableau.
One of the most useful ways to extract data from more than one table in SQL is by
using a type of join clause. Joins rely on the structure of normalized relational data-
bases that have tables related through primary keys and foreign keys. If you intend
to extract data from two tables that are related, you simply have to identify the two
fields that the tables have in common, and create the join based on that relationship.
Referencing Exhibit 2-1, the Sláinte Procure-to-Pay Database Schema, if you
intended to extract data from both the FGI_Product table and the Sales_Subset table,
you would create your join based on the common fields of Product_Code in each
table. The code would be written as follows:
FROM FGI_Product
INNER JOIN Sales_Subset
ON FGI_Product.Product_Code = Sales_Subset.Product_Code
By inserting the FROM, INNER JOIN, and ON clauses into a SQL query in the
pattern described above, you can add any fields that you wish to view from either of
the tables to the SELECT clause. The generic pattern of the FROM, INNER JOIN,
and ON clauses is as follows:
FROM table1
INNER JOIN table2
ON table1.primarykey = table2.foreignkey
There is more description about writing queries and a chance to practice creating joins
in Lab 2-2.
Excel: The tables that contain the data you need can also be extracted in whole into
Excel and worked with directly in a spreadsheet. The advantage of this is that further
analy- sis will almost certainly be done in Excel, so it could be beneficial to have all the
data read- ily available for further questions to drill down into once the initial question is
answered. Understanding the primary key and foreign key relationships is also integral to
working with the data directly in Excel.
Sometimes, your data are stored directly in Excel instead of a database. In this case,
you can also use Excel functions and formulas to combine data from multiple Excel
tables into one table, similar to how you can join data with SQL in Access or another
relational database. One of Excel’s most useful tools for looking up data from two
separate tables
and matching them based on a matching primary key/foreign key relationship is the
48 Chapter 2 Data Preparation and Cleaning
VLookup function. There are a variety of ways that the VLookup function can be used,
but for extracting and transforming data it is best used to add a column to a table. Using
a similar example from the SQL Join explanation above, assume that your data for the
three tables (FGI_Product, Sales_Subset, and Customer_Table) exist in three different
spreadsheets of the same Excel workbook. If you wished to view the actual Product_
Description associated with each Product_Code in the Sales_Subset table, you could use
VLookup to match the foreign key of Product_Code in the Sales_Subset table to the pri-
mary key, Product_Code, in the FGI_Product table, and have the corresponding Product_
Description returned.
When you type the VLookup formula into Excel, the arguments are
=VLOOKUP(lookup value, table_array, col_index_num, [range lookup]). In our
example, this function would be typed into the spreadsheet holding the Sales_Subset
information, and it would be used to look up and return the Product_Description
associated with each Product_Code in the Sales_Subset table.
• The lookup_value is the foreign key you wish to look up; in our example, you
would reference the Product_Code in the Sales_Subset table. This is a single-cell
reference.
• The table_array is the table that contains the corresponding primary key; in this
instance, it is the FGI_Product table. VLookup will always look in the first column
of this table_array to find a value that matches the foreign key selected in the
lookup_ value argument. This works well when data are well-organized with a
primary key situ- ated in the first column of a table. In this example, you would
select the entire set of data in the FGI_Product table.
• Col_Index_Num refers to the column in your selected table_array that contains the
data you wish to view. In other words, if you were to manually match Product_Codes
between the two tables, you would look at the foreign key in the Sales_Subset table,
navigate to the FGI_Product table to find its match, then run your eyes across the
same row to locate the corresponding Product_Description. VLookup will do the very
same thing—the first two arguments represent the match, and this argument indicates
that you would like the function to return the Product_Description.
Product_Description
is located in the second column of the FGI_Product table, so you would enter a
number 2.
• [range_lookup] has two options, either FALSE or TRUE. The default is TRUE, so
whenever you want to match data based on an exact match (not an approximate or
near match), you need to type in FALSE.
TRANSFORMATION
Step 3: Validating the Data for Completeness and
Integrity
Anytime data are moved from one location to another, it is possible that some of the data
could have been lost during the extraction. It is critical to ensure that the extracted data
are complete (that the data you wish to analyze were extracted fully) and that the integrity
of the data remains (that none of the data have been manipulated or tampered with during
the extraction). Being able to validate the data successfully requires you to not only have
the technical skill to perform the task, but also to know your data well. If you know what
to reasonably expect from the data in the extraction (How many records should have been
extracted? What are some checksums you can rely on to ensure the data is complete and
hasn’t been tampered with?), then you have a higher likelihood of identifying any errors
Chapter 2 Data Preparation and Cleaning 49
or issues from the extraction. The following four steps should be completed to validate
the
data after extraction:
1. Compare the number of records that were extracted to the number of records in the
source database. This will give you a quick snapshot into whether any data were
skipped or didn’t extract properly due to an error or datatype mismatch. This is a
critical first step, but it will not provide information about the data themselves other
than ensuring that the record counts match.
2. Compare descriptive statistics for numeric fields: Calculating the minimums,
maxi- mums, averages, and medians will help ensure that the numeric data were
extracted completely.
3. Validate Date/Time fields in the same way as numeric fields by converting the
datatype to numeric and running descriptive statistic comparisons.
4. Compare string limits for text fields: Text fields are unlikely to cause an issue if
you extracted your data into Excel because Excel allows a maximum character
number of 32,767 characters. However, if you extracted your data into a tool that
does limit the number of characters in a string, you will want to compare these
limits to the source database’s limits per field to ensure that you haven’t cut off any
characters.
If an error is found, depending on the size of the dataset, you may be able to easily
find the missing or erroneous data by scanning the information with your eyes. However,
if the dataset is large, or if the error is difficult to find, it may be easiest to go back to the
extrac- tion and examine how the data were extracted, fix any errors in the SQL code, and
re-run the extraction.
LOADING
Step 5: Loading the Data for Data Analysis
If the extraction and transformation steps have been done well by the time you reach this
step, the loading part of the ETL process should be the simplest step. It is so simple, in
fact, that if your goal is to do your analysis in Excel and you have already transformed and
cleaned your data in Excel, you are finished. There should be no additional loading
necessary.
However, it is possible that Excel is not the last step for analysis. The data analysis
tech- nique you plan to implement, the subject matter of the business questions you
intend to answer, and the way in which you wish to communicate results will all drive the
choice of which tool you use to perform your analysis.
Throughout the text, you will be introduced to a variety of different tools to use for
analyzing data beyond Access and Excel. These will include Tableau, Weka, and IDEA.
As these tools are introduced to you, you will learn how to load data into them.
PROGRESS CHECK
5. Describe two different methods for obtaining data for analysis.
6. What are four common issues with data that must be fixed before analysis can
take place?
Summary
The first step in the IMPACT cycle is to identify the questions that you intend to
■
answer through your data analysis project. Once a data analysis problem or question
has been identified, the next step in the IMPACT cycle is mastering the data, which
can be broken down to mean obtaining the data needed and preparing it for analysis.
In order to obtain the right data, it is important to have a firm grasp of what data are
■
need to request the data from a database administrator or the information systems
team. If the latter is the case, you will complete a data request form, indicating exactly
which data you need and why.
Once you have the data, they will need to be validated for completeness and integrity
■
— that is, you will need to ensure that all of the data you need were extracted and that
all data are correct. Sometimes when data are extracted, some formatting or sometimes
even entire records will get lost, resulting in inaccuracies. Correcting the errors and
cleaning the data is an integral step in mastering the data.
Finally, after the data have been cleaned, there may be one last step of mastering the
■
data, which is to load them into the tool that will be used for analysis. Often, the
cleaning and
correcting of data occur in Excel and the analysis will also be done in Excel. In this
case,
there is no need to load the data elsewhere. However, if you intend to do more
rigorous
statistical analysis than Excel provides, or if you intend to do more robust data
visualiza- tion than can be done in Excel, it may be necessary to load the data into
another tool following the transformation process.
Key Words
composite primary key (42) A special case of a primary key that exists in linking tables. The
compos- ite primary key is made up of the two primary keys in the table that it is linking.
data dictionary (43) Centralized repository of descriptions for all of the data attributes of a dataset.
data request form (45) A method for obtaining data if you do not have access to obtain the data
directly yourself.
descriptive attributes (42) Attributes that exist in relational databases that are neither primary
nor foreign keys. These attributes provide business information, but are not required to build a
database. An example would be “Company Name” or “Employee Address.”
ETL (44) The extract, transform, and load process that is integral to mastering the data.
flat file (41) A means of storing data in one place, such as in an Excel spreadsheet, as opposed to stor-
ing the data in multiple tables, such as in a relational database.
foreign key (42) An attribute that exists in relational databases in order to carry out the relationship
between two tables. This does not serve as the “unique identifier” for each record in a table. These must
be identified when mastering the data from a relational database in order to extract the data correctly
from more than one table.
mastering the data (40) The second step in the IMPACT cycle; it involves identifying and obtaining
the data needed for solving the data analysis problem, as well as cleaning and preparing the data for
analysis.
primary key (41) An attribute that is required to exist in each table of a relational database and serves
as the “unique identifier” for each record in a table.
relational database (41) A means of storing data in order to ensure that the data are complete, not
redundant, and to help enforce business rules. Relational databases also aid in communication and
integra- tion of business processes across an organization.
51
5. Depending on the level of security afforded to a business analyst, she can either
obtain
data directly from the database herself or she can request the data. When obtaining
data herself, the analyst must have access to the raw data in the database and a firm
knowledge of SQL and data extraction techniques. When requesting the data, the
ana- lyst doesn’t need the same level of extraction skills, but she still needs to be
familiar with the data enough in order to identify which tables and attributes contain
the information she requires.
6. Four common issues that must be fixed are removing headings or subtotals, cleaning
leading zeroes or nonprintable characters, formatting negative numbers, and
correcting inconsistencies across the data.
1. Mastering the data can also be described via the ETL process. The ETL process
stands for:
a. Extract, total, and load data.
b. Enter, transform, and load data.
c. Extract, transform, and load data.
d. Enter, total, and load data.
2. The goal of the ETL process is to:
a. Identify which approach to data analytics should be used.
b. Load the data into a relational database for storage.
c. Communicate the results and insights found through the analysis.
d. Identify and obtain the data needed for solving the problem.
3. The advantages of storing data in a relational database include which of the
following?
a. Help in enforcing business rules.
b. Increased information redundancy.
c. Integrating business processes.
d. All of the above are advantages of a relational database.
e. Only A and B.
f. Only B and C.
g. Only A and C.
4. The purpose of transforming data is:
a. To validate the data for completeness and integrity.
b. To load the data into the appropriate tool for analysis.
c. To obtain the data from the appropriate source.
d. To identify which data are necessary to complete the analysis.
5. Which attribute is required to exist in each table of a relational database and serves
as the “unique identifier” for each record in a table?
a. Foreign key
b. Unique identifier
c. Primary key
52
6. The metadata that describes each attribute in a database is which of the following?
a. Composite primary key
b. Data dictionary
c. Descriptive attributes
d. Flat file
7. As mentioned in the chapter, which of the following is not a common way that data
will need to be cleaned after extraction and validation?
a. Remove headings and subtotals.
b. Format negative numbers.
c. Clean up trailing zeroes.
d. Correct inconsistencies across data.
8. Why is Supplier ID considered to be a primary key for a Supplier table?
a. It contains a unique identifier for each supplier.
b. It is a 10-digit number.
c. It can either be for a vendor or miscellaneous provider.
d. It is used to identify different supplier categories.
9. What are attributes that exist in a relational database that are neither primary nor
foreign keys?
a. Nondescript attributes
b. Descriptive attributes
c. Composite key
d. Relational table attributes
10. Which of these is not included in the five steps of the ETL process?
a. Determine the purpose and scope of the data request.
b. Obtain the data.
c. Validate the data for completeness and integrity.
d. Scrub the data.
Discussion Questions
1. The advantages of a relational database include limiting the amount of redundant
data that are stored in a database. Why is this an important advantage? What can go
wrong when redundant data are stored?
2. The advantages of a relational database include integrating business processes. Why
is it preferable to integrate business processes in one information system, rather than
store different business process data in separate, isolated databases?
3. Even though it is preferable to store data in a relational database, storing data across
separate tables can make data analysis cumbersome. Describe three reasons why it
is worth the trouble to store data in a relational database.
4. Among the advantages of using a relational database is enforcing business rules.
Based on your understanding of how the structure of a relational database helps
prevent data redundancy and other advantages, how does the primary key/foreign
key relationship structure help enforce a business rule that indicates that a company
shouldn’t process any purchase orders from suppliers who don’t exist in the
database?
5. What is the purpose of a data dictionary? Identify four different attributes that could
be stored in a data dictionary, and describe the purpose of each.
6. In the ETL process, the first step is extracting the data. When you are obtaining the
data yourself, what are the steps to identifying the data that you need to extract?
53
7. In the ETL process, if the analyst does not have the security permissions to access
the
data directly, then he or she will need to fill out a data request form. While this doesn’t
necessarily require the analyst to know extraction techniques, why does the analyst
still need to understand the raw data very well in order to complete the data request?
8. In the ETL process, when an analyst is completing the data request form, there are a
number of fields that the analyst is required to complete. Why do you think it is impor-
tant for the analyst to indicate the frequency of the report? How do you think that
would affect what the database administrator does in the extraction?
9. Regarding the data request form, why do you think it is important to the database
administrator to know the purpose of the request? What would be the importance of
the “To be used in” and “intended audience” fields?
10. In the ETL process, one important step to process when transforming the data is to
work with NULL, N/A, and zero values in the dataset. If you have a field of
quantitative data (e.g., number of years each individual in the table has held a full-
time job), what would be the effect of the following?
a. Transforming NULL and N/A values into blanks
b. Transforming NULL and N/A values into zeroes
c. Deleting records that have NULL and N/A values from your dataset
(Hint: Think about the impact on different aggregate functions, such as COUNT and
AVERAGE.)
Problems
The following problems correspond to the College Scorecard data. You should be
able to answer each question by just looking at the data dictionary (CollegeScorecard_
DataDictionary.pdf) included in Appendix A, but if you would like to use the raw data, feel
free to do so (CollegeScorecard_RawData.txt).
1. Which attributes from the College Scorecard data would you need to compare cost of
attendance across types of institutions (public, private non-profit, or private for-
profit)??
2. Which attributes from the College Scorecard data would you need to compare SAT
scores across types of institutions (public, private non-profit, or private for-profit)?
3. Which attributes from the College Scorecard data would you need to compare levels
of diversity across types of institutions (public, private non-profit, or private for-profit)?
4. Which attributes from the College Scorecard data would you need to compare
comple- tion rate across types of institutions (public, private non-profit, or private for-
profit)?
5. Which attributes from the College Scorecard data would you need to compare the
per- centage of students who receive federal loans at universities above and below the
median cost of attendance across all institutions (public, private non-profit, or private
for-profit)?
6. Which attributes from the College Scorecard data would you need to determine if
differ- ent regions of the country have significantly different costs of attendance?
7. Use the College Scorecard data to determine if different regions of the country have
signif- icantly different costs of attendance (same as Problem 6 above) and fill out a data
request form in order to extract the appropriate data. Use the template from the chapter
as a guide.
8. If you were analyzing the levels of diversity across public and private institutions
using the College Scorecard data, how would you define diversity in terms of the data
pro- vided? Would it be beneficial to combine fields?
9. If you were conducting a data analysis in order to compare the percentage of stu-
54
Answers to Multiple Choice Questions
1. C
2. D
3. G
4. A
5. C
6. B
7. C
8. A
9. B
10. D
Appendix A
CollegeScorecard Dataset
Description of Variables/Attributes
UNITID—a unique identifier for the institution INSTNM
—institution name
CITY—city STABBR—
state postcode
CONTROL—1 = Public. 2 = Private nonprofit. 3 = Private for-
profit CCBASIC—Carnegie Classification, basic:
−2 Not applicable
0 (Not classified)
1 Associate’s Colleges: High Transfer-High Traditional
2 Associate’s Colleges: High Transfer-Mixed Traditional/Nontraditional
3 Associate’s Colleges: High Transfer-High Nontraditional
4 Associate’s Colleges: Mixed Transfer/Vocational & Technical-High Traditional
5 Associate’s Colleges: Mixed Transfer/Vocational & Technical-Mixed
Traditional/ Nontraditional
6 Associate’s Colleges: Mixed Transfer/Vocational & Technical-High Nontraditional
7 Associate’s Colleges: High Vocational & Technical-High Traditional
8 Associate’s Colleges: High Vocational & Technical-Mixed
Traditional/ Nontraditional
9 Associate’s Colleges: High Vocational & Technical-High Nontraditional
10 Special Focus Two-Year: Health Professions
11 Special Focus Two-Year: Technical Professions
12 Special Focus Two-Year: Arts & Design
13 Special Focus Two-Year: Other Fields
14 Baccalaureate/Associate’s Colleges: Associate’s Dominant
15 Doctoral Universities: Highest Research Activity
16 Doctoral Universities: Higher Research Activity
17 Doctoral Universities: Moderate Research Activity
55
18 Master’s Colleges & Universities: Larger Programs
19 Master’s Colleges & Universities: Medium Programs
20 Master’s Colleges & Universities: Small Programs
21 Baccalaureate Colleges: Arts & Sciences Focus
22 Baccalaureate Colleges: Diverse Fields
23 Baccalaureate/Associate’s Colleges: Mixed Baccalaureate/Associate’s
24 Special Focus Four-Year: Faith-Related Institutions
25 Special Focus Four-Year: Medical Schools & Centers
26 Special Focus Four-Year: Other Health Professions Schools
27 Special Focus Four-Year: Engineering Schools
28 Special Focus Four-Year: Other Technology-Related Schools
29 Special Focus Four-Year: Business & Management Schools
30 Special Focus Four-Year: Arts, Music & Design Schools
31 Special Focus Four-Year: Law Schools
32 Special Focus Four-Year: Other Special Focus Institutions
33 Tribal Colleges
ADM_RATE – admission
rate
SAT_AVG – average equivalent SAT of students admitted
UGDS – enrollment of undergraduate certificate/degree-seeking students
UGDS_WHITE – total share of enrollment of undergraduates who are white
UGDS_BLACK – total share of enrollment of undergraduates who are black
UGDS_HISP – total share of enrollment of undergraduates who are Hispanic
UGDS_ASIAN – total share of enrollment of undergraduates who are Asian
UGDS_AIAN – total share of enrollment of undergraduates who are American Indian/
Alaska Native
UGDS_NHPI – total share of enrollment of undergraduates who are Native Hawaiian/
Pacific Islander
UGDS_2MOR – total share of enrollment of undergraduates who are two or more
races
UGDS_NRA – total share of enrollment of undergraduates who are non-resident
aliens UGDS_UNKN – total share of enrollment of undergraduates whose race is
unknown PPTUG_EF – share of undergraduate degree/certificate-seeking students
who are
part-time
NPT4_PUB – Average net price for Title IV institutions (public)
NPT4_PRIV – Average net price for Title IV institutions (private for-profit and
nonprofit)
COSTT4_A – Average cost of attendance
TUITFTE – Net tuition revenue per full-time equivalent student
INEXPFTE – Instructional expenditures per full-time equivalent student
PFTFAC – Proportion of faculty that is full-time
PCTPELL – Percentage of undergraduates who receive a Pell Grant
C150_4 – Completion rate for first-time, full-time students at four-year institutions (6
year) PFTFTUG1_EF – Share of undergraduate students who are first-time, full-time,
degree seeking undergraduates
RET_FT4 – First-time, full-time student retention rate at four-year institutions
PCTFLOAN – Percent of all federal undergraduates receiving a federal student loan
56
Lab 2-1 Create a Request for Data Extraction
One of the biggest challenges you face with data analysis is getting the right data. You
may have the best questions in the world, but if there are no data available to support your
hypothesis, you will have difficulty providing value. Additionally, there are instances in
which the IT workers may be reluctant to share data with you. They may send incomplete
data, the wrong data, or completely ignore your request. Be persistent, and you may have
to look for creative ways to find insight with an incomplete picture.
Company summary
Sláinte is a fictional brewery that has recently gone through big changes. Sláinte sells six
dif- ferent products. The brewery has only recently expanded its business to distributing
from one state to nine states, and now its business has begun stabilizing after the
expansion. With that stability comes a need for better analysis. You have been hired by
Sláinte to help manage- ment better understand the company’s sales data and provide input
for its strategic decisions.
Data
• Data request form
• Sláinte dataset
Technique
• Some experience with spreadsheets and PivotTables is useful for this lab.
Software needed
• Word processor
• Excel
• Screen capture tool (Windows: Snipping Tool; Mac: Cmd + Shift + 4)
57
Part 2: Generate a Request for Data
Now that you’ve identified the data you need for your analysis, complete a data request
form.
1. Open the Data Request Form.
2. Enter your contact information.
3. In the description field, identify the tables that you’d like to analyze, along with the
time periods (e.g., past month, past year, etc.). Note: It’s almost always better to ask
for complete tables rather than limit your request to specific attributes; there may be
use- ful information that you’ll want later.
4. Select a frequency. In this case, this is a “One-off request.”
5. Enter a request date (today) and a required date (one week from today).
6. Choose a format (spreadsheet).
7. Indicate what the information will be used for in the appropriate box
(internal analysis).
8. Take a screenshot (label it 2-1A) of your completed form.
FGI_Product Table
Attribute Description of Attribute
Product_Code (PK) Unique identifier for each product
Product_Description Product description (plain English) to indicate the name or other identifying charac-
teristics of the product
Product_Sale_Price Price per unit of the associated product
Customer Table
Attribute Description of Attribute
Customer_ID (PK) Unique identifier for each customer
Business_Name The name of the customer
Customer_Address The physical street address of the customer
Customer_City The physical city where the customer is located
Customer_St The physical state where the customer is located
Customer_Zip The zip code of the city where the customer is physically located
58
You may notice that while there are a few attributes that may be useful in your sales
analysis, the list may be incomplete and be missing several values. This is normal with
data requests.
Q4. Take a moment and identify any attributes that you are missing from your
origi- nal request.
Q5. Evaluate your original questions and responses. How do the data alter the ques-
tions? Can you get a similar answer using the data provided by Rachel?
End of Lab
Efficient relational databases contain normalized data. That is, each table contains only
data that are relevant to the object, and tables’ relationships are defined with primary key/
foreign key pairs. For example, each record in a customer table is assigned a unique ID
(e.g., customer 152883), and the remaining attributes (e.g., customer address) describe
that customer. In a sales order table, the only customer data you find is a foreign key
pointing to the customer (e.g., customer 152883) we are selling merchandise to. The
foreign key value connects the sales order record to the customer record and allows any
or all of the linked attributes to appear on the sales order form or report.
With Data Analytics, efficient databases are not as helpful. Rather, we would like to
“denormalize” the data or combine all of the related data into one large file that can be
eas- ily evaluated for summary statistics or be used to create meaningful PivotTables.
Excel calls this the Internal Data Model. In Access, we create a query. This lab will take
you through this process. This lab will help you recognize how to create relationships
between related spreadsheets in Excel using Excel’s Internal Data Model. The Internal
Data Model is avail- able in Excel for PC versions from 2013 onward. This lab is in
preparation for using the Internal Data Model in future labs to transform data, as well as
to aid in understanding of primary and foreign key relationships.
Company summary
Sláinte is a fictional brewery that has recently gone through big changes. Sláinte sells six
different products. The brewery has only recently expanded its business to distributing from
one state to nine states, and now its business has begun stabilizing after the expansion.
With that stability comes a need for better analysis. One of Sláinte’s first priorities is to
identify its areas of success, as well as areas of potential improvement.
Data
• Sláinte dataset
Technique
• Some experience with relational databases, spreadsheets, and PivotTables is useful for
this lab.
Software needed
• Excel
• Access
• Screen capture tool (Windows: Snipping Tool; Mac: Cmd + Shift + 4)
59
In this lab, you will:
Depending on your desired analysis, there are a few alternative approaches that you
could use to prepare the data for analysis.
Alternative 1: Do nothing.
If you are simply trying to calculate statistics or make comparisons using attributes within
a single table, there is no need to transform the tables. Simply load the table, make sure
the data are clean, and proceed to analysis.
For example, to find the total number of each item sold, you would need only the
[Sales_ Subset] table and its attributes [Product_Code] and [Sales_Order_Quantity_Sold].
Q2. When would it be a good idea to use a single table?
60
LAB EXHIBIT 2-2B
Define the Primary
Key/Foreign Key
Relationships in Excel
Source: Microsoft Excel 2016
10. Click Close in the Manage Relationships window to return to the spreadsheets.
While the spreadsheets do not appear to have changed with the new relationships,
we have created a powerful engine for analyzing our data. We will have access to
any of the records and related fields in any of the tables without additional work,
such as Find and Replace or VLookup.
61
11. Save your workbook as Slainte_Relationships.xlsx.
Q3. How comfortable are you with identifying primary key/foreign key relationships?
Alternative 3: Merging the data into a single table using Excel Query Editor.
While relationships are incredibly useful when dealing with multiple tables, there are
times when it is useful to have all of the data together in one table. Both queries and
PivotTables are much more straightforward when you don’t have to continually define
the relationships. The downside to working with a single table is that you must work with
a larger file size and there are a lot of redundant data.
1. Create a new blank spreadsheet in Excel.
2. Click the Data tab on the ribbon.
3. Click the Get Data menu in the Get & Transform Data section.
4. Choose From File > From Workbook.
5. Locate the Slainte_Subset.xlsx file on your computer, and click Open.
6. In the Navigator, check Select multiple items, then check the three tables to
import, shown in Exhibit 2-2D:
a. [Customer]
b. [FGI_Product]
c. [Sales_Subset]
7. Click Load. The three tables will appear as queries in the Queries & Connections
pane on the right side of the screen.
8. Double-click the [Sales_Subset] query to open the Query Wizard.
9. To merge the tables click the Home tab, then choose Merge Queries from the
Combine
section. A new Merge window will appear.
62
LAB EXHIBIT 2-2E
Select the Primary
and Foreign Keys in
the Merge Window to
Create a Large Table in
Excel’s Query Editor
Source: Microsoft Excel 2016
63
17. Rename [Sheet1] to [Sales_Order_Merge].
Note: You can also directly load your merged table into a PivotTable if that is the
analysis you’re going to perform.
18. Save your workbook as Slainte_Merge.xlsx.
Q4. Have you used the Query Editor in Excel before? Double-click the [Sales_
Subset] query and click through the tabs on the ribbon. Which options do
you think will be useful in the future?
SQL can be used to create tables, delete records, or edit databases, but we will pri-
marily use SQL to query the database—that is, not to edit or manipulate the data,
but to create different views of the data to help us answer business questions.
There are four key phrases that are used in every query: SELECT fields, FROM
table, WHERE criteria, GROUP BY aggregate.
• SELECT indicates which attributes you wish to view. These can be columns
that already exist in a table (such as Product_Code), or they can be math-
ematical expressions that already exist, such as the sum of the quantity
sold. Use AS to give your expression a friendly name. For expressions, you
write equations similar to an Excel function:
SUM(Sales_Order_Quantity_Sold). When you select more than one
column, put a comma between them.
For example:
SELECT Product_Code, SUM(Sales_Order_Quantity_Sold)
SELECT Product_Code, Sales_Order_Quantity_Sold*Product_Sale_Price
AS Order_Total
• FROM indicates which table you are pulling the fields in from. If you need
to retrieve data from more than one table, you will use another SQL phrase:
table JOIN table ON (foreignkey= primarykey).
For example:
FROM Sales_Subset;
FROM Sales_Subset JOIN Customer ON (Customer_ID=Customer_ID)
• WHERE is used to filter your results on a specific value or range.
Commonly, you will compare a field to a specific number (e.g., WHERE
Customer_ ID=2056) or text value in quotes (e.g., WHERE
Product_Description=”Pale Ale”). You can also filter on ranges such as
dates (e.g., WHERE Sales_Order_ Date BETWEEN #1/1/2019# AND
#12/31/2019#). Use AND or OR to com-
bine multiple filters
For example:
WHERE Customer_ID=2056 AND Sales_Order_Date BETWEEN #1/1/2019#
AND #12/31/2019#
64
• GROUP BY is used anytime you have an aggregate in your SELECT
column; this will indicate how you want to categorize, or group, your data. In
our example, we intend to group our aggregate by Product_Code. Without
the GROUP BY command, you will see duplicate records in the query
results.
For example:
GROUP BY Product_Code
1. Open the Slainte_Subset.accdb file.
2. Open the SQL editor by navigating to the Create tab on the ribbon.
3. Click Query Design to open the SQL Designer.
4. Click Close on the Show Table window.
5. In the top left corner, click SQL to open the SQL Editor.
6. The 3 lines of code in the examples of the SQL Key Word Tutorial 1 are the
three lines that we will use to execute our report in SQL. In the SQL Editor,
type the following lines of code:
SELECT Sales_Subset.*, FGI_Product.*, Customer.*
FROM Customer RIGHT JOIN (FGI_Product RIGHT JOIN Sales_Subset
ON FGI_Product.Product_Code = Sales_Subset.Product_Code) ON
Customer.Customer_ID = Sales_Subset.Customer_ID;
7. Click Run from the Query Tools > Design tab on the Ribbon to view your
combined query output.
8. Take a screenshot (label it 2-2C).
9. Save your query as Slainte_Merge. From here you can either click External Data
> Excel in Access to export your query as an Excel file OR close your
database, open Excel and choose Data > Get Data > From Database >
From Microsoft Access Database, then navigate to your database and
import the query.
PivotTables allow you to quickly summarize large amounts of data. In Excel, click
Insert > PivotTable, choose your data source, then click the checkmark next to or
drag your fields to the appropriate boxes in the PivotTable Fields pane to identify
filters, columns, rows, or values. You can easily move attributes from one pane to
another to quickly “pivot” your data. Here is a brief description of each section:
• Rows: Show the main item of interest. You usually want master data here,
such as customers, products, or accounts.
• Columns: Slice the data into categories or buckets. Most commonly,
columns are used for time (e.g., years, quarters, months, dates).
65
• Values: This area represents the meat of your data. Any measure that you would
like to count, sum, average, or otherwise aggregate should be placed here.
The aggregated values will combine all records that match a given row and
column.
• Filters: Placing a field in the Filters area will allow you to filter the data
based on that field, but it will not show that field in the data. For example, if
you wanted to filter based on a date, but didn’t care to view a particular
date, you could use this area of the field list. With more recent versions of
Excel, there are improved methods for filtering, but this legacy feature is still
functional.
1. From any of the files you created in Part 2, click the Insert tab on the ribbon.
2. Click PivotTable in the Tables section.
3. In the Create PivotTable window click Use this workbook’s Data Model. Note: If
you have only one table, choose Select a table or range and choose your sheet.
4. Click OK to create the PivotTable. A PivotTable Fields pane appears on the right. Note: If at
any point while working with your PivotTable, your PivotTable Fields list disappears, you
can make it reappear by ensuring that your active cell is within the PivotTable itself. If the
Field List still doesn’t reappear, navigate to the Analyze tab in the Ribbon, and select
Field List.
5. Click the arrow toggle next to each table to show the available fields. If you don’t
see your three tables, click the All option directly below the PivotTable Fields pane
title.
6. Take a screenshot (label it 2-2D).
7. Because you defined relationships or merged the tables in Part 2, you can drag any
of the attributes from your list of fields to their respective Filters, Columns, Rows,
or Values. Do that now:
a. Columns: [Sales_Order_Date] (Month) from [Sales_Subset]. Note: When you add a
date, Excel will automatically try to group the data by Year, Quarter, etc. For
now, remove the other options.
b. Rows: [Product_Description] from [FGI_Products].
c. Values: [Sales_Order_Quantity_Sold] from [Sales_Subset].
d. Filters: None.
8. Finally, to show only the four months from January to April, click the drop-
down arrow next to Column Labels and uncheck Nov and Dec.
9. Optional step: Clean up your PivotTable. Rename labels and the title of the report
to something more useful.
10. Take a screenshot (label it 2-2E).
11. Save a copy of your workbook as Slainte_Pivot.xlsx.
To perform a similar, but less flexible analysis in Access, do the following:
1. Open your Slainte_Subset.accdb file from Part 2.
2. Click Create > Query Design. Close the window that appears.
3. Click SQL View in the top-left corner.
4. Enter the following query:
SELECT Product_Description, Sum(Sales_Order_Quantity_Sold) AS
Total_Sales
FROM Slainte_Merge
WHERE Sales_Order_Date Between #1/1/2020# And #4/30/2020#
GROUP BY Product_Description;
5. Click Run to show the results.
6. Take a screenshot (label it 2-2F).
7. Save your query as Total_Sales_By_Product and close your database.
66
Part 4: Address and Refine Your Results
Now that you’ve completed a basic analysis to answer management’s question, take a
moment to think about how you could improve the report and anticipate questions your
manager might have.
Q5. If the owner of Sláinte wishes to identify which product sold the most,
how would you make this report more useful?
Q6. If you wanted to provide more detail, what other attributes would be useful
to add as additional rows or columns to your report, or what other reports
would you create?
End of Lab
There are several issues with this dataset that we’ll need to resolve before we can process
the data. This will require some cleaning, reformatting, and other techniques.
Company summary
LendingClub is a peer-to-peer marketplace where borrowers and investors are matched
together. The goal of LendingClub is to reduce the costs associated with these bank-
ing transactions and make borrowing less expensive and investment more engaging.
LendingClub provides data on loans that have been approved and rejected since 2007,
including the assigned interest rate and type of loan. This provides several opportunities
for data analysis.
Data
• LendingClub datasets: ApproveStats
Technique
• Some experience with Excel is useful for this lab.
Software needed
• Excel
• Screen capture tool (Windows: Snipping Tool; Mac: Cmd +Shift + 4)
67
Part 1: Identify the Questions
You’ve already identified some analysis questions for LendingClub in chapter 1. Here,
you’ll focus on data quality. Think about some of the common issues with data you
receive from other people. For example, is the date field in the proper format? Do number
fields contain text or vice versa?
Q1. What do you expect will be major data quality issues with LendingClub’s data?
Q2. Given this list of attributes, what concerns do you have with the data’s ability
to predict answers to the questions you identified in chapter 1?
Take a moment and familiarize yourself with the data.
1. Open your web browser and go to: https://www.lendingclub.com/info/download-data
.action [TRA1].
2. In the Download Loan Data section, choose “2014” from the drop-down list, then click
Download.
3. Locate your downloaded zip files on your computer, and extract the csv files to a
con- venient location (e.g., desktop or Documents).
4. Open the LoanStats3c.csv file in Excel. There should be 235,629 records (ignoring
the first two header rows). Note: Calculating summary statistics, such as the total
number
of records, is an important step in data validation. We’ll cover that in Lab 2.4.
68
5. Take a moment and explore the data.
Q3. Is there anything in the data that you think will make analysis difficult? For
example, are there any special symbols, nonstandard data, or numbers that
look out of place?
Q4. What would you do to clean the data in this
file? Let’s identify some issues with the data.
• There are many attributes without any data, and that may not be necessary.
• The [int_rate] values are written in ##.##%, but analysis will require #.####.
• The [term] values include the word “months,” which should be removed for
numerical analysis.
• The [emp_length] values include “n/a”, “<”, “+”, “year”, and “years”—all of
which should be removed for numerical analysis.
• Dates, including [issue_d], can be more useful if we expand them to show the day,
month, and year as separate attributes. Dates cause issues in general because
different systems use different date formats (e.g., 1/9/2009, Jan-2009, 9/1/2009 for
European dates, etc.), so typically some conversion is necessary.
First, remove the unwanted data:
6. Save your file as “Loans2007-2011.xlsx” to take advantage of some of Excel’s features.
7. Delete the first row that says “Notes offered by prospectus. . .”.
8. Delete the last four rows that include “Total amount funded. . .”.
9. Delete columns that have no values, including [id], [member_id], and [url].
10. Repeat for any other blank columns or unwanted
attributes. Next, fix your numbers:
11. Select the [int_rate] column.
12. In the Home tab, go to the Number section and change the number type from
Percentage to General using the drop-down menu.
13. Repeat for any other attributes with percentages.
14. Take a screenshot (label it 2-3A) of your partially cleaned data
file. Then, remove any words from numerical values:
15. Select the [term] column.
16. Use Find & Replace (Ctrl+H or Home > Editing > Find & Select > Find &
Replace) to find the words “months” and “month” and replace them with a null/blank
value “”. Important: Be sure to include the space before the words and go from the
longest varia- tion of the word to the shortest. In this case, if you replaced “month”
first, you would end up with a lot of values that still had the letter “s” from “months”.
17. Now select the [emp_length] column and find and replace the following values:
69
Original Value New Value
7 years 7
8 years 8
9 years 9
10+ years 10
, (comma) (blank)
18. Take a screenshot (label it 2-5B) of your partially cleaned data file, showing the
[term] column.
Note: Finding and replacing 13 values by hand may be tedious, but it is efficient for
a one-off analysis and a small file. If you plan to re-perform this analysis multiple
times or find and replace dozens of items or you have a file that is larger than Excel
can handle, you’re better off using a scripting language, such as Python. You can
down- load Python free from python.org, and a quick search on Google will help you
find tutorials to start with the basics.
Here’s what the script would look like for the find and replace function where you
would list the original value as item and the replacement value as replacement:
import csv
s = ifile.read()
for item, replacement in zip(findlist, replacelist):
s = s.replace(item, replacement)
s = s.replace(item, replacement)
ofile.write(s)
ifile.close()
ofile.close()
70
22. Now convert the formulas to data values. Select the new [issue_month] column that
contains your formula.
23. Copy [Ctrl+C] and Paste Special [Ctrl+Alt+V]. Choose Values [V], then click OK.
24. Save your file.
25. Add another blank column and name it [issue_year].
26. Use the =YEAR([column address for issue_d]) formula to extract the month from
the date in your new column and copy your formula to the bottom of the sheet. You
should see a month number value in each cell. If it still has a date format, change
the number format to General in the Home tab.
27. Select the new [issue_month] column that contains your formula, then Copy [Ctrl+C]
and Paste Special [Ctrl+Alt+V]. Choose Values [V]. Click OK.
28. Save your file.
29. Take a screenshot (label it 2-5C) of your cleaned data file, showing the new
date columns.
Q5. Why do you think it is useful to reformat and extract parts of the dates
before you conduct your analysis? What do you think would happen if you
didn’t?
Q6. Did you run into any major issues when you attempted to clean the data?
How would you resolve those?
End of Lab
Company summary
LendingClub is a peer-to-peer marketplace where borrowers and investors are matched
together. The goal of LendingClub is to reduce the costs associated with these banking trans-
actions and make borrowing less expensive and investment more engaging.
LendingClub provides data on loans that have been approved and rejected since 2007,
including the assigned interest rate and type of loan. This provides several opportunities
for data analysis.
Data
• LendingClub dataset: ApproveStats2014
Technique
• Some experience with Excel is useful for this lab.
Software needed
• Excel
• Screen capture tool (Windows: Snipping Tool; Mac: Cmd + Shift + 4)
71
Calculate Summary Statistics in Excel
For basic validation, we’ll use Excel. Remember, there is a limitation on the number
of records that Excel can handle, so this is best for smaller- to medium-sized files.
Excel’s toolbar at the bottom of the window provides quick access to a summary of any
selected values.
Note: If you’ve already downloaded LendingClub data from 2014, you can skip to step
5.
1. Open your web browser and go to: https://www.lendingclub.com/info/download-data
.action.
2. In the Download Loan Data section, choose “2014” from the drop-down list, then click
Download.
3. Locate your downloaded zip files on your computer, and extract the .csv files to a
con- venient location (e.g., desktop or Documents).
4. Open the LoanStats3c.csv file in Excel.
5. Select the [loan_amnt] column. At the bottom of the window, you will see the
Average, Count, and Sum calculations, shown in LAB Exhibit 2-4A. Compare
those to the vali- dation given by LendingClub:
Funded loans: $3,503,840,175
Number of approved loans: 235,629
Data summary
The data used are a subset of the College Scorecard dataset that is provided by the U.S.
Department of Education. These data provide federal financial aid and earnings informa-
tion, insights into the performance of schools eligible to receive federal financial aid, and
the outcomes of students at those schools. You can learn more about how the data are
used and view the raw data yourself at https://collegescorecard.ed.gov/data/. However,
for this lab, you should use the text file provided to you.
Data
• CollegeScorecard Datasets: CollegeScorecard_RawData
Technique
• Some experience with Excel is useful for this lab.
Software needed
• Text Editor (Windows: Notepad; Mac: TextEdit)
• Excel
• Screen capture tool (Windows: Snipping Tool; Mac: Cmd + Shift + 4)
73
8. Leaving delimited checked (as is the default), click Next in the wizard, and select the
appropriate delimiter. Make sure to un-check the default option, tab.
9. Click Finish in the wizard.
10. Take a screenshot (label it 2-5B).
11. To ensure that you captured all of the data through the extraction from the txt file,
we need to validate it. Validate the following checksums:
• You should have 7,704 records (rows).
• Compare the attribute names (column headers) to the attributes listed in the data
dictionary. Are you missing any, or do you have any extras?
• The average SAT score should be 1,059.07 (this is leaving NULL values as NULL).
Q2. In the checksums, you validated that the average SAT score for all of the
records is 1,059.07. When we work with the data more rigorously, several tests
will require us to transform NULL values. If you were to transform the NULL
SAT values into 0, what would happen to the average (would it stay the same,
decrease, or increase)? How would that change to the average affect the way
you would interpret the data? Do you think it’s a good idea to replace NULL
values with 0s in this case?
12. Now that the data have been validated, you can clean the data. How you clean the
data is determined by the question you intend to answer. In this case, we’re
preparing our data to run a regression test using the two attributes SAT_AVG and
C150_4. As you’ll learn in chapter 3, a regression test won’t run with non-numeric
values (i.e., we can’t leave the NULL values in, and we can’t transform them to
blanks). Earlier you discussed the cons of replacing NULL values with 0s.
To avoid the issues with NULL, blanks, and 0s, we will remove all of the
records that contain NULL values in either SAT_AVG or C150_4. Do so.
13. Perform a =COUNT() to verify the number of records that remain after removing
all records associated with NULL values in SAT_AVG or C150_4. 1,271 records
should remain.
14. Take a screenshot (label it 2-5C).
Your data is now ready for the test plan. This lab will continue in chapter 3.
Data
The data for this lab and other all Dillard’s labs are available at http://walton.uark.edu/
enterprise/. Your instructor will either give you specific instructions on how to access the
data, or there will be information available on Connect. The 2016 Dillard’s data cover all
transactions over the period 1/1/2014 to 10/17/2016.
74
Software needed
• Microsoft SQL Server Management Studio (available on the Remote Desktop at
the University of Arkansas)
75
4. Leave the default for authentication to Windows Authentication, and click Connect.
5. Expand the Databases folder in the Object Explorer window.
76
10. Select the tables you would like to view. For this lab, select all of them.
11. Take a screenshot (label it 2-6A).
End of Lab
Data
The data for this lab and other all Dillard’s labs are available at http://walton.uark.edu/
enterprise/. Your instructor will either give you specific instructions on how to access the
data, or there will be information available on Connect. The 2016 Dillard’s data cover all
transactions over the period 1/1/2014 to 10/17/2016.
Software needed
• Microsoft SQL Server Management Studio (available on the Remote Desktop at
the University of Arkansas)
78
7. Because this dataset is massive, it can take a very long time for the system to return
the complete set of data for some of the bigger tables (such as TRANSACT). If you
would like to view just the top few rows of a dataset to get the feel for what type of
data is in the table, you can do so with a query.
In the SELECT line, you can type TOP # before the columns you would like to
see. Any type of filtering, aggregating, and ordering will still work through the rest
of the query, but selecting the top few will help the query run faster by returning a
subset of the result.
8. To view the top 10 rows in the TRANSACT table, type the following query into
the query window:
SELECT TOP 10 *
FROM TRANSACT
In SQL, SELECT indicates the columns you would like to view. * is a shortcut
to indicate that you’d like to view all of the columns. The TOP command limits
the amount of rows that are returned.
FROM indicates the tables that contains the data you’d like to view.
9. To see the result of the query, click Execute. F5 also works to run queries as a PC
shortcut.
LAB EXHIBIT 2-7D
Source: Microsoft SQL Server
Management Studio
79
Lab 2-8 Comprehensive Case: Dillard’s Store Data:
Connecting Excel to a SQL Database
Company summary
Dillard’s is a department store with approximately 330 stores in 29 states. Its
headquarters is in Little Rock, Arkansas. You can learn more about Dillard’s by looking at
finance.yahoo. com (Ticker symbol = DDS) and the Wikipedia site for DDS. You’ll
quickly note that William T. Dillard II is an accounting grad of the University of
Arkansas and the Walton College of Business, which may be why he shared transaction
data with us to make available for this lab and labs throughout this text.
Data
The data for this lab and other all Dillard’s labs are available at http://walton.uark.edu/
enterprise/. Your instructor will either give you specific instructions on how to access the
data, or there will be information available on Connect. The 2016 Dillard’s data cover all
transactions over the period 1/1/2014 to 10/17/2016.
Software needed
• Microsoft SQL Server Management Studio (available on the Remote Desktop at
the University of Arkansas)
• Excel 2016 (available on the Remote Desktop at the University of Arkansas)
80
LAB EXHIBIT 2-8A
Source: Microsoft Excel 2016
5. In the Microsoft SQL database pop-up window, input the server name that you
were provided through the Walton.uark.edu/enterprise website. The database name
is UA_Dillards_2016.
6. Click OK.
7. On the next window, keep the default to use your current credentials, and then click
Connect.
81
8. Click OK in the Encryption Support pop-up window.
9. The tables in the UA_Dillards_2016 database are available for you to select in the
Navigator window. Click once on STORE to preview the data.
10. The data will preview on the right side of the Navigator window. Click Load to
load the data into a table in Excel.
As long as the dataset that you have loaded is under the Excel row limit of
1,048,576, the entire table will be available for you to work with in Excel. You can
ana- lyze the data using Excel’s formulas, functions, and statistical tools, as well as
create PivotTables and charts.
11. Create a PivotTable for this set of data by selecting all of the data from the Store
table and then clicking PivotTable on the Insert tab of the Excel ribbon.
12. We can quickly view a count of how many stores are in each state. Drag and drop
STATE into the ROWS section of the PivotTable Fields window and STORE into
the
VALUES section.
82
LAB EXHIBIT 2-8F
Source: Microsoft Excel 2016
13. It is likely that the PivotTable assumed you wanted to SUM the Store ID, which
provides nonsense data. We need to change that aggregate to a COUNT instead.
Click the drop-down next to Sum of STORE in the VALUES section of the
PivotTable Fields window and select Value Field Settings.
83
LAB EXHIBIT 2-8G
Source: Microsoft Excel 2016
14. Select Count to change the way the data for number of stores per state are
summarized.
3. In the Microsoft SQL database pop-up window, input the server information
that you received when accessing the UA_Dillards_2016 data. The Database
name is UA_Dillards_2016.
Important Note: If you just worked through the first part of this lab (connecting to
data), this step is where the process begins to be different. Instead of clicking OK, you
will click SQL statement (optional).
85
4. For this query, we will pull in enough data to answer a variety of questions about
trans-
action line items in each state. We’ll select all of the columns from the TRANSACT
table and the STATE column from the STORE table. In order to do that, we’ll join
the two tables together in our query.
Q5. Joins are made based on their primary key/foreign key relationship. Looking
at the ERD or the dataset, which two columns form the relationship between
the TRANSACT and STORE tables?
5. Type this query into the SQL statement box:
SELECT TRANSACT.*, STATE
FROM TRANSACT
INNER JOIN STORE
ON TRANSACT.STORE = STORE.STORE
WHERE TRAN_DATE BETWEEN '20160901' AND '20160915'
6. Click OK to continue.
7. Click Connect using your current credentials in the next window.
8. Click OK on the Encryption Support window.
9. Excel will provide you a preview of your data before loading it. If the query loads
suc- cessfully (i.e., if you see the preview, instead of an error), click Close & Load
to load the data into an Excel table.
86
LAB EXHIBIT 2-8M
Source: Microsoft SQL Server
Management Studio
10. It may take a few minutes to load. Even though the query we ran was only for 15
days of transactions, there are still more than 1 million transactions (or rows) to
return.
14. For the Input Range, select the three columns associated with the three attributes
that we are measuring. Leave the default to columns, and place a check-mark in
Labels in First Row.
15. Place a check mark next to Summary Statistics, then press OK.
It may take awhile for the statistics to run because you’re working with so many
rows. Q7. What are the means for each of the attributes?
Q8. The mean from TRAN_AMT is lower than the means for both ORIG_PRICE
and SALE_PRICE, why do you think that is? (Hint: It is not an error).
88
Part 5: Address and Refine Results
Q9. How does doing a query within Excel allow quicker and more efficient
access and analysis of the data?
Q10. Is 15 days of data sufficient to capture the statistical relationship among and
between different variables? What will Excel do if you have more than 1 million
rows? There are statistical programs such as SAS and SPSS that allow for
trans- formation and statistical analysis of bigger datasets.
End of Lab
Company summary
Dillard’s is a department store with approximately 330 stores in 29 states. Its
headquarters is in Little Rock, Arkansas. You can learn more about Dillard’s by looking at
finance.yahoo. com (Ticker symbol = DDS) and the Wikipedia site for DDS. You’ll
quickly note that William T. Dillard II is an accounting grad of the University of
Arkansas and the Walton College of Business, which may be why he shared transaction
data with us to make available for this lab and labs throughout this text.
Data
The data for this lab and other all Dillard’s labs are available at http://walton.uark.edu/
enterprise/. Your instructor will either give you specific instructions on how to access the
data, or there will be information available on Connect. The 2016 Dillard’s data cover all
transactions over the period 1/1/2014 to 10/17/2016.
Technique
• This lab is most easily performed if Labs 2-6 and 2-7 have already been completed.
Software needed
• Microsoft SQL Server Management Studio (available on the Remote Desktop at
the University of Arkansas)
89
LAB EXHIBIT 2-9A
Source: Microsoft SQL Server
Management Studio
90
8. Given the description in the text and in Labs 2-6 and 2-7, you have the tools you
need
to join two tables, TRANSACT and CUSTOMER and run a query on customer state
(note this is where the customer lives and not where the store is located). Input a
query that will show how many customers have shopped at Dillard’s, grouped by
their respective states. Run the query for the entire dataset; do not filter based on a
limited set of days.
9. This query may take a few minutes to run. Once the results have returned, you can
check your results by looking at how many customers have shopped at Dillard’s
from Arkansas (AR): 2673089.
Q3. How many different states are listed?
Q4. Why are there so many more states listed than 50?
Q5. What do you assume the Other, XX, blank, and Null states represent? If you
were to analyze these data to learn more about the number of customers
from different places have shopped at Dillard’s, what would you do with
these data: group them, leave them out, leave them alone? Why?
End of Lab
91
Chapter 3
Modeling and Evaluation:
Going from Defining Business
Problems and Data Understanding
to Analyzing Data and Answering
Questions
A Look Back
Chapter 2 provided a description of how data are prepared and scrubbed to be ready to use to answer business
questions. We explained how to extract, transform, and load data and then how to validate and normalize the
data. In addition, we explained how data standards are used to facilitate the exchange of data between both
senders and receivers.
A Look Ahead
Chapter 4 will demonstrate various techniques that can be used to effectively communicate the results of
your analyses. Additionally, we discuss how to refine your results and translate your findings into useful
information for decision makers.
92
Liang Zhao Zhang, a San Francisco–based janitor, made more
than $275,000 in 2015. The average janitor in the area earns
just $26,180 a year. Zhang, a Bay Area Rapid Transit (BART)
janitor, has a base pay of $57,945 and $162,050 in overtime
pay. With benefits, the total was $276,121. While some call his
compensation “outrageous and irresponsible,” Zhang signed
up for every available overtime slot that became available. To
be sure, Zhang worked more than 4,000 hours last year and
received overtime pay. Can BART predict who might take
advan- tage of overtime pay? Should it set a policy restricting
overtime pay? Would it be better for BART to hire more regular,
kurhan/123RF full-time employees instead of offering so much overtime?
Can Data Analytics help with these questions?
Using a profiling data analytics approach detailed in this chapter, BART could generate summary statistics of its
workers and their overtime pay to see the extent that overtime is required and taken advantage of.
Using regression and classification approaches to Data Analytics would help to classify which employees are
most likely to exceed normal bounds and why. BART, for example, has a policy of offering overtime by seniority. So
do the most senior employees sign up first and leave little overtime to others? Will a senior employee get paid
more for overtime than more junior-level employees? If so, is that the best policy for the company and its
employees?
Source: http://www.cnbc.com/2016/11/04/how-one-bay-area-janitor-made-276000-last-year.html.
OBJECTIVES
After reading this chapter, you should be able to:
LO 3-1
LO 3-2 Define Data Analytics approaches
Distance
You could also use co-occurrence grouping to match vendors by geographic region;
data reduction to simplify vendors into obvious categories, such as wholesale or retail or
based on overall volume of orders; or profiling to evaluate vendors with similar on-time
delivery behavior, shown in Exhibit 3-2. In any of these cases, the data drive the
decision, and you evaluate the output to see if it matches our intuition. These exploratory
exercises may help to define better questions, but are generally less useful for making
decisions.
On the other hand, we may ask questions with specific outcomes, such as: “Will a new
vendor ship a large order on time?” When you are performing analysis that uses historical
data to predict a future outcome, you will use a supervised approach. We use historical data
Chapter 3 Modeling and Evaluation 95
−3 −2 −1 0 +1 +2 +3
Days to ship (Z-score)
EXHIBIT 3-2 Profiling Profiling is an unsupervised method that is used to discover patterns
of behavior. In this case, the higher the Z-score (farther away from the mean), the more likely a
vendor will have a delayed shipment (green circle). We use profiling to explore the attributes of
that vendor that we may want to avoid in the future.
to create the new model. Using a classification model, you can predict whether a new
vendor belongs to one class or another based on the behavior of the others, shown in
Exhibit 3-3. You might also use regression to predict a specific value to answer a question
such as, “How many days do we predict it will take a new vendor to ship an order?”
Again, the prediction is based on the activity we have observed from other vendors,
shown in Exhibit 3-4. Causal modeling, similarity matching, and link prediction are
additional supervised approaches
Volume
Distance
EXHIBIT 3-3 Classification Classification is a supervised method that can be used to predict
the class of a new observation. In this case, blue circles represent “on-time” vendors. Green
squares represent “delayed” vendors. The gold star represents a new vendor with no history.
Volume
Days to ship
EXHIBIT 3-4 Regression Regression is a supervised method used to predict specific values. In
this case, the number of days to ship is dependent on the number of items in the order. Therefore,
we
can use regression to predict the “days to ship” of the gold star based on the volume in the order.
96 Chapter 3 Modeling and Evaluation
Casual modeling
Identify your Classification
(Event Influences
question (Whether or
other)
not)
Yes Yes
Are you
looking Yes Yes
Yes
for hidden Link prediction
links? (Social networks)
Profiling Co-occurrence grouping
No (Typical (Events that
behavior) happen together)
Do you
have a specific Yes
target in
mind?
No
where you attempt to identify causation (which can be expensive), identify a series of
char- acteristics that predict a model, or attempt to identify other relationships,
respectively.
Ultimately, the model you use comes down to the questions you are trying to answer.
The flowchart in Exhibit 3-5 shows several decisions that will help you select an
appropriate model, or data approach. By evaluating your data, the question that needs to
be addressed as well as the desired outcomes, an appropriate data approach can be
determined. Once you’ve selected an approach, then your analysis can begin.
We highlighted the data analytics approaches in chapter 1 and provide them again here
for reference:
• Classification: A data approach used to assign each unit (or individual) in a
population into a few categories. An example classification might be, of all of the
loans this bank has offered, which are most likely to default? Or which loan
applications are expected to be approved? In this case, classification would classify
loan requests as either approved or denied. A second example would be a credit card
company flagging a trans- action as being approved or potentially being fraudulent
and denying payment.
• Regression: A data approach used to estimate or predict, for each unit, the
numerical value of some variable using some type of statistical model. An example
of regression analysis might be, given a balance of total accounts receivable held by
a firm, what is the appropriate level of allowance for doubtful accounts for bad
debts?
Chapter 3 Modeling and Evaluation 97
PROGRESS CHECK
1. Using the flowchart in Exhibit 3-5, identify the appropriate approach for the fol-
lowing questions:
a. Will a customer purchase item X if given incentive A?
b. Which item (X, Y, or none) will a customer likely purchase given incentive A?
c. How many items will the customer purchase?
2. What is the main difference between supervised and unsupervised methods?
3. Evaluate the model shown in Exhibit 3-3. Which class would you predict the
new vendor belongs to?
98 Chapter 3 Modeling and Evaluation
LO 3-2 PROFILING
Explain the As you recall, profiling involves gaining an understanding of a typical behavior of an individual,
profiling approach group, or population (or sample). Profiling is done primarily using structured data—that is,
to Data Analytics data that are stored in a database or spreadsheet and are readily searchable. Using these data,
analysts can use common summary statistics to describe the individual, group, or population,
including knowing its mean, standard deviation, sum, etc. Profiling is generally performed
on data that are readily available, so the data have already been gathered and are ready for
further analysis.
Data profiling can be as simple as calculating summary statistics on transactional data,
such as the average number of days to ship a product, the typical amount we pay for a
prod- uct, or the number of hours an employee is expected to work. On the other hand,
profiling can be used to develop complex models to predict potential fraud. For example,
you might create a profile for each employee in a company that may include a
combination of salary, hours worked, and travel and entertainment purchasing behavior.
Sudden deviations from an employee’s past behavior may represent risk and warrant
follow-up by the internal auditors. Similar to evaluating behavior, data profiling is
typically used to assess data quality and internal controls. For example, data profiling
may identify customers with incomplete or
erroneous master data or mistyped transactions.
Data profiling typically involves the following steps:
1. Identify the objects or activity you want to profile. What data do you want to
evaluate? Sales transactions? Customer data? Credit limits? Imagine a manager wants
to track sales volume for each store in a retail chain. She might evaluate total sales
dollars, asset turnover, use of promotions and discounts, and/or employee incentives.
2. Determine the types of profiling you want to perform. What is your goal? Do you want
to set a benchmark for minimum activity, such as monthly sales? Have you set a
budget that you wish to follow? Are you trying to reduce fraud risk? In the retail store
scenario, the manager would likely want to compare each store to the others to
identify which ones are underperforming or overperforming.
3. Set boundaries or thresholds for the activity. This is a benchmark that may be
manually set, such as a budgeted value, or automatically set, such as a statistical
mean, quartile, or percentile. The retail chain manager may define underperforming
stores as those whose sales activity falls below the 20th percentile of the group and
overperforming stores as those whose sales activity is above the 80th percentile.
These thresholds are automati- cally calculated based on the total activity of the
stores, so the benchmark is dynamic.
4. Interpret the results and monitor the activity and/or generate a list of exceptions.
Here is where dashboards come into play. Management can use dashboards to
quickly see multiple sets of profiled data and make decisions that would affect
behavior. As you evaluate the results, try to understand what a deviation from the
defined boundary represents. Is it a risk? Is it fraud? Is it just something to keep an
eye on? To evalu- ate her stores, the retail chain manager may review a summary of
the sales indicators and quickly identify under- and overperforming stores. She is
likely to be more con- cerned with underperforming stores as they represent major
challenges for the chain. Overperforming stores may provide insight into marketing
efforts or customer base.
5. Follow up on exceptions. Once a deviation has been identified, management should
have a plan to take a course of action to validate, correct, or identify the causes of the
abnor- mal behavior. When the retail chain manager notices a store that is
underperforming compared to its peers, she may follow up with the individual store
Chapter 3 Modeling and Evaluation 99
EXHIBIT 3-6 Example of Price and Volume Variance Profiling Note that there are multiple
benchmarks. The blue line is the standard behavior; the green area contains favorable variances; the
orange area shows unfavorable variances.
1
http://www.washingtonpost.com/wp-dyn/content/article/2005/07/14/AR2005071402055.html (accessed
August 2, 2017).
100 Chapter 3 Modeling and Evaluation
30%
25%
20%
15%
10%
5%
0%
1 2 3 4 5 6 7 8 9
Real GDP 2016 Benford’s Predicted
EXHIBIT 3-7 Benford’s Law Benford’s law predicts the distribution of first digits. In
this example, the GDP is given (in U.S. dollars) for countries around the world from 2016;
note its departure from what we would expect given Benford’s law.
Source: data.worldbank.org
PROGRESS CHECK
4. Profiling is also used in law enforcement, such as offender or criminal profiling.
Offender profiling is a tool used by law enforcement to identify likely suspects
and analyzing data patterns to help predict future offenses by criminals and
identify potential victims. Compare and contrast this type of profiling with the
profiling data approach used in accounting (mentioned earlier in this section).
5. Identify a reason the sales amount of any single product may or may not follow
Benford’s law.
Chapter 3 Modeling and Evaluation 101
The data reduction approach allows us to focus more time and effort on those vendors
and
transactions that might require additional analysis to make sure they are legitimate.
An example of the data reduction approach might be to do gap detection, such as look-
ing for a missing check number in a sequence of checks. Finding out why certain check
numbers were skipped and not recorded requires additional analysis and consideration.
Another application of the data reduction approach is to filter all the transactions
between known related party transactions. Focusing specifically on related party
transactions allows the auditor to focus on those transactions that might potentially be
sensitive and/or risky.
Another example might be the comparison between the address of vendors and the
address of employees to ensure that employees are not siphoning funds to themselves.
Such a filter might require the use of a computer-assisted technique, called fuzzy match,
to match addresses that do not perfectly match 100 percent. Use of fuzzy match looks for
correspondences between portions, or segments, of the text of each potential match. Once
potential matches between vendors and employees are found, additional analysis must be
conducted to figure out if funds have been, or potentially could be, siphoned.
PROGRESS CHECK
6. Describe how the data reduction approach is used to consider T&E expenses.
7. Explain how XBRL might be used to focus on specific areas of interest by lenders.
LO 3-4
Understand
REGRESSION
the regression Regressions allow the accountant to develop models to predict expected outcomes. These
approach to Data expected outcomes might be to predict the level of the allowance of doubtful accounts
Analytics needed for a given accounts receivable balance.
Regression analysis involves the following process:
1. Identify the variables that might predict an outcome. The inputs are called independent
Chapter 3 Modeling and Evaluation 103
2. Determine the functional form of the relationship. Is it a linear relationship where each
input plots to another? Are you trying to divide the records into different groups or
classes?
3. Identify the parameters of the model. What are the relative weights of each variable
or the thresholds of each branch in a classification?
The following discussion primarily identifies the structure of the model—that is, the relation-
ship between the dependent variable and the plausible independent variables—in this way:
Dependent variable = f(independent variables)
The dependent variable might be the amount that should be considered in an allowance
for doubtful accounts; the independent variables that might predict the level needed to
reserve it may be current aged loans, loan type, customer loan history, and collections
suc- cess. We develop this further later.
We provide a multitude of examples in this next section.
2
http://www.cpafma.org/articles/inside-public-accounting-releases-2015-national-benchmarking-report/
(accessed November 9, 2016).
3
A. S. Ahmed, C. Takeda, and S. Thomas, “Bank Loan Loss Provisions: A Reexamination of Capital
Management, Earnings Management and Signaling Effects,” Journal of Accounting and Economics
28, no. 1 (1999), pp. 1–25.
4
http://www.pwc.com/us/en/cfodirect/publications/in-brief/fasb-new-impairment-guidance-financial-
instruments.html (accessed November 9, 2016).
104 Chapter 3 Modeling and Evaluation
Classification Terminology
First, a bit of terminology to prepare us for our discussion.
Training data are existing data that have been manually evaluated and assigned a
class. We know that some customer accounts have been written off, so those accounts are
assigned the class “Write-Off.” We will train our model to learn what it is that those
customers have in common so we can predict whether a new customer will default or not.
Test data are existing data used to evaluate the model. The classification algorithm
will try to predict the class of the test data and then compare its prediction to the
previously assigned class. This comparison is used to evaluate the accuracy of the model,
or the prob- ability that the model will assign the correct class.
Decision trees are used to divide data into smaller groups. Decision boundaries mark
the split between one class and another.
Exhibit 3-9 provides an illustration of both decision trees and decision boundaries.
Decision trees split the data at each branch into two or more groups. In this example, the
first branch divides the vendor data by geographic distance and inserts a decision
dor volume. Note that the decision boundaries are different for each grouping.
Chapter 3 Modeling and Evaluation 105
1 EXHIBIT 3-9
Example of Decision
Trees and Decision
1
Boundaries
2
Volume
2 3
3
Distance
Pruning removes branches from a decision tree to avoid overfitting the model. Pre-
pruning occurs during the model generation. The model stops creating new branches
when the information usefulness of an additional branch is low. Post-pruning evaluates
the com- plete model and discards branches after the fact. Exhibit 3-10 provides an
illustration of how pruning might work in a decision tree.
1
EXHIBIT 3-10
Illustration of Pruning
a Decision Tree
2 3
Linear classifiers are useful for ranking items rather than simply predicting class prob-
ability. These classifiers are used to identify a decision boundary. Exhibit 3-11 shows an
illustration of linear classifiers segregating the two classes. Note the error observation
that shows that this linear classifier is not perfect in segregating the two classes.
3 EXHIBIT 3-11
Illustration of Linear
2 1
Classifiers
Volume
Error
2
1 3
Distance
106 Chapter 3 Modeling and Evaluation
EXHIBIT 3-12
Support Vector
Machines
With support vector
Volume
Volume
machines, first find the
widest margin (biggest
pipe); then find the
middle line.
EXHIBIT 3-13
Support Vector
Machine Decision
Boundaries Volume
SVMs have two
decision boundaries at
the edges of the pipes.
Chapter 3 Modeling and Evaluation 107
EXHIBIT 3-14
Illustration of
Underfitting and
Overfitting the Data
with a Predictive Model
Complexity of model
PROGRESS CHECK
8. If we are trying to predict the extent of employee turnover, do you believe the
health of the economy, as measured using GDP, will be positively or
negatively associated with employee turnover?
9. If we are trying to predict whether a loan will be rejected, would you expect
credit score to be positively or negatively associated with loan rejection by a
bank such as LendingClub?
CLUSTERING LO 3-6
The clustering data approach works to identify groups of similar data elements and the Understand
underlying drivers of those groups. More specifically, clustering techniques are used to the clustering
group data/observations in a few segments so that data within any segment are similar, approach to Data
while data across segments are different. Analytics
As an example, Walmart may want to understand the types of customers who shop
at its stores. Because Walmart has good reason to believe there are different market seg-
ments of people, it may consider changing the design of the store or the types of products
to accommodate the different types of customers, emphasizing the ones that are most
profitable to Walmart. To learn about the different types of customers, managers may
ask whether customers agree with the following statements using a scale of 1–7 (on a
Likert scale):
Statement 1: I enjoy shopping.
Statement 2: I try to avoid shopping because it is bad for the budget.
Statement 3: I like to combine my shopping with eating out.
108 Chapter 3 Modeling and Evaluation
EXHIBIT 3-16
Example of Cluster
Analysis of Group
Insurance Claim
Payments
Source: Thiprungsri S., and
M.A. Vasarhelyi, 2011, page 79.
PROGRESS CHECK
10. Name three clusters of customers who might shop at Walmart.
11. Cluster 1 of the group insurance highlighted claims have a long period from
death to payment dates. Why would that cluster be of interest to internal
auditors?
Summary
■
In this chapter, we addressed the third step of the IMPACT cycle model: the “P” for
“performing test plan.” That is, how are we going to test or analyze the data to address
a problem we are facing?
■
Based on our problem and the data available, we provided a flowchart that helps the
analyst to choose the most appropriate model, noting the differences when we use a
supervised versus an unsupervised approach.
■
Specifically, we addressed five data analytics approaches or techniques are most com-
mon to address our accounting questions: profiling, data reduction, regression, classifi-
cation, and clustering. We also provided examples of accounting and auditing
problems addressed by these data approaches.
■
We introduced the concepts of Benford’s law and fuzzy match, which we will use in
sub- sequent chapters.
■
We presented some classification terminology—including test and training data,
decision
trees and boundaries, linear classifiers, and support vector machines—and talked about
the perils of under- and overfitting the training data and their consequences in predic-
tions using the test data.
Key Words
Benford’s law (100) An observation about the frequency of leading digits in many real-life sets of
numerical data. The law states that in many naturally occurring collections of numbers, the significant
lending digit is likely to be small.
causal modeling (95) A data approach similar to regression, but used when the relationship between
independent and dependent variables where it is hypothesized that the independent variables cause or are
associated with the dependent variable.
classification (95) A data approach used to assign each unit in a population into a few categories
potentially to help with predictions.
clustering (94) A data approach used to divide individuals (like customers) into groups (or clusters)
in a useful or meaningful way.
co-occurrence grouping (94) A data approach used to discover associations between individuals
based on transactions involving them.
data reduction (94) A data approach used to reduce the amount of information that needs to be con-
sidered to focus on the most critical items (i.e., highest cost, highest risk, largest impact, etc.).
decision boundaries (104) Technique used to mark the split between one class and another.
decision tree (104) Tool used to divide data into smaller groups.
fuzzy match (102) A computer-assisted technique of finding matches that are less than 100 percent
per- fect by finding correspondencies between portions of the text of each potential match.
link prediction (95) A data approach used to predict a relationship between two data items.
profiling (94) A data approach used to characterize the “typical” behavior of an individual, group, or
population by generating summary statistics about the data (including mean, standard deviations, etc.).
regression (95) A data approach used to estimate or predict, for each unit, the numerical value of
some variable using some type of statistical model.
similarity matching (95) A data approach used to identify similar individuals based on data known
about them.
structured data (98) Data that are organized and reside in a fixed field with a record or a file. Such
data are generally contained in a relational database or spreadsheet and are readily searchable by search
algorithms.
supervised approach/method (94) Approach used to learn more about the basic relationships
between independent and dependent variables that are hypothesized to exist.
support vector machine (106) A discriminating classifier that is defined by a separating
hyperplane that works first to find the widest margin (or biggest pipe).
training data (104) Existing data that have been manually evaluated and assigned a class, which
assists in classifying the test data.
test data (104) A set of data used to assess the degree and strength of a predicted relationship estab-
lished by the analysis of training data.
unsupervised approach/method (94) Approach used for data exploration looking for potential pat-
terns of interest.
XBRL (102) (eXtensible Business Reporting Language) A global standard for exchanging financial
reporting information that uses XML.
110
ANSWERS TO PROGRESS CHECKS
1. a. Link prediction
b. Classification
c. Regression
2. In supervised learning, there is some idea of the basic relationships either because
of theory or because we have learned from our training data. In unsupervised
learning, we are primarily exploring the data for potential patterns that might exist
which may ulti- mately turn into supervised learning.
3. Exhibit 3-3 suggests that the new observation would likely belong to the “delayed
ven- dors” instead of the “on-time vendors” based on the volume shipped and the
distance and where it appears relative to the other observations.
4. In some sense, profiling techniques to find criminals and accounting anomalies are
very similar. Profiling to find criminals often looks to the physical characteristics (race,
sex, mental state, etc.) to predict whether the person has or is likely to commit a crime
(and is illegal to use in some jurisdictions). Accounting looks to other, nonphysical
characteristics such as the amounts, totals, and types of expenditures to identify
potential anomalies.
5. A dollar store might have everything for exactly $1.00. In that case, the use of
Benford’s law for any single project or even for every product would not follow
Benford’s law!
6. Data reduction may be used to filter out ordinary travel and entertainment expenses
so an auditor can focus on those that are potentially erroneous or fraudulent.
7. The XBRL tagging allows an analyst or decision maker to focus on one or a category
of expenses of most interest to a lender. For example, lenders might be most
interested in monitoring the amount of long-term debt, interest payments, and
dividends paid to assess if the borrower will be able to repay the loan. Using the
capabilities of XBRL, lend- ers could focus on just those individual accounts for further
analysis.
8. We certainly could let the data speak and address this question directly. In general,
when the health of the economy is stronger, there are fewer layoffs and fewer people
out look- ing for a job, which means less turnover.
9. Chapter 1 illustrated that LendingClub collects the credit score data and the initial
analysis there suggested the higher the credit score the less likely to be rejected. Given
this evidence, we would predict a negative relationship between credit score and loans
that are rejected.
10. Three clusters of customers who might consider Walmart could include thrifty
shoppers (looking for the lowest price), shoppers looking to shop for all of their
household needs (both grocery and non-grocery items) in one place, and those
customers who live close to the store (good location).
11. The longer time between the death and payment dates begs one to ask why it has
taken so long for payment to occur and if the interest required to be paid is likely
large. Because of these issues, there might be a possibility that the claim is
fraudulent or at least deserves a more thorough review to explain why there was such
a long delay.
111
2. Data that are organized and reside in a fixed field with a record or a file. Such data
are
generally contained in a relational database or spreadsheet and are readily
searchable by search algorithms. The term matching this definition is:
a. Training data.
b. Unstructured data.
c. Structured data.
d. Test data.
3. The observation that the frequency of leading digits in many real-life sets of numeri-
cal data is called:
a. Leading digits hypothesis.
b. Moore’s law.
c. Benford’s law.
d. Clustering.
4. Which approach to data analytics attempts to predict a relationship between two data
items?
a. Similarity matching
b. Classification
c. Link prediction
d. Co-occurrence grouping
5. In general, the more complex the model, the greater the chance of:
a. Overfitting the data.
b. Underfitting the data.
c. Pruning the data.
d. The need to reduce the amount of data considered.
6. In general, the simpler the model, the greater the chance of:
a. Overfitting the data.
b. Underfitting the data.
c. Pruning the data.
d. The need to reduce the amount of data considered.
7. is a discriminating classifier that is defined by a separating hyperplane
that works first to find the widest margin (or biggest pipe) and then works to find the
middle line.
a. Linear classifier
b. Support vector machine
c. Decision tree
d. Multiple regression
8. mark (marks) the split between one class and another.
a. Decision trees
b. identified questions
c. Decision boundaries
d. Linear classifiers
9. Models associated with regression and classification data approaches have all except
this important part:
a. Identifying which variables (we’ll call these independent variables) might help pre-
dict an outcome (we’ll call this the dependent variable).
b. The functional form of the relationship (linear, nonlinear, etc.).
c. The numeric parameters of the model (detailing the relative weights of each of the
variables associated with the prediction).
d. Test data.
112
10. Which approach to data analytics attempts to assign each unit in a population into a
small set of classes where the unit belongs?
a. Classification
b. Regression
c. Similarity matching
d. Co-occurrence grouping
Discussion Questions
1. What is the difference between a target and a class?
2. What is the difference between a supervised and an unsupervised approach?
3. What is the difference between training datasets and test (or testing) datasets?
4. Using Exhibit 3-5 as a guide, what are three data approaches associated with the
super- vised approach?
5. Using Exhibit 3-5 as a guide, what are three data approaches associated with the
unsu- pervised approach?
6. How might the data reduction approach be used in auditing?
7. How might classification be used in approving or denying a potential fraudulent credit
card transaction?
8. How is similarity matching different from clustering?
9. How does fuzzy match work? Give an accounting situation where it might be most
useful.
10. Compare and contrast the profiling data approach and the development of standard
cost for a unit of production at a manufacturing company. Are they substantially the
same, or do they have differences?
11. Exhibits 3-1, 3-2, 3-3, and 3-4 suggest that volume and distance are the best
predictors of “days to ship” for a wholesale company. Any other variables that would
also be useful in predicting the number of “days to ship”?
Problems
1. How could the fuzzy match be used to find undisclosed related party transactions that
might need to be disclosed?
2. An auditor is trying to figure out if the inventory at an electronics store chain is obso-
lete. What characteristics might be used to help establish a model predicting
inventory obsolescence?
3. An auditor is trying to figure out if the goodwill its client recognized when it purchased
a factory has become impaired. What characteristics might be used to help establish
a model predicting goodwill impairment?
4. How might clustering be used to explain customers that owe us money (accounts
receivable)?
5. Why would the use of data reduction be useful to highlight related party transac-
tions (e.g., CEO has her own separate company that the main company does
business with)?
6. How could an investor use XBRL to do an analysis of the industry’s inventory
turnover?
7. Name three accounts that would be appropriate and interesting to apply Benford’s
law in auditing those accounts. Why would an auditor choose those three accounts?
When would a departure from Benford’s law encourage the auditor to investigate
further?
113
Answers to Multiple Choice Questions
1. D 2. C 3. C 4. C 5. A 6. B 7. B 8. C 9. D 10. A
Distance Formula
(Use first number 3959 for miles or 6371 for kilometers)
3959 * ACOS
(
SIN(RADIANS([Lat])) * SIN(RADIANS([Lat2])) + COS(RADIANS([Lat])) *
COS(RADIANS([Lat2])) * COS(RADIANS([Long2]) – RADIANS([Long]))
)
Assign Classes
Take a moment to define your classes. You are trying to predict whether a given order
ship- ment will either be “On-time” or “Delayed” based on the number of days it takes
from the order date to the shipping date. What does “on-time” mean? Let’s define “on-time”
as an order that ships in 5 days or less and a “delayed” order as one that ships later than 5
days. You’ll use this rule to add the class as a new attribute to each of your historical
114
On-time = (Days to ship ≤ 5)
Delayed = (Days to ship > 5)
CHAPTER 3 LABS
115
Lab 3-1 Data Reduction
Auditors use data reduction to focus their efforts on testing internal controls and limiting
their scope. For example, they may want to look only at transactions for a given year. In
this lab, you will learn to use filters in Excel and perform some fuzzy matches on vendor
and employee records, a common auditor analysis.
Company summary
These data are for a generic manufacturing company. You have been asked to see if there
are any potentially fictitious vendors or employees who may have created fake companies
in an effort to commit fraud.
Data
• Fuzzy.xlsx—contains employee and vendor data
Technique
• Some Excel experience is handy here. You will use tables, filters, and the Fuzzy
Lookup add-in.
Software needed
• Excel
• Fuzzy Lookup add-in: https://www.microsoft.com/en-us/download/details.aspx?id=15011
Employees
EmployeeID
EmployeeFirstName
EmployeeLastName
116
EmployeeGender
EmployeeHireDate
EmployeeStreetAddress
EmployeeCity
EmployeeState
EmployeeZip
EmployeePhone
Vendors
VendorID
VendorName
VendorType
VendorSince
VendorContact
VendorBillingAddress
VendorBillingCity
VendorBillingState
VendorBillingZip
VendorBillingPhone
Your first step is to understand the data and prepare it in Excel to perform some
matching.
1. Open Fuzzy.xlsx in Excel.
2. Quickly browse through the worksheets to ensure that they are complete.
3. Go to the Employees tab and click any data element.
4. Select the entire data table (Ctrl + A).
5. Go to the Home tab, Styles section, and click Format as Table. Any style will do.
6. In the Format As Table box that appears, make sure the My table has headers box
is checked, and click OK.
7. In the Table Tools > Design tab, under Properties, change the table name from
Table1 to Employees.
8. Now go to the Vendors tab and click any data element. Repeat steps 4–7 and name
the new table “Vendors”.
9. Take a screenshot of either table (label it 3-1A).
10. Save your file as Fuzzy-Tables.xlsx.
Tool: Filtering
Excel Filters allow you to quickly find data with common attributes and help to limit the
scope of your analysis. Assume that the auditors have analyzed all vendors prior to 2019
and have resolved any outstanding issues. By analyzing only the vendors from 2019, you
avoid unnecessary analysis and reduce the time it will take for the computer to run the
analysis.
11. Open Fuzzy-Tables.xlsx and click the Vendors worksheet.
12. Click the drop-down arrow next to VendorSince to show filtering options,
shown below.
117
13. To select only 2019 records, uncheck Select All and then check the box next to 2019.
14. Select the table and headers (Ctrl + A twice) and copy the values (Ctrl + C).
15. Create a new worksheet tab called “Vendors2019” and paste the filtered values there.
16. Select your new table and format it as a table called “Vendors2019”.
17. Take a screenshot (label it 3-1B).
18. Save your file as Fuzzy-Tables-2019.xlsx.
118
119
30. Format the output as a table named FuzzyMatch, then filter out any records with
0.0000 Similarity.
Q4. How many vendors have similar addresses to employees?
Q5. What do you notice about the street vendor and employee street
addresses? Q6. Are there any false positives (fuzzy matches that aren’t
really matches)?
31. Take a screenshot (label it 3-1C).
End of Lab
R
Company summary
The data used are a subset of the College Scorecard dataset that is provided by the U.S.
Department of Education. These data provide federal financial aid and earnings informa-
tion, insights into the performance of schools eligible to receive federal financial aid, and
the outcomes of students at those schools. You can learn more about how the data are
used and view the raw data yourself at https://collegescorecard.ed.gov/data/. However,
for this lab, you should use the text file provided to you.
Data
• CollegeScorecard Datasets: CollegeScorecard_CleanedData from Lab 2-5
Technique
• Some experience with Excel is useful for this lab.
Software needed
• Excel
• Screen capture tool (Windows: Snipping Tool; Mac: Cmd + Shift + 4)
4. Select the entire set of data that is associated with the response variable for the Y
range, then select the entire set of data that is associated with the explanatory
variable for the X range.
5. If you selected the labels in your ranges, place a checkmark in the box next to
Labels.
6. Click OK. This will run the regression test and place the output on a new
spreadsheet in your Excel workbook.
7. Take a screenshot of your regression output (label it 3-2A).
End of Lab
121
Lab 3-3 Classification
Company summary
LendingClub is a peer-to-peer marketplace where borrowers and investors are matched
together. The goal of LendingClub is to reduce the costs associated with these bank-
ing transactions and make borrowing less expensive and investment more engaging.
LendingClub provides data on loans that have been approved and rejected since 2007,
including the assigned interest rate and type of loan. This provides several opportunities
for data analysis.
Data
• LendingClub datasets: LendingClub-Classification
Software needed
• Excel
• Weka—available at www.cs.waikato.ac.nz/ml/weka
• Screen capture tool (Windows: Snipping Tool; Mac: Cmd + Shift + 4)
122
Attribute Description LAB TABLE 3-3A
LoanStatsXXXX.csv
id Loan identification number
member_id Membership id
loan_amnt Requested loan amount
emp_length Employment length
issue_d Date of loan issue
loan_status Fully paid or charged off
pymnt_plan Payment plan: yes or no
purpose Loan purpose: e.g., wedding, medical, debt_consolidation, car
zip_code The first three digits of the applicant’s zip code
addr_state State
dti Debt-to-income ratio
delinq_2y Late payments within the past two years
earliest_cr_line Oldest credit account
inq_last_6mnths Credit inquiries in the past 6 months
open_acc Number of open credit accounts
revol_bal Total balance of all credit accounts
revol_util Percentage of available credit in use
total_acc Total number of credit accounts
application_type Individual or joint application
Attribute Description
LAB TABLE 3-3B
Amount Requested Requested loan amount RejectStats.csv
Application Date Date of loan application
Loan Title Brief description of loan purpose
Risk_Score LendingClub’s calculated value
Debt-To-Income Ratio Debt-to-income ratio
Zip Code The first three digits of the applicant’s zip code
State State
Employment Length Employment length
Policy Code Internal number
Q6. What does the lack of attributes in the RejectData files tell us about the
data that LendingClub retains on rejected loans?
Q7. How will that affect a classification analysis?
We will need to convert the data into a useful format before we can perform any analysis.
We need to generate two sets of data, one for classification and one for regression and
clustering.
123
Cleaning the Data for Classification
Goal: Combine approved and rejected data for a given year, assign a class to each record.
Issues
• Approved and rejected loans contain different data attributes.
• Date data values are recorded in different formats (1/9/2009 vs. Jan-2009).
• Years of employment contain text values and should be numbers.
In Excel
1. Select a year you would like to analyze between 2007 and 2012.
2. Create a new spreadsheet.
3. Type the common attributes from Table 3-3C into the first row.
4. Open the LoanStats and RejectStats for your chosen year.
5. Delete all columns that don’t match those listed in Table 3-3C.
6. Use the =MONTH formula to extract the month from the date.
7. Copy the Month column and Paste Special > Values into the Date column.
8. Add a new column for the class and add REJECT to the rejected loans and
APPROVE
to the approved loans.
9. Copy and paste the values for your chosen year from each .csv file into your
new spreadsheet.
10. Find and replace the employment values using Lab Table 3-3D.
LAB TABLE 3-3D Original Value New Value
na 0
< 1 year 0
1 year 1
2 years 2
3 years 3
4 years 4
5 years 5
6 years 6
7 years 7
8 years 8
9 years 9
10+ years 10
, (comma) (blank)
11. Save your file as LoanClassificationXXXX.csv, replacing XXXX with your year. Be sure
to choose .csv as the file type.
12. Take a screenshot (label it 3-3A).
124
17. Click Classify.
18. Run each of the following classification models:
a. Weka > Trees > Random Forest.
b. Weka > Meta > AdaBoostM1.
c. Weka > Functions > Logistic.
d. Weka > Bayes > BayesNet.
Q8. Which model has the highest accuracy? How do you know?
End of Lab
Data
The data for this lab and other all Dillard’s labs are available at http://walton.uark.edu/
enterprise/. Your instructor will either give you specific instructions on how to access the
data, or there will be information available on connect. The 2016 Dillard’s data cover all
transactions over the period 1/1/2014 to 10/17/2016.
Software needed
• Microsoft SQL Server Management Studio (available on the Remote Desktop at
the University of Arkansas)
• Excel 2016 (available on the Remote Desktop at the University of Arkansas)
LAB EXHIBIT 3-4A Common Attributes from the STORE and TRANSACT Tables of 2016
Dillard’s Data (http://walton.uark.edu/enterprise/dillardshome.php)
Source: http://walton.uark.edu/enterprise/dillardshome.php
126
Attribute Description Values
SKU Stock Keeping Unit number of the stock item 4757355, 2128748, . . .
Store Store Number 2, 3, 4, 100
Register Register Number of the Current Transaction 580, 30, 460, . . .
TranCode Transaction Code 09700, 018000
(or Trannum)
Saledate Sale date of the Item Stock 2005-01-20, 2005-06-02, . . .
Seq Sequence Number 298100028, 213500030, . . .
Interid Internal ID 265005802, 671901998, . . .
Stype Type of Transaction (Return or Purchase) P, R
Quantity Item Quantity of the Transaction 1, 2, 3, 4, . . .
OrgPrice Original price of the item stock 75.00, 44.00, . . .
SPrice Sale price of the item stock 26.25, 65.00, . . .
Amt Total amount of the transaction charge to the 26.25, 44.00, . . .
customer
Mic Master Item Code 862, 689, . . .
City City where the store is located St. Louis, Tampa, . . .
State State where the store is located FL, MO, AR, . . .
Zip Zip code 33710, 63126, . . .
1. Run the following SQL query on Microsoft SQL Server Management Studio
to address the question regarding which state had the highest transaction
balance. (Recall that transaction is defined for each individual item
purchased.)
SELECT STATE, AVG(TRAN_AMT) AS Average
FROM TRANSACT
INNER JOIN STORE
ON TRANSACT.STORE = STORE.STORE
GROUP BY STATE
The output should look like this:
AL 27.992390 MS 28.338400
AR 41.379066 MT 28.941823
AZ 27.845655 NC 25.576096
CA 28.315362 NE 26.904771
CO 27.297332 NM 28.826383
FL 28.760791 NV 30.021116
GA 27.270740 NY 21.757447
IA 24.879376 OH 26.432211
ID 29.408952 OK 29.088865
IL 24.787586 SC 28.241007
IN 26.066528 TN 29.178345
KS 27.771021 TX 29.477805
KY 28.206677 UT 25.254111
LA 30.282367 VA 26.500511
MO 25.546692 WY 26.429770
127
Part 3: Perform an Analysis of the Data
2. Take a screenshot of your results (label it 3-4A).
Noting that Arkansas (State =‘AR’) has the highest transaction balance, let’s
address our second question: “Do customers in the state with the highest transaction
balances have a significantly higher transaction balance from September 1, 2016, to
September 15, 2016, than all other states?”
3. To address Q2, run the following SQL query to extract the data needed for
additional analysis. You can do this analysis in SQL Server or you can do it in
Excel:
SELECT TRANSACT.*, STORE.STATE
FROM TRANSACT
INNER JOIN STORE
ON TRANSACT.STORE = STORE.STORE
WHERE TRAN_DATE BETWEEN '20160901' AND '20160915'
ORDER BY TRAN_DATE
4. If you choose to do the SQL query in Excel, here are the steps:
Create query through Excel 2016
4.1 Data tab > New Query > From Database > From SQL Server Database
128
4.2 Enter the Server (essql1.walton.uark.edu) and the Database
(UA_Dillards_2016) – not case-sensitive
Click Advanced options to input the query text:
SELECT TRANSACT.*, STORE.STATE
FROM TRANSACT
INNER JOIN STORE
ON TRANSACT.STORE = STORE.STORE
WHERE TRAN_DATE BETWEEN '20160901' AND '20160915'
ORDER BY TRAN_DATE
129
Source: Microsoft Excel 2016
(If you have an error, click Edit to return to your query and resolve the
error.) Example of error text:
(This error indicates that there is a typo in the State column name).
4.5 From the Data preview screen, you can click Load to immediately load the
dataset into your Excel workbook.
130
5. Once the data are in Excel, you’ll need to transform the State data to perform regres-
sion analysis on the state of Arkansas to address Q2. To do so, make a new column
just right of the existing dataset and label it Arkansas-dummy in column 1. Write the
formula “(=IF([@STATE]=“AR”,1,0) in each row. It will assign a value of 1 to trans-
actions at stores in Arkansas and a value of 0 for transactions at stores outside of the
Arkansas. Copy this formula all the way down to cover each row.
6.3 Reference the cells that contain the Tran_AMT in the Input Y Range and
Arkansas-dummy in the Input X Range and then click OK.
131
Source: Microsoft Excel 2016
6.4 Your output should look like the screenshot below. The t Stat greater than
2.0 suggests that the transaction amount (Tran_Amt) is statistically greater in
Arkansas than in all other states.
132
September 1, 2016, to September 15, 2016? Because we found that transactions in
Arkansas are statistically higher than all other states, we will include that finding
in our analysis as well, making this a multivariate regression.
8. To address this question we need to transform the Online variable into an online-
dummy variable. The Online variable carries values of “Y” for online and “N” for Not
online. To do our analysis, we will transform this into a dummy variable that allows
statistical analysis. Dummy variables carry the value of “1” or “0”. We transform it
in the following way:
Once this is complete, copy the calculation for all cells in the column.
We’re now ready for regression analysis. Reference the cells that contain the
Tran_AMT in the Input Y Range and Arkansas-dummy and reference Online-dummy in
the Input X Range and click OK.
133
The results of the regression analysis suggesting that both Transactions in
Arkansas and Transactions done online are associated with greater transaction
amounts are below.
End of Lab
134
135
LAB EXHIBIT 3-5B
Source: Microsoft Excel 2016
136
Q3. Why did we also include Arkansas state sales and online sales as other explan-
atory variables (X- or independent variables) in this regression analysis? Are
these results still significant after the inclusion of the use of the Dillard’s
credit card?
Q4. Are there any other data from the TRANSACT table that might help us
predict the transaction amount?
Q5. If we had any other data to predict transaction amount, what would you use?
Brainstorm freely to come up with what could explain these different levels
of transaction amounts!
End of Lab
137
Chapter 4
Visualization: Using Visualizations
and Summaries to Share Results
with Stakeholders
A Look Back
In chapter 3, we considered various models and techniques used for data analytics and discussed when to use
them and how to interpret the results. We also provide specific accounting-related examples of when each of these
specific data approaches and models is appropriate to address our particular question.
A Look Ahead
Because most of the focus of data analytics in accounting is on auditing, chapter 5 considers how both internal
and external auditors are using technology in general—and audit analytics specifically—to evaluate firm data and
generate support for management assertions. We emphasize audit working papers, audit planning, continuous
monitoring, and continuous data assurance.
138
Before the 2016 presidential election, almost all polls predicted a Hillary Clinton win. But many of those polls
assumed that Hillary Clinton would receive the same support and passion from Obama’s 2012 supporters, which
turned out not to be the case.
Exhibit 4-1 shows the 2016 election results relative to the polling predictions prior to the election. This one
graph pretty much encapsulates why Donald Trump won and Hillary Clinton lost. And that is the focus of the
chapter: how we capture and communicate information to better understand a good or a bad decision. As noted in
the chapter, data are important, and Data Analytics are effective, but they are only as important and effective as
we can communi- cate and make the data understandable.
Source: http://fivethirtyeight.com/features/why-fivethirtyeight-gave-trump-a-better-chance-than-almost-anyone-else/
(accessed August 3, 2017).
EXHIBIT 4-1
OBJECTIVES
After reading this chapter, you should be able to:
LO 4-1
LO 4-2 Determine the purpose of your data visualization
139
140 Chapter 4 Visualization: Using Visualizations and Summaries to Share Results with Stakeholders
Data are important, and Data Analytics are effective, but they are only as important and effec-
tive as we can communicate and make the data understandable. One of the authors often
asks her students what they would do if they were interns and their boss asked them to
supply infor- mation regarding in which states all of the customers her organization served
were located. Would they simply point their boss to the Customers table in the sales
database? Would they go a step further and isolate the attributes to the Company Name
and the State? Perhaps they could go a step further and run a quick query or PivotTable to
perform a count on the number of customers in each different state that the company
serves. If they were to give their boss what she actually wanted, however, they should
provide a short written summary of the answer to the research question, as well as an
organized chart to visualize the results. Data visualization isn’t just for people who are
“visual” learners. When the results of data analysis are visualized appropriately, the results
are made easier and quicker to interpret for everybody. Whether the data you are analyzing
are “small” data or “big” data, they still merit synthesis and visualization to help your
stakeholders interpret the results with ease and efficiency.
Think back to some of the first data visualizations and categorizations you were
exposed to (the food guide pyramid/food plate, the animal kingdom, the periodic table)
and, more modernly, how frequently infographics are applied to break down a series of
complicated information on social media. These charts and infographics make it easier
for people to understand difficult concepts by breaking them down into categories and
visual components.
LO 4-1
Determine the DETERMINE THE PURPOSE OF YOUR
purpose of your
data visualization DATA VISUALIZATION
As with selecting and refining your analytical model, communicating results is more art
than science. Once you are familiar with the tools that are available, your goal should
always be to share critical information with stakeholders in a clear, concise manner. This
could involve a chart or graph, a callout box, or a few key statistics. Visualizations have
become very popular over the past three decades. Managers use dashboards to quickly
evaluate key performance indicators (KPIs) and quickly adjust operational tasks; analysts
use graphs to plot stock price and financial performance over time to select portfolios that
meet expected performance goals.
In any project that will result in a visual representation of data, the first charge is
ensur- ing that the data are reliable and that the content necessitates a visual. In our case,
however, ensuring that the data are reliable and useful has already been done through the
first three steps of the IMPACT model.
At this stage in the IMPACT model, determining the method for communicating your
results requires the answers to two questions:
1. Are you explaining the results of previously done analysis, or are you exploring the
data through the visualization? (Is your purpose declarative or exploratory?)
2. What type of data is being visualized (conceptual, qualitative data or data-driven,
quan- titative data)?
Scott Berinato, senior editor at Harvard Business Review, summarizes the possible
answers to these questions1 in a chart shown in Exhibit 4.2. The majority of the work that
we will do with the results of data analysis projects will reside in quadrant 2 of Exhibit 4-
2, the declarative, data-driven quadrant. We will also do a bit of work in Exhibit 4-2’s
quadrant 4, the data-driven, exploratory quadrant. There isn’t as much qualitative work to
be done,
Chapter 4 Visualization: Using Visualizations and Summaries to Share Results with Stakeholders 141
although we will work with categorical qualitative data occasionally. When we do work
with
qualitative data, it will most frequently be visualized using the tools in quadrant 1, the
declarative, conceptual quadrant.
EXHIBIT 4-2
Declarative The Four Chart Types
S. Berinato, Good Charts:
The HBR Guide to Making
Smarter, More Persuasive
Data Visualizations (Boston:
1 2 Harvard Business Review
Press, 2016).
Conceptual Data-driven
(Qualitative) (Quantitative)
3 4
Exploratory
Once you know the answers to the two key questions and have determined which
quad- rant you’re working in, you can determine the best tool for the job. Is a written
report with a simple chart sufficient? If so, Word or Excel will suffice. Will an interactive
dashboard and repeatable report be required? If so, Tableau may be a better tool. Later in
the chapter, we will discuss these two tools in more depth, along with when they should
be used.
number of items in a particular category, then dividing that number by the total number
of
observations. For example, if I had a dataset of 150 people and had each individual’s cor-
responding hair color with 25 people in my dataset having red hair, I could calculate the
proportion of red-haired people in my dataset by dividing 25 (the number of people with
red hair) by 150 (the total number of observations in my dataset). The proportion of red-
haired people, then, would be 16.7 percent.
Qualitative data (both nominal and ordinal) can also be referred to as “conceptual”
data because such data are text-driven and represent concepts instead of numbers.
Quantitative data are more complex than qualitative data because not only can they
be counted and grouped just like qualitative data, but the differences between each data
point are meaningful—when you subtract 4 from 5, the difference is a numerical measure
that can be compared to subtracting 3 from 5. Quantitative data are made up of
observations that are numerical and can be counted and ranked, just like ordinal
qualitative data, but that can also be averaged. A standard deviation can be calculated,
and datasets can be easily compared when standardized (if applicable). Chapter 3
mentions the concept of the normal distribution in the context of profiling in continuous
auditing. The normal distribution is a phenomenon that many naturally occurring datasets
in our world follow, such as SAT scores and heights and weights of newborn babies. For
a distribution of data to be consid- ered normal, the data should have equal median, mean,
and mode, with half of the observa- tions falling below the mean and the other half falling
above the mean. If you are comparing two datasets that follow the normal distribution,
even if the two datasets have very different means, you can still compare them by
standardizing the distributions with Z-scores. By using a formula, you can transform
every normal distribution into a special case of the nor- mal distribution called the
standard normal distribution, which has 0 for its mean (and thus, for its mode and median,
as well) and 1 for its standard deviation. The benefit of standard- izing your data when
comparing is no longer comparing wildly different numbers and trying to eyeball how one
observation differs from the other—if you standardize both datasets, you can place both
distributions on the same chart and more swiftly come to your insights.
Similar to qualitative data, quantitative data can be categorized into two different
types: interval and ratio. However, there is some dispute among the analytics community
on whether the difference between the two datasets is meaningful, and for the sake of the
ana- lytics and calculations you will be performing, the difference is not pertinent. Ratio
data are considered the most sophisticated type of data, and the simplest way to express
the differ- ence between interval and ratio data is that ratio data have a meaningful 0 and
interval data do not. In other words, for ratio data, when a dataset approaches 0, 0 means
“the absence of.” Consider money as ratio data—we can have 5 dollars, 72 dollars, or
8,967 dollars, but as soon as we reach 0, we have “the absence of” 0.
The other scale for quantitative data is interval data, which are not as sophisticated
as ratio data. Interval data do not have a meaningful 0; in other words, in interval data,
0 does not mean “the absence of” but is simply another number. An example of interval
data is the Fahrenheit scale of temperature measurement, where 90 degrees is hotter than
70 degrees, which is hotter than 0 degrees, but 0 degrees does not represent “the absence
of” temperature—it’s just another number on the scale.
Quantitative data can be further categorized as either discrete or continuous data.
Discrete data are data that are represented by whole numbers. An example of discrete
data is points in a basketball game—you can earn 2 points, 3 points, or 157 points, but
you cannot earn
3.5 points. On the other hand, continuous data are data that can take on any value within
a range. An example of continuous data is height: you can be 4.7 feet, 5 feet, or 6.27345
feet. The difference between discrete and continuous data can be blurry sometimes
because you can express a discrete variable as continuous—for example, the number of
children a person
can have is discrete (a woman can’t have 2.7 children, but she could have 2 or 3), but if
you
Chapter 4 Visualization: Using Visualizations and Summaries to Share Results with Stakeholders 143
are researching the average number of children that women aged 25–40 have in the
United
States, the average would be a continuous variable. Whether your data are discrete or
con- tinuous can also help you determine the type of chart you create because continuous
data lend themselves more to a line chart than do discrete data.
• Static
Conceptual Data-driven
(Qualitative) (Quantitative)
• Interactive • Interactive
Charts Charts
• Working • Working
Sessions Sessions
144 Chapter 4 Visualization: Using Visualizations and Summaries to Share Results with Stakeholders
PROGRESS CHECK
1. What are two ways that complicated concepts were explained to you via
catego- rization and data visualization as you were growing up?
2. Using the Internet or other resources (other textbooks, a newspaper, or a
maga- zine), identify an example of a data visualization for each possible
quadrant.
3. Identify which type of data scale the following variables are measured on
(quali- tative nominal, qualitative ordinal, or quantitative):
a. Instructor evaluations in which students select excellent, good, average, or
poor.
b. Weekly closing price of gold throughout a year.
c. Names of companies listed on the Dow Jones Industrial Average.
d. Fahrenheit scale for measuring temperature.
30%
EXHIBIT 4-4
Imperial Pie Charts and Column
25%
Stout Chart Show Different
20% Ways to Visualize
IPA
15% Proportions
Stout 10%
Pale Ale 5%
Wheat 0%
Imperial IPA Stout Pale Ale Wheat Imperial
Imperial Stout IPA
IPA
7% EXHIBIT 4-5
11% Imperial Stout
28% IPA
12% Stout
Pale Ale
17% Wheat
26%
Imperial IPA
The same set of data could also be represented in a stacked bar chart or a 100 percent
stacked bar chart (Exhibit 4-6). This chart is not a default option in Excel, but it does
work in another data visualization tool that we introduce later in this chapter, Tableau.
The first figure in Exhibit 4-6 is a stacked bar chart, which shows the proportion of each
type of beer sold expressed in the number of beers sold for each product, while the lat-
ter shows the proportion expressed in terms of percentage of the whole in a 100 percent
stacked bar chart.
While bar charts and pie charts are among the most common charts used for
qualitative data, there are several other charts that function well for showing proportions:
• Tree maps and heat maps: These are similar types of visualizations, and they both
use size and color to show proportional size of values. While tree maps show
proportions using physical space, heat maps use color to highlight the scale of the
values. However, both are heavily visual, so they are imperfect for situations where
precision of the num- bers or proportions represented is necessary.
• Symbol maps: Symbol maps are geographic maps, so they should be used
when expressing qualitative data proportions across geographic areas such as
states or countries.
• Word clouds: If you are working with text data instead of categorical data, you can
repre- sent them in a word cloud. Word clouds are formed by counting the frequency
of each word mentioned in a dataset; the higher the frequency (proportion) of a given
word, the larger and bolder the font will be for that word in the word cloud. Consider
analyzing the results of an open-ended response question on a survey; a word cloud
would be a great way to quickly spot the most commonly used words to tell if there is
a positive or negative feeling toward what’s being surveyed. There are also settings
that you can put into place when creating the word cloud to leave out the most
commonly used English words—such as the, an, and a—in order to not skew the data.
Exhibit 4-7 is an example of a word cloud for the text of chapter 2 from this textbook.
146 Chapter 4 Visualization: Using Visualizations and Summaries to Share Results with Stakeholders
140
90%
130
120 80%
110
70%
100
Number of Records 80
50%
70
40%
60
50
30%
Product Description
40
Imperial Stout
20%
30 IPA
Stout
20
10% Pale Ale
10 Wheat
0% Imperial IPA
0
EXHIBIT 4-7
Word Cloud Example
from chapter 2
Text
There are many different methods for visualizing quantitative data. With the exception
of
the word cloud, all of the methods mentioned in the previous section for qualitative data
can work for depicting quantitative data, but the following charts can depict more complex
data:
• Line charts: Show similar information to what a bar chart shows, but line charts are
good for showing data changes or trend lines over time. Line charts are useful for
con- tinuous data, while bar charts are often used for discrete data. For that reason,
line charts are not recommended for qualitative data, which by nature of being
categorical, can never be continuous.
• Box and whisker plots: Useful for when quartiles, median, and outliers are required
for analysis and insights.
• Scatter plots: Useful for identifying the correlation between two variables or for
identify- ing a trend line or line of best fit.
• Filled geographic maps: As opposed to symbol maps, a filled geographic map is
used to illustrate data ranges for quantitative data across different geographic areas
such as states or countries.
A summary of the chart types just described appears in Exhibit 4-8. Each chart option
works equally well for exploratory and declarative data visualizations. The chart types are
categorized based on when they will be best used (e.g., when comparing qualitative vari-
ables, a bar chart is an optimal choice), but this figure shouldn’t be used to stifle creativity
— bar charts can also be used to show comparisons among quantitative variables, just as
many of the charts in the listed categories can work well with other datatypes and
purposes than their primary categorization below.
EXHIBIT 4-8
Summary of chart
Conceptual Data-Driven types. RT
(Qualitative) (Quantitative)
Comparison:
Bar chart
Pie chart Outlier detection:
Box and whisker
Stacked bar chart
Tree map plot
Heat map
Relationship between two variables:
Geographic data: Scatter plot
Symbol map
Geographic data:
Filled map
As with selecting and refining your analytical model, communicating results is more
art than science. Once you are familiar with the tools that are available, your goal should
always be to share critical information with stakeholders in a clear, concise manner.
While visualizations can be incredibly impactful, they can become a distraction if you’re
not careful. For example, bar charts can be manipulated to show a bias and, while novel,
148 Chapter 4 Visualization: Using Visualizations and Summaries to Share Results with Stakeholders
EXHIBIT 4-9
Gartner Magic
Quadrant for Business
Intelligence and
Analytics Platforms
Source: R. L. Sallam,
C. Howson, C. J.
Idoine, T. W. Oestreich,
J. L. Richardson, and
J. Tapadinhas, “Magic
Quadrant for Business
Intelligence and Analytics
Platforms,” Gartner RAS
Core Research Notes,
Gartner, Stamford,
CT (2017).
Based on Gartner’s quadrant, it is easy to see that Tableau and Microsoft are two of
the best and most popular options available, and these are the two tools that we will focus
on as well. The Microsoft tool that Gartner analyzed and compared with the other
products is not just Excel, it includes the entire Microsoft BI suite, of which Excel is only
a part. We will focus on Excel as the main driver of the Microsoft toolkit in this text.
Tableau is ranked slightly higher than Microsoft on its ability to execute, while Microsoft
is ranked slightly higher than Tableau in completeness of vision. This distinction makes
sense because Tableau is a newer product and has placed the majority of its focus on data
visualization, while Microsoft Excel has a much more robust platform for data analysis.
Excel’s biggest advantage over Tableau (and over any other data visualization software in
the market) is its ubiquity. Excel has been on the market longer than any of its
competitors, and it is rare to find a business or university that doesn’t have a version of
Excel on every computer. If your data analysis project is more declarative than exploratory, it
is more likely that you will perform your data visualization to communicate results in Excel,
simply because it is likely that you
Chapter 4 Visualization: Using Visualizations and Summaries to Share Results with Stakeholders 149
performed steps 2 through 4 in Excel, and it is convenient to create your charts in the
same
tool that you performed your analysis.
Tableau earns high praise for being intuitive and easy to use, which makes it ideal
for exploratory data analysis. You may even find that you would prefer to immediately load
your data from Excel or Access (or wherever your data are stored) into Tableau during the
sec- ond step of the IMPACT model and work on your analysis inside the tool, instead of
wait- ing for step 5 to just communicate your results through Tableau. If your question
isn’t fully defined or specific, exploring your dataset in Tableau and changing your
visualization type to discover different insights is as much a part of performing data
analysis as crafting your communication. One of the biggest disadvantages to Tableau is
its cost, but fortunately, Tableau is a tremendous supporter of education, and as a student,
you can download a free academic license to use Tableau on your PC or Mac. The link to
download your free license of Tableau is: https://www.tableau.com/academic/students.
Once you have downloaded your license, we recommend opening the Superstore sample
workbook provided. You will find it at the bottom of the start screen under “Sample
workbooks” (Exhibit 4-10).
Once you open the workbook, you will see a variety of tabs at the bottom of the
workbook that you can page through and see different ways that the same dataset can be
analyzed and visu- alized. When you perform exploratory analysis in Tableau, or even if you
have already performed EXHIBIT 4-10
©Tableau Software, Inc.
All rights reserved.
150 Chapter 4 Visualization: Using Visualizations and Summaries to Share Results with Stakeholders
your analysis and you have uploaded the dataset into Tableau to communicate insights, we
rec-
ommend trying several different types of charts to see which one makes your insights stand
out the most effectively. In the top right corner of the Tableau workbook, you will see the Show
Me window, which provides different options for visualizing your dataset (Exhibit 4-11).
EXHIBIT 4-11
©Tableau Software, Inc.
All rights reserved.
In the Show Me tab, only the visualizations that will work for your particular dataset
will appear in full color.
In this chart, the Daily Mail, a UK-based newspaper, tries to emphasize an upgrade
in the estimated growth of British economy. The estimate from the Office of National
Statistics indicated that Q4 growth would be 0.7 percent instead of 0.6 percent (a rela-
tively small increase of about 15 percent). Yet the visualization makes it appear as if this
is a 200 percent increase because of the scale the newspaper chose. The other obvious
issue is that some time has passed between the estimates, and we don’t see that disclosed
here (Exhibit 4-12).
First
estimate 0.60%
If we reworked the data points to show the correct scale (starting at 0 instead of 0.55)
and the change over time (plotting the data along the horizontal axis), we’d see some-
thing like Exhibit 4-13. If we wanted to emphasize growth, we might choose a chart like
Exhibit 4-14. Notice that both new graphs show an increase that is less dramatic and
confusing.
0.4%
0.3%
0.2%
0.1%
0%
First Second
estimate estimate
See Exhibit 4-15. Is a pie chart really the best way to present these data?
152 Chapter 4 Visualization: Using Visualizations and Summaries to Share Results with Stakeholders
0.5
0.4
Old
0.3
estimate
0.2
0.1
Sloane; 2
Rose; 3
Bryan; 1
Molly; 1 Efrain; 3
Marisol; 2 Assistant
Jan; 3
Luna; 3 Researcher
Kinley; 2
Lia; 3 Leroy; 1 Administrator
If you want to emphasize users, consider a rank-ordered bar chart like Exhibit 4-16.
To emphasize the category, a comparison like that in Exhibit 4-17 may be helpful. Or to
show proportion, maybe a stacked bar (Exhibit 4-18). In any case, there are much better
ways to clearly communicate.
EXHIBIT 4-16
This rank-ordered 9
Assistant
bar chart is more clear. 8
Researcher
7
Administrator
6
Attacks
5
4
3
2
1
0
Efrain
Azul
Lia
Luna
Rose
Marisol
Yareli
Alanna
Sloane
Sophie
Steve
Kinley
Zachery
Jan
Aubrie
Ann
Willie
Bryan
Leroy
Molly
Chapter 4 Visualization: Using Visualizations and Summaries to Share Results with Stakeholders 153
30 EXHIBIT 4-17
This bar chart
25 emphasizes attacks
by job function.
20
Attacks
15
10
0
Assistant Researcher Administrator
50 EXHIBIT 4-18
45
This stacked bar
chart emphasizes
40 proportion of attacks
35 by job function.
Assistant
30
Attacks
25
20
15 Researcher
10
5 Administrator
0
PROGRESS CHECK
4. The following two charts represent the exact same data—the quantity of beer
sold on each day in the Sláinte Sales Subset dataset. Which chart is more
appro- priate for working with dates, the column chart or the line chart? Which
do you prefer? Why?
a.
b.
5. The same dataset was consolidated into quarters. This chart was made with
the chart wizard feature in Excel, which made the creation of it easy, but
something went wrong. Can you identify what went wrong with this chart?
6. The following four charts represent the exact same data quantity of each
beer sold. Which do you prefer, the line chart or the column chart? Whichever
you chose, line or column, which of the pair do you think is the easiest to
digest?
a.
b.
c.
d.
Color
Similar to how Excel and Tableau have become stronger tools at picking appropriate data
scales and increments, both Excel and Tableau will have default color themes when you
begin creating your data visualizations. You may choose to customize the theme.
However, if you do, here are a few points to consider:
• When should you use multiple colors? Using multiple colors to differentiate types
of data is effective. Using a different color to highlight a focal point is also
effective.
However, don’t use multiple colors to represent the same type of data. Be careful to
not use color to make the chart look pretty—the point of the visualization is to
showcase insights from your data, not to make art.
• We are trained to understand the differences among red, yellow, and green, with red
meaning something negative that we would want to “stop” and green being something
positive that we would want to “continue,” just like with traffic lights. For that reason,
use red and green only for those reasons. Using red to show something positive or
green to show something negative is counterintuitive and will make your chart harder
to understand. You may also want to consider a color-blind audience. If you are
concerned
that someone reading your visuals may be color blind, avoid a red/green scale and
Chapter 4 Visualization: Using Visualizations and Summaries to Share Results with Stakeholders 157
consider using orange/blue. Tableau has begun defaulting to orange/blue color scales
instead of red/green for this reason.
• Once your chart has been created, convert it to grayscale to ensure that the contrast
still exists—this is both to ensure your color-blind audience can interpret your visuals and
also to ensure that the contrast, in general, is stark enough with the color pallet you have
chosen.
PROGRESS CHECK
7. Often, external consultants will use a firm’s color scheme for a data
visualization or will use a firm’s logo for points on a scatter plot. While this
might be a great approach to support a corporate culture, it is often not the
most effective way to create a chart. Why would these methods harm a chart’s
effectiveness?
C: If you are including a data visualization with your write-up, you need to explain
how to use the visual. If there are certain aspects that you expect to stand out from the
analysis and the accompanying visual, you should describe what those components are
—the visual should speak for itself, but the write-up can provide confirmation that the
important pieces are gleaned.
T: Discuss what’s next in your analysis. Will the visual or the report result in a
weekly or quarterly report? What trends or outliers should be paid attention to over
time?
Revising
Just as you addressed and refined your results in the fourth step of the IMPACT model,
you should refine your writing. Until you get plenty of practice (and even once you
consider yourself an expert), you should ask other people to read through your writing to
make sure that you are communicating clearly. Justin Zobel suggests that revising your
writing requires you to “be egoless—ready to dislike anything you have previously
written. If someone
dislikes something you have written, remember that it is the readers you need to please,
not yourself.”3 Always placing your audience as the focus of your writing will help you
maintain an appropriate tone, provide the right content, and avoid too much detail.
PROGRESS CHECK
Progress Checks 5 and 6 display different charts depicting the quantity of beer sold
on each day in the Sláinte Sales Subset dataset. If you had created those visuals,
starting with the data request form and the ETL process all the way through data
analysis, how would you tailor the written report for the following two roles?
8. For the CEO of the brewery who is interested in how well the different products
are performing.
9. For the programmers who will be in charge of creating a report that contains
the same information that needs to be sent to the CEO on a monthly basis.
3
Ibid.
Summary
■
This chapter focused on the fifth step of the IMPACT model, or the “C,” to discuss
how to communicate the results of your data analysis projects. Communication can be
done through a variety of data visualizations and written reports, depending on your
audience and the data you are exhibiting.
■
In order to select the right chart, you must first determine the purpose of your data
visu- alization. This can be done by answering two key questions:
◦ Are you explaining the results of a previously done analysis, or are you exploring
the data through the visualization? (Is your purpose declarative or exploratory?)
◦ What type of data is being visualized (conceptual qualitative data or data-driven
quan- titative data)?
■
The differences between each type of data (declarative and exploratory, qualitative and
quantitative) are explained, as well as how each datatype impacts both the tool you’re
likely to use (generally either Excel or Tableau) and the chart you should create.
■
After selecting the right chart based on your purpose and datatype, your chart will
need to be further refined. Selecting the appropriate data scale, scale increments, and
color for your visualization is explained through the answers to the following
questions:
◦ How much data do you need to share in the visual to avoid being misleading, yet
also avoid being distracting?
◦ If your data contain outliers, should they be displayed, or will they distort your
scale to the extent that you can leave them out?
◦ Other than how much data you need to share, what scale should you place those data
on?
◦ Do you need to provide context or reference points to make the scale meaningful?
◦ When should you use multiple colors?
■
Finally, this chapter discusses how to provide a written report to describe your data
anal-
ysis project. Each step of the IMPACT model should be communicated in your write-
up, and the report should be tailored to the specific audience to whom it is being
delivered.
Key Words
continuous data (142) One way to categorize quantitative data, as opposed to discrete data.
Continuous data can take on any value within a range. An example of continuous data is height.
declarative visualizations (143) Made when the aim of your project is to “declare” or present your
findings to an audience. Charts that are declarative are typically made after the data analysis has been
completed and are meant to exhibit what was found in the analysis steps.
discrete data (142) One way to categorize quantitative data, as opposed to continuous data. Discrete
data are represented by whole numbers. An example of discrete data is points in a basketball game.
exploratory visualizations (143) Made when the lines between steps P (perform test plan),
A (address and refine results), and C (communicate results) are not as clearly divided as they are in a
declarative visualization project. Often when you are exploring the data with visualizations, you are
per- forming the test plan directly in visualization software such as Tableau instead of creating the
chart after the analysis has been done.
interval data (142) The third most sophisticated type of data on the scale of nominal, ordinal, inter-
val, and ratio; a type of quantitative data. Interval data can be counted and grouped like qualitative
data, and the differences between each data point are meaningful. However, interval data do not have a
mean- ingful 0. In interval data, 0 does not mean “the absence of” but is simply another number. An
example of interval data is the Fahrenheit scale of temperature measurement.
nominal data (141) The least sophisticated type of data on the scale of nominal, ordinal, interval,
and ratio; a type of qualitative data. The only thing you can do with nominal data is count, group, and
take a proportion. Examples of nominal data are hair color, gender, and ethnic groups.
159
normal distribution (142) A type of distribution in which the median, mean, and mode are all equal, so
half of all the observations fall below the mean and the other half fall above the mean. This phenomenon
is naturally occurring in many datasets in our world, such as SAT scores and heights and weights of
newborn babies. When datasets follow a normal distribution, they can be standardized and compared for
easier analysis.
ordinal data (141) The second most sophisticated type of data on the scale of nominal, ordinal, inter-
val, and ratio; a type of qualitative data. Ordinal can be counted and categorized like nominal data and
the categories can also be ranked. Examples of ordinal data include gold, silver, and bronze medals.
proportion (141) The primary statistic used with quantitative data. Proportion is calculated by
counting the number of items in a particular category, then dividing that number by the total number of
observations.
qualitative data (141) Categorical data. All you can do with these data are count and group, and in
some cases, you can rank the data. Qualitative data can be further defined in two ways: nominal data and
ordinal data. There are not as many options for charting qualitative data because they are not as sophisti-
cated as quantitative data.
quantitative data (142) More complex than qualitative data. Quantitative data can be further defined
in two ways: interval and ratio. In all quantitative data, the intervals between data points are meaningful,
allowing the data to be not just counted, grouped, and ranked, but also to have more complex operations
performed on them such as mean, median, and standard deviation.
ratio data (142) The most sophisticated type of data on the scale of nominal, ordinal, interval, and
ratio; a type of quantitative data. They can be counted and grouped just like qualitative data, and the
differences between each data point are meaningful like with interval data. Additionally, ratio data have a
meaningful 0. In other words, once a dataset approaches 0, 0 means “the absence of.” An example of ratio
data is currency.
standard normal distribution (142) A special case of the normal distribution used for
standardizing data. The standard normal distribution has 0 for its mean (and thus, for its mode and
median, as well), and 1 for its standard deviation.
standardization (142) The method used for comparing two datasets that follow the normal distribu-
tion. By using a formula, every normal distribution can be transformed into the standard normal distribu-
tion. If you standardize both datasets, you can place both distributions on the same chart and more
swiftly come to your insights.
ANSWER
S TO PROGRESS CHECKS
1. Certainly, answers will vary given our own individual experiences. But we can note
that complex topics can be explained and understood by linking them to
categorizations or pictures.
2. Answers will vary.
3. a. Qualitative ordinal
b. Quantitative (ratio data)
c. Qualitative nominal
d. Quantitative (interval data)
4. While this question does ask for your preference, it is likely that you prefer image b
because time series data are continuous and can be well represented with a line
chart instead of bars.
5. Notice that the quarters are out of order (1, 2, then 4); this looks like quarter 3 has
been skipped, but quarter 4 is actually the last quarter of 2019 instead of the last
quarter of 2020, while quarters 1 and 2 are in 2020. Excel defaulted to simply ordering
the quarters numerically instead of recognizing the order of the years in the underlying
160
6. Answers will vary. Possible answers include: Quantity of beer sold is a discrete value, so
it
is likely better modeled with a bar chart than a line chart. Between the two line charts,
the second one is easier to interpret because it is in order of highest sales to lowest.
Between the two bar charts, it depends on what is important to convey to your
audience—are the numbers critical? If so, the second chart is better. Is it most
important to simply show which beers are performing better than others? If so, the first
chart is better. There is no reason to provide more data than necessary because they
will just clutter up the visual.
7. Color in a chart should be used purposefully; it is possible that a firm’s color scheme
may be counterproductive to interpreting the chart. The icons as points in a scatter plot
might be distracting, which could make it take longer for a reader to gain insights from
the chart.
8. Answers will vary. Possible answers include: Explain to the CEO how to read the
visual, call out the important insights in the chart, tell the range of data that is
included (is it one quarter, one year, all time?).
9. Answers will vary. Possible answers include: Explain the ETL process, exactly what
data are extracted to create the visual, which tool the data were loaded into, and how
the data were analyzed. Explain the mechanics of the visual. The particular insights
of this visual are not pertinent to the programmer because the insights will potentially
change over time. The mechanics of creating the report are most important.
Discussion Questions
1. Explain Exhibit 4-2 and why these four dimensions are helpful in describing
information to be communicated? Exhibit 4-2 lists conceptual and data-driven as
being on two ends of the continuum. Does that make sense, or can you think of a
better way to organize and differentiate the different chart types?
2. According to Exhibit 4-8, which is the best chart for showing a distribution of a single
variable, like height? How about hair color? Major in college?
3. Box and whisker plots (or box plots) are particularly adept at showing extreme
observa- tions and outliers. In what situations would it be important to communicate
these data to a reader? Any particular accounts on the balance sheet or income
statement?
4. Based on the data from datavizcatalogue.com, a line graph is best at showing
compari- sons, relationships, compositions, or distributions? Name the best two.
162
5. Based on the data from datavizcatalogue.com, what are some major flaws of using
word clouds to communicate the frequency of words in a document?
6. Based on the data from datavizcatalogue.com, how does a box and whisker plot
show if the data are symmetrical?
7. What would be the best chart to use to illustrate earnings per share for one company
over the past five years?
8. The text mentions, “If your data analysis project is more declarative than
exploratory, it is more likely that you will perform your data visualization to
communicate results in Excel.” In your opinion, why is this true?
9. According to the text and your own experience, why is Tableau ideal for exploratory
data analysis?
Problems
1. Why was the graphic associated with the opening vignette regarding the 2016 presi-
dential election an effective way to communicate the voter outcome for 50 states?
What else could have been used to communicate this, and would it have been more
or less effective in your opinion?
2. Evaluate the use of multiple colors in the graphic associated with the opening vignette
regarding the 2016 presidential election. Would you consider its use effective or inef-
fective? Why? Can you think of a better way to communicate the extent to which poll-
sters incorrectly predicted the outcome in many of the states and in the country
overall?
3. According to Exhibit 4-8, which is the best chart for comparisons of earnings per
share over many periods? How about for only a few periods?
4. According to Exhibit 4-8, which is the best chart for static composition of a data item
of the Accounts Receivable balance at the end of the year? Which is best for showing
a change in composition of Accounts Receivable over two or more periods?
5. The Big 4 accounting firms (Deloitte, EY, KPMG, and PwC) dominate the audit and
tax market in the United States. What chart would you use to show which accounting
firm dominates in each state in terms of audit revenues? Any there other interesting
ways you could use to find opportunities within the audit market?
6. Datavizcatalogue.com lists seven types of maps in its listing of charts. Which one
would you use to assess geographic customer concentration by number? How could
you show if some customers buy more than other customers on such a map? Would
you use the same chart or a different one?
7. In your opinion, is the primary reason that analysts use inappropriate scales for their
charts primarily due to an error related to naiveté (or ineffective training), or are the
inappropriate scales used so the analyst can sway the audience one way or the
other?
Company summary
Sláinte is a fictional brewery that has recently gone through big change. Sláinte sells six
dif- ferent products. The brewery has only recently expanded its business to distributing
from one state to distributing to nine states, and now the business has begun stabilizing
after the expansion. With that stability comes a need for better analysis. One of Sláinte’s
first priori- ties is to identify its areas of success, as well as areas of potential
improvement.
Data
• Sláinte dataset
Technique
• Some experience with spreadsheets and PivotTables is useful for this lab.
Software needed
• Excel
• Screen capture tool (Windows: Snipping Tool; Mac: Cmd + Shift + 4)
Q1. Spend a few minutes filtering the data with the slicers. Name three
important insights that were easy to identify through this visualization.
Q2. What does the data visualization and the interactivity of the slicer provide
your audience that the original PivotTable does not?
10. From the Home tab, select the Conditional Formatting button, and a menu with
the different types of formatting available will appear.
11. Select Data Bars and pick the first option for blue gradient fill bars.
12. This conditional formatting is helpful because it allows us to compare grand totals of
each product. However, if we would like to see how each product’s month-over-
month sales compare to one another, we can display mini line charts next to each
row with a sparkline. To do so, select all of the “meat” of your PivotTable—that is,
don’t select any of the product labels (such as Imperial IPA), month labels, or grand
totals.
13. Navigate to the Insert tab on the ribbon, and select Line in the Sparklines category.
14. A window will appear specifying the data range you just selected and awaiting
input for the Location Range. We’d like to see the trend lines to the immediate
right of our PivotTable, so you can select the cells in the first empty column after
your Grand Totals.
165
LAB EXHIBIT 4-1B
Source: Microsoft Excel 2016
End of Lab
166
Company summary
Sláinte is a fictional brewery that has recently gone through big change. Sláinte sells six
dif- ferent products. The brewery has only recently expanded its business to distributing
from one state to distributing to nine states, and now the business has begun stabilizing
after the expansion. With that stability comes a need for better analysis. One of Sláinte’s
first priori- ties is to identify its areas of success, as well as areas of potential
improvement.
Data
• Sláinte dataset
Software needed
• Tableau. Visit with your instructor for instructions or follow this link to download
Tableau, https://www.tableau.com/academic/students, and click Get Tableau for Free to
register for a free student license. Your student license will last one year.
• Screen capture tool (Windows: Snipping Tool; Mac: Cmd + Shift + 4)
Q1. Using the UML diagram, identify which table(s) and attributes you will need
to answer your initial question regarding amount of products sold.
3. Browse to the Slainte_Subset.accdb file and click Open. This will extract the data.
4. The Data Source tab will open, with three tables for you to select from. We can begin
by just exploring the Sales data. Double-click on the Sales_Subset table to load it into
Tableau.
168
169
10. Add labels to the bars: To the left of your data viz, there is the Marks window. It has
a variety of ways that you can enhance the way you’re viewing the data. Click Labels,
then place a check mark in the box next to Show mark labels.
11. Instead of showing Product Code, show the Product Description; this will require you
to join in another table. Click back into the Data Source tab in the bottom left.
12. Double-click on the FGI_Product table to load the product data into Tableau. You
will see the FGI_Product data populate, as well as a Venn diagram joining the two
datasets. Click on the Venn diagram to ensure the data are joined properly. You
want to ensure that the primary key of FGI_Product is matched with the
corresponding foreign key in the Sales_Subset data (the same way the two tables
are joined in the UML diagram).
170
16. Take a screenshot (label it 4-2B).
17. Sometimes when you’re performing exploratory data analysis, you’ll want to save
the visualization you just made, while also giving yourself the opportunity to drill
down into the data. We’ll name this sheet after the analysis you just did, then
duplicate the data to work with it further. Right-click Sheet 1 and select Rename
Sheet. Type Total Products Sold as the sheet’s name.
18. Right-click the sheet tab that you just renamed and select Duplicate Sheet.
171
©Tableau Software, Inc. All rights
172
25. Select Geographic Role, and then select State/Province.
26. Create a new sheet (do not duplicate any of the previous sheets) by clicking the
first icon to the right of the Total Products Sold by Year tab.
27. This time, we will create a report that shows total products sold by state. Double-
click the measure Customer St. Tableau automatically populates a map with a dot
in each state that’s listed in the Customer table.
28. Double-click on the measure Sales Order Quantity Sold. The dots have changed to
vary in size, which is proportional to the amount of sales in each state.
29. We can make the results easier to interpret by changing the visualization type. If
the Show Me window isn’t showing in the upper right corner, click Show Me, then
select the Filled Map.
173
30. Rename this sheet Total Products Sold by State.
31. Take a screenshot (label it 4-2D).
33. In the Dashboard view, instead of seeing the various dimensions and measures to
drag and drop, you see the three sheets that you have created. You can drag and drop
them into the area that says Drop Sheets Here, and you can arrange them any way
you wish. Replicate this arrangement:
34. You can also use each sheet as a filter. Click the Total Products Sold section of
your dashboard. There are three small icons in the top right of the sheet when the
sheet is active. Clicking the middle one (which looks like a funnel) will allow you
to use the bars as filters for the entire dashboard. Click to do so.
174
©Tableau Software, Inc. All rights reserved.
35. Follow the same process to make the states work as filters for the dashboard by
click- ing Use as Filter in the Total Products Sold by State sheet.
Now, you can click any of the bars in the Total Products Sold chart or any of the states
in the Total Products Sold by State, and the data in each of the three sheets will shift to
focus on just those products and/or states.
36. Filter by either a state or a product, and take a screenshot (label it 4-2E).
Q5. After creating these sheets and the dashboard, what additional data would you
recommend that Sláinte analyze? What is another data visualization that
would be helpful for Sláinte’s decision making?
End of Lab
Data
The data for this lab and other all Dillard’s labs are available at http://walton.uark.edu/
enterprise/. Your instructor will either give you specific instructions on how to access the
data, or there will be information available on Connect. The 2016 Dillard’s data cover all
transactions over the period 1/1/2014 to 10/17/2016.
Software needed
• Microsoft SQL Server Management Studio and Microsoft Excel (available on
the Remote Desktop at the University of Arkansas)
• Tableau (available on the Remote Desktop at the University of Arkansas)
176
2. Input the Server and Database information that you received from the
Walton.uark.edu/
enterprise page for the Dillard’s data, and then click Sign In.
3. Wait for the connection to process, and then you have two options: If you are
certain that you will only want to visualize one specific set of query results, you
can input a query from the Connections page. Alternatively, you can connect to
entire tables if you want the option to drill down into the data and answer more
than one question.
6. It may take a couple minutes for the results to populate. Once they do, we’ll
preview the data.
The data should load without a problem, but because Tableau is automatically
interpret- ing the data, it is a good idea to look through the data to ensure that we don’t
need to transform them in any way. In Tableau, you should always check which datatype
has been assigned to each attribute. The datatype is denoted by a little icon that is an Abc
for a string of text, a number sign for numerical data, a calendar for dates, or a globe for
geographic data.
The two attributes of state are denoted with an Abc and a number sign:
7. Of particular concern is the way the state data were imported. The Abc above the
state column indicates that they were imported into Tableau as plain text instead of
as a geographic attribute.
9. Once Tableau has processed the change, click Sheet 1 on the bottom of the
Tableau window to begin working with the data.
179
10. Double-click on state in Dimensions.
You will see that Tableau immediately populates a map with a blue dot in each state
that has a Dillard’s store.
11. To make these data even more meaningful, we’ll add average to this view. Double-
click
Average in Measures.
12. Tableau might have defaulted to a symbol map. The difference in averages is easier
to interpret with a filled map. Click Show Me in the top right corner of Tableau if
your Show Me window isn’t already available, then click Filled Map.
17. Check that the attributes pulled in as the appropriate datatypes. For example, City
and Zip Code pulled in as geographic datatypes, but state did not. Click the Abc
above the State attribute to change the datatype.
19. Click Sheet 1 in the bottom left corner of the Tableau screen to begin working
with the data.
181
20. Double-click State from Dimensions.
Tableau immediately populates a map with a blue dot in each state that has a
Dillard’s store.
21. To make these data even more meaningful, we’ll add average transaction amount
to this view. Start by double-clicking on Tran Amt from the Measures.
22. It may take a couple minutes for Tableau to populate the data, but the size of the
blue dots will adjust to show how the amounts vary across states. The default value
for this measure is SUM, though, so we need to edit it to be average.
23. Hover over SUM(Tran Amt) in the Marks window to make available an arrow for
a drop-down window.
182
24. Click the drop-down, then click Measure (Sum) to change the measure to Average.
25. Tableau might have defaulted to a symbol map. The difference in averages is easier
to interpret with a filled map. Click Show Me in the top right corner of Tableau if
your Show Me window isn’t already available, then click Filled Map.
26. Take a screenshot of your results (label it 4-3A).
183
Part 3: Perform an Analysis of the Data
Visualizing data often makes it easier to see the answers to your questions, which then
leads to more questions. In this case, Arkansas clearly has a higher average transaction
amount than the other states. This may lead you to want to drill down into the data to see
if the performance is the same across all of the stores in Arkansas, or if there is a stand-
out store.
27. If you click Arkansas, Tableau will give you the option to filter out all of the
other states so that you can drill down into this data point. Click Keep Only.
Q3. Which city has the highest average transaction amount? (It can be easier to
answer this question if you sort the data. Clicking the “sort” button will re-
order the bars so that the city with the highest average transaction amount will
be the first bar listed.)
Q4. How would you think managers would like to see transaction balance by state?
Q5. What are further questions that would be meaningful to drill down into with
this same dataset, given what you have seen so far?
184
To dig deeper into the data, we can drill down into which types of items are being sold
the most in Maumelle. To do so, we need to join in two more tables. Joining in the SKU
table will provide description of the items being sold, and joining in the DEPARTMENT
table will provide categorical information for each individual item.
30. Click Data Source in the bottom left corner of the Tableau application.
32. Return to your Tableau sheet with the horizontal bar chart, and click Keep Only for
Maumelle.
The DEPARTMENT and SKU data are hierarchical, with an item belonging to a
depart- ment, which groups into a deptdec (decade) through a deptcent (century).
33. Begin by viewing the Maumelle store data by the highest level of the hierarchy,
the department century. The description attribute will be the most useful to inter-
pret, so double-click on the Deptcent Desc attribute from the DEPARTMENT
dimensions.
34. To drill down further into the data, add the department decade data to the chart.
Double-click on Deptdec Desc to add another level of detail.
35. You can also add drill-down capabilities by creating the hierarchy in Tableau. Drag
and drop Deptdec Desc on top of Deptcent Cent in the Dimensions window:
185
36. Click OK on the window to create the hierarchy.
37. Notice that the Deptcent Desc pill in the Rows shelf changed to include a minus sign
— this indicates that the hierarchy has been expanded. Click the minus sign to
collapse the hierarchy.
Data
The data for this lab and other all Dillard’s labs are available at http://walton.uark.edu/
enterprise/. Your instructor will either give you specific instructions on how to access the
data, or there will be information available on Connect. The 2016 Dillard’s data cover all
transactions over the period 1/1/2014 to 10/17/2016.
Specifically, you will need the Excel file that you created and saved in Lab 3-2, Lab
3-2Dummy.xlsx. Completing Labs 3-2 and 3-5 and saving the associated Excel file are pre-
requisites for this lab.
186
Software needed
• Microsoft SQL Server Management Studio and Microsoft Excel (available on
the Remote Desktop at the University of Arkansas)
• Tableau
187
2. Browse to the Excel output you created with the dummy variables from Lab 3-2 to
Tableau and click Open. This will extract the data.
3. When running a regression in Tableau, you will want to place your explanatory vari-
ables on the columns and your dependent variables on the rows. To do so, drag and
drop the Arkansas-dummy measure to the Columns shelf and the Tran Amt Measure
to the Rows shelf.
5. It may take some time for the data to disaggregate. Once they do, navigate back to
the
Analysis tab and click Lines, and then select Show All Trend Lines.
189
Chapter 5
The Modern Audit
and Continuous Auditing
A Look Back
Chapter 4 completed our discussion of the IMPACT model by explaining how to communicate your results
through data visualization and through written reports. We discussed how to choose the best chart for your dataset
and your purpose. We also helped you learn how to refine your chart so that it communicates as efficiently and
effectively as possible. The chapter wrapped up by describing how to provide a written report tailored to specific
audiences who will be interested in the results of your data analysis project.
A Look Ahead
In chapter 6, you will learn how to use audit software to perform substantive audit tests, including when and how
to select samples and how to confirm account balances. Specifically, we discuss the use of different types of
descriptive, diagnostic, predictive, and prescriptive analytics as they are used to generate computer-assisted auditing
techniques.
190
The large public accounting firms offer a variety of analytical tools to their customers. Take PwC’s Halo, for
example, shown in Exhibit 5-1. This tool allows auditors to interrogate a client’s data and identify patterns and
relationships within the data in a user-friendly dashboard. By mapping the data, auditors and managers can
identify inefficiencies in business processes, discover areas of risk exposure, and correct data quality issues by
drilling down into the indi- vidual users, dates and times, and amounts of the entries. Tools like Halo allow auditors
to develop their audit plan by narrowing their focus and audit scope to unusual and infrequent issues that
represent high audit risk.
©Shutterstock/Nonwarit
EXHIBIT 5-1
Source: http://halo.pwc.com
OBJECTIVES
After reading this chapter, you should be able to:
LO 5-1
LO 5-2 Understand modern auditing techniques
or another manager, these individuals build teams to develop and implement analytical
techniques to aid the following audits:
1. Process efficiency and effectiveness.
2. Governance, risk, and compliance, including internal controls effectiveness.
3. Information technology and information systems audits.
4. Forensic audits in the case of fraud.
5. Support for the financial statement audit.
Internal auditors are also more likely to have working knowledge of the various enter-
prise resource planning systems that are in use at their companies. They are familiar with
how the general journals from a product like JD Edwards actually reconcile to the general
ledger in SAP. Because implementation of these systems varies across organizations (and
even within organizations), internal auditors can understand how analytics are not simply
a one-size-fits-all type of strategy.
PROGRESS CHECK
1. How do auditors use Data Analytics in their audit testing?
2. Make the case for why an internal audit is increasingly important in the modern
audit. Why is it also important for external auditors and the scope of their
work?
Translator
Data
Warehouse
Data
Audit
Warehouse
Program
Audit
Program
The current set of audit data standards defines the following standards:
• The Base Standard defines the format for files and fields as well as some master data
for users and business units.
• The General Ledger Standard adds the chart of accounts, source listings, trial
balance, and GL (journal entry) detail.
• The Order to Cash Subledger Standard focuses on sales orders, accounts receivable,
ship- ments, invoices, cash receipts and adjustments to accounts, shown in Exhibit 5-3.
EXHIBIT 5-3
Audit Data Standards
The audit data
standards define
common elements
needed to audit the
order-to-cash or sales
process.
Source: https://www.aicpa
.org/InterestAreas/FRC/
AssuranceAdvisoryServices/
DownloadableDocuments/
AuditDataStandards/
AuditDataStandards.O2C.
July2015.pdf
Chapter 5 The Modern Audit and Continuous Auditing 195
• The Procure to Pay Subledger Standard identifies data needed for purchase orders,
goods
received, invoices, payments, and adjustments to accounts.
• The Inventory Subledger Standard defines product master data, location data,
inventory on hand data, and inventory movement.
With standard data elements in place, not only will auditors streamline their access to
data, but they also will be able to build analytical tools that they can share with others
within their company or professional organizations. This can foster greater collaboration
among auditors and increased use of Data Analytics across organizations. These data
elements will be useful when performing substantive testing in chapter 6.
PROGRESS CHECK
3. What are the advantages of the use of homogeneous systems? Would a
merger target be more attractive if it used a similar financial reporting system
as the potential parent company?
4. How does the use of audit data standards facilitate data transfer between
auditors and companies? How does it save time for both parties?
• Audit procedures themselves typically identify data, locations, and attributes that the
auditors will evaluate. These are the variables that will provide the input for many of
the substantive analytical procedures discussed in chapter 6.
• The evaluation of audit data may be distilled into a risk score. This may be a
function of the volume of exceptional records or level of exposure for the functional
area. If the judgment and decision making is easily defined, a rule-based analytic
could automati- cally assign a score for the auditor to review. For more complex
judgment, the increas- ing prevalence of artificial intelligence and machine learning
discussed in chapter 3 may be of assistance. After all, if we have enough
observations of the scores auditors assign to specific cases and outcomes, we can
create models that will provide accurate enough classification for these tasks.
Typical internal audit organizations that have adopted Data Analytics to enhance their
audit have done so when an individual on the team has begun tinkering with Data
Analytics. They convince their managers that there is value in using the data to direct the
audit and get a champion in the process. Once they show the value proposition of Data
Analytics, they are given more resources to build the program and adapt the existing audit
program to include more data-centric evaluation where appropriate.
Because of the potential disruption to the organization, it is more likely that an auditor
will adapt an existing audit plan than develop a new system from scratch. Automating the
audit plan and incorporating data analytics involve the following steps, which are similar
to the IMPACT model:
1. Identify the questions or requirements in the existing audit plan.
2. Master the data by identifying attributes and elements that are automatable.
3. Perform the test plan, in this case by developing analytics (in the form of rules
or models) for those attributes identified in step 2.
4. Address and refine results. List expected exceptions to these analytics and
expected remedial action by the auditor, if any.
5. Communicate insight by testing the rules and comparing the output of the analytics
to manual audit procedures.
6. Track outcomes by following up on alarms and refining the models as needed.
Let’s assume that an internal auditor has been tasked with implementing data analytics
to automate the evaluation of a segregation of duties control within SAP. The auditor
evaluates the audit plan and identifies a procedure for testing this control. The audit plan
identifies which tables and fields contain relevant data, such as an authorization matrix,
and the spe- cific roles or permissions that would be incompatible. The auditor would use
that information to build a model that would search for users with incompatible roles and
notify the auditors.
LO 5-5
CONTINOUS AUDITING TECHNIQUES
Evaluate audit
Data Analytics and audit automation allow auditors to continuously monitor and
alarms as part
of continuous
audit the systems and processes within their companies. Whereas a traditional audit may
auditing have the internal auditors perform a routine audit plan once every 12 to 36 months or
so, the continuous audit evaluates data in a form that matches the pulse of the business.
For example, purchase orders can be monitored for unauthorized activity in real time,
while month-end adjusting entries would be evaluated once a month. When exceptions
occur—for example, a purchase order is created with a customer whose address matches
an employee’s—the auditors are alerted immediately and given the option to respond
right away to resolve the issue.
Chapter 5 The Modern Audit and Continuous Auditing 197
Continuous auditing is a process that provides real-time assurance over business pro-
cesses and systems. It involves the application of rules or analytics that perform a
continu- ous monitoring function that constantly evaluates internal controls and
transactions. It also generates continuous reporting on the status of the system so that an
auditor can know at any given time whether the system is operating within the parameters
set by management or not.
Implementing continuous auditing procedures is similar to automating an audit plan
with the additional step of scheduling the automated procedures to match the timing and
fre- quency of the data being evaluated and notifying the auditor when exceptions occur.
• Database maps (such as UML diagrams) and data dictionaries that define the location
and types of data auditors will analyze.
• Documentation about existing automated controls, including parameters and
variables used for analysis.
• Evidence, including data extracts, transformed data, and model output, that
provides support for the functioning controls and management assertions.
Policies and procedures that help provide consistent quality work are essential to
maintaining a complete and consistent audit. The audit firm or chief audit executive is
responsible for providing guidance and standardization so that different auditors and
audit teams produce clear results. These standardizations include consistent use of sym-
bols or tick marks and a uniform mechanism for cross-referencing output to source docu-
ments or data.
PROGRESS CHECK
5. Continuous audit uses alarms to identify exceptions that might indicate an
audit issue and require additional investigation. If there are too many alarms
and exceptions based on the parameters of the continuous audit system, will
continuous auditing actually help or hurt the overall audit effectiveness?
6. PwC uses three systems to automate its audit process. Aura is used to direct
the audit by identifying which evidence to collect and analyze, Halo performs
Data Analytics on the collected evidence, and Connect provides the workflow
pro- cess that allows managers and partners to review and sign off on the
work. How does that line up with the steps of the IMPACT model we’ve
discussed through- out the text?
Summary
As auditing has evolved over the past few decades, Data Analytics has driven many of
the changes. The ability to increase coverage of the audit using data has made it less
likely that key elements are missed. Data Analytics has improved auditors’ ability to assess
risk, inform their opinions, and improve assurance over the processes and controls in their
organizations.
Key Words
audit data standards (ADSs) (193) The audit data standards define common tables and fields
that are needed by auditors to perform common audit tasks. The AICPA developed these standards.
data warehouse (193) A data warehouse is a repository of data accumulated from internal and
exter- nal data sources, including financial data, to help management decision making.
flat file (193) A flat file is a single table of data with user-defined attributes that is stored
separately from any application.
homogeneous systems approach (193) Homogeneous systems represent one single installation or
instance of a system. It would be considered the opposite of a heterogeneous system.
heterogeneous systems approach (193) Heterogeneous systems represent multiple installations or
instances of a system. It would be considered the opposite of a homogeneous system.
production or live systems (193) Production (or live systems) are those active systems that collect
and report and are directly affected by current transactions.
systems translator software (193) Systems translator software maps the various tables and fields
from varied ERP systems into a consistent format.
200
8. When there is no alarm in a continuous audit, but there is an abnormal event, we
would
call that a:
a. False negative.
b. True negative.
c. True positive.
d. False positive.
9. If purchase orders are monitored for unauthorized activity in real time while month-
end adjusting entries are evaluated once a month, those transactions monitored in
real time would be an example of a:
a. Traditional audit.
b. Periodic test of internal controls.
c. Continuous audit.
d. Continuous monitoring.
10. Who is most likely to have a working knowledge of the various ERP systems that are
in use in the company?
a. Chief executive officer
b. External auditor
c. Internal auditor
d. IT staff
Discussion Questions
1. Why has most innovation in Data Analytics originated more in an internal audit than
an external audit? Or if not, why not?
2. Is it possible for a firm to have general journals from a product like JD Edwards
actually reconcile to the general ledger in SAP? Why or why not?
3. Is it possible for multinational firms to have many different financial reporting systems
and ERP packages all in use at the same time?
4. How does the systems translator software work? How does it store the merged data
into a data warehouse?
5. Why is it better to extract data from a data warehouse than a production or live
system directly?
6. Would an auditor view heterogeneous systems as an audit risk? Why or why not?
7. Why would audit firms prefer to use proprietary workpapers rather than just storing
working papers on the cloud?
Problems
1. What are the advantages of the use of homogeneous systems? Would a merger
target be more attractive if it used a similar financial reporting system as the potential
parent company?
2. Consider Exhibit 5-3. Looking at the audit data standards order-to-cash process, what
function is there for the AR_Adjustments transaction table—that is, adjustments to the
Accounts Receivable? Why is this an audit data standard, and why is it important for
an auditor to see?
3. Who developed the audit data standards? In your opinion, why is it the right group to
develop and maintain them rather than, say, the Big 4 firms or a small practitioner?
201
4. Simple to complex Data Analytics can be applied to a client’s data during the
planning
stage of the audit to identify which areas the auditor should focus on. Which types of
techniques or tests might be used in this stage?
5. What approach should a company make if its continuous audit system has too many
alarms that are false positives? How would that approach change if there are too
many missed abnormal events (such as false negatives)?
6. Implementing continuous auditing procedures is similar to automating an audit plan
with the additional step of scheduling the automated procedures to match the timing
and frequency of the data being evaluated and the notification to the auditor when
exceptions occur. In your opinion, will the traditional audit be replaced by continuous
auditing?
202
Lab 5-1 Set Up a Cloud Folder
Auditors collect evidence in electronic workpapers that include a permanent file with
infor- mation about policies and procedures and a temporary file with evidence related to
the cur- rent audit. These files could be stored locally on a laptop, but the increased use of
remote communication makes collaboration through the cloud more necessary. There are
a num- ber of commercial workpaper applications, but we can simulate some of those
features with consumer cloud platforms, like Microsoft OneDrive.
Company summary
You have rotated into the internal audit department at a mid-sized manufacturing
company. Your team is still using company e-mail to send evidence back and forth, usually
in the form of documents and spreadsheets. There is a lot of duplication of these files, and
no one is quite sure which version is the latest. You see an opportunity to streamline this
process using OneDrive.
Technique
• Gather documents, explore document history and revisions
Software needed
• A modern web browser
In this lab, you will:
Part 1: Create a shared folder.
Part 2: Upload files.
Part 3: Review revisions.
End of Lab
R
See Lab 5-1 for background information on this lab. The goal of a shared folder is that
other members of the audit team can contribute and edit the documents. Commercial soft-
ware provides an approval workflow and additional internal controls over the documents
to reduce manipulation of audit evidence, for example. For consumer cloud platforms,
one control appears in the versioning of documents. As revisions are made, old copies of
the documents are kept so that they can be reverted to, if needed.
In this lab, you will:
Part 1: Upload revised documents.
Part 2: Review document revision history.
204
Lab 5-3 Identify Audit Data Requirements
As the new member of the internal audit team, you have introduced your team to the
shared folder and are in the process of modernizing the internal audit at your firm. The chief
audit exec- utive is interested in using Data Analytics to make the audit more efficient. Your
internal audit manager agrees and has tasked you with reviewing the audit plan. She has
provided three “audit action sheets” with procedures that they have been using for the past
three years to evaluate the procure-to-pay (purchasing) process and is interested in your
thoughts for modernizing them.
Technique
• Review the audit plan, look for procedures involving data, and identify the locations
of the data.
Software needed
• A modern web browser
Q1. Read the first audit action sheet. What other data elements that are not listed in
the procedures do you think would be useful in analyzing this account?
205
206 Chapter 5 The Modern Audit and Continuous Auditing
Technique
• Review the audit plan, identify procedures that must be completed manually, and
identify those that can be automated and scheduled.
• Also determine when the procedures should occur.
Software needed
• A modern web browser
3. For each element and rule, determine whether it requires manual review or can be
performed automatically and alter auditors when exceptions occur. Add either
“Auto” or “Manual” to that column.
4. Finally, determine how frequently the data should be evaluated. Indicate “Daily,”
“Weekly,” “Monthly,” “Annually,” or “During Audit.” Think about when the data
are being generated. For example, transactions occur every day, but new
employees are added every few months.
5. Take a screenshot (label it 5-4A).
6. Save and close your file.
End of Lab
Chapter 6
Audit Data Analytics
A Look Back
In chapter 5, we introduced Data Analytics in auditing by considering how both internal and external auditors are
using technology in general, and audit analytics specifically, to evaluate firm data and generate support for
manage- ment assertions. We emphasized audit planning, audit data standards, continuous auditing, and audit
working papers.
A Look Ahead
Chapter 7 explains how to apply Data Analytics to measure performance. By measuring past performance and
com- paring it to targeted goals, we are able to assess how well a company is working toward a goal. Also, we can
determine required adjustments to how decisions are made or how business processes are run, if any.
208
Internal auditors at Hewlett-Packard Co. (HP) understand
how data analytics can improve processes and controls.
Management identified abnormal behavior with manual journal
entries, and the internal audit department responded by work-
ing with various governance and compliance teams to develop
dashboards that would allow them to monitor accounting activ-
ity. The dashboard made it easier for management and the
audi- tors to follow trends, identify spikes in activity, and drill
down to identify the individuals posting entries. Leveraging
accounting data allows the internal audit function to focus on
the risks fac- ing HP and act on data in real time by
©Anatolii Babii/Alamy
implementing better con- trols. Audit data analytics provides
an enhanced level of control that is missing from a traditional
OBJECTIVES
After reading this chapter, you should be able to:
209
210 Chapter 6 Audit Data Analytics
Data may also be found in unlikely places. An auditor may be tasked with determining
whether the steps of a process are being followed. Traditional evaluation would involve
the auditor observing or interviewing the employee performing the work. Now that most
pro- cesses are handled through online systems, an auditor can perform Data Analytics on
the time stamps of the tasks and determine the sequence of approvals in a workflow
along with
212 Chapter 6 Audit Data Analytics
the amount of time spent on each task. This form of process mining enables insight into
areas where greater efficiency can be applied. Likewise, data stored in paper documents,
such as invoices received from vendors, can be scanned and converted to tabular data
using specialized software. These new pieces of data can be joined to other transactional
data to enable new, thoughtful analytics.
There is an increasing opportunity to work with unstructured Big Data to provide
addi- tional insight into the economic events being evaluated by the auditors, such as
surveillance video or text from e-mail, but those are still outside the scope of current Data
Analytics that an auditor would develop.
Most auditors will perform descriptive and diagnostic analytics as part of their audit
plan. On rare occasions, they may experiment with predictive and prescriptive analytics
directly. More likely, they may identify opportunities for the latter analytics and work
with data scientists to build those for future use.
Some examples of CAATs and audit procedures related to the descriptive, diagnostic,
predictive, and prescriptive analytics can be found in Table 6-2.
While many of these analyses can be performed using Excel, most CAATs are built on
generalized audit software (GAS), such as IDEA, ACL, or TeamMate Analytics. The
GAS software has two main advantages over traditional spreadsheet software. First, it
enables analysis of very large datasets. Second, it automates several common analytical
routines, so an auditor can click a few buttons to get to the results rather than writing a
complex set of formulas. GAS is also scriptable and enables auditors to record or
program common analy- ses that may be reused on future engagements.
Communicate Insights
Many analytics can be adapted to create an audit dashboard, particularly if the firm has
adopted continuous auditing. The primary output of CAATs is evidence used to validate
assertions about the processes and data. This evidence should be included in the audit
workpapers.
Track Outcomes
The detection and resolution of audit exceptions may be a valuable measure of the effi-
ciency and effectiveness of the internal audit function itself. Additional analytics may
track the number of exceptions over time and the time taken to report and resolve the
issues. For the CAATs involved, a periodic validation process should occur to ensure that
they continue to function as expected.
PROGRESS CHECK
1. Using Table 6-2 as a guide, compare and contrast descriptive and diagnostic
analytics. How might these be used in an audit?
2. In a continuous audit, how would a dashboard help to communicate audit find-
ings and spur a response?
In this and the next few sections, we’ll present some examples of procedures that audi-
tors commonly use to evaluate enterprise data. In these examples, we show the basic pro-
cess for Excel, including formulas, and IDEA. Note that in the Excel formulas, we
identify data elements in [brackets]. To use these formulas, replace the bracketed [data
element] with a value or range of values as appropriate. For example, [Aging date] would
be replaced with C3 if the data are in column C, row 3.
Age Analysis
Aging of accounts receivable and accounts payable help determine the likelihood that a
bal- ance will be paid. This substantive test of account balances evaluates the date of an
order and groups it into buckets based on how old it is, typically in 0–30, 31–60, 61–90,
and >90
days, or similar. See Table 6-3 for an example. Extremely old accounts that haven’t been
resolved or written off should be flagged for follow-up by the auditor. It could mean that
(1) the data are bad, (2) a process is broken, (3) there’s a reason someone is holding that
account open, or (4) it was simply never resolved.
There are many ways to calculate aging in Excel, including using pivot tables. If you
have a simple list of accounts and balances, you can calculate a simple age of accounts in
Excel using the following procedure.
Data
• Customer/vendor name
• Unpaid order number
• Order date
• Amount
In Excel
1. Open your worksheet.
2. Add a cell with the aging date.
3. Add a calculated column for the days outstanding: =[Aging date]–[Order date].
4. Add four new calculated columns for the buckets:
a. 0–30 days: =IF([Aging date]–[Order date]<=30,[Amount],0).
b. 31–60 days: =IF(AND([Aging date]–[Order date]<=60, [Aging date]–
[Order date]>30),[Amount],0).
c. 61–90 days: =IF(AND([Aging date]–[Order date]<=90, [Aging date]–
[Order date]>60),[Amount],0).
d. >90 days: =IF([Aging date]–[Order date]>90),[Amount],0).
5. Copy the formulas for all records.
6. Add a total to the bottom of each bucket: =SUM([bucket column]).
In IDEA
1. Open your worksheet.
2. Go to Analysis > Categorize > Aging.
216 Chapter 6 Audit Data Analytics
3. Select aging date, field containing transaction date, and amount for the field to
total amount.
4. Click OK.
Sorting
Sometimes, simply viewing the largest or smallest values can provide meaningful insight.
Sorting in ascending order shows the smallest number values first. Sorting in descending
order shows the largest values first.
Data
• Any numerical, date, or text data of interest
In Excel
1. Open your worksheet.
2. Select the data you wish to sort.
3. Go to Home > Format as Table.
4. Click the drop-down arrow next to the header or the column you want to sort.
5. Click Sort A to Z for ascending order or Sort Z to A for descending order.
Chapter 6 Audit Data Analytics 217
In IDEA
1. Open your data table.
2. Go to Data > Order > Sort.
3. Choose your fields and direction, Ascending or Descending.
4. Click OK.
Summary Statistics
Summary statistics provide insight into the relative size of a number compared with the
population. The mean indicates the average value, while the median produces the middle
value, where all the transactions lined up in a row. The min shows the smallest value,
while the max shows the largest. Finally, a count tells how many records exist, where the
sum adds up the values to find a total. Once summary statistics are calculated, you have a
refer- ence point for an individual record. Is the amount above or below average? What
percent- age of the total does a group of transactions make up?
Data
• Any numerical data, such as a dollar amount or quantity
In Excel
1. Open your workbook.
2. Add the following calculated values:
• Mean: =AVERAGE([range]).
• Median: =MEDIAN([range]).
• Minimum: =MIN([range]).
• Maximum: =MAX([range]).
• Count: =COUNT([range]).
• Sum: =SUM([range]).
3. Alternatively, format your data as a table and show the total row at the bottom:
a. Select your data.
b. Go to Home > Styles > Format as Table.
c. Select a table style and click OK.
d. Go to Table Tools > Design > Table Style Options and click the Total Row box.
e. Click the drop-down arrow next to the column total value that appears, and choose
an appropriate statistic.
In IDEA
1. Open your worksheet.
2. In the Properties pane on the right, click Field Statistics.
3. Allow IDEA to calculate all uncalculated fields, if prompted.
4. In the output screen, you can click any blue number to locate those transactions.
Sampling
Sampling is useful when you have manual audit procedures, such as testing transaction details
or evaluating source documents. The idea is that if the sample is an appropriate size, the
features of the sample can be confidently generalized to the population. So, if the sample
has no errors (misstatement), then the population is unlikely to have errors as well. Of
course, sampling has its limitations. The confidence level is not a guarantee that you won’t
miss some- thing critical like fraud. But it does limit the scope of the work the auditor
must perform.
218 Chapter 6 Audit Data Analytics
There are three determinants for sample size: confidence level, tolerable misstatement,
and estimated misstatement.
Data
• Any list of transactions or master data
In Excel
1. Enable Analysis ToolPak:
a. Go to File > Options > Add-ins > Excel Add-ins > Go.
b. Check the box next to Analysis ToolPak, and click OK.
2. Go to Data > Analysis > Data Analysis.
3. Click Sampling, then OK.
a. Select your input range, usually the transaction number.
b. Choose Random, and input the number of samples.
c. Click OK.
4. A new worksheet will appear with a list of your randomly selected transactions.
In IDEA
1. Open your worksheet.
2. Go to Analysis > Sample > Random.
a. Input number of records to select for your sample size.
b. Change other values as needed.
c. Click OK.
3. A new worksheet will be created with your random sample.
Monetary unit sampling (MUS) allows auditors to evaluate account balances. MUS
is more likely to pull accounts with large balances (higher risk and exposure) because it
focuses on dollars, not account numbers.
Data
• The book value of the financial accounts you’re evaluating
• The sample size
In Excel
1. Find the sampling interval. Divide the book value by sample size.
a. 1,000,000/132 = 7,575 <- Sampling interval
2. Sort the financial accounts in some type of sequence, and calculate a
cumulative balance.
a. Alphabetically by name.
b. Numerically by number.
c. By date.
3. Pick a random number between 1 and your sampling interval.
a. This will be the starting value. For example, 1,243.
4. Go down the list of cumulative balances until you pass your random number.
a. For example, test the first account that passes 1,243.
5. Continue down the list of cumulative balances until you pass the next sampling
interval.
a. For example, test the second account that passes 1,243 + 7,575 = 8,818.
6. Repeat step 5 until you run out of accounts.
a. 8,818 + 7,575 = 16,393; 16,393 + 7,575 = 23,968 . . .
Chapter 6 Audit Data Analytics 219
In IDEA
1. Open your data table.
2. Go to Analysis > Sample > Monetary Unit > Plan.
a. Choose your monetary value field.
b. Set your confidence level, tolerable error, and expected error.
c. Click Estimate to calculate your sample size.
d. Adjust other values as needed, then click Accept.
e. Click OK.
3. A new worksheet will appear with your sample transactions.
PROGRESS CHECK
3. What type of descriptive analytics would you use to find negative numbers that
were entered in error?
4. How does monetary unit sampling help you isolate the items of greatest poten-
tial significance to an auditor in evaluating materiality?
Z-Score
A standard score or Z-score is a concept from statistics that assigns a value to a number
based on how many standard deviations it stands from the mean, shown in Exhibit 6-1.
By setting the mean to 0, you can see how far a point of interest is above or below it. For
example, a point with a Z-score of 2.5 is two-and-a-half standard deviations above the
mean. Because most values that come from a large population tend to be normally
distributed (frequently skewed toward smaller values in the case of financial
transactions), nearly all (98 percent) of the values should be within plus-or-minus three
standard deviations. If a value has a Z-score of 3.9, it is very likely an outlier that
warrants scrutiny.
In Excel
1. Calculate the average: =AVERAGE([range]).
2. Calculate the standard deviation: =STDEVPA([range]).
3. Add a new column called “Z-score” next to your number range.
4. Calculate the Z-score: =STANDARDIZE([value],[mean],[standard deviation])
a. Alternatively: =([value]–[mean])/[standard deviation].
5. Sort your values by Z-score in descending order.
In IDEA
• Z-score calculation is not a default feature of IDEA.
Benford’s Law
Benford’s law states that when you have a large set of naturally occurring numbers, the
lead- ing significant digit will likely be small. The economic intuition behind it is that
people are more likely to make $10, $100, or $1,000 purchases than $90, $900, or $9,000
purchases. This law has been shown in many settings, such as the amount of electricity
bills, street addresses, and GDP figures from around the world (as shown in Exhibit 6-2).
20%
15%
10%
5%
0%
1 2 3 4 5 6 7 8 9
Purchases GDP 2016 Benford’s Predicted
In auditing, we can use Benford’s law to identify transactions or users with nontypical
activity based on the distribution of the first digit of the number. For example, assume
that purchases over $500 require manager approval. A cunning employee might try to
make large purchases that are just under the approval limit to avoid suspicion. She will
even be clever and make the numbers look random: $495, $463, $488, etc. What she
doesn’t real- ize is that the frequency of the leading digit 4 is going to be much higher
than it should be, shown in Exhibit 6-3. Benford’s law can also detect random computer-
generated numbers because those will have equally distributed first digits.
We show an illustration of how to evaluate data and their frequency with respect to
Benford’s law in both Excel and IDEA.
Chapter 6 Audit Data Analytics 221
35%
EXHIBIT 6-3
Using Benford’s Law
30%
Structured
purchases may look
25% normal, but they
alter the distribution
20% under Benford’s
law.
15%
10%
5%
0%
1 2 3 4 5 6 7 8 9
Purchases Benford’s Predicted
Data
• Large set of numerical data, such as monetary amounts or quantities
In Excel
1. Open your spreadsheet.
2. Add a new column and extract the leading digit: =LEFT([Amount],1).
3. Create a frequency distribution:
a. Create a list on your sheet using values from as shown in Table 6-4 below.
In IDEA
1. Open your worksheet.
2. Go to Analysis > Explore > Benford’s Law.
a. Choose the numerical field to analyze.
b. Only check First digit. Uncheck everything else.
c. Click OK.
3. A graph will appear with the Benford’s expected amount and the actual frequency
of the dataset.
4. Click any digits that are significantly above the bounds and choose Extract Records.
Bonus: Use the average expected Benford’s law value to identify specific employees
with abnormally large transactions. In this case, a user with lots of transactions should
have an average expected Benford’s law percentage of 11.1 percent or above. Employees
whose average purchases are closer to 8 or 5 percent have a lot of 7, 8, and 9 values that
are skew- ing their average.
In Excel
1. Open your spreadsheet with financial data that contain an employee name and
transac- tion amount.
2. Add a new column and extract the leading digit:
=NUMBERVALUE(LEFT([Amount],1))
3. Add the expected Benford’s law percentages to your sheet similar to Table 6.5 below:
TABLE 6-5
Digit Benford Expected %
Expected Benford’s
Law Percentages 1 30.1%
2 17.6%
3 12.5%
4 9.6%
5 7.9%
6 6.7%
7 5.8%
8 5.1%
9 4.6%
4. Add a new column next to your data to look up the expected Benford’s law percentage
for your value: =INDEX([Benford Expected %], MATCH([Value],[Digit],0)).
5. Create a PivotTable to see the average % by user:
a. Select your data.
b. Go to Insert > Tables > PivotTable.
c. Click OK to add the PivotTable to a new sheet.
Chapter 6 Audit Data Analytics 223
In IDEA
• This is not possible by built-in tool.
Drill Down
The most modern Data Analytics software allows auditors to drill down into specific
values by simply double-clicking a value. This lets you see the underlying transactions
that gave you the summary amount. For example, you might click the total sales amount
in an income statement to see the sales general ledger summarizing the daily totals. Click
a daily amount to see the individual transactions from that day.
Data needed
• Two tables/sheets with a common attribute, such as a primary key/foreign key, name,
or address
In Excel
1. Search the Internet for Fuzzy Lookup Add-In for Excel, then download and install it
to your computer.
2. Open your spreadsheet with two sheets you’d like to join using a fuzzy match. For
exam- ple, employees and vendors.
3. Go to Fuzzy Lookup > Fuzzy Lookup (Go to File > Options > Add-ins > COM Add-ins
> Go. . . and check Fuzzy Lookup Add-in For Excel if you don’t see the bar).
224 Chapter 6 Audit Data Analytics
a. Select the sheet you want for the Left Table and a sheet that has similar values
for the Right Table.
b. Choose the columns that you expect to find matching values in the Left and
Right Columns pane. Note: For addresses, choose Address AND Zip Code for
more likely matches.
c. Select your output columns, if needed.
d. Adjust the similarity threshold, if needed.
e. Open a new worksheet.
f. Click Go.
4. Evaluate the similarity.
In IDEA
1. Fuzzy matching isn’t available by default in IDEA.
Sequence Check
Another substantive procedure is the sequence check. This is used to validate data integ-
rity and test the completeness assertion, making sure that all relevant transactions are
accounted for. Simply put, sequence checks are useful for finding gaps, such as a missing
check in the cash disbursements journal, or duplicate transactions, such as duplicate pay-
ments to vendors. This is a fairly simple procedure that can be deployed quickly and
easily with great success.
In Excel
PROGRESS CHECK
5. A sequence check will help us to see if there is a duplicate payment to
vendors. Why is that important for the auditor to find?
6. Let’s say a company has nine divisions, and each division has a different
check number based on its division—so one starts with “1,” another with “2,”
etc. Would Benford’s law work in this situation?
226 Chapter 6 Audit Data Analytics
Regression
Regression allows an auditor to predict a specific dependent value based on independent
variable inputs. In other words, what would we expect behavior to be given some inputs
and does that match reality? In auditing, we could evaluate overtime booked for workers
against productivity or the value of inventory shrinkage given environmental factors.
Classification
Classification in auditing is going to be mainly focused on risk assessment. The predicted
classes may be low risk or high risk, where an individual transaction is classified in either
group. In the case of known fraud, auditors would classify those cases or transactions as
fraud/not fraud and develop a classification model that could predict whether similar
trans- actions might also be potentially fraudulent.
There is a longstanding classification method used to predict whether a company is
expected to go bankrupt or not. Altman’s Z is a calculated score that helps predict bank-
ruptcy and might be useful for auditors to evaluate a company’s ability to continue as a
going concern.
When using classification models, it is important to remember that large training sets
are needed to generate relatively accurate models. Initially, this requires significant manual
classi- fication by the auditors or business process owner so that the model can be useful for
the audit.
Probability
When talking about classification, the strength of the class can be important to the
auditor, especially when trying to limit the scope (e.g., evaluate only the 10 riskiest
transactions). Classifiers that use a rank score can identify the strength of classification
by measuring the distance from the mean. That rank order focuses the auditor’s efforts on
the items of poten- tially greatest significance.
Sentiment Analysis
Evaluate text (e.g., 10-K or annual report) for positive or negative sentiment to predict
posi- tive or negative outcomes or to look for potential bias on management’s part. There
is more discussion on sentiment analysis in chapter 8.
Applied Statistics
Additional mixed distributions and nontraditional statistics may also provide insight to
the auditor. For example, an audit of inventory may reveal errors in the amount recorded
in the system. The difference between the error amounts and the actual amounts may
provide some valuable insight into how significant or material the problem may be.
Auditors can plot the frequency distribution of errors and use Z-scores to hone in the
cause of the most significant or outlier errors.
Chapter 6 Audit Data Analytics 227
Artificial Intelligence
As the audit team generates more data and takes specific action, the action itself can
be modeled in a way that allows an algorithm to predict expected behavior. Artificial
intelligence is designed around the idea that computers can learn about action or behav-
ior from the past and predict the course of action for the future. Assume that an expe-
rienced auditor questions management about the estimate of allowance for doubtful
accounts. The human auditor evaluates a number of inputs, such as the estimate calcula-
tion, market factors, and the possibility of income smoothing by management. Given
these inputs, the auditor decides to challenge management’s estimate. If the auditor
consistently takes this action and it is recorded by the computer, the computer learns
from this action and makes a recommendation when a new inexperienced auditor faces
a similar situation.
Decision support systems that accountants have relied upon for years (e.g., TurboTax)
are based on a formal set of rules and then updated based on what the user decides given
several choices. Artificial intelligence can be used as a helpful assistant to auditors and
may potentially be called upon to make judgment decisions itself.
Additional Analyses
The list of Data Analytics presented in this chapter is not exhaustive by any means. There
are many other approaches to identifying interesting patterns and anomalies in enterprise
data. Many ingenious auditors have developed automated scripts that can simplify several
of the audit tasks presented here. Excel add-ins like TeamMate Analytics provide many
different techniques that apply specifically to the audit of fixed assets, inventory, sales
and purchase transactions, etc. Auditors will combine these tools with other techniques,
such as periodically testing the effectiveness of automated tools by adding erroneous or
fraudulent transactions, to enhance their audit process.
PROGRESS CHECK
7. Why would a bankruptcy prediction be considered classification? And why
would it be useful to auditors?
8. If sentiment analysis is used on a product advertisement, would you guess the
overall sentiment would be positive or negative?
Summary
This chapter discusses a number of analytical techniques that auditors use to gather
insight about controls and transaction data. These include descriptive analytics that are
used to summarize and gain insight into the data, diagnostic analytics that identify
patterns in the data that may not be immediately obvious, predictive analytics that look
for common attributes of problematic data to help identify similar events in the future,
and prescriptive analytics that provide decision support to auditors as they work to
resolve issues with the processes and controls.
Key Words
computer-assisted audit techniques (CAATs) (212) Computer-assisted audit techniques
(CAATs) are automated scripts that can be used to validate data, test controls, and enable substantive
testing of transaction details or account balances and generate supporting evidence for the audit.
descriptive analytics (212) Descriptive analytics summarize activity or master data elements
based on certain attributes.
diagnostic analytics (212) Diagnostic analytics looks for correlations or patterns of interest in the data.
fuzzy matching (213) Fuzzy matching finds matches that may be less than 100 percent matching by
finding correspondences between portions of the text or other entries.
monetary unit sampling (MUS) (218) Monetary unit sampling allows auditors to evaluate account
balances. MUS is more likely to pull accounts with large balances (higher risk and exposure) because it
focuses on dollars, not account numbers.
predictive analytics (212) Predictive analytics attempt to find hidden patterns or variables that are
linked to abnormal behavior.
prescriptive analytics (212) Prescriptive analytics use machine learning and artificial intelligence for
auditors as decision support to assist future auditors in finding potential issues in the audit.
229
8. Which testing approach would be used to predict whether certain cases should be
eval- uated as having fraud or no fraud?
a. Classification
b. Probability
c. Sentiment analysis
d. Artificial intelligence
9. Which testing approach would be useful in assessing the value of inventory
shrinkage given multiple environmental factors?
a. Probability
b. Sentiment analysis
c. Regression
d. Applied statistics
10. What type of analysis would help auditors find missing checks?
a. Sequence check
b. Benford’s law analysis
c. Fuzzy matching
d. Decision support systems
Discussion Questions
1. How do nature, extent, and timing of audit procedures help us identify when to apply
Data Analytics to the audit process?
2. When do you believe that Data Analytics will add value to the audit process? How
can it most help?
3. Using Table 6-2 as a guide, compare and contrast predictive and prescriptive
analytics. How might these be used in an audit? Or a continuous audit?
4. An example of prescriptive analytics is when an action is recommended based on
previ- ously observed actions. For example, an analysis might help determine
procedures to follow when new accounts are opened for inactive customers, such as
requiring super- visor approval. How might this help address a potential audit issue?
5. One type of descriptive analytics is simply sorting data. Why is seeing extreme
values helpful (minimums, maximums, counts, etc.) in evaluating accuracy and
completeness and in potentially finding errors and fraud and the like?
Problems
1. One type of descriptive analytics is age analysis. Why are auditors particularly inter-
ested in the aging of accounts receivable and accounts payable? How does this
analy- sis help evaluate management judgment on collectability of receivables and
potential payment of payables? Would a dashboard item reflecting this aging be
useful in a con- tinuous audit?
2. One of the benefits of Data Analytics is the ability to see and test the full population.
In that case, why is sampling (even monetary sampling) still used, and how is it
useful?
3. What does a Z-score greater than three (or minus three) suggest? How is that useful
in finding extreme values? What type of analysis should we do when we find extreme
or outlier values?
4. What are some patterns that could be found using diagnostic analysis? Between
which types of variables?
230
5. In a certain company, one accountant records most of the adjusting journal entries at
the end of the month. What type of analysis could be used to identify that this
happens and the cumulative size of the transactions that the one accountant
records? Is this a problem or if not, when would it be?
6. Which distributions would you recommend be tested using Benford’s law? What
would a Benford’s law evaluation of sales transaction amounts potentially show?
What would a test of vendor numbers or employee numbers show? Anything different
from a test of invoice or check numbers? Any cases where Benford’s law wouldn’t
work?
7. How could artificial intelligence be used to help with the evaluation of the estimate for
the allowance for doubtful accounts? Could past allowances be tested for their
predic- tive ability that might be able to help set allowances in the current period?
8. How do you think sentiment analysis of the 10-K might assess the level of bias
(positive or negative) of the annual reports? If management is too positive about the
results of the company, can that be viewed as being neutral or impartial?
231
Lab 6-1 Evaluate the Master Data for Interesting Addresses
You’re starting to make a name for yourself in the internal audit department. Your
manager liked your analysis of the audit plan and now would like you to see what other
ways data analytics could be applied beyond the existing audit action sheets.
As you’ve been reading about risk and fraud, you learned that one common risk is that
employees may be tempted to create fictitious suppliers that they use to embezzle money.
The premise is simple enough. An employee with access to create master data adds a sup-
plier record for a spouse. She then submits an invoice for “cleaning services” that were
never performed and is promptly paid, assuming there isn’t good follow-up from the
accounts pay- able department. The employee is smart enough to know that an exact
address would raise red flags, so she alters it slightly to avoid detection. Other suspicious
addresses may include PO Box addresses because they can obscure the identity of a
fictitious supplier.
You know that one way to detect this issue is to look for fuzzy matches, and you’re
eager to show your manager what you know. Refer to Lab 3-2 for another example. This
lab assumes you have completed Lab 5-1.
Techniques
• Data preparation
• Filtering
• Fuzzy matching
Software needed
• Excel
In this lab, you will:
Part 1: Identify the questions.
Part 2: Master the employee and vendor data.
Part 3: Perform the analysis.
Part 4: Address the results.
In IDEA
1. Download the P2P IDEA Audit Data from Connect, as directed.
2. Unzip the file on your computer.
232
3. Open IDEA and go to Home > Projects > Select.
4. Click the External Projects tab, then navigate to your downloaded P2P IDEA Audit
Data project folder.
5. Click OK.
6. Take a screenshot (label it 6-1B).
In IDEA
1. Open your Supplier_Listing table.
2. Go to Data > Search > Search.
a. Text to find: box
b. Fields to look in: Supplier_Physical_Street_Address1
c. Click OK.
3. Take a screenshot (label it 6-1D).
In Excel
1. Click the drop-down arrow next to the Address field, and choose Clear Filter
From “Supplier_Physical_Street_Address1”.
2. Perform a fuzzy match on the Supplier_Physical_Street_Address1, and Supplier_
Physical_ZipPostalCode from the Suppliers sheet and the User_Physical_Street_
Address1 and User_Physical_Street_ZipPostalCode from the Users sheet. Refer to the
example in chapter 6 or Lab 3-2 for specific step-by-step instructions.
3. Take a screenshot (label it 6-1E).
In IDEA
IDEA doesn’t support fuzzy matching directly, but this works with a few steps by
merging the supplier and user tables, and then looking for fuzzy duplicate records. The
resulting table will show duplicate records that will match despite not being exact.
1. Open the Supplier_Listing table.
2. Click Data > Append.
a. Field name: Type
b. Field type: Virtual Character
c Length: 20
233
d. Parameter: “Supplier”
e. Click OK.
3. Open the User_Listing table.
4. Click Data > Append.
a. Field name: Type
b. Field type: Virtual Character
c. Length: 20
d. Parameter: “Employee”
e. Click OK.
5. Go to Analysis > Relate > Append.
a. Double-click Supplier_Listing.
b. Click OK.
6. Go to Data > Append.
a. Field name: Combo_Address
b. Field type: Virtual Character
c. Length: 100
d. Parameter: = Supplier_Physical_Street_Address1 +
User_Physical_Street_Address1
e. Click OK.
7. In your new Append Databases table, click Analysis > Explore > Duplicate Key > Fuzzy.
a. Output: Fuzzy matches
b. Similarity degree (%): Adjust as needed
c. Key: Combo_Address
d. Click OK.
8. Take a screenshot (label it 6-1F).
Q5. How many fuzzy matches appeared?
Q6. Which of the matches are
suspicious? Q7. Which of the
matches are normal?
End of Lab
4. In the PivotTable Fields window, click All to view both tables in the workbook.
235
5. Create a PivotTable that shows the Sales_Order_Total and the Receipt_Total for
each Sales_Order_ID.
6. The data will look odd at first, and you will be prompted to create relationships.
You can allow Excel to auto-detect the relationships, and it will identify the
relationship between the Primary and Foreign Keys that exist between the two
tables.
Q4. What are the primary and foreign keys that relate the two tables in this workbook?
7. After creating the relationships, the top few records of your PivotTable output
should look like the following:
236
8. Copy the data in the PivotTable to a new spreadsheet to convert the PivotTable data to
a range. Doing so will allow us to be able to identify which of the invoices have yet
to be paid in full yet. You can ensure that you’re copying only the range by selecting
and copy- ing all of the data in the PivotTable, except for the last row containing the
Grand Total.
9. Add a column to your new range, and calculate the difference between the
Sales_ Order_Total and the Receipt_Amount.
10. Add a filter to the Difference column, and filter out all values that appear as 0’s.
This will allow you to view all of the invoices that haven’t been paid in full yet.
11. This data can be made more interesting by identifying how late the payments
are. Return to the Cash_Received spreadsheet in your workbook.
12. Add a new column to the Cash_Received table called Sales_Order_Date. This will
allow you to easily compare the date of the original Sales Order to the date of the
payment.
13. Use a False VLookup formula to look up the date that corresponds with the
Sales_ Order_ID that each cash receipt corresponds to.
14. Now that you have the Sales_Order_Date easily accessible, you can create another
col- umn to calculate the difference between the dates. Create a new column labeled
Age, and subtract the Sales_Order_Date from the Cash_Receipt_Date.
15. Your next step is to create a True VLookup formula to assign each cash receipt to an
aging bucket. Create an aging table with the following information somewhere on your
spreadsheet:
0 0–30
30 31–60
60 61–90
90 >90
237
16. Add another new column to the Cash_Received table labeled Bucket, and create
a True VLookup formula to identify the bucket for each invoice.
17. We can quickly create a summary of how many invoices fall into each bucket
using Excel’s COUNTIF function. In the column to the right of your aging table,
create a column labeled Count.
18. In the cell to the right of your 0–30 bucket, type the COUNTIF function.
COUNTIF requires two arguments, range and criteria. The range in this case is the
Buckets column. The criteria is 0–30. COUNTIF will count every instance of 0–30 in the
buckets column.
19. Repeat the steps for the remaining three buckets. The top two records in the
Count column should return the following data:
20. Return to your PivotTable, and refresh the data so that you can pull in your new
fields for further analysis. You can refresh your data by clicking the Refresh button
in the Analyze tab from the ribbon.
21. You should now be able to add the Buckets field to the PivotTable. Do so.
238
Source: Microsoft Excel 2016.
22. Collapse the fields so that you do not see the detail of each invoice within the
buckets, but only the totals. The top two records of the PivotTable will appear as the
following:
239
25. Repeat the same steps as you did above in the new dataset.
a. Create a PivotTable that shows the Sales_Order_Total and Receipt_Amount
for each Sales_Order_ID.
i. Remember to use the Internal Data Model and to build relationships so that the
data in your PivotTable is accurate.
b. Create a range from your PivotTable data and calculate the difference between
the Sales_Order_Total and the Receipt_Amount. Filter the Difference column to
show only the invoices that haven’t been paid in full yet.
c. Return to the Cash_Received table and create the additional columns so that
you can identify the aging bucket for each invoice.
d. Create a PivotTable to identify which invoices fall into each bucket.
26. Save your file as Lab6-2December.xslx, ensuring that the PivotTable with buckets
is included in your final spreadsheet.
End of Lab
Technique
• Search for duplicates
Software needed
• Excel or IDEA
240
5. Select all of the data, choose Home > Styles > Format as Table, and pick a light,
non- banded theme.
6. Click the drop-down next to Invoice_Reference, choose Filter by color. . . , and
select the highlight color used in step 4.
7. Take a screenshot (label it 6-3A).
8. Remove the filter on Invoice_Reference and repeat steps 4–6 on the Payment_Amount
column.
In IDEA
1. Open the P2P IDEA Audit Data project in IDEA.
2. Open the Payments_Made table.
3. Go to Analysis > Explore > Duplicate Key > Detection.
a. Click Output duplicate records.
b. Key: Invoice_Reference.
c. Click OK.
4. Take a screenshot (label it 6-3B).
5. Repeat steps 2–3 on the Payment_Amount
column. Q3. How many duplicate records did you
locate? Q4. What course of action would you
recommend?
End of Lab
241
Part 1: Identify the Questions
January sales are associated with Christmas. Most retail establishments have fairly
generous return policies in case a gift received was the wrong size or just not the desired
item. Do retail companies have the same generous policies throughout the year, and do
customers apply them throughout the year?
Therefore, our specific question that we hope to test is whether there a significant
differ- ence in the amount of returns in January compared to the rest of the year.
3. Change the Values Column drop-down to Amount in the Pivot Column window,
then click OK.
4. Now that the data have been transformed, you can load them into Excel. Click
Close & Load from the Home tab. It will take a moment for all of the data (1,014
rows) to load into Excel.
242
5. Create a PivotTable by clicking PivotTable from the Insert tab on the Excel ribbon.
6. Even though you have loaded the data into Excel, you have not added it to Excel’s
Internal Data Model. To do so, place a check mark in the box next to Add this data
to the Data Model in the Create PivotTable window.
7. To create a measure for Refunds over Purchases, select Measures > New Measure. . .
from the PowerPivot tab in the Excel ribbon.
8. The new measure’s name defaults to Measure 1, which isn’t very descriptive.
Because we’ll be measuring average Transaction amount, we’ll change the name to
R/P. Type R/P over the default text.
9. The formula will auto-populate as you type; begin typing SUM, then fill in the
remain- der of the measure to divide the purchasing transactions by the refund
transactions:
=sum(Query1[R])/SUM(Query1[P]).
10. At the bottom of the Measure window is an option to select a category. The
Category has no bearing on how the measure or the KPI will work. For this
measure, we’ll leave it on the default of General. Click OK to create the measure.
243
Source: Microsoft Excel 2016.
11. Now that the measure is created, it has been added to the PivotTable Fields
window. Create a PivotTable to view only the January dates (place
Tran_Date(Month) in the filter) and days along the rows. Use the new measure
you created, R/P, as the value.
244
Parsing out month and day will require placing Tran_Date in the rows column
first, then removing the Year and Quarter attributes that automatically populate.
Drag and Drop Tran_Date(month) to the filter, and keep the Tran_Date attribute in
the rows.
This PivotTable will provide the data we need for one part of our hypothesis test
— the values from all January dates in the database. Now we need to separate the
values from all non-January dates in the database. We’ll do this by copying the
PivotTable you just created, and modifying the filter.
12. Select the entire PivotTable (including the Filter cells), and copy the selection.
13. Place your cursor in cell D1, and paste the PivotTable there.
14. Now you can modify the filter. Place a check mark in the box next to Select
Multiple Items, then scroll to the top of the filter options to select All. Finally,
scroll down to take the check mark out of the box next to January. This will
provide the data for all transactions, except for the items that are from January.
15. Take a screenshot of your results (label it 6-4A).
16. To clarify the difference between the two PivotTables, you can rename the labels
that say sum of R/P in each table. Place your cursor inside the cell with the sum of
R/P label, and type in January and Rest of the Year in its place:
245
Source: Microsoft Excel 2016.
19. Follow the same pattern for Variable 2 by selecting all of the data that
correspond with the second PivotTable’s values.
20. Place a check mark in the box next to Labels to ensure that the labels for the
data (January and Rest of the Year) show up in the resulting output, and click
OK.
246
Part 4: Address and Refine the Results
Q1. Using the p-values (or the t-statistic and critical values), are the returns as a
percentage of sales in January greater, less than, or the same as the returns as
a percentage of sales for the rest of the year?
Q2. What can we conclude about returns?
Q3. Do you think most Christmas sales are returned in January, or do they
also occur in early January? How would you modify your tests to take
this into account?
End of Lab
Data
The data for this lab and other all Dillard’s labs are available at http://walton.uark.edu/
enterprise/. Your instructor will either give you specific instructions on how to access the
data, or there will be information available on connect. The 2016 Dillard’s data covers
all transactions over the period 1/1/2014 to 10/17/2016.
Software needed
• Microsoft SQL Server Management Studio (available on the Remote Destkop at
the University of Arkansas)
• Excel 2016 (available on the Remote Destkop at the University of Arkansas)
• Tableau (available on the Remote Destkop at the University of Arkansas)
247
Prerequisite
• Lab 6-4. This lab requires some of the skills covered in Lab 6-4 for steps 1–4. If
you haven’t completed Lab 6-4, then you can still read through the steps in that lab
to see the screenshots of the ETL process in Excel.
• Lab 4-2. Some Tableau skills from Lab 4-2 are also expected. If you haven’t
completed Lab 4-2, you can still read through the steps in that lab to learn the basics
of how to build a map and a dashboard in Tableau.
248
7. Name your new field R/P, and create the calculation SUM([R])/SUM([P]), then click OK.
End of Lab
249
Chapter 7
Generating Key Performance
Indicators
A Look Back
In chapter 6, we focused on substantive testing within the audit setting. We highlighted discussion of the audit plan,
and account balances were checked. We also highlighted the use of statistical analysis to find errors or fraud in the
audit setting. We also discussed the use of clustering to detect outliers and the use of Benford’s analysis.
A Look Ahead
In chapter 8, we will focus on how to access and analyze financial statement data. The data are accessed via
XBRL in a quick and efficient manner. We also discuss how ratios are used to analyze financial performance,
and how sparklines help users visualize trends in the data. Finally, we discuss the use of text mining to analyze
the sentiment in financial reporting data.
250
For years, Kenya Red Cross had attempted to refine its strategy and align its daily activities with its overall
strategic goals. It had annual strategic planning meetings with external consultants that always resulted in the
consultants presenting a new strategy to the organization that the Red Cross didn’t have a particularly strong buy-
in to, and the Red Cross never felt confident in what was developed or what it would mean for its future. When
Kenya Red Cross went through a Data Analytics–backed Balanced Scorecard planning process for the first time,
though, it immediately felt like its organization’s mission and vision was involved in the strategic planning and that
“strategy” was no longer so vague. The Balanced Scorecard approach helped the Kenya Red Cross align its
goals into measurable metrics. The organization prided itself on being “first in and last out” but hadn’t actively
measured its success in that goal, nor had the organization fully analyzed how being the first in and last out of
disaster scenarios affected other goals and areas of its organization. Using Data Analytics to refine its strategy and
assign measurable performance metrics to its goals, Kenya Red Cross felt confident that its everyday activities
were linked to measurable goals that would help the organization reach its goals and maintain a strong positive
reputation and impact through its service. Exhibit 7-1 gives an illustration of the Balanced Scorecard at the Kenyan
Red Cross.
OBJECTIVES
After reading this chapter, you should be able to:
LO 7-1 Evaluate management requirements and identify useful KPIs from a list
LO 7-2 Evaluate underlying data quality used for KPI
LO 7-3 Create a dashboard using KPIs
251
252 Chapter 7 Generating Key Performance Indicators
In the past six chapters, you learned how to apply the IMPACT model to data analysis
projects in general and, specifically, to internal and external auditing and financial state-
ment analysis. The same accounting information used in internal and external auditing
and financial statement analysis can also be used to determine how closely an
organization is meeting its strategic objectives. In order to better determine the gaps in
actual company performance and targeted strategic objectives, data should be condensed
into easily digest- ible and useful digital dashboards, providing precisely the information
needed to help make operational decisions that support a company’s strategic direction.
This chapter brings us to how to apply Data Analytics to measuring performance.
More specifically, we measure past performance and compare it to targeted goals to
assess how well a company is working toward a goal. In addition, we can determine
required adjust- ments to how decisions are made or how business processes are run, if
any.
Because data are increasingly available and affordable for companies to access and
store, and because the growth in technology has created robust and affordable business
intelli- gence tools, data and information are becoming the key components for decision
making, replacing gut response. Specifically, various measures and metrics are defined,
compiled from the data, and used for decision making. Performance metrics are, rather
simply, any number used to measure performance at a company. The amount of
inventory on hand is a metric, and that metric gains meaning when compared to a
baseline (e.g., how much inven- tory was on hand yesterday?). A specific type of
performance metric is key performance indicators (KPIs). Just like any performance
metric, a KPI should help managers keep track of performance and strategic objectives,
but the KPIs are performance metrics that stand out as the most important—that is, “key”
metrics that influence decision making and strategy. Nearly every organization can use
data to create the same performance metrics (although, of course, with different results),
but it is dependent upon each organization’s particular strategy which performance
metrics that organization would deem to be a KPI.
As you will recall from chapter 4, the most effective way to communicate the results
of any data analysis project is through data visualization. A project in which you are
determin- ing the right KPIs and communicating them to the appropriate stakeholders is
no differ- ent. One of the most common ways to communicate a variety of KPIs is
through a digital dashboard. A digital dashboard is an interactive report showing the
most important met- rics to help users understand how a company or an organization is
performing. There are many public digital dashboards available; for example, the Walton
College of Business at the University of Arkansas has an interactive dashboard to
showcase enrollment, where students are from, where students study abroad, student
retention and graduation rates, and where alumni work after graduation
(https://walton.uark.edu/osie/reports/data-dashboard. php). The public dashboard
detailing student diversity at the Walton College can be used by prospective students to
learn more about the university and by the university itself to assess how it is doing in
meeting goals. If the university has a goal of increasing gender bal- ance in enrollment,
for example, then monitoring the “Diverse Walton” metrics, pictured in Exhibit 7-2, can
help the university understand how it is doing at reaching that goal.
Digital dashboards provide interesting information, but their value is maximized when
the metrics provided on the dashboard are used to affect decision making and action. One
iteration of a digital dashboard is the Balanced Scorecard. The Balanced Scorecard was
cre- ated by Robert S. Kaplan and David P. Porter in 1996 to help companies turn their
strategic goals into action by identifying the most important metrics to measure, as well
as identify- ing target goals to compare metrics against.
The Balanced Scorecard is comprised of four components: financial (or stewardship),
customer (or stakeholder), internal process, and organizational capacity (or learning and
growth). As depicted in Exhibit 7-3, the measures in each category affect other
categories, and all four should be directly related to the strategic objectives of an
organization.
Chapter 7 Generating Key Performance Indicators 253
EXHIBIT 7-2
Walton College Digital
Dashboard—Diverse
Walton
EXHIBIT 7-3
Components of
the Balanced
Scorecard
Reprinted with permission
from Balanced Scorecard
Institute, a Strategy
Management Group Company.
Copyright 2008-2017.
254 Chapter 7 Generating Key Performance Indicators
For each of the four components, objectives, measures, targets, and initiatives are
iden- tified. Objectives should be aligned with strategic goals of the organization,
measures are the KPIs that show how well the organization is doing at meeting its
objective, and targets should be achievable goals toward which to move the metric.
Initiatives should be the actions that an organization can take to move its specified metrics
in the direction of their stated target goal. Exhibit 7-4 is an example of different objectives
that an organization might iden- tify for each component. You can see how certain
objectives relate to other objectives—for example, if the organization increases process
efficiency (in the internal process component row), that should help with the objective of
lowering cost in the financial component row.
EXHIBIT 7-4
An Example of a
Balanced Scorecard
Reprinted with permission
from Balanced Scorecard
Institute, a Strategy
Management Group Company.
Copyright 2008-2017.
EXHIBIT 7-5
48. Return on Innovation Investment (ROI2)
(Continued)
49. Time to Market
50. First-Pass Yield (FPY)
51. Rework Level
52. Quality Index
53. Overall Equipment Effectiveness (OEE)
54. Process or Machine Downtime Level
55. First Contact Resolution (FCR)
Employee Performance KPIs 56. Human Capital Value Added (HCVA)
57. Revenue per Employee
58. Employee Satisfaction Index
59. Employee Engagement Level
60. Staff Advocacy Score
61. Employee Churn Rate
62. Average Employee Tenure
63. Absenteeism Bradford Factor
64. 360-Degree Feedback Score
65. Salary Competitiveness Ratio (SCR)
66. Time to Hire
67. Training Return on Investment
Environmental and Social 68. Carbon Footprint
Sustainability KPIs 69. Water Footprint
70. Energy Consumption
71. Saving Levels Due to Conservation and Improvement Efforts
72. Supply Chain Miles
73. Waste Reduction Rate
74. Waste Recycling Rate
75. Product Recycling Rate
If a strategy is already developed, or after the strategy has been fully defined, it needs
to be broken down into goals that can be measured. Identifying the pieces of the strategy
that can be measured is critical. Without tracking performance and measuring results, the
strategy is only symbolic. The adage “what gets measured, gets done” shows the motiva-
tion behind aligning strategy statements with KPIs—people are more inclined to focus
their work and their projects on initiatives that are being paid attention to and measured.
Of course, simply measuring something doesn’t imply that anything will be done to
improve the measure—the attainable initiative attached to a metric indicating how it can be
improved is a key piece to ensuring that people will work to improve the measure.
PROGRESS CHECK
1. To illustrate what KPIs emphasize in “what gets measured, gets done,”
Walmart has a goal of a “zero waste future.”2 How does reporting Walmart’s
waste recy- cling rate help the organization figure out if it is getting closer to its
goal? Do you believe it helps the organization accomplish its goals?
2. How can management identify useful KPIs? How could Data Analytics help
with that?
2
http://corporate.walmart.com/2016grr/enhancing-sustainability/moving-toward-a-zero-waste-future
(accessed August 2017).
Chapter 7 Generating Key Performance Indicators 257
EXHIBIT 7-6
Balanced Scorecard
Strategy Map Template
with Measures,
Targets, and Initiatives
If not following the strategy map template, the most important KPIs should be placed
in the top left corner, as our eyes are most naturally drawn to that part of any page that
we are reading.
258 Chapter 7 Generating Key Performance Indicators
PROGRESS CHECK
3. How often would you need to see the KPI of Waste Recycling Rate to know if
you are making progress? Any different for the KPI of ROA?
4. Why do you think that the most important KPIs should be shown in the top left
corner of a digital dashboard?
PROGRESS CHECK
5. Why are digital dashboards for KPIs an effective way to address and refine
results, as well as communicate insights and track outcomes?
6. Consider the opening vignette of the Kenya Red Cross. How do KPIs help the
organization prepare and carry out its goal of being the “first in and last out”?
Summary
■
In order to better determine the gaps in actual company performance and targeted stra-
tegic objectives, data should be condensed into easily digestible and useful digital
dash- boards providing precisely the information needed to help make operational
decisions that support a company’s strategic direction.
■
Because data are increasingly available and affordable for companies to access and
store, and because the growth in technology has created robust and affordable business
intel- ligence tools, data and information are becoming the key components for
decision mak- ing, replacing gut response.
■
Performance metrics are defined, compiled from the data, and used for decision
making. A specific type of performance metrics, key performance indicators—or
“key” metrics that influence decision making and strategy—is the most important.
■
One of the most common ways to communicate a variety of KPIs is through a digital
dash- board. A digital dashboard is an interactive report showing the most important metrics
to help users understand how a company or an organization is performing. Their value is
maximized when the metrics provided on the dashboard are used to affect decision making
and action.
■
One iteration of a digital dashboard is the Balanced Scorecard, which is used to help
companies turn their strategic goals into action by identifying the most important
metrics to measure, as well as identifying target goals to compare metrics against. The
Balanced Scorecard is comprised of four components: financial (or stewardship),
customer (or stakeholder), internal process, and organizational capacity (or learning
and growth).
■
For each of the four components, objectives, measures, targets, and initiatives are
identified. Objectives should be aligned with strategic goals of the organization, measures
are the KPIs that show how well the organization is doing at meeting its objective, and
targets should be achievable goals toward which to move the metric. Initiatives should
be the actions that an organization can take to move its specified metrics in the direction
of its stated target goal.
■
Regardless of whether you are creating a Balanced Scorecard or another type of digital
dashboard to showcase performance metrics and KPIs, the IMPACT model should be
used to complete the project.
Key Words
Balanced Scorecard (252) A particular type of digital dashboard that is made up of strategic objec-
tives, as well as KPIs, target measures, and initiatives, to help the organization reach its target measures
in line with strategic goals.
digital dashboard (252) An interactive report showing the most important metrics to help users
understand how a company or an organization is performing. Often created using Excel or Tableau.
key performance indicator (KPI) (252) A particular type of performance metric that an
organization deems the most important and influential on decision making.
performance metric (252) Any calculation measuring how an organization is performing,
particularly when that measure is compared to a baseline.
260
5. According to the text, which of these are not helpful in refining a dashboard?
a. Which metric are you using most frequently to help you make decisions?
b. Are you downloading the data to do any additional analysis after working with the
dashboard, and if so, can the dashboard be improved to save those extra steps?
c. Are there any metrics that you do not use? If so, why aren’t they helpful?
d. Which data are the easiest to access or least costly to collect?
6. On a Balanced Scorecard, which is not included as a component?
a. Financial Performance
b. Customer/Stakeholder
c. Internal Process
d. Employee Capacity
7. On a Balanced Scorecard, which is not included as a component?
a. Financial Performance
b. Customer/Stakeholder
c. Order Process
d. Organizational Capacity
8. What is defined as an interactive report showing the most important metrics to help
users understand how a company or an organization is performing?
a. KPI
b. Performance metric
c. Digital dashboard
d. Balanced Scorecard
9. What is defined as any calculation measuring how an organization is performing, par-
ticularly when that measure is compared to a baseline?
a. KPI
b. Performance metric
c. Digital dashboard
d. Balanced Scorecard
10. What would you consider to be Marketing KPIs?
a. Conversion Rate
b. Six Sigma Level
c. Employee Churn Rate
d. Customer Engagement
Discussion Questions
1. We know that a Balanced Scorecard is comprised of four components: financial (or
stewardship), customer (or stakeholder), internal process, and organizational
capacity (or learning and growth). What would you include in a dashboard for the
financial and customer components?
2. We know that a Balanced Scorecard is comprised of four components: financial (or
stewardship), customer (or stakeholder), internal process, and organizational
capacity (or learning and growth). What would you include in a dashboard for the
internal pro- cess and organizational capacity components? How do digital
dashboards make KPIs easier to track?
3. Amazon, in the author’s opinion, has cared less about profitability in the short run
but has cared about gaining market share. Arguably Amazon gains market share by
261
taking care of the customer. Given the “Suggested 75 KPIs That Every Manager
Needs to Know” from Exhibit 7-5, what would be a natural KPI for the customer
aspect for Amazon? How do digital dashboards make KPIs easier to track?
4. For an accounting firm like PwC, how would the Balanced Scorecard help balance
the desire to be profitable for its partners with keeping the focus on its customers?
5. For a company like Walmart, how would the Balanced Scorecard help balance the
desire to be profitable for its shareholders with continuing to develop organizational
capacity to compete with Amazon (and other online retailers)?
6. Why is Customer Retention Rate a great KPI for understanding your Tesla customers?
7. If the data underlying your digital dashboard are updated in real time, why would you
want to update your digital dashboard in real time? Are there situations when you
would not want to update your digital dashboard in real time? Why or why not?
8. In which of the four components of a Balanced Scorecard would you put the Walton
College’s diversity initiative? Why do you think this is important for a public institution
of higher learning?
Problems
1. From Exhibit 7-5, choose 5 Financial Performance KPIs to answer the following three
questions. This URL (https://www.linkedin.com/pulse/20130905053105-64875646-the-
75-kpis-every-manager-needs-to-know) provides links with background information
for each individual KPI that may be helpful in understanding the individual KPIs and
answering the questions.
a. Identify the equation/relationship/data needed to calculate the KPI. If you need
data, how frequently would the data need to be incorporated to be most useful?
b. Describe a simple visualization that would help a manager track the KPI.
c. Identify a benchmark for the KPI from the Internet. Choose an industry and find
the average, if possible. This is for context only.
2. From Exhibit 7-5, choose 10 Employee Performance KPIs to answer the following
three questions. This URL (https://www.linkedin.com/pulse/20130905053105-
64875646-the- 75-kpis-every-manager-needs-to-know) provides links with
background information for each individual KPI that may be helpful in understanding
the individual KPIs and answering the questions.
a. Identify the equation/relationship/data needed to calculate the KPI. How frequently
would it need to be incorporated to be most useful?
b. Describe a simple visualization that would help a manager track the KPI.
c. Identify a benchmark for the KPI from the Internet. Choose an industry and find
the average, if possible. This is for context only.
3. From Exhibit 7-5, choose 10 Marketing KPIs to answer the following three questions.
This URL (https://www.linkedin.com/pulse/20130905053105-64875646-the-75-kpis-
every- manager-needs-to-know) provides links with background information for each
individ- ual KPI that may be helpful in understanding the individual KPIs and
answering the questions.
a. Identify the equation/relationship/data needed to calculate the KPI. How frequently
would it need to be incorporated to be most useful?
b. Describe a simple visualization that would help a manager track the KPI.
c. Identify a benchmark for the KPI from the Internet. Choose an industry and find
the average, if possible. This is for context only.
4. How does Data Analytics help facilitate the use of the Balanced Scorecard and track-
ing KPIs? Does it make the data more timely? Are you able to access more
information easier or faster, or what capabilities does it give?
262
5. If ROA is considered a key KPI for a company, what would be an appropriate bench-
mark? The industry’s ROA? The average ROA for the company for the past five
years? The competitors’ ROA?
a. How will you know if the company is making progress?
b. How might Data Analytics help with this?
c. How often would you need a measure of ROA? Monthly? Quarterly? Annually?
6. If Time to Market is considered a key KPI for a company, what would be an
appropriate benchmark? The industry’s Time to Market? The average Time to Market
for the com- pany for the past five years? The competitors’ Time to Market?
a. How will you know if the company is making progress?
b. How might Data Analytics help with this?
c. How often would you need a measure of Time to Market? Monthly? Quarterly?
Annually?
7. Why is Order Fulfillment Cycle Time an appropriate KPI for a company like Wayfair
(which sells furniture online)? How long does Wayfair think customers will be ready
to wait if Amazon Prime promises items delivered to its customers in two business
days? Might this be an important basis for competition?
263
Lab 7-1 Evaluate Management Requirement and
Identify Useful KPIs from a List
Key performance indicators help managers keep track of performance and strategic
objectives.
Bernard Marr came up with a list of 75 KPIs that he believes every manager needs to
know.3
Q1. Imagine you work for Tesla. Choose 20 KPIs that you believe are most
impor- tant to Tesla’s management (include 5 from each category).
3
https://www.linkedin.com/pulse/20130905053105-64875646-the-75-kpis-every-manager-needs-to-
know/ (accessed 10/13/2017).
264
To gauge your market 27. Market Growth Rate
and marketing efforts: 28. Market Share
29. Brand Equity
30. Cost per Lead
31. Conversion Rate
32. Search Engine Rankings (by keyword) and Click-Through Rate
33. Page Views and Bounce Rate
34. Customer Online Engagement Level
35. Online Share of Voice (OSOV)
36. Social Networking Footprint
37. Klout Score
To measure your operational 38. Six Sigma Level
performance: 39. Capacity Utilisation Rate (CUR)
40. Process Waste Level
41. Order Fulfilment Cycle Time
42. Delivery in Full, on Time (DIFOT) Rate
43. Inventory Shrinkage Rate (ISR)
44. Project Schedule Variance (PSV)
45. Project Cost Variance (PCV)
46. Earned Value (EV) Metric
47. Innovation Pipeline Strength (IPS)
48. Return on Innovation Investment (ROI2)
49. Time to Market
50. First-Pass Yield (FPY)
51. Rework Level
52. Quality Index
53. Overall Equipment Effectiveness (OEE)
54. Process or Machine Downtime Level
55. First Contact Resolution (FCR)
To understand your employees and their 56. Human Capital Value Added (HCVA)
performance: 57. Revenue per Employee
58. Employee Satisfaction Index
59. Employee Engagement Level
60. Staff Advocacy Score
61. Employee Churn Rate
62. Average Employee Tenure
63. Absenteeism Bradford Factor
64. 360-Degree Feedback Score
65. Salary Competitiveness Ratio (SCR)
66. Time to Hire
67. Training Return on Investment
To measure your environmental and 68. Carbon Footprint
social sustainability performance: 69. Water Footprint
70. Energy Consumption
71. Saving Levels Due to Conservation and Improvement Efforts
72. Supply Chain Miles
73. Waste Reduction Rate
74. Waste Recycling Rate
75. Product Recycling Rate
265
Part 1: Identify the Questions
For each of these 20 KPIs:
Q2. Identify the specific equation/relationship/data needed to calculate the KPI.
If you need frequent data, how frequent?
Q3. Describe a simple visualization or dashboard that would help a manager track
the KPI. Is it red, yellow, and green indicators, or do you have something else
in mind that would be better?
End of Lab
Company summary
Superstore is a large seller of retail and wholesale office supplies, furniture, and
technology. It operates in the United States and has divided its sales regions into North,
South, East, and West. Each region has a regional sales representative who interacts with
the customers to take orders and deal with returns.
Data
Sales order data are available from 2013 to 2016, including demographic data about the
customers, as well as main categories and subcategories of products.
Technique
• In this lab, you will use Tableau to generate a dashboard to evaluate four
key performance indicators.
Software needed
• Tableau
266
Assuming you’ll have access to sales order and returns data, as well as the sales
represen- tatives involved, think about different ways you could measure performance.
Q1. What KPIs would you consider using to evaluate sales financial
performance? Q2. What KPIs would you consider using to evaluate customer
relationships?
Q3. What KPIs would you consider using to evaluate process
efficiency? Q4. What KPIs would you consider using to evaluate
employee growth?
Q5. For each KPI, identify a benchmark value or KPI goal that you think
manage- ment might use.
Q6. Using the available fields, identify some calculations or relationships that
would support your KPIs from Q1 to Q4.
Q7. Are there any KPIs you selected that don’t have supporting data fields?
Now it’s your turn to build a Balanced Scorecard dashboard in Tableau for each of
these metrics. First, you’ll create four individual worksheets; then you’ll combine them
into a dashboard for quick review.
Note: To compare actual performance to management’s goals, you’ll need to set some
parameters and create some additional calculated fields.
268
d. Name: Top Salespeople
i. Datatype: Integer
ii. Current value: 1 <- This shows the number of top employees management wants
to recognize.
iii. Display format: Number (standard)
iv. Allowable values: Range
v. Minimum: 0
vi. Maximum: 3
vii. Step size: 1
7. Create the four worksheets. For simplicity, full instructions are provided for the first
sheet. For subsequent sheets, drag the attributes to the appropriate places.
a. Create a new worksheet called Finance.
i. Create calculated fields—click the down-arrow next to Dimensions in the left
pane and choose Create Calculated Field. Enter the name of the new field,
then type the expression in the box below.
1. Profit Ratio: SUM([Profit])/SUM([Sales]).
2. Actual vs Target – Return on Sales: [Profit Ratio] > [KPI Target –
Return on Sales].
ii. Drag the following attribute to the Columns pane: Profit Ratio ->
becomes AGG(Profit Ratio).
iii. Drag the following attributes to the Rows pane: Category, Sub-Category.
iv. Drag the following attribute to the Filters pane: Product Name. Double-click the
value and select Custom Value List in the window that appears. Then click OK.
v. Drag the following attribute to the Marks pane: Actual vs Target – Return
on Sales becomes AGG (Actual vs Target – Return on Sales). Click the icon
next to it and select Color from the list.
vi. Click the Analytics tab in the left pane. In the Custom section, drag
Reference Line onto the Finance table. In the window that appears, choose
the following options:
1. Entire Table
2. Value: KPI Target – Return on Sales
vii. Click OK and save your project.
viii. Take a screenshot (label it 7-2A).
269
b. Create a new worksheet called Process.
i. Create calculated fields:
1. Delivery Time Days: ROUND(FLOAT(DATEDIFF(‘day’, [Order
Date], [Ship Date])),2)
2. Actual vs Target – Delivery: AVG([Delivery Time Days]) < [KPI Target –
Delivery Days]
ii. Columns: Longitude (generated)
iii. Rows: Latitude (generated)
iv. Type: Filled Map
v. Marks:
1. Delivery Time Days > Average > Color
2. Country > Detail
3. State > Detail
vi. Double-click AVG(Delivery Time Days) color scale:
1. Red-Green Diverging
2. Reversed
3. Advanced: Center: 4
vii. Take a screenshot (label it 7-2B).
271
8. Finally, create a new dashboard called Balanced Scorecard.
a. Drag Finance, Customer, Process, and Growth to main body of your dashboard.
b. To enable management to adjust its goals (and corresponding reference
lines), add the parameters to the dashboard along the top. Click Show/Hide
Cards >
Parameters, and add the parameters to the dashboard, then drag them along the top.
End of Lab
272
Lab 7-3 Comprehensive Case: Dillard’s Store Data: Creating
KPIs in Excel (Part I)
Company summary
Dillard’s is a department store with approximately 330 stores in 29 states. Its
headquarters is in Little Rock, Arkansas. You can learn more about Dillard’s by looking
at finance.yahoo
.com (Ticker symbol = DDS) and the Wikipedia site for DDS. You’ll quickly note that
William T. Dillard II is an accounting grad of the University of Arkansas and the Walton
College of Business, which may be why he shared transaction data with us to make
available for this lab and labs throughout this text.
Data
If you completed comprehensive Labs 3-4 and Labs 3-5, you can use the same Excel file
that you created and saved in Lab 3-5.
If you did not complete those labs, you will need to extract those data and load them
into Excel using the following query. If you need to review how to extract data from SQL
Server and load them into Excel, see Part 3 of Comprehensive Labs 3-4. Steps 4.1–4.5
are the same steps necessary to load the data into Excel.
Select Transact.*, Store.STATE
From Transact
Inner Join Store
On Transact.Store = Store.STORE
Where TRAN_DATE BETWEEN '20160901' and '20160915'
Order By Tran_Date
Software needed
• Microsoft SQL Server Management Studio (available on the Remote Desktop at
the University of Arkansas)
• Excel 2016 (available on the Remote Desktop at the University of Arkansas)
• Power Pivot Excel add-in. To create a date table, we’ll extract and load the data
through Power Pivot instead of through the Get & Transform tab. If you don’t see
Power Pivot as a tab in the Excel ribbon, you will need to activate the add-in.
273
a. From the File tab on the ribbon, open Options.
b. Select Add-ins from the left side of the Excel Options window.
274
c. From the drop-down window at the bottom of the Add-ins screen, select COM add-
ins, then click Go. . .
d. Place a check mark in the box next to Microsoft Power Pivot for Excel,
then click OK.
275
1. From the Insert tab on the ribbon, click PivotTable.
2. In the Create PivotTable window, make sure to place a check mark in the box next to
Add this data to the Data Model. Then click OK.
3. Once the PivotTable has been created (this may take a few moments as the data
are loaded into the data model), you can create a measure and a KPI. Navigate to
the Power Pivot tab in the ribbon.
Click Measures, then select New Measure. . .
4. The new measure’s name defaults to Measure 1, which isn’t very descriptive.
Because we’ll be measuring average Transaction amount, we’ll change the name to
AVG(Tran_ Amt). Type AVG(Tran_Amt) over the default text.
276
Source: Microsoft Excel 2016.
5. The formula will auto-populate as you type. Begin typing average, and then begin
typ- ing the field Tran_Amt to fill in the formula.
6. The category has no bearing on how the measure or the KPI will work. For this
mea- sure, we’ll leave it on the default of General. Click OK to create the
measure.
7. If you scroll down on the PivotTable Fields window, you will see that the explicit
mea- sure has been added to the bottom of the field list.
277
8. Now we will create the KPI. In the Power Pivot tab of the ribbon, click KPIs
and select New KPI. . .
9. Because you have only one measure added to this spreadsheet for now, the base
field defaults to your newly created measure. If you had more than one measure,
you would use the drop-down to select the measure you wanted to use for your base
field. The target value can be defined by another measure or by an absolute value.
For this first KPI, we’ll define it by an Absolute Value. Let’s assume that Dillard’s
has a goal of averaging at least $28 per Transaction.
Input 28 as the Absolute value for the target
value. Leave the default for the status
thresholds.
278
Q2. Why might you want to edit the status thresholds? Does 22.4 seem low for
the upper limit?
10. Now that you have your KPI created, you can see each of them in the
PivotTable Fields list.
Occasionally, if the KPI status was automatically added to your PivotTable, the
stoplight signals show as −1, 0, and 1. If you remove the status field from the field list
and put it back in, this will correct the issue and the stoplight icons will show.
If you expand the KPI fields, you see three options:
◦ The Value (2016 Sales) will show the actual sale totals associated with the year 2016
(or sliced by month or day, depending on the other values you drill into in the
PivotTable).
◦ The Goal will show 2015 sales totals—this is the measure that you are using to
compare 2016 sales against. The Goal is for the sales to be at least 2 percent higher
than the pre- vious year’s sales.
◦ The Status will show stoplight icons indicating red, yellow, or green circles based on
the thresholds you selected when setting the KPI.
11. Create a PivotTable that shows the KPI status for average Transaction by each of
the 15 days in your data range.
12. Take a screenshot (label 7-3A).
Q3. How did Dillard’s perform in September 2016 compared to September
2015? Do you think the target is set too high or too low? Which day(s)
performed the worst, compared to the same date(s) in the previous period?
Why do you think that is?
End of Lab
Company summary
Dillard’s is a department store with approximately 330 stores in 29 states. Its
headquarters is in Little Rock, Arkansas. You can learn more about Dillard’s by looking
at finance.yahoo
.com (Ticker symbol = DDS) and the Wikipedia site for DDS. You’ll quickly note that
William T. Dillard II is an accounting grad of the University of Arkansas and the Walton
College of Business, which may be why he shared transaction data with us to make
available for this lab and labs throughout this text.
Data
The data for this lab and other all Dillard’s labs are available at http://walton.uark.edu/
enterprise/. Your instructor will either give you specific instructions on how to access the
data, or there will be information available on connect. The 2016 Dillard’s data cover all
transactions over the period 1/1/2014 to 10/17/2016.
Software needed
• Microsoft SQL Server Management Studio (available on the Remote Desktop at
the University of Arkansas)
• Excel 2016 (available on the Remote Desktop at the University of Arkansas)
279
In this lab, you will:
• Compare total sales across all Dillard’s stores year over year, month over month,
and day over day and develop it as a KPI.
2. Enter the Server name and the Database name as provided to you through the
walton.uark.edu/enterprise site, and then click Advanced options to input the
query text:
Select year(Tran_Date) as year, month(Tran_Date) as month, day(Tran_Date)
as day, sum(Tran_Amt) as amount
From TRANSACT
Where TRAN_TYPE = 'P'
Group By year(Tran_Date), month(Tran_Date), day(Tran_Date)
Order By year(Tran_Date), month(Tran_Date), day(Tran_Date)
280
Source: Microsoft Excel 2016.
3. Click OK.
4. A preview of your data will load. Instead of immediately loading these data into
Excel, you need to transform them in the Query Editor. Click Edit.
281
The data have been fully extracted from SQL Server into Excel’s Internal Data Model,
but they need to be transformed so that we can more easily compare daily sales
amounts year over year. Instead of seeing a separate record for each day, beginning with
January
1, 2014, and ending with October 17, 2016, we would prefer to see only 365 records—one
record for each day in a calendar year, but with separate columns for each year (2014, 2015,
and 2016), each with the transaction amount associated with that year’s month and day.
5. Select the year column.
6. Select Pivot Column from the Transform tab on the Query Editor ribbon.
7. Select Amount from the drop-down for the Values column and click OK.
8. Now that the data have been transformed, we’re ready to load them into Excel.
From the Home button on the Query Editor’s ribbon, click Close and Load.
9. Excel has a way to super-charge its conditional formatting by creating KPIs in
Power Pivot. In the Create PivotTable window, make sure to place a check mark
in the box next to Add this data to the Data Model. Then click OK.
11. The new measure’s name defaults to Measure 1, which isn’t very descriptive.
Because we’ll be measuring average transaction amount, we’ll change the first
KPI’s name to 2014 Sales. Type 2014 Sales over the default text.
12. The formula will auto-populate as you type, begin typing SUM, then fill in the
paren- theses with the column name 2014.
13. At the bottom of the Measure window is an option to select a category. The
Category has no bearing on how the measure or the KPI will work. For this
measure, we’ll leave it on the default of General. Click OK to create the measure.
283
Source: Microsoft Excel 2016.
14. Repeat the same steps used to create the measure for 2014 sales to create measures
for 2015 sales and 2016 sales.
15. Now we will create the KPIs to compare 2015 sales to 2014, and 2016 sales to 2015.
In the Power Pivot tab of the ribbon, click KPIs and select New KPI. . .
16. The first KPI we will create is comparing 2016 sales to the previous year’s sales.
Use the drop-down to select 2016 Sales for your base field. The target value can be
defined by another measure or by an absolute value. We have already defined the
measure to compare 2016 sales to, so select 2015 Sales for the target value Measure.
We will define excellent performance as a 2 percent improvement over last year’s
sales, so move the upper range of the target slider to 102%. Poor performance will be
defined as a 2 percent decline from last year’s sales. Move the lower range of the
target slider to 98%.
Q1. Do you think +/− 2 percent is the right benchmark to set? Would you propose
a different percentage change to track here?
Once all of your settings are correct, click OK to create the KPI.
284
Source: Microsoft Excel 2016.
17. Create the KPI comparing 2015 sales to 2014 sales using the same thresholds for
mea- suring performance.
18. Now that you have your two KPIs created, you can see each of them in the
PivotTable Fields list.
285
Occasionally, if the KPI status is automatically added to your PivotTable, the stoplight
signals show as −1, 0, and 1. If you remove the status field from the fields list and put
it back in, this will correct the issue and the stoplight icons will show.
If you expand the KPI fields, you see three options:
• The Value (2016 Sales) will show the actual sale totals associated with the year 2016
(or sliced by month or day, depending on the other values you drill into in the
PivotTable).
• The Goal will show 2015 sales totals—this is the measure that you are using to
compare 2016 sales against. The Goal is for the sales to be at least 2 percent higher
than the pre- vious year’s sales.
• The Status will show stoplight icons indicating red, yellow, or green circles based on
the thresholds you selected when setting the KPI.
19. Create a PivotTable that shows the KPI status of 2015 and 2016 sales by month.
To do so, drag and drop Months into the Rows and Status for both KPIs into the Values.
286
If you just place a check mark in the box next to the month field, you will notice that
the PivotTable defaults to reading Month values as numerical data instead of calendar data,
so it places it as a value and sums the month numbers. You just need to drag and drop it
outside of Values and into Rows.
20. Take a Screenshot (label it 7-4A).
21. To provide some drill-down capabilities, add the Day field to the Rows (beneath
Month).
Q2. Do you notice a pattern with how frequently the “bad” (red icon) days appear
in 2016 in relation to 2015?
Q3. What do you think is the potential problem with comparing days (e.g.,
compar- ing September 1, 2016 to September 1, 2015)? How could this be
improved?
End of Lab
Company summary
Dillard’s is a department store with approximately 330 stores in 29 states. Its
headquarters is in Little Rock, Arkansas. You can learn more about Dillard’s by looking
at finance.yahoo. com (Ticker symbol = DDS) and the Wikipedia site for DDS. You’ll
quickly note that William T. Dillard II is an accounting grad of the University of
Arkansas and the Walton College of Business, which may be why he shared transaction
data with us to make available for this lab and labs throughout this text.
Data
The data for this lab and other all Dillard’s labs are available at http://walton.uark.edu/
enterprise/. Your instructor will either give you specific instructions on how to access the
data, or there will be information available on connect. The 2016 Dillard’s data cover all
transactions over the period 1/1/2014 to 10/17/2016.
Software needed
• Microsoft SQL Server Management Studio (available on the Remote Desktop at
the University of Arkansas)
• Excel 2016 (available on the Remote Desktop at the University of Arkansas)
• Power Pivot Excel add-in. To create a date table, we’ll extract and load the data
through Power Pivot instead of through the Get & Transform tab. If you don’t see
Power Pivot as a tab in the Excel ribbon, you will need to activate the add-in.
287
a. From the File tab on the ribbon, open Options.
b. Select Add-ins from the left side of the Excel Options window.
288
c. From the drop-down window at the bottom of the Add-ins screen, select
COM add-ins, then click Go. . .
d. Place a check mark in the box next to Microsoft Power Pivot for Excel, then click OK.
2. In the Power Pivot for Excel window, click Get External Data from the Home tab,
then navigate through From Database and From SQL Server.
289
3. The Table Import Wizard window will open. Input the SQL Server name and
the Database name that you received from Walton.uark.edu/enterprise, then click
Next.
4. We will import the data with a query, so select the radio button next to Write a
query that will specify the data to import.
290
5. We need to bring in only two attributes. In Lab 7-2, we had to parse out the different
date parts in order to group our data by month and year, instead of just by day. In
this lab, we will use Excel’s Power Pivot tool to create a Date table. The tool will be
able to parse out the date parts for us, instead of us having to do so with our query.
This will also allow us to view more interesting date parts, such as the day of the
week (not just the date).
Input the following query into the Table Import Wizard window to extract the total
amount of Transactions for each day in the database:
Select Tran_Date, SUM(Tran_Amt) AS Sales
From Transact
Group By Tran_Date
After entering the SQL text, click Validate to ensure the query will run, and then
click Finish.
291
6. Once the data are loaded, you can close the Table Import Wizard window. Click Close.
7. After closing the Table Import Wizard, you will see your data loaded into Power
Pivot. This does not mean the data have been loaded into Excel yet, so you can
transform the data within the Power Pivot tool first. Creating the date table takes
three steps: Select the Tran_Date column, click Date Table from the Design tab on
the ribbon, then click New.
You have created a Date table. Now it’s time to load the transformed data into Excel.
292
8. Return to the Home tab on the Power Pivot ribbon, and select PivotTable.
The PivotTable Fields list contains two tables, Calendar and Query. The Calendar
table contains the Date Hierarchy for drilling down, but it also contains attributes
beneath the More Fields title. These contain the same attributes in the hierarchy, as
well as different ways of viewing the data, such as Day of Week. The Query table con-
tains the data that you extracted with your SQL query. The valuable field from the
query table is Sales, which you will use as a value (or an implicit measure).
293
Part 3: Perform an Analysis of the Data
10. Create a PivotTable to compare sales performance on different weekdays of each
month, year over year. To do so, drag and drop year (from the Calendar > More
fields drop-down) into Columns, Month and DayofWeek into Rows, and Sales into
Values). The Sales data will be transformed into a measure, Sum of Sales,
automatically.
11. Take a screenshot (label it 7-5A).
Q1. Something should seem a bit off with your numbers. There are some big
dispari- ties month over month for some weekdays. Look back over our query
and the ER Diagram (and if you completed Lab 7-2, compare the query you
executed in this lab to the query from that lab). What did we leave out of this
query? How could it cause us to make poor decisions?
14. Add in a WHERE clause to the query, validate the new query, and save it.
Select Tran_Date, SUM(Tran_Amt) AS Sales
From Transact
Where Tran_Type = 'p'
Group By Tran_Date
294
15. The data will be automatically refreshed in the Power Pivot tool and in the Excel
work- sheet with the PivotTable. Close the Power Pivot tool.
End of Lab
Company summary
Dillard’s is a department store with approximately 330 stores in 29 states. Its
headquarters is in Little Rock, Arkansas. You can learn more about Dillard’s by looking
at finance.yahoo
.com (Ticker symbol = DDS) and the Wikipedia site for DDS. You’ll quickly note that
William T. Dillard II is an accounting grad of the University of Arkansas and the Walton
College of Business, which may be why he shared Transaction data with us to make
avail- able for this lab and labs throughout this text.
Data
The data for this lab and other all Dillard’s labs are available at http://walton.uark.edu/
enterprise/. Your instructor will either give you specific instructions on how to access the
data, or there will be information available on connect. The 2016 Dillard’s data cover all
transactions over the period 1/1/2014 to 10/17/2016.
Software needed
• Microsoft SQL Server Management Studio (available on the Remote Desktop at
the University of Arkansas)
• Excel 2016 (available on the Remote Desktop at the University of Arkansas)
• Power Pivot Excel add-in. To create a date table, we’ll extract and load the data
through Power Pivot instead of through the Get & Transform tab. If you don’t see
Power Pivot as a tab in the Excel ribbon, you will need to activate the add-in.
Prerequisite
• Labs 7-4 and 7-5. If you haven’t completed these labs, then you can still read
through the steps in Labs 7-4 and 7-5 to see the screenshots of the ETL process in
Excel (Lab 7-5) and the KPI creation process (Lab 7-4) to be ready for this lab.
295
Part 1: Identify the Questions
In Lab 7-4, you created KPIs for comparing 2015 sales to 2014 sales, but the date was
parsed out from the original Tran_Date attribute. In Lab 7-5, you created a date table so
that the date fields were more descriptive in the Excel report, but you didn’t create any
KPIs. In this lab, we will combine those two skills to create a descriptive report with
KPIs. We will also expand the reports capabilities by extracting and loading state and
store data in addition to date and transaction data.
296
KPI’s base. Each base measure can only have one KPI assigned to it. Create a new
measure called Current Month to calculate sales (this will be the exact same as how
you created Current Year in step 4, but with a different Measure Name).
9. Create a new measure to use as the monthly target measure. The DAX expression
for calculating last month’s sales is:
=CALCULATE([Sum of Amount],PREVIOUSMONTH('Calendar'[Date]))
You can name this measure Previous Month.
10. Create a new KPI comparing current sales (your base measure) to previous month
as your target measure. Create the same status thresholds as the KPI comparing
years (<98%, 98%–102%, >102%).
11. Add this KPI status to your PivotTable.
12. Take a screenshot (label it 7-6A).
14. Place a check mark in the boxes next to State and Store to create the slicers.
297
15. Notice what happens as you select different states: Not only do the data change to
reflect the KPI status for the state that you selected, but the stores that are
associated with that state shift to the top of the store slicer, making it easier to drill
down.
16. Take a screenshot (label it 7-6B).
We can ease drill-down capabilities even more by creating a hierarchy between state
and store.
17. Open the Power Pivot tool by clicking Manage from the Power Pivot tab in the Excel ribbon.
18. From the Power Pivot Home tab, switch to Diagram View.
19. Select both the State and the Store attributes from the Query table, then right-
click one of the attributes to create a hierarchy.
20. You can change the name of the Hierarchy to Store and State Hierarchy.
21. Close the Power Pivot tool. The PivotTable will have refreshed automatically.
22. You will see that the hierarchy has been added to your PivotTable Fields list. Drag
and drop the hierarchy to the Rows (above the Date hierarchy).
298
23. Take a screenshot (label it 7-6C).
Now you can drill down from State to Store directly in the PivotTable, or you can
filter it via the slicer.
Q1. How does the ability to drill down into the state and store data give manage-
ment critical information and help them to identify issues that are occurring
or opportunities that might be available?
Q2. What would you get sales changes of certain products (SKUs) or product cat-
egories from one month to the next? Having this type of information will
help you do what to help plan future promotions or future purchases?
End of Lab
299
Chapter 8
Financial Statement Analytics
A Look Back
Chapter 7 focused on generating and evaluating key performance metrics that are used primarily in managerial
accounting. By measuring past performance and comparing it to targeted goals, we are able to assess how well a
company is working toward a goal. Also, we can determine required adjustments to how decisions are made or
how business processes are run, if any.
300
Sometimes the future is now. The StockSnips app uses sentiment analysis,
machine learning, and artificial intelligence to aggregate and analyze news
related to publicly traded companies on Nasdaq and the New York Stock
Exchange to “gain stock insights and track a company’s financial and
business operations.” The use of Data Analytics helps classify the news to
help predict revenue, earnings, and cash flows and uses those data to help
predict stock price performance that is most relevant to predicting company
performance. What will Data Analytics do next?
©S Narayan/Dinodia Photo/agefotostock
EXHIBIT 8-1
OBJECTIVES
After reading this chapter, you should be able to:
LO 8-1
LO 8-2 Describe how XBRL tags financial reporting data
Understand how different types of ratio analysis can be facilitated by
LO 8-3 XBRL
Explain how to create and read visualizations of financial statement
data
LO 8-4
Describe the value of text mining and sentiment analysis of financial
reporting
301
302 Chapter 8 Financial Statement Analytics
LO 8-1 XBRL
Describe how XBRL is a global standard for Internet communication among businesses. XBRL stands
XBRL tags financial for eXtensible Business Reporting Language and is a type of XML (extensible markup
reporting data lan- guage) used for organizing and defining financial elements. By each company
providing tags for each piece of its financial data, XBRL data can be computer readable
and immedi- ately available for each type of financial statement user, be they financial
analysts, investors, or lenders, for their own specific use.
As of June 2011, the Securities and Exchange Commission requires all public
company filers to file financial statements prepared in accordance with U.S. GAAP (generally
accepted accounting principles), including smaller reporting companies, and all foreign
private issuers to prepare their financial statements in accordance with IFRS. This includes
tagging the five basic financial statements:
• Balance sheet
• Income statement
• Statement of comprehensive income
• Statement of cash flows
• Statement of stockholders’ equity
In addition, detailed tagging of the numbers included in the footnotes by use of XBRL
tags is also required. This means that numbers in the footnotes (e.g., facts, figures, years,
and percentages) are also tagged, as well as the text disclosure of the major footnotes.
XBRL uses a taxonomy to help define and describe each key data element (like cash
or accounts payable). The XBRL taxonomy also defines the relationships between each
element—such as cash being a component of current assets and current assets being a
com- ponent of total assets or accounts payable being a component of current liabilities
and cur- rent liabilities, in turn, being a component of total liabilities.
The 2017 U.S. GAAP Financial Reporting Taxonomy is found at this website: https://
xbrl.us/xbrl-taxonomy/2017-us-gaap/. It defines more than 19,000 elements. Using such a
taxonomy, each unique financial data item is tagged to an element within the taxonomy.
For example, the XBRL tag for cash is labeled “Cash” and is defined as follows:
The XBRL tag for cash and cash equivalents footnote disclosure is labeled as
“CashAndCashEquivalentsDisclosureTextBlock” and is defined as follows:
The entire disclosure for cash and cash equivalent footnotes, which may include the types
of deposits and money market instruments, applicable carrying amounts, restricted amounts
and compensating balance arrangements. Cash and equivalents include: (1) currency on
hand
(2) demand deposits with banks or financial institutions (3) other kinds of accounts that
have the general characteristics of demand deposits (4) short-term, highly liquid
investments that are both readily convertible to known amounts of cash and so near their
maturity that they present insignificant risk of changes in value because of changes in
interest rates. Generally, only investments maturing within three months from the date of
1
acquisition qualify.2
https://xbrl.us/xbrl-taxonomy/2017-us-gaap/
2
https://xbrl.us/xbrl-taxonomy/2017-us-gaap/
Chapter 8 Financial Statement Analytics 303
The use of tags allows data to be quickly transmitted and received, and the tags serve
as
an input for financial analysts valuing a company, an auditor finding areas where an error
might occur, or regulators seeing if firms are in compliance with various regulations and
laws (like the SEC or IRS).
IBM labels revenue as “Total revenue” and uses the tag “Revenues”, whereas
Apple, labels their revenue as “Net sales” and uses the tag “SalesRevenueNet”.
This is a relatively simple case, because both companies used tags from the FASB
taxonomy.
Users are typically not interested in the subtle differences of how companies tag or
label information. In the previous example, most users would want Apple and
IBM’s revenue, regardless of how it was tagged. To that end, we create standardized
metrics.3
Different data vendors such as XBRLAnalyst and Calcbench both provide a trace
func- tion that allows you to trace the standardized metric back to the original source to
see which XBRL tags are referenced or used to make up the standardized metric. 4
Exhibit 8-2 shows what a report using standardized metrics looks like for Boeing’s
bal- ance sheet. Note the standardized tags used for Boeing could be used for any of the
SEC filers to gather their balance sheet and other financial statements.
3
https://knowledge.calcbench.com/hc/en-us/articles/230017408-What-is-a-standardized-metric
(accessed August 2017).
4
https://knowledge.calcbench.com/hc/en-us/articles/230017408-What-is-a-standardized-metric.
304 Chapter 8 Financial Statement Analytics
Exhibit 8-2
Balance Sheet from
XBRL Data
PROGRESS CHECK
1. How does XBRL facilitate Data Analytics by analysts?
2. How might standardized XBRL metrics be useful in comparing the financial
state- ments of General Motors, Alphabet, and Alibaba?
3. Assuming XBRL-GL is able to disseminate real-time financial reports, which
real- time financial elements (account names) might be most useful to decision
mak- ers? And which information might not be useful?
Chapter 8 Financial Statement Analytics 305
Classes of Ratios
There are basically four types of ratios: liquidity, activity, solvency (or financing), and
profitability.
Liquidity is the ability to satisfy the company’s short-term obligations using assets that
can be most readily converted into cash. Liquidity ratios help measure the liquidity of a
company. Liquidity ratios include the current ratio and the acid-test ratio.
Activity ratios are a computation of a firm’s operating efficiency. Company activity is
often measured by use of turnover ratios reflect the number of times assets flow into and
out of the company during the period and serve as a gauge of the efficiency of putting
assets to work. Receivables, inventory, and total asset turnover are all examples of
activity ratios.
We use solvency (or sometimes called financing) ratios to help assess a company’s
abil- ity to pay its debts and stay in business. In other words, we assess the company’s
financial risk—that is, the risk resulting from a company’s choice of financing the business
using debt or equity. Debt-to-equity, long-term debt-to-equity, and times interest earned
ratios are also useful in assessing the level of solvency.
Profitability ratios are a common calculation when assessing a company. They are
used to provide information on the profitability of a company and its prospects for the
future.
5
AICPA, AU section 329, http://www.aicpa.org/Research/Standards/AuditAttest/DownloadableDocuments/
306 Chapter 8 Financial Statement Analytics
Return on equity (ROE) = Profit margin × Operating leverage (or Asset turnover) ×
Financial leverage
= (Net profit/Sales) × (Net profit/Sales)(Sales/Average total
assets) × (Average total assets/Average equity)
It decomposes return on equity into three different types of ratios: profitability (profit
mar- gin), activity (operating leverage or asset turnover), and solvency (financial leverage)
ratios. We illustrate it in Exhibit 8-3 by considering a calculation from some standard
XBRL data.
Exhibit 8-3
DuPont Analysis Using
XBRL Data
Source: https://www
.calcbench.com/xbrl_to_excel.
You’ll note for the Quarter 2 analysis in 2009, for DuPont (Ticker Symbol = DD), if
you take its profit margin, 0.294, multiplied by asset turnover of 20.1 percent multiplied
by the financial leverage of 471.7 percent, you get a return on equity of 27.8 percent.
LO 8-3
The Use of Sparklines and Trendlines in Ratio Analysis
Explain how to
By using sparklines and trendlines, financial statement users can easily see the data
create and read
visually and give meaning to the underlying financial data. We define sparklines as a
visualizations of
financial statement small trend- line of graphic that efficiently summarizes numbers or statistics in a graph
data without axes. Because they generally can fit in a single cell within a spreadsheet, they can
easily add to the data without detracting from the tabular results.
For what types of reports or spreadsheets should sparklines be used? It usually
depends on the type of reporting that is selected. For example, if used in a digital
dashboard that already has many charts and dials, additional sparklines might clutter up
the overall appear- ance. However, if used to show trends where it replaces or
complements lots of numbers, it might be used as a very effective visualization. The nice
thing about sparklines is they are generally small and just show simple trends rather than
all the details regarding the horizon- tal and vertical axes that you would expect on a
normal graph.
Exhibit 8-4 provides an example of the use of sparklines in a DuPont analysis for
Chapter 8 Financial Statement Analytics 307
Exhibit 8-4
Illustration of the Use
of Sparklines to Show
Trends in the DuPont
Ratio Analysis for
Walmart
PROGRESS CHECK
4. How might standardized XBRL metrics be useful in comparing the financial
state- ments of General Motors, Alphabet, and Alibaba?
5. How might sparklines be used to enhance the analysis of Exhibit 8-3 regard-
ing the DuPont analysis? Would you show the sparklines for each component
of the DuPont ROE disaggregation, or would you propose it be shown only for
the total?
6. Using Exhibit 8-4 as the source of data and using the raw accounts, show the
components of profit margin, operating leverage and financial leverage and
how they are combined to equal ROE for Q2 2009 for DuPont (Ticker = DD).
To provide an illustration of the use and predictive ability of text mining and
sentiment
analysis, Loughran and McDonald6 use text mining and sentiment analysis to predict the
stock market reaction to the issuance of a 10-K form by examining the proportion of
nega- tive words used in a 10-K report. Exhibit 8-5 comes from their research suggesting
that the stock market reaction is related to the proportion of negative words (or inversely,
the pro- portion of positive words). They call this method overlap. Thus, using this
method to define the tone of the article, they indeed find a direct association, or
relationship, between the proportion of negative words and the stock market reaction to
the disclosure of 10-K reports.
Exhibit 8-5 The Stock Market Reaction (Excess Return) of Companies Sorted by the Proportion
of Negative Words.
The lines represent the words from a financial dictionary (Fin-Neg) and a standard English
dictionary (H4N-INF).
Source: Tim Loughran and Bill McDonald, "When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10–Ks,"
Journal of Finance 66, no. 1 (2011), pp. 35–65.
6
Tim Loughran and Bill McDonald, “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and
10-Ks,” Journal of Finance 66, no. 1 (2011), pp. 35–65.
PROGRESS CHECK
7. Which would you predict would have more positive sentiment in a 10-K, the
footnotes to the financial statements or the MD&A (management discussion
and analysis) of the financial statements?
8. Why would you guess the results between the proportion of negative words
and the stock market reaction to the 10-K issuance diverge the Fin-Neg and
the H4N-Inf dictionary?
Summary
Data Analytics extends to the financial accounting and financial reporting space.
■
By tagging financial elements in a computer readable manner, XBRL facilitates the
accu- rate and timely transmission of financial reporting to all interested stakeholders.
■
The XBRL taxonomy provides tags for 19,000 financial elements and allows for the
use of company-defined tags when the normal XBRL tags are not suitable.
■
XBRL and Data Analytics allow timely analysis of the financial statements and the
computation of financial ratios. We illustrated its usage by showing the DuPont ratio
framework.
■
We introduced and discussed the use of sparklines and trendlines as ways to efficiently
and effectively visualize firm performance.
■
We concluded the chapter by explaining how sentiment analysis could be used with
financial statements, other financial reports, and other financially related information.
Key Words
DuPont ratio analysis (306) Developed by the DuPont Corporation to decompose performance
(par- ticularly return on equity [ROE]) into its component parts.
financial statement analysis (305) Used by investors, analysts, auditors, and other interested
stake- holders to review and evaluate a company’s financial statements and financial performance.
ratio analysis (305) A tool used to evaluate relationships among different financial statement items
to help understand a company’s financial and operating performance.
sparkline (306) A small trendline or graphic that efficiently summarizes numbers or statistics in a
graph without axes.
standardized metrics (303) Metrics used by data vendors to allow easier comparison of company
reported XBRL data.
XBRL (302) A global standard for exchanging financial reporting information that uses XML; a global
standard for Internet communication among businesses.
XBRL-GL (303) Stands for XBRL-General Ledger; relates to the ability of enterprise system to tag
financial elements within the firm’s financial reporting system.
XBRL taxonomy (302) Defines and describes each key data element (like cash or accounts
payable). The taxonomy also defines the relationships between each element (like inventory is a
component of cur- rent assets and current assets is a component of total assets).
309
ANSWERS TO PROGRESS CHECKS
1. By each company providing tags for each piece of its financial data as computer
read- able, XBRL allows immediate access to each type of financial statement user,
be they financial analysts, investors, lenders, for their own specific use.
2. Standardized metrics allow for comparison of different companies by using similar
titles for similar financial elements. While these standardized metrics are determined by
a data vendor such as Calcbench or XBRLAnalyst (among others), they greatly
facilitate the use and value of financial reporting provided by XBRL.
3. When journal entries and transactions are made in an XBRL-GL system, there is the
pos- sibility of real-time financial reporting. In the author’s opinion, income statement
infor- mation (including sales, cost of goods sold, and SG&A expenditures) would be
useful to financial users on a real-time basis. Any information that does not change
frequently would not be as useful. Examples include real-time financial elements,
including good- will, long-term debt, and property, plant, and equipment.
4. Standardized metrics are useful for comparing companies because they allow for
similar accounts to have the same title regardless of the account names used by the
various companies. They allow for ease of comparison across multiple companies.
5. Answers may vary on how to visualize the data. It might depend on the type of
reporting that is selected. For example, is it solely a digital dashboard, or is it a report
with many facts and figures where more sparklines might clutter up the overall
appearance? The nice thing about sparklines is they are generally small and just show
simple trends rather than details about the horizontal and vertical axes.
6. Profit margin = (Revenues – Cost of revenue)/Revenues = ($7.088B – $5.007B)/
$7.088B = 29.4%
Operating leverage = Sales/Assets = ($7.088B / $35.258B) = 20.1%
Financial leverage = Assets/Equity = $35.258B / $7.474B = 471.7%
ROE = Profit margin × Operating leverage (or Asset turnover) × Financial leverage =
0.294 × 0.201 × 4.717 = 0.278
7. The MD&A section of the 10-K has management reporting on what happened in the
most recent period and what they expect will happen in the coming year. They are
usu- ally upbeat and generally optimistic about the future. The footnotes are generally
back- ground looking and would be much more factual-based, careful, and
conservative. We would expect the MD&A section to be much more optimistic than
the footnotes.
8. Accounting has its own lingo. Words that might seem negative for the English
language are not necessarily negative for financial reports. For this reason, the
results diverge based on whether the standard English usage dictionary (H4N-inf) or
the financial dic- tionary (Fin-Neg) is used. The relationship between the excess stock
market return and the financial dictionary is what we would expect.
310
2. XBRL stands for:
a. Extensible Business Reporting Language.
b. Extensive Business Reporting Language.
c. XML Business Reporting Language.
d. Excel Business Reporting Language.
3. Which term defines and describes each XBRL financial element?
a. Data dictionary
b. Descriptive statistics
c. XBRL-GL
d. Taxonomy
4. Which stage of the IMPACT model (introduced in chapter 1) would the use of spark-
lines fit?
a. Track outcomes
b. Communicate insights
c. Address and refine results
d. Perform test plan
5. What is the name of the output from data vendors to help compare companies using
different XBRL tags for revenue?
a. XBRL taxonomy
b. Data assimilation
c. Consonant tagging
d. Standardized metrics
6. What is the term used to describe the process of assigning XBRL tags internally within
a financial reporting/enterprise system?
a. XBRL tagging
b. XBRL taxonomy
c. XBRL-GL
d. XBRL dictionary
7. What computerized technique would be used to perform sentiment analysis on an
annual accounting report?
a. Text mining
b. Sentiment mining
c. Textual analysis
d. Decision trees
8. What type of ratios measure a firm’s operating efficiency?
a. DuPont ratios
b. Liquidity ratios
c. Activity ratios
d. Solvency ratios
9. What type of ratios measure a firm’s ability to pay its debts and stay in business?
a. DuPont ratios
b. Liquidity ratios
c. Activity ratios
d. Solvency ratios
311
10. What is considered an essential component of planning an audit and carrying out
sub-
stantive testing that involves ratio analysis?
a. Environmental analysis
b. Competitive analysis
c. Management integrity analysis
d. Analytical procedures
Discussion Questions
1. Which would you predict would have more positive sentiment in a 10-K, the financial
statements or the MD&A (management discussion and analysis) of the financial
state- ments? More positive sentiment in the footnotes or MD&A? Why?
2. Would you recommend the Securities and Exchange Commission require the use of
sparklines on the face of the financial statements? Why or why not?
3. Why do audit firms perform analytical procedures to identify risk? Which type of ratios
(liquidity, solvency, activity, and profitability ratios) would you use to evaluate the
com- pany’s ability to continue as a going concern?
4. Go to https://xbrl.us/data-rule/dqc_0015-lepr/ and find the XBRL element name for
Interest Expense and Sales, General, and Administrative expense.
5. Go to https://xbrl.us/data-rule/dqc_0015-lepr/ and find the XBRL element name for
Other NonOperating Income and indicate whether XBRL says that should normally be
a debit or credit entry.
6. Go to finance.yahoo.com and type in the ticker symbol for Apple (AAPL) and click on
the statistics tab. Which of those variables would be useful in assessing profitability?
7. Can you think of any other settings, besides financial reports, where tagged data
might be useful for fast, accurate analysis generally completed by computers? How
could it be used in a hospital setting? Or at your university?
8. Can you think of how sentiment analysis might be used in a marketing setting? How
could it be used in a hospital setting? Or at your university? When would it be
especially good to measure the sentiment?
Problems
1. Can you think of situations where sentiment analysis might be helpful to analyze
press releases or earnings announcements? What additional information might it
provide that is not directly in the overall announcement? Would it be useful to have
sentiment analy- sis automated to just get a basic sentiment measure versus the
base level of sentiment expected in a press announcement or earnings
announcement?
2. We noted in the text that negative words in the financial dictionary include words like
loss, claims, impairment, adverse, restructuring, and litigation. What are other
negative words might you add to that list? What are your thoughts on positive words
that would be included in the financial dictionary, particularly those that might be
different than standard English dictionary usage?
3. You’re asked to figure out how the stock market responded to Amazon’s announce-
ment on June 16, 2017, that it would purchase Whole Foods—arguably a
transforma- tional change for Amazon, Walmart, and the whole retail industry.
Required:
a. Go to finance.yahoo.com, type in the ticker symbol for Amazon (AMZN), click on
his- torical data, and input the dates around June 16, 2017. Specifically, see how
much the stock price changed on June 16.
b. Do the same analysis for Walmart (WMT) over the same dates, which was
arguably
most directly affected, and see what happened to its stock price.
312
4. The preceding question asked you to figure out how the stock market responded to
Amazon’s announcement that it would purchase Whole Foods. The question now is
if the stock market for Amazon had higher trade volume on that day than the average
of the month before.
Required:
a. Go to finance.yahoo.com, type in the ticker symbol for Amazon (AMZN), click on
historical data, and input the dates from May 15, 2017, to June 16, 2017.
Download the data, calculate the average volume for the month prior to June 16,
and compare it to the trading volume on June 16. Any effect on trading volume of
the Whole Foods announcement by Amazon?
b. Do the same analysis for Walmart (WMT) over the same dates and see what hap-
pened to its trading volume. Any effect on trading volume of the Whole Foods
announcement by Amazon?
5. Go to Loughran and McDonald’s sentiment word lists at https://www3.nd.edu/~mcdonald/
Word_Lists.html and download the Master Dictionary. These are what they’ve used to
assess sentiment in financial statements and related financial reports. Give five words
that are considered to be “negative” and five words that are considered to be “con-
straining.” How would you use this in your analysis of sentiment of an accounting
report?
6. Go to Loughran and McDonald’s sentiment word lists at https://www3.nd.edu/~mcdonald/
Word_Lists.html and download the Master Dictionary. These are what they’ve used to
assess sentiment in financial statements and related financial reports. Give five words
that are considered to be “litigious” and five words that are considered to be
“positive.”
313
Lab 8-1 Use XBRLAnalyst to Access XBRL Data
Company summary
This lab will pull in XBRL data from Fortune 100 companies listed with the SEC. You
have the option to analyze a pair of companies of your choice based on your own interest
level. This lab will have you compare other companies as well.
Data
The data used in this analysis are XBRL-tagged data from Fortune 100 companies. The
data are pulled from FinDynamics, which in turn pulls the data from the SEC.
Technique
• You will use a combination of spreadsheet formulas and live XBRL data to generate
a spreadsheet that is adaptable and dynamic. In other words, you will create a
template that can be used to answer several financial statement analysis questions.
Software needed
• Google Sheets (sheets.google.com)
• iXBRLAnalyst script (https://findynamics.com/gsheets/ixbrlanalyst.gs)
314
5. Click Save and name the project XBRL.
6. Close the Script Editor window and return to your Google Sheet.
7. Reload/refresh the page. If you see a new iXBRLAnalyst menu appear, you are
now connected to the XBRL data.
8. Test your connection by typing in the following formula anywhere on your sheet:
=XBRLFact("AAPL","AssetsCurrent","2017"). If your connection is good, it should
return the value 128645000000 for Apple Inc.’s 2017 balance in current assets.
9. Delete the formula and continue to the next step.
Note: Once you’ve added the iXBRLAnalyst script to a Google Sheet, you can simply
open that sheet, then go to File > Make a copy . . . , and the script will automatically be
copied to the new sheet.
The basic formulas available with the iXBRLAnalyst script are:
=FinValue(company, tag, year, period, member, scale)
=XBRLFact(company, tag, year, period, member, scale, true)
=SharePriceStats(company, date, duration, request)
where:
company = ticker symbol (e.g., “AAPL” for Apple Inc.)
tag = XBRL tag or normalized tag (e.g., “NetIncomeLoss” or “[Net Income]”)
year = reporting year (e.g., “2017”)
period = fiscal period (e.g., “Q1” for 1st Quarter or “Y” for year)
scale = rounding (e.g., “k,” “thousands,” or “1000” for thousands) [Note: There is an
error with rounding, so it is suggested to simply divide the formula by the scale instead,
e.g.
=XBRLFact(c,t,y,p)/scale.]
Because companies frequently use different tags to represents similar concepts (such
as the tags ProfitLoss or NetIncomeLoss to identify Net Income), it is important to make
sure you’re using the correct values. FinDynamics attempts to coordinate the diversity of
tags by using normalized tags that use formulas and relationships instead of direct tags.
Normalized tags must be contained within brackets []. Some examples are given in Lab
Table 8-1A.
If you’re looking for specific XBRL tags, you can explore the current XBRL
taxonomy at xbrlview.fasb.org.
Balance Sheet Income Statement Statement of Cash Flows
[Cash, Cash Equivalents and Short-Term [Revenue] [Cash From Ope rations (CFO)]
Investments] [Cost of Revenue] [Changes in Working Capital]
[Short-Term Investments] [Gross Profit] [Changes in Acc unts Receivables]
[Accounts Receivable, [Selling, General & Administrative [Changes in o ilities]
Current] [Inventory] Expense] [Changes in Liabentories]
[Other Current Assets] [Research & Development [AdjustmentsInv Non-Cash Items, CF]
[Current Assets] Expense] [Depreciation (& [Provision F of ubtful Accounts]
[Net of Property, Plant & Equipment] Amortization), IS] [Non-Interest [Depreciationor DoAmortization), CF]
[Long-Term Investments] Expense] [Stock-Based (& pensation]
[Intangible Assets, Net] [Other Operating Expenses] [Pension and Com her Retirement Benefits]
[Goodwill] [Operating Expenses] [Interest PaidOt
[Other Noncurrent Assets] [Operating Income] [Other CFO] ]
[Noncurrent Assets] [Other Operating Income] [Cash from ting (CFI)]
[Assets] [Non-Operating Income (Expense)] [Capital ExpeInvestures]
[Accounts Payable and Accrued Liabilities, [Interest Expense] [Payments to ndi uire Investments]
Current] [Costs and Expenses] [Proceeds fr Acq Investments]
[Short-Term Borrowing] [Earnings Before [Other CFI] om
[Long-Term Debt, Current] Taxes] [Income Taxes] [Cash From ncing (CFF)]
[Other Current Liabilities] [Income from Continuing Operations] [Payment of Fina dends]
Divi
315
Balance Sheet Income Statement Statement of Cash Flows
[Current Liabilities] [Income from Discontinued [Proceeds from Sale of Equity]
[Other Noncurrent Liabil ities ] Operations, Net of Taxes] [Repurchase of Equity]
[Noncurrent Liabilities] [Extraordinary Items, Gain (Loss)] [Net Borrowing]
[Liabilities] [Net Income] [Other CFF]
[Preferred Stock] [Net Income Attributable to Parent] [Effect of Exchange Rate
[Common Stock] [Net Income Attributable to Changes] [Total Cash, Change]
[Additional Paid-in Capital] Noncontrolling Interest] [Net Cash, Continuing Operations]
[Retained Earnings (Accumul ated Deficit)] [Preferred Stock Dividends and Other [Net CFO, Continuing Operations]
[Equity Attributable to Parent ] Adjustments] [Net CFI, Continuing Operations]
[Equity Attributable to Nonco ntrolling Interest] [Comprehensive Income (Loss)] [Net CFF, Continuing Operations]
[Stockholders’ Equity] [Other Comprehensive Income (Loss)] [Net Cash, DO]
[Liabilities & Equity] [Comprehensive Income (Loss) [Net CFO, DO]
Attributable to Parent] [Net CFI, DO]
[Comprehensive Income (Loss) [Net CFF, DO]
Attributable to Noncontrolling
Interest]
10. In your Google Sheet, begin by entering the values for the tags, as shown:
A B
LAB EXHIBIT 8-1A
1 Company AAPL
2 Year 2016
3 Period Y
4 Scale 1000000
11. Then set up your financial statement using the following normalized tags and
periods. Note: Because we already identified the most current year in A2, we’ll use a
formula to find the three most recent years.
A B C D
LAB EXHIBIT 8-1B
6 =$B2 =B6-1 =C6-1
7 [Revenue]
8 [Cost of Revenue]
9 [Gross Profit]
10 [Selling, General & Administrative Expense]
11 [Research & Development Expense]
12 [Other Operating Expenses]
13 [Operating Expenses]
14 [Operating Income]
15 [Depreciation (& Amortization), CF]
16 [Interest Income]
316
A B C D
17 [Earnings before Taxes]
18 [Income Taxes]
19 [Net Income]
12. Now enter the =XBRLFact() formula to pull in the correct values, using relative
refer- ences (e.g., $A$1) as necessary. For example, the formula in B7 should be
=XBRLFact($B$1,$A7,B$6,$B$3)/$B$4.
13. If you’ve used relative references correctly, you can either drag the formula down
and across columns B, C, and D, or copy and paste the cell (not the formula itself)
into the rest of the table.
14. Use the formatting tools to clean up your spreadsheet, then take a screenshot
(label it 8-1A).
Next, you can begin editing your dynamic data and expanding your analysis,
identifying trends and ratios.
15. In your Google Sheet, use a sparkline to show the change in income
statement accounts:
a. In cell E7, type: =SPARKLINE(B7:D7).
b. Note: The line is trending toward the left.
16. Now perform a vertical analysis in the columns to the right showing each value as
a percentage of revenue:
a. Copy cells B6:D6 into F6:H6.
b. In F7, type =B7/B$7.
c. Drag the formula to fill in F7:H19.
d. Format the numbers as a percentage.
e. Add a sparkline in Column I.
17. Take a screenshot (label it 8-1B).
End of Lab
In Lab Exhibit 8-2B, you’ll find the common-size ratios for each Lab Exhibit 8-2A
company’s income statement (as a percentage of revenue) and balance sheet (as a
percentage of assets).
A B C D E F G H I J
As a Percentage of Sales 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Revenue 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Cost of Goods Sold 64.9% 37.1% 74.9% 22.4% 9.6% 34.9% 60.9% 4.5% 39.3% 85.4%
Gross Profit 35.1% 62.9% 25.1% 77.6% 90.4% 65.1% 39.1% 95.5% 60.7% 14.6%
Research & Development 0.0% 12.8% 0.0% 12.4% 0.0% 25.4% 0.0% 0.0% 0.0% 4.9%
Selling, General and Adm inistrative 7.1% 23.2% 20.1% 36.4% 15.7% 24.5% 25.2% 53.8% 36.5% 3.8%
Expenses
Other Operating Expenses 89.8% 0.5% 0.0% 3.0% –4.3% 1.6% 3.3% 0.0% 0.0% 0.0%
Total Operating Expenses 96.9% 37.2% 20.1% 51.8% 16.0% 51.6% 29.2% 54.7% 36.5% 8.7%
Operating Income/Loss 3.1% 25.7% 5.0% 25.9% 84.0% 13.5% 9.9% 47.7% 20.6% 6.2%
318
A B C D E F G H I J
As a Percentage of Sales 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Total Other Income/Expenses Net –0.2% 0.5% 0.0% 0.0% 0.0% –1.8% –4.3% 0.6% –1.4% 0.0%
Interest Expense 0.4% 1.4% 0.5% 0.0% –0.5% 0.0% 0.0% 12.4% 1.8% 0.3%
Income before Tax 2.8% 26.2% 4.5% 40.7% 26.7% 11.7% 5.6% 31.4% 19.4% 5.9%
Income Tax Expense 1.0% 4.4% 1.4% –40.5% 9.1% 1.8% 0.5% 9.0% 3.8% 0.7%
Minority Interest 0.0% 0.0% 0.1% 0.0% 0.7% 0.1% 0.0% 0.0% 0.1% 0.0%
Net Income 1.7% 21.8% 3.1% 80.9% 17.6% 9.9% 6.4% 22.4% 15.6% 5.2%
As a Percentage of Assets 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Current Assets 54.9% 64.7% 30.2% 37.2% 18.4% 32.1% 13.8% 0.0% 39.0% 69.4%
Cash 23.2% 6.3% 4.4% 7.6% 5.0% 6.8% 0.0% 6.8% 9.8% 9.8%
Investments 8.0% 47.8% 0.0% 22.4% 0.0% 8.2% 0.0% 8.2% 15.6% 1.4%
Receivables 10.0% 4.8% 2.8% 2.5% 9.8% 7.4% 4.2% 40.9% 4.4% 9.8%
Inventory 13.7% 1.0% 22.3% 0.0% 1.5% 5.1% 4.0% 0.0% 3.1% 48.0%
Other Current Assets 8.0% 52.6% 4.9% 27.1% 6.9% 12.8% 2.7% 0.0% 21.7% 11.5%
Total Current Assets 54.9% 64.7% 30.2% 37.2% 18.4% 32.1% 13.8% 0.0% 39.0% 69.4%
Long-Term Investments 0.0% 3.4% 0.0% 16.6% 4.7% 12.0% 9.1% 13.4% 18.6% 1.5%
Property, Plant and Equipment 34.9% 2.9% 55.2% 6.4% 29.7% 12.6% 13.4% 0.4% 12.2% 14.2%
Goodwill 4.5% 21.9% 8.4% 18.9% 30.2% 19.0% 32.9% 3.2% 12.2% 5.9%
Intangible Assets 0.0% 2.1% 0.0% 0.4% 7.6% 18.1% 29.4% 0.1% 11.2% 2.8%
Amortization 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
Other Assets 5.7% 5.0% 6.3% 20.5% 9.4% 6.1% 1.4% 0.0% 6.8% 6.1%
Long-Term Assets 45.1% 35.3% 69.8% 62.8% 81.6% 67.9% 86.2% 85.0% 61.0% 30.6%
Total Assets 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Liabilities 76.9% 47.7% 58.1% 55.8% 48.6% 57.7% 59.0% 87.8% 73.4% 99.0%
Current Liabilities 52.5% 20.5% 32.4% 16.1% 18.3% 18.0% 23.4% 0.0% 30.4% 55.7%
Accounts Payable 30.3% 3.7% 29.4% 1.6% 9.9% 6.7% 12.7% 0.0% 10.9% 28.8%
Current Portion of Long-Term Debt 0.0% 0.0% 1.7% 0.0% 4.0% 0.0% 2.4% 7.8% 4.0% 0.0%
Other Current Liabilities 22.2% 13.3% 0.0% 8.4% 4.4% 10.8% 4.3% 57.6% 15.5% 26.5%
Long-Term Debt 9.2% 20.1% 19.1% 31.5% 17.9% 25.5% 0.0% 0.0% 34.0% 0.0%
Other Liabilities 15.1% 6.4% 2.9% 8.2% 12.4% 14.2% 14.1% 0.0% 9.0% 32.7%
Minority Interest 0.0% 0.0% 1.5% 0.0% 4.4% 0.2% 0.1% 0.0% 0.2% 0.1%
Total Liabilities 76.9% 47.7% 58.1% 55.8% 48.6% 57.7% 59.0% 87.8% 73.4% 99.0%
Total Stockholders’ Equity 23.1% 52.3% 41.9% 44.2% 51.4% 42.3% 41.0% 12.2% 26.6% 1.0%
Total Liabilities and Stockholders’ 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Equity
1. Use a Google Sheet with the iXBRLAnalyst script as well as the normalized
accounts in Lab Exhibit 8-2B (or search for XBRL tags in the FASB taxonomy if
normalized accounts aren’t available) to recreate the ratios above.
2. Take a screenshot (label it 8-2A) of your completed worksheet.
Q1. Using the skills learned from your prior financial accounting classes, your
abil- ity to extract information from XBRL, and your knowledge of common-
size financial statements, match the company names in Lab Exhibit 8-2A
with their corresponding ratios in each column of Lab Exhibit 8-2B?
319
• Column A = which company?
• Column B =
• Column C =
• Column D =
• Column E =
• Column F =
• Column G =
• Column H =
• Column I =
• Column J =
End of Lab
Data
• Financial Elements from XBRL from SEC Filings
Software needed
• Google Account
• Google Sheets
• Browser connected to Internet
320
Q1. How does XBRL fulfill the need for real-time, accurate financial data?
Q2. Why is it useful to compare multiple companies at once?
321
Part 3: Input Ticker Symbols
Refer to Lab Exhibit 8-3B for your industry’s ticker symbols.
7. Referring to Lab Exhibit 8-3B for your industry’s ticker symbols, in the Main Company
Ticker field, input the ticker of the company you would like to focus your analysis
on and press Enter. In a moment, the value on the spreadsheet will change to
Loading. . . and then show your company’s financial figures.
8. In the Most Recent Year field, enter the most recent reporting year. It may be the
cur- rent year or the previous year.
9. In the Period field, enter either FY for a fiscal year or Q1 for 1st quarter, etc.
10. In the Round to field, choose the rounding amount. 1,000 will round to thousands
of dollars; 1,000,000 will round to millions of dollars.
11. In the Comparable 1 Ticker field, input the ticker of a second company you would
like to compare with your first company.
12. In the Comparable 2 Ticker field, input the ticker of a third company you would like
to compare with your first company.
13. Take a screenshot (label it 8-3A) of your figure with the financial statements of
your chosen companies.
End of Lab
322
Lab 8-4 Use SQL to Query an XBRL Database
Company summary
As the chapter mentioned, there are 19,000 tags in the XBRL taxonomy, which doesn’t
even include the custom tags that organizations have created for themselves. The normal-
ized tags XBRLAnalyst provides can be helpful, but sometimes you will need to find a
more specific tag. One way that you can do this is by using SQL to query an XBRL
database for all tags that are similar to the normalized tag you are working with.
Data
We have provided a subset of the XBRL database in the Access database file
XBRL.accdb. We have used the Arelle open-source XBRL platform to build our subset,
which in turn pulls the data from the SEC.
Technique
• You will use SQL to query the database.
Software needed
• Microsoft Access
324
Part 5: Address and Refine Results
Based on the massive amount of tags that contain the word cash in them, we may decide
to be more specific with the query.
7. This time, refine the query to show only the tags that begin with the word Cash.
8. Take a screenshot (label it 8-4C) that includes the first rows that are in your output,
as well as the bottom corner that indicates how many total rows were in the query
results.
Q4. How would you further drill down into the first question about the large filers?
Q5. Do you think the number of outputs you got for the different types of tags
with the word Cash is reasonable? What recommendation would you have
regarding the numerous elements in the taxonomy?
End of Lab
325
Glossary
A continuous data (142) One way to categorize quantitative
data, as opposed to discrete data. Continuous data can take
Audit Data Standards (ADS) (193) The Audit Data
on any value within a range. An example of continuous
Standards define common tables and fields that are needed
data is height.
by auditors to perform common audit tasks. The AICPA
developed these standards.
D
B Data Analytics (4) The process of evaluating data with the
Balanced Scorecard (252) A particular type of digital purpose of drawing conclusions to address business ques-
dash- board that is made up of strategic objectives, as well tions. Indeed, effective Data Analytics provides a way to
as KPIs, target measures, and initiatives, to help the search through large structured and unstructured data to
organization reach its target measures in line with strategic identify unknown patterns or relationships.
goals. data dictionary (14, 43) Centralized repository of descrip-
Benford’s law (100) An observation about the frequency tions for all of the data attributes of the dataset.
of leading digits in many real-life sets of numerical data. The data reduction (11, 94) A data approach used to reduce
law states that in many naturally occurring collections of the amount of information that needs to be considered to
numbers, the leading significant digit is likely to be small. focus on the most critical items (i.e., highest cost, highest
big data (4) Datasets that are too large and complex for risk, larg- est impact, etc.).
businesses’ existing systems to handle utilizing their tradi- data request form (45) A method for obtaining data if you
tional capabilities to capture, store, manage, and analyze do not have access to obtain the data directly yourself.
these datasets. data warehouse (193) A data warehouse is a repository of
data accumulated from internal and external data sources,
including financial data, to help management decision
C making.
causal modeling (95) A data approach similar to decision boundaries (104) Technique used to mark the
regression, but used when the relationship between split between one class and another.
independent and depen- dent variables where it is decision tree (104) Tool used to divide data into smaller groups.
hypothesized that the independent variables cause or are
associated with the dependent variable. declarative visualizations (143) Made when the aim of
your project is to "declare" or present your findings to an
classification (9, 95) A data approach used to assign audi- ence. Charts that are declarative are typically made
each unit in a population into a few categories potentially after the data analysis has been completed and are meant to
to help with predictions. exhibit what was found in the analysis steps.
clustering (10, 94) A data approach used to divide descriptive analytics (212) Descriptive analytics
individu- als (like customers) into groups (or clusters) in a summarize activity or master data elements based on certain
useful or meaningful way. attributes.
co-occurrence grouping (10, 94) A data approach used to descriptive attributes (42) Attributes that exist in relational
discover associations between individuals based on transac- databases that are neither primary nor foreign keys. These
tions involving them. attributes provide business information, but are not required
composite primary key (42) A special case of a to build a database. An example would be “Company Name”
primary key that exists in linking tables. The composite or “Employee Address.”
primary key is made up of the two primary keys in the diagnostic analytics (212) Diagnostic analytics looks for
table that it is linking. correlations or patterns of interest in the data.
computer-assisted audit techniques (CAATs) (212) Computer- digital dashboard (252) An interactive report showing the
assisted audit techniques (CAATs) are automated scripts most important metrics to help users understand how a
that can be used to validate data, test controls, and enable company or an organization is performing. Often created
substan- tive testing of transaction details or account using Excel or Tableau.
balances and generate supporting evidence for the audit.
326
Glossary 327
standardized metrics (303) Metrics used by data vendors to training data (104) Existing data that have been manually
allow easier comparison of company reported XBRL data. evaluated and assigned a class, which assists in classifying
structured data (98) Data that are organized and reside in the test data.
a fixed field with a record or a file. Such data are generally
con- tained in a relational database or spreadsheet and are U
readily searchable by search algorithms.
unsupervised approach/method (94) Approach used for data
supervised approach/method (94) Approach used to
exploration looking for potential patterns of interest.
learn more about the basic relationships between
independent and dependent variables that are hypothesized
to exist. X
support vector machines (106) A discriminating classifier XBRL (eXtensible Business Reporting Language) (102, 302)
that is defined by a separating hyperplane that works first to A global standard for exchanging financial reporting informa-
find the widest margin (or biggest pipe). tion that uses XML; a global standard for Internet communi-
Systems translator software (193) Systems translator soft- cation among businesses.
ware maps the various tables and fields from varied ERP XBRL taxonomy (302) Defines and describes each key
systems into a consistent format. data element (like cash or accounts payable). The taxonomy
also defines the relationships between each element (like
inven- tory is a component of current assets and current
T assets is a component of total assets).
test data (104) A set of data used to assess the degree XBRL-GL (303) Stands for XBRL general ledger;
and strength of a predicted relationship established by the relates to the ability of enterprise system to tag financial
analy- sis of training data. elements within the firm’s financial reporting system.
Index
Note: Page numbers followed by n indicate source notes or Lab 6-5: Comprehensive Case: Dillard’s Store Data/
footnotes. Hypothesis Testing (Part II—Data Visualization),
249
Lab 7-2: Create a Balanced Scorecard Dashboard in
A
Tableau, 272
Access. See Microsoft Access
Lab 7-5: Comprehensive Case: Dillard’s Store Data/
Account balance tests
Creating KPIs in Excel (Part III), 294–295
Lab 6-2: Perform Substantive Tests of Account Balances,
Lab 7-6: Comprehensive Case: Dillard’s Store Data/
234–240
Creating KPIs in Excel (Part IV—Putting It All
Lab 6-3: Finding Duplicate Payments, 240–241
Together), 297–299
Accounting. See also Auditing; Financial accounting;
Lab 8-1: Use XBRLAnalyst to Access XBRL Data, 317
Managerial accounting
Lab 8-4: Use SQL to Query an XBRL Database, 325
Data Analytics skills needed in,
Address evaluation
12 impact of Data Analytics on,
data reduction and, 102
5–7 tax data analytics, 7
Lab 6-1: Evaluate the Master Data for Interesting
Accounting cycle
Addresses, 232–234
data use and storage in, 40
Advanced Environmental Recycling Technologies, profiling
Unified Modeling Language (UML) diagram,
in managerial accounting, 99
40 ACL, 214
Age analysis, in audit data analytics, 213, 215–216
Activity ratios, 305
Ahmed, A. S., 103n
Address and Refine Results stage. See also
Alarms, in automating the audit plan, 197
Data visualization
Alibaba, 3, 10, 24, 304, 307
in audit data analytics, 214
Alphabet, 304, 307
in Balanced Scorecard,
Amazon, 3, 10, 261–262, 263, 312–313, 318
258 described, 11
American Institute of Certified Public Accountants (AICPA),
in LendingClub example, 17–19
Audit Data Standards (ADSs), 45, 193–195, 199,
Lab 2-2: Use PivotTables to Denormalize and Analyze
210, 211
the Data, 67
Analytics mindset, 12
Lab 2-6: Comprehensive Case: Dillard’s Store Data/How
Analytics Tools
to Create an Entity-Relationship Diagram, 77
Python, 70
Lab 2-7: Comprehensive Case: Dillard’s Store Data/How
SQL queries, 64–65
to Preview Data from Tables in a Query, 79
Apple Inc., 29, 312, 315, 321–322
Lab 2-8: Comprehensive Case: Dillard’s Store Data/
Applied statistics, 213, 226
Connecting Excel to a SQL Database, 89
Artificial intelligence, 213, 227
Lab 3-3: Classification, 125
Assurance services, 192. See also Auditing
Lab 3-4: Comprehensive Case: Dillard’s Store
Audience, in Communicate Insights stage,
Data/Data Abstract (SQL) and Regression (Part
158 Audit data analytics, 208–249
1), 134
automating the audit plan, 195–196,
Lab 3-5: Comprehensive Case: Dillard’s Store
197 in continuous auditing, 196–197
Data/Data Abstract (SQL) and Regression (Part
descriptive analytics, 212, 213, 214–
II), 136–137
219
Lab 4-3: Comprehensive Case: Dillard’s Store Data/
diagnostic analytics, 212, 213, 219–225
Create Geographic Data Visualizations in
examples of, 213
Tableau, 186
IMPACT model in, 210–214
Lab 4-4: Comprehensive Case: Dillard’s Store Data/
Visualizing Regression in Tableau, 189 impact of auditing on business, 5–6
Lab 6-1: Evaluate the Master Data for Interesting in internal audits, 99, 101, 192–193, 196
Addresses, 234 predictive analytics, 212, 213, 226, 228
Lab 6-4: Comprehensive Case: Dillard’s Store Data/ prescriptive analytics, 212, 213, 226, 227, 228
Hypothesis Testing (Part I), 247 when to use, 210–214
330
Index 331
Clustering, 107–109
Costco, 321–322
defined, 10, 94, 97, 110, 213, 225
Coughlin, Tom, 99
example in auditing, 108–109
CPA (Certified Public Accountant), 305
in Perform the Analysis/Test Plan stage, 10, 94, 97, Credit or risk score
107–109, 110 in LendingClub example, 15, 17–19, 25–26, 31–32, 43,
by Walmart, 107–108 104, 123
CMA (Chartered Management Analyst), Z-scores, 95, 100, 213, 219–220, 226
305 Coca-Cola, 318 Customer key performance indicators (KPIs), 255
College Scorecard
example dataset, 55–56
Lab 2-5: College Scorecard Extraction and Data
D
Preparation, 73–74
Daily Mail, 151
Lab 3-2: Regression in Excel, 120–
Data Analytics. See also Audit data analytics; Financial
121 Color, in data visualization, 156–
statement analysis; IMPACT model and specific steps
157 Column charts, 145
in IMPACT model
Committee of Sponsoring Organizations (COSO) Enterprise
defined, 4, 21
Risk Management Framework, 195
environmental scanning with, 7
Common-size financial statements, Lab 8-2: Use
impact on accounting, 5–7
XBRLAnalyst to Create Dynamic Common-Size
impact on business, 4–5
Financial Statements, 317–320
Labs. See Labs
Communicate Insights stage, 157–158. See also Data
overview, 8–12
visualization audience and tone, 158
skills needed by accountants,
in audit data analytics, 214
12 value and size of, 5
in Balanced Scorecard, 258
Database maps,
content and organization, 157–
198 Data
158 described, 11
dictionaries
in LendingClub example, 19
auditing and, 198
revising, 158
creating and using, 43–
Lab 2-2: Use PivotTables to Denormalize and Analyze the
44 defined, 14, 21, 43,
Data, 67
51
Lab 4-1: Using PivotCharts to Visualize Declarative Data,
in LendingClub example, 14, 43–44
164–166
Data extraction. See also Master the Data stage
Lab 4-2: Use Tableau to Perform Exploratory Analysis
in ETL process, 44–48
and Create Dashboards, 174–175
Lab 2-1: Create a Request for Data Extraction, 57–59
Lab 6-4: Comprehensive Case: Dillard’s Store Data/
Lab 2-5: College Scorecard Extraction and Data
Hypothesis Testing (Part I), 247
Preparation, 54–56 (dataset), 73–74
Lab 7-5: Comprehensive Case: Dillard’s Store Data/
Data preparation and cleaning, 12, 38–91. See also Master
Creating KPIs in Excel (Part III), 295
the Data stage
Composite primary key, 42, 51
data dictionaries. See Data dictionaries
Comprehensive case. See Dillard’s Department
ETL (extraction, transformation, and loading) process,
Store (comprehensive case)
44–50
Computer-assisted audit techniques (CAATs), 212, 228
relationships in relational database, 41–43
Connect (PwC), 198
Lab 2-1: Create a Request for Data Extraction, 57–59
ConocoPhillips, 321–322
Lab 2-2: Use PivotTables to Denormalize and Analyze
Content, in Communicate Insights stage, 157–158
the Data, 59–67
Continuous auditing
Lab 2-3: Resolve Common Data Problems in Excel
alarms and exceptions, 197
and Access, 67–71
defined, 197
Lab 2-4: Generate Summary Statistics in Excel, 71–72
example of profiling in, 100
Lab 2-5: College Scorecard Extraction and Data
techniques in, 196–197
Preparation, 54–56 (dataset), 73–74
Continuous data, 142–143, 159
Lab 2-6: Comprehensive Case: Dillard’s Store Data/How
Co-occurrence grouping
to Create an Entity-Relationship Diagram,
defined, 10, 21, 97, 110
74–77
example of, 10
Lab 2-7: Comprehensive Case: Dillard’s Store Data/How
in Perform the Analysis/Test Plan stage, 10, 94, 97
to Preview Data from Tables in a Query, 77–79
Lab 2-8: Comprehensive Case: Dillard’s Store Data/
Connecting Excel to a SQL Database, 80–89
Index 333
DuPont ratio
Financial statement analysis, 300–
analysis defined,
325 data reduction and, 102
306, 309
defined, 305, 309
sparklines in, 307
ratio analysis, 305–307, 309, 320–323
Lab 8-3: Use XBRL to Access and Analyze
sentiment analysis, 213, 226, 307–308
Financial Statement Ratios—The Use of
text mining, 307–308
DuPont Ratios, 320–323
XBRL in, 302–304
Lab 8-1: Use XBRLAnalyst to Access XBRL Data, 314–
317
E
Lab 8-2: Use XBRLAnalyst to Create Dynamic
eBay, 3, 318
Common- Size Financial Statements, 317–320
Employee performance key performance indicators
Lab 8-3: Use XBRL to Access and Analyze
(KPIs), 256
Financial Statement Ratios—The Use of
Employee turnover, regression and,
DuPont Ratios, 320–323
103 Entity-Relationship Diagram
Lab 8-4: Use SQL to Query an XBRL Database, 323–
Lab 1-4—Dillard’s Store Data, 36
325
Lab 2-6: Dillard’s Store Data/How to Create an Entity- Financing ratios, 305
Relationship Diagram, 74–77
FinDynamics, Lab 8-1: Use XBRLAnalyst to Access XBRL
Environmental and social sustainability key performance
Data, 314–317
indicators (KPIs), 256
Flat files
Environmental scanning, 7
defined, 41, 51, 193, 199
Equifax, 21
LendingClub example, 43–44
ERP systems, 9, 33, 193–194, 257
relational database versus, 41–42, 193
ETL (extract, transform, and load), 44–50
Flood of alarms, 197
defined, 44, 51
Forbes Insights/KPMG report, 5–
extraction phase, 44–48
6 Foreign key (FK), 42, 51
transformation phase, 48–49 Fraudulent insurance claims, 108–109
loading phase, 50 Fuzzy match
overview, 40 audit data analytics, 213, 223–225
Exact match, in audit data analytics, 213, 223– defined, 102, 110, 228
225 Excel. See entries beginning with “Microsoft in Perform the Analysis/Test plan stage,
Excel” Exception report, 197 102 Lab 3-1: Data Reduction, 118–120
Exceptions, in automating the audit plan, 197
Expected credit losses (ECLs), regression and, 103
Experian, 21
G
Explanatory variable. See Predictor (independent)
Gap detection, data reduction and, 102
variable
Gartner, 148
Exploratory visualizations, 143, 159
Generalized audit software (GAS), 214
External auditing. See Auditing
General Ledger Standard (Audit Data Standards), 194
ExxonMobil, 321–322
General Motors, 304, 307
Geographic data
filled geographic maps, 147
F
Lab 4-3: Comprehensive Case: Dillard’s Store Data/
Facebook, 7, 10, 11, 321–322
Create Geographic Data Visualizations in
Fawcett, Tom, 9, 9n
Tableau, 175–186
Filled geographic maps, 147 Google, 7
Financial accounting. See also Financial statement Google Sheets
analysis CFA (Chartered Financial Analyst), 305 Lab 8-1: Use XBRLAnalyst to Access XBRL Data, 314–
CPA (Certified Public Accountant), 305 317
Data Analytics impact on financial reporting, 6–7 Lab 8-2: Use XBRLAnalyst to Create Dynamic
Lab 1-1 Data Analytics in Financial Accounting, 28– Common- Size Financial Statements, 317–320
31 Financial Accounting Standards Board (FASB) Lab 8-3: Use XBRL to Access and Analyze
Accounting Standards Update 2016-13, 103 Financial Statement Ratios—The Use of
XBRL taxonomy, 29–30, 302, 309 DuPont Ratios, 320–323
Financial performance key performance indicators Gray, Glen L., 4n
(KPIs), 255
Index 335
H
Lab 3-2: Regression in Excel, 120
Halo (PwC), 191, 198
Lab 3-3: Classification, 122
Harriott, J. S., 8, 8n, 40n
Lab 3-4: Comprehensive Case: Dillard’s Store Data/Data
Harvard Business Review, 140–141
Abstract (SQL) and Regression (Part 1), 125–126
Headings, removing in data cleaning, 49
Lab 3-5: Comprehensive Case: Dillard’s Store Data/Data
Heat maps, 145
Abstract (SQL) and Regression (Part II), 135
Heterogeneous systems approach, 193, 194, 199
Lab 4-2: Use Tableau to Perform Exploratory Analysis
Hewlett-Packard Co., 209
and Create Dashboards, 167
Homogeneous systems approach, 193, 194, 199
Lab 4-3: Comprehensive Case: Dillard’s Store Data/Create
Howson, C., 148
Geographic Data Visualizations in Tableau, 176
Hypothesis testing
Lab 4-4: Comprehensive Case: Dillard’s Store Data/
Lab 6-4: Comprehensive Case: Dillard’s Store Data/
Visualizing Regression in Tableau, 187
Hypothesis Testing (Part I), 241–247
Lab 6-1: Evaluate the Master Data for Interesting
Lab 6-5: Comprehensive Case: Dillard’s Store Data/
Addresses, 232
Hypothesis Testing (Part II—Data Visualization),
Lab 6-2: Perform Substantive Tests of Account
247–249
Balances, 235
Lab 6-3: Finding Duplicate Payments, 240
Lab 6-4: Comprehensive Case: Dillard’s Store Data/
I Hypothesis Testing (Part I), 242
IBM, 303 Lab 6-5: Comprehensive Case: Dillard’s Store Data/
IDEA, 50, 214, 215 Hypothesis Testing (Part II—Data Visualization),
in audit data analytics, 220, 221–222, 223 248
age analysis, 215–216 Lab 7-1: Evaluate Management Requirement and Identify
sampling, 218, 219 Useful KPIs from a List, 266
sorting, 217 Lab 7-2: Create a Balanced Scorecard Dashboard in
summary statistics, 217 Tableau, 266–267
Lab 6-1: Evaluate the Master Data for Interesting Lab 7-3: Comprehensive Case: Dillard’s Store Data/
Addresses, 233–234 Creating KPIs in Excel (Part I), 275
Identify the Questions/Problem Lab 7-4: Comprehensive Case: Dillard’s Store Data/
stage in audit data analytics, 210 Creating KPIs in Excel (Part II), 280
in Balanced Scorecard, 254– Lab 7-5: Comprehensive Case: Dillard’s Store Data/
256 described, 8 Creating KPIs in Excel (Part III), 289
in ETL (extraction, transformation, loading) process, 45 Lab 7-6: Comprehensive Case: Dillard’s Store Data/
in LendingClub example, 13, 31–32 Creating KPIs in Excel (Part IV—Putting It All
Lab 1-1—Data Analytics in Financial Accounting, 28–29 Together), 296
Lab 1-2—Data Analytics in Managerial Accounting, 31–32 Lab 8-1: Use XBRLAnalyst to Access XBRL Data, 314
Lab 1-3—Data Analytics in Auditing, 33–34 Lab 8-3: Use XBRL to Access and Analyze Financial
Lab 1-4—Dillard’s Store Data (comprehensive case), 35– Statement Ratios—The Use of DuPont Ratios,
36 320–321
Lab 2-1—Create a Request for Data Extraction, 57 Lab 8-4: Use SQL to Query an XBRL Database, 323
Lab 2-2—Use PivotTables to Denormalize and Analyze the Idoine, C J., 148n
Data, 60 IMPACT model, 8–12. See also entries for specific steps in
Lab 2-3: Resolve Common Data Problems in Excel and IMPACT model
Access, 68 in audit data analytics, 210–214
Lab 2-5: College Scorecard Extraction and Data in automating the audit plan, 196
Preparation, 73 with Balanced Scorecard, 252, 254–258
Lab 2-6: Comprehensive Case: Dillard’s Store Data/How iterative nature of, 12
to Create an Entity-Relationship Diagram, 75 LendingClub example, 13–20, 24–26, 67–71
Lab 2-7: Comprehensive Case: Dillard’s Store Data/How Step 1: Identify the Questions/Problem, 8
to Preview Data from Tables in a Query, 77 Step 2: Master the Data, 8–9
Lab 2-8: Comprehensive Case: Dillard’s Store Data/ Step 3: Perform Analysis/Test Plan, 9–11
Connecting Excel to a SQL Database, 80 Step 4: Address and Refine Results, 11
Lab 2-9: Comprehensive Case: Dillard’s Store Data/ Step 5: Communicate Insights, 11
Joining Tables, 89 Step 6: Track Outcomes, 11
Lab 3-1: Data Reduction, 116 Inconsistencies across data, correcting in data cleaning, 49
Independent (predictor) variable, defined, 9, 21
336 Index
T
U
Tableau, 50 Underfitting classifiers, 106–107
color, 156–157 Unified Modeling Language (UML), 40, 198
creating good charts through bad examples, 150– Lab 5-3: Identify Audit Data Requirements, 205–206
155 data scale and increments, 156
U.S .GAAP Financial Reporting Taxonomy, 29–30, 302, 309
inputting custom query into Connections Page,
U.S .Securities and Exchange Commission (SEC), 312, 314
177–180
EDGAR database, 29
joining tables into Connections Page, 180–
generally accepted accounting principles (GAAP), 29–30,
183 visuals and, 149–155
302, 309
Lab 4-2: Use Tableau to Perform Exploratory Analysis
text mining and, 307–308
and Create Dashboards, 166–175
University of Arkansas, Walton College of Business, 252,
Lab 4-3: Comprehensive Case: Dillard’s Store Data/
253, 262
Create Geographic Data Visualizations in
Unsupervised approach/method, 94, 110, 111
Tableau, 175–186
Lab 4-4: Comprehensive Case: Dillard’s Store Data/
Visualizing Regression in Tableau, 186–189
V
Lab 7-2: Create a Balanced Scorecard Dashboard in
Validating data, in ETL process, 48–49
Tableau, 266–272
Vasarhelyi, M. A., 108, 108n, 109n
Tables, joining
Visualization. See Data visualization
into Tableau Connections page, 180–183
Lab 2-9: Comprehensive Case: Dillard’s Store Data/
Joining Tables, 89–91
W
Takeda, C., 103n
Walmart, 99, 107–108, 111, 256, 259, 262, 306, 307, 312–313,
Tao, Y., 3n
318, 321–322
Target, 94, 321–322
Walt Disney Company, 318
Tax data analytics, 7
Walton College of Business, University of Arkansas, 252,
TeamMate Analytics, 198, 214, 227
253, 262
Tesla, 262
Wang, H.,
Lab 7-1: Evaluate Management Requirement and Identify
3n Wayfair,
Useful KPIs from a List, 264–266
263
Test data, 104, 110
Weka, 50
Text mining, 307–308
Lab 3-3: Classification, 124–125
Thiprungsri, S., 108, 108n, 109n
Wells-Fargo, 321–322
Thomas, S., 103n
What-if analysis, 213
Time validation, in data transformation, 49
Whole Foods, 312–313
Tone, in Communicate Insights stage, 158
Witt, G. C., 41n
Track Outcomes stage
Word clouds, 145, 146
in audit data analytics, 214
Workflow, in auditing, 197–198
in Balanced Scorecard,
Working papers
258 described, 11
in auditing, 197–198
in LendingClub example, 19
electronic, 198
Lab 6-4: Comprehensive Case: Dillard’s Store Data/
Lab 5-2: Review Changes to Working Papers (OneDrive),
Hypothesis Testing (Part I), 247
204
Training data, 104,
Writing for Computer Science (Zobel), 157–158
110 Transformation
in ETL process, 48–49
steps in, 49
X
TransUnion, 21
XBRL (eXtensible Business Reporting Language), 302–304
Travel and entertainment (T&E)
adding tags, 30–31, 302–303
data reduction in internal auditing, 101
in data reduction, 102
profiling in internal auditing, 99
defined, 102, 110, 302, 309
Tree maps, 145
DuPont analysis using, 306, 307
Trendlines, in ratio analysis, 306, 307
extensible reporting in, 303
Trump, Donald, 139
Financial Accounting Standards Board (FASB) XBRL
taxonomy, 29–30, 302, 309
Index 343
real-time financial reporting and, 303– XBRL-GL (eXtensible Business Reporting Language—
304 standardized metrics and, 303, 309 General Ledger), 303–304, 309
Lab 8-1: Use XBRLAnalyst to Access XBRL XBRL (eXtensible Business Reporting Language) taxonomy,
Data, 314–317 29–30, 302, 309
Lab 8-2: Use XBRLAnalyst to Create Dynamic Common- Xero, 198
Size Financial Statements, 317–320
Lab 8-3: Use XBRL to Access and Analyze
Financial Statement Ratios—The Use of Z
DuPont Ratios, 320–323 Zhang, Liang Zhao, 93
Lab 8-4: Use SQL to Query an XBRL Database, 323– Zobel, Justin, 157–158
325 XBRLAnalyst. See under Microsoft Excel Z-scores, 95, 100, 213, 219–220, 226