(Ebook) Advancing into Analytics: From Excel to Python and R by Mount, George ISBN 9781492094340, 149209434X All Chapters Instant Download
(Ebook) Advancing into Analytics: From Excel to Python and R by Mount, George ISBN 9781492094340, 149209434X All Chapters Instant Download
com
https://ebooknice.com/product/advancing-into-analytics-from-
excel-to-python-and-r-25094292
DOWLOAD EBOOK
https://ebooknice.com/product/advancing-into-analytics-33789692
ebooknice.com
https://ebooknice.com/product/modern-data-analytics-in-excel-51196912
ebooknice.com
ebooknice.com
https://ebooknice.com/product/modern-data-analytics-in-excel-first-
early-release-51196908
ebooknice.com
(Ebook) Modern Data Analytics in Excel: Using Power Query,
Power Pivot, and More for Enhanced Data Analytics by
George Mount ISBN 9781098148829, 1098148827
https://ebooknice.com/product/modern-data-analytics-in-excel-using-
power-query-power-pivot-and-more-for-enhanced-data-analytics-57044528
ebooknice.com
https://ebooknice.com/product/sat-ii-success-
math-1c-and-2c-2002-peterson-s-sat-ii-success-1722018
ebooknice.com
ebooknice.com
ebooknice.com
ebooknice.com
Advancing into Analytics
From Excel to Python and R
George Mount
Advancing into Analytics
by George Mount
Copyright © 2021 Candid World Consulting, LLC. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
(http://oreilly.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or
[email protected].
Copyeditor: JM Olejarz
Learning Objective
By the end of this book, you should be able to conduct exploratory
data analysis and hypothesis testing using a programming
language. Exploring and testing relationships is core to analytics.
With the tools and frameworks you’ll pick up in this book, you will
be well positioned to continue learning more advanced data
analysis techniques.
We’ll be using Excel, R, and Python because these are powerful
tools, and because they make for a seamless learning journey. Few
books cover this combination, even though the progression from
spreadsheets into programming is common for analysts, myself
included.
Prerequisites
To meet these objectives, this book makes some technical and
technological assumptions.
Technical Requirements
I am writing this book on a Windows computer with the Office 365
version of Excel for desktop. As long as you have a paid version of
Excel 2010 or greater for either Windows or Mac installed on your
machine, you should be able to follow along with the majority of the
instruction in this book, with some variations, particularly with
PivotTables and data visualization.
NOTE
While Excel offers both free and paid versions online, a paid desktop
version is needed to access some of the features covered in this book.
R and Python are both free, open source tools available for all major
operating systems. I address how to install them later in the book.
Technological Requirements
This book assumes no prior knowledge of R or Python; that said, it
does rely on moderate knowledge of Excel to flatten that learning
curve.
The Excel topics you should be familiar with include the following:
Absolute, relative, and mixed cell references
Conditional logic and conditional aggregation (IF()
statements, SUMIF()/SUMIFS(), and so forth)
Combining data sources (VLOOKUP(), INDEX()/MATCH(), and
so forth)
Sorting, filtering, and aggregating data with PivotTables
Basic plotting (bar charts, line charts, and so forth)
If you would like more practice with these topics before moving on,
I suggest Excel 2019 Bible by Michael Alexander et al. (Wiley).
How I Got Here
Like many in our field, my route to analytics was circuitous. In
school, mathematics became a subject I actively avoided; too much
of it seemed entirely theoretical. I did have some coursework in
statistics and econometrics that caught my interest. It was a breath
of fresh air to apply mathematics to some concrete end.
This exposure to statistics was admittedly scant. I attended a liberal
arts college, where I picked up solid writing and thinking skills, but
few quantitative ones. When I got to my first full-time job, I was
floored by the depth and breadth of the data I was entrusted with
managing. Much of this data lived in spreadsheets and was hard to
get much value out of without intense cleaning and preparation.
Some of this “data wrangling” is to be expected; the New York
Times has reported that data scientists spend 50% to 80% of their
time preparing data for analysis. But I wondered if there were more
efficient ways to clean, manage, and store data. In particular, I
wanted to do this so I could spend more time analyzing the data.
After all, I always found statistical analysis somewhat palatable—
manual and error-prone spreadsheet data preparation, not so much.
Because I enjoyed writing (thank you, liberal arts degree), I started
blogging about tips I picked up in Excel. Through good grace and
hard work, the blog gained traction, and I attribute much of my
professional success to it. You are welcome to stop by at
stringfestanalytics.com; I still post regularly on Excel and analytics
more generally.
As I began to learn more about Excel, my interest spread to other
analytics tools and techniques. By this point, the open source
programming languages R and Python had gained significant
popularity in the data world. But while I made my way through
grasping these languages, I felt unnecessary friction in the learning
path.
“Excel Bad, Coding Good”
I noticed that for Excel users, most R or Python training sounded a
lot like this:
All along, you’ve been using Excel when you really should have
been programming. Look at all these problems Excel has caused!
Time to kick the habit entirely.
That’s the wrong attitude to take for a couple of reasons:
It’s not accurate
The choice between coding and spreadsheets is often framed
like a sort of struggle between good and evil. In reality, it’s
better to think of these as complementary tools rather than
substitutes. Spreadsheets have their place in analytics; so does
programming. Learning and using one does not negate the
other. Chapter 5 discusses this relationship.
NOTE
Both spreadsheets and programming languages are valuable analytics
tools; there’s no need to abandon Excel once you’ve picked up R and
Python.
NOTE
Excel provides the opportunity to learn the fundamentals of data
analytics without the need to learn a new programming language at the
same time. This greatly reduces cognitive overhead.
Book Overview
Now that you understand the spirit of the book and what I hope for
you to achieve, let’s review its structure.
Part I, Foundations of Analytics in Excel
Analytics stands on the shoulders of statistics. In this part, you
will learn how to explore and test relationships between
variables using Excel. You’ll also use Excel to build compelling
demonstrations of some of the most important concepts in
statistics and analytics. This grounding in statistical theory and
framework for conducting analysis will put you on solid footing
for data programming.
TIP
Learning is best done actively; without putting what you’ve read into
immediate practice, you’re likely to forget it.
Don’t Panic
As an author, I hope you find me easygoing and approachable. I do,
however, have one rule for this book: don’t panic! There is an
admittedly steep learning curve at play here, since you’ll be
exploring not just probability and statistics but two programming
languages. This book will introduce you to concepts from statistics,
computer science, and more. They may initially be jarring, but you’ll
begin to internalize them over time. Allow yourself to learn by trial
and error.
I thoroughly believe that with the knowledge you possess about
Excel, this is an achievable order for one book. There may be
moments of frustration and impostor syndrome; it happens to all of
us. Don’t let these moments overshadow the real progress you’ll
make here.
Are you ready? I’ll see you over in Chapter 1.
Constant width
Used for program listings, as well as within paragraphs to refer
to program elements such as code variable or function names,
databases, data types, environment variables, statements, and
keywords.
TIP
This element signifies a tip or suggestion.
NOTE
This element signifies a general note.
WARNING
This element indicates a warning or caution.
NOTE
For more than 40 years, O’Reilly Media has provided technology and
business training, knowledge, and insight to help companies succeed.
How to Contact Us
Please address comments and questions concerning this book to the
publisher:
Sebastopol, CA 95472
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples,
and any additional information. You can access this page at
https://oreil.ly/advancing-into-analytics.
Email [email protected] to comment or ask technical
questions about this book.
For news and information about our books and courses, visit
http://oreilly.com.
Find us on Facebook: http://facebook.com/oreilly.
Follow us on Twitter: http://twitter.com/oreillymedia.
Watch us on YouTube: http://www.youtube.com/oreillymedia.
Acknowledgments
First, I want to thank God for giving me this opportunity to cultivate
and share my talents. At O’Reilly, Michelle Smith and Jon Hassell
have been so enjoyable to work with, and I will be forever grateful
for their offer to have me write a book. Corbin Collins kept me
rolling during the book’s development. Danny Elfanbaum and the
production team turned the raw manuscript into an actual book.
Aiden Johnson, Felix Zumstein, and Jordan Goldmeier provided
invaluable technical reviews.
Getting people to review a book isn’t easy, so I have to thank John
Dennis, Tobias Zwingmann, Joe Balog, Barry Lilly, Nicole LaGuerre,
and Alex Bodle for their comments. I also want to thank the
communities who have made this technology and knowledge
available, often without direct compensation. I’ve made some
fantastic friends through my analytics pursuits, who have been so
giving of their time and wisdom. My educators at Padua Franciscan
High School and Hillsdale College made me fall in love with learning
and with writing. I doubt I’d have written a book without their
influence.
I also thank my mother and father for providing me the love and
support that I’m so privileged to have. Finally, to my late Papou:
thank you for sharing with me the value of hard work and decency.
Part I. Foundations of
Analytics in Excel
Chapter 1. Foundations of
Exploratory Data Analysis
“You never know what is gonna come through that door,” Rick
Harrison says in the opening of the hit show Pawn Stars. It’s the
same in analytics: confronted with a new dataset, you never know
what you are going to find. This chapter is about exploring and
describing a dataset so that we know what questions to ask of it.
The process is referred to as exploratory data analysis, or EDA.
EDA gives us a lot to do. Let’s walk through the process using Excel
and a real-life dataset. You can find the data in the star.xlsx
workbook, which can be found in the datasets folder of this book’s
repository, under the star subfolder. This dataset was collected for a
study to examine the impact of class size on test scores. For this
and other Excel-based demos, I suggest you complete the following
steps with the raw data:
Doing these first few analysis tasks will be good practice for other
datasets you want to work with in Excel. For the star dataset, your
completed table should look like Figure 1-2. I’ve named my table
star. This dataset is arranged in a rectangular shape of columns
and rows.
Observations
In this dataset we have 5,748 rows: each is a unique observation.
In this case, measurements are taken at the student level;
observations could be anything from individual citizens to entire
nations.
Variables
Each column offers a distinct piece of information about our
observations. For example, in the star dataset we can find each
student’s reading score (treadssk) and which class type the student
was in (classk). We’ll refer to these columns as variables. Table 1-1
describes what each column in star is measuring:
Table 1-1. Descriptions of the star dataset’s
variables
Column Description
sex Sex
race Race
There are further types of variables that could be covered here: for
example, we won’t consider the difference between interval and
ratio data. For a closer look at variable types, check out Sarah
Boslaugh’s Statistics in a Nutshell, 2nd edition (O’Reilly). Let’s work
our way down Figure 1-3, moving from left to right.
Categorical variables
Sometimes referred to as qualitative variables, these describe a
quality or characteristic of an observation. A typical question
answered by categorical variables is “Which kind?” Categorical
variables are often represented by nonnumeric values, although this
is not always the case.
An example of a categorical variable is country of origin. Like any
variable, it could take on different values (United States, Finland,
and so forth), but we aren’t able to make quantitative comparisons
between them (what is two times Indonesia, anyone?). Any unique
value that a categorical variable takes is known as a level of that
variable. Three levels of a country of origin could be US, Finland, or
Indonesia, for example.
Because categorical variables describe a quality of an observation
rather than a quantity, many quantitative operations on this data
aren’t applicable. For example, we can’t calculate the average
country of origin, but we could calculate the most common, or the
overall frequency count of each level.
We can further distinguish categorical values based on how many
levels they can take, and whether the rank-ordering of those levels
is meaningful.
Binary variables can only take two levels. Often, these variables are
stated as yes/no responses, although this is not always the case.
Some examples of binary variables:
In the case of wine type, we are implicitly assuming that our data of
interest only consists of red or white wine…but what happens if we
also want to analyze rosé? In that case, we can no longer include all
three levels and analyze the data as binary.
Any qualitiative variable with more than two levels is a nominal
variable. Some examples include:
Quantitative variables
These variables describe a measurable quantity of an observation.
A typical question answered by quantitative variables is “How
much?” or “How many?” Quantitative variables are nearly always
represented by numbers. We can further distinguish between
quantitative variables based on the number of values they can take.
Observations of a continuous variable can in theory take an infinite
number of values between any two other values. This sounds
complicated, but continuous variables are quite common in the
natural world. Some examples:
Height (within a range of 59 and 75 inches, an observation
could be 59.1, 74.99, or any other value in between)
pH level
Surface area
III.
ERKÖLCSI SÜLYEDÉSÜNK.
A TÁRSADALMI BÉKE.
DUMAS.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebooknice.com