From Concepts To Code
From Concepts To Code
The breadth of problems that can be solved with data science is astonishing, and
this book provides the required tools and skills to a broad audience. The reader
takes a journey into the forms, uses, and abuses of data and models, and learns
how to critically examine each step. Python coding and data analysis skills are
built from the ground up, with no prior coding experience assumed. The neces-
sary background in computer science, mathematics, and statistics is provided in
an approachable manner.
Each step of the machine learning lifecycle is discussed, from business objective
planning to monitoring a model in production. This end-to-end approach sup-
plies the broad view necessary to sidestep many of the pitfalls that can sink a data
science project. Detailed examples are provided from a wide range of applica-
tions and fields, from fraud detection in banking to breast cancer classification in
healthcare. The reader will learn the techniques to accomplish tasks that include
predicting outcomes, explaining observations, and detecting patterns. Improper
use of data and models can introduce unwanted effects and dangers to society.
A chapter on model risk provides a framework for comprehensively challenging
a model and mitigating weaknesses. When data is collected, stored, and used, it
may misrepresent reality and introduce bias. Strategies for addressing bias are
discussed. From Concepts to Code: Introduction to Data Science leverages content
developed by the author for a full-year data science course suitable for advanced
high school or early undergraduate students. This course is freely available and it
includes weekly lesson plans.
Adam P. Tashman has been working in data science for more than 20 years. He is
an associate professor of data science at the University of Virginia School of Data
Science. He is currently Director of the Capstone program, and he was formerly
Director of the Online Master’s of Data Science program. He was the School of
Data Science Capital One Fellow for the 2023–2024 academic year. Dr. Tashman
won multiple awards from Amazon Web Services, where he advised education
and government technology companies on best practices in machine learning
and artificial intelligence. He lives in Charlottesville, VA with his wonderful wife
Elle and daughter Callie.
From Concepts to Code
Introduction to Data Science
Adam P. Tashman
Designed cover image: © Adam P. Tashman
First edition published 2024
by CRC Press
2385 NW Executive Center Drive, Suite 320, Boca Raton FL 33431
Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com
or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. For works that are not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
DOI: 10.1201/9781003403982
Typeset in CMR10
by KnowledgeWorks Global Ltd.
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
Acknowledgments xiii
Preface xv
Symbols xvii
1 Introduction 1
1.1 What Is Data Science? . . . . . . . . . . . . . . . . . . . . . 1
1.2 Relationships Are of Primary Importance . . . . . . . . . . 1
1.3 Modeling and Uncertainty . . . . . . . . . . . . . . . . . . . 2
1.4 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4.1 The Data Pipeline . . . . . . . . . . . . . . . . . . . 2
1.4.2 The Data Science Pipeline . . . . . . . . . . . . . . . 3
1.5 Representation . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 For Everyone . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Target Audience . . . . . . . . . . . . . . . . . . . . . . . . 4
1.8 How this Book Teaches Coding . . . . . . . . . . . . . . . . 5
1.9 Course and Code Package . . . . . . . . . . . . . . . . . . . 5
1.10 Why Isn’t Data Science Typically Done with Excel? . . . . 6
1.11 Goals and Scope . . . . . . . . . . . . . . . . . . . . . . . . 6
1.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
vii
viii Contents
4 An Overview of Data 33
4.1 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Statistical Data Types . . . . . . . . . . . . . . . . . . . . . 34
4.3 Datasets and States of Data . . . . . . . . . . . . . . . . . . 35
4.4 Data Sources and Data Veracity . . . . . . . . . . . . . . . 36
4.5 Data Ingestion . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5.1 Data Velocity and Volume . . . . . . . . . . . . . . . 38
4.5.2 Batch versus Streaming . . . . . . . . . . . . . . . . . 38
4.5.3 Web Scraping and APIs . . . . . . . . . . . . . . . . 39
4.6 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . 42
4.7 Levels of Data Processing . . . . . . . . . . . . . . . . . . . 42
4.7.1 Trusted Zone . . . . . . . . . . . . . . . . . . . . . . 43
4.7.2 Standardizing Data . . . . . . . . . . . . . . . . . . . 44
4.7.3 Natural Language Processing . . . . . . . . . . . . . 44
4.7.4 Protecting Identity . . . . . . . . . . . . . . . . . . . 45
4.7.5 Refined Zone . . . . . . . . . . . . . . . . . . . . . . 45
4.8 The Structure of Data at Rest . . . . . . . . . . . . . . . . 45
4.8.1 Structured Data . . . . . . . . . . . . . . . . . . . . . 46
4.8.2 Semi-structured Data . . . . . . . . . . . . . . . . . . 46
4.8.3 Unstructured Data . . . . . . . . . . . . . . . . . . . 48
4.9 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.10 Representativeness and Bias . . . . . . . . . . . . . . . . . . 49
4.11 Data Is Never Neutral . . . . . . . . . . . . . . . . . . . . . 50
4.12 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . 50
4.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Data Processing 81
6.1 California Wildfires . . . . . . . . . . . . . . . . . . . . . . . 81
6.1.1 Running Python with the CLI . . . . . . . . . . . . . 82
6.1.2 Setting the Relative Path . . . . . . . . . . . . . . . 83
6.1.3 Variables . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.1.4 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.1.5 Importing Data . . . . . . . . . . . . . . . . . . . . . 85
6.1.6 Text Processing . . . . . . . . . . . . . . . . . . . . . 86
6.1.7 Getting Help . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Counting Leopards . . . . . . . . . . . . . . . . . . . . . . . 91
6.2.1 Extracting DataFrame Attributes . . . . . . . . . . . 92
6.2.2 Subsetting . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2.3 Creating and Appending New Columns . . . . . . . . 95
6.2.4 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.5 Saving the DataFrame . . . . . . . . . . . . . . . . . 97
6.3 Patient Blood Pressure . . . . . . . . . . . . . . . . . . . . . 97
6.3.1 Data Validation . . . . . . . . . . . . . . . . . . . . . 97
6.3.2 Imputation . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3.3 Data Type Conversion . . . . . . . . . . . . . . . . . 100
6.3.4 Extreme Observations . . . . . . . . . . . . . . . . . 101
6.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . 101
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Bibliography 357
Index 361
Acknowledgments
I thank my wife, Elle Tashman, for reviewing the early chapters and providing
support and encouragement. I am very grateful to Philip D Waggoner and J
Gavin Wu for their thoughtful feedback on the manuscript. Efrain Olivares
provided helpful, detailed ideas and feedback on reproducible data science.
The School of Data Science at the University of Virginia provided an ex-
tremely supportive and enriching environment for this work. In particular, I
thank Phil Bourne, Jeffrey Blume, Don Brown, Raf Alvarado, Brian Wright,
Jon Kropko, Pete Alonzi, Siri Russell, and Emma Candelier. Boris Deychman
and Judy Pann provided invaluable insight into model risk, and Greg van In-
wegen generously shared knowledge on modeling and quantitative finance. I
am very grateful to Lara Spieker for encouraging me to write this book and for
shepherding it through the process. I thank the supporting team at Chapman
and Hall/CRC. Finally, I wish to acknowledge the impact that Devin Chan-
dler, Lavel Davis Jr., and D’Sean Perry had on this book. Following their
heartbreaking tragedies, I felt compelled to write this book as a response. I
send love to their families.
xiii
Preface
The amount of data generated in the world and the number of decisions to
be made from this data has never been greater. Over 2.5 quintillion bytes of
data are produced by humans every day [1], and new units of measure are on
the way to quantify data of the future. From healthcare and finance to retail
and marketing, every field that isn’t in the technology sector is directly using
technology, or likely should be using it. Many of these fields are spawning new
fields with a suffix: finance has FinTech, insurance has InsurTech, and health-
care has HealthTech. There are even more specialized fields, like WealthTech,
MarTech (marketing), and AdTech (advertising).
This is exciting, but it is also exceedingly difficult to find people who
know the underlying field and the technology really well. In the early days
of data science, there were some rare unicorns with strong quantitative and
programming skills who could do wizardry in their industries. Perhaps the IT
department lent a hand, or they were applied mathematicians or statisticians.
One-off projects proved that understanding the organization’s data could work
wonders. Simple analytics, or perhaps some regression analysis, caught the
attention of management. As the desire to repeat and productize the data
assets grew, however, it became clear this was a full-time job, or perhaps it
required an entire department of people. Data became a strategic asset, and
the people needed to tap into this asset were in high demand.
As of 2024, demand for this kind of work and the people who bring the
magic has never been greater. The job functions of the AI-driven world are
better understood, and they have names like data scientist, data engineer, and
machine learning engineer. These lucky people are some of the most desired
knowledge workers in the world. Yet there is a massive shortage of them.
From the supply side, we still graduate generations of students doing math
without computers. However, there is good news: as early as kindergarten, chil-
dren are counting how many siblings they have, representing those counts with
little tiles of paper, aggregating the tiles, and organizing them into columns to
make a picture. What I’m saying is this: we have children all over the world
taking their first steps in data science. They just need more resources and
guidance.
As a passion project, I created a high school data science course with
Matt Dakolios, who was one of my students from the University of Virginia
(UVA) School of Data Science. As you might guess, it’s a tremendous amount
of work to create a high school course. Unfortunately for me, I can’t seem
to pick passion projects that are easy. The best I can hope for is that the
xv
xvi Preface
projects don’t flop. Eagerly following the students piloting the course, Matt
and I found that it sparked an interest that didn’t exist before. Some of them
wanted to pursue data science in college, while others had a peaked interest
in science or math.
Given this positive feedback, I have been working to get adoption of the
course at other schools. The great part about “selling” a product is that the
tough questions make you think and reflect (although it’s best to do this ahead
of time!). During one meeting with the director of a highly regarded, diverse
school, he asked this direct question:
“If you could engage a wide range of students, equip them with a skill set
to make a substantial dent in the world’s major problems, allow them to do
something personally meaningful, present a career path that earns a nice liv-
ing, all while applying and developing their skills in science and mathematics,
would you be interested? Because that’s data science.”
This book aims to bridge the gap between a workplace in need of more data
scientists, and a workforce that stands to greatly benefit in this field. However,
it’s bigger than this: even if you have no interest in ever working in data
science, this book will expand your data literacy, and your understanding of
how things work. I hope you enjoy this journey into the world of data.
1 Well, I more or less said some of this, and the Dean of the UVA School of Data Science,
Symbol Description
xvii
xviii Symbols
Beyond these fields, as data is gathered from people and other living beings
and used to make decisions, important questions arise in ethics, law, privacy,
and security. Business questions arise, such as what data is valuable, which
algorithms to build, which datasets to use, and how data products can fairly
serve others. The practice of data science touches so many areas, which makes
it broad, complex, and exciting.
Data science is intertwined with computer science: computations are done
by computer, data is often stored in databases and similar structures, and
algorithms (recipes) are used to automate tasks.
1
2 Introduction
when the distinction is necessary). The quantity being predicted will be called
the target variable. The variables used for prediction will be called the pre-
dictors. As data science has many groups of participants using specific ter-
minology, elsewhere the target might also be called the response variable or
dependent variable. A predictor might also be called an explanatory variable,
independent variable, factor, or feature.
Much of data science rests on systems and methods which are inherently un-
certain.
When predictive modeling is used, the model will make assumptions. It will
take variables as inputs. The inputs may be subject to uncertainty in mea-
surement and sampling error . For example, drawing again from the population
might produce data with different properties and relationships. Finally, where
we cannot explain a relationship with equations, we might turn to simulation.
The simulation will face uncertainty. This is not to say that we shouldn’t sim-
ulate, model, or do data science. This is to say we must do it carefully, we
must understand the assumptions, and we need to properly test and monitor
the products.
1.4 Pipelines
1.4.1 The Data Pipeline
For data to be useful in a system, it often needs to be ingested, processed,
and stored. Processing steps generally include:
• Data extraction, which parses useful parts of data from volumes of ingested
data
The structure that handles this end-to-end process is called a data pipeline.
To advance the idea beyond just a blueprint for handling the data requires
developing code . . . and usually quite a lot of it. Since it can be such a heavy lift
to build and run a reliable, efficient pipeline, specialized software applications
have been developed for this purpose. For example, Amazon Kinesis can ingest
and process streaming data in real time.
1.5 Representation
An essential step in preparing data for modeling and analysis is to represent
the data in a useful form. A data representation needs to include enough
detail to solve the problem at hand. It also requires a data structure that is
compatible with the algorithm.
might cover chapters 1–6, 8–13, and 16–17. This would give an excellent
overview of data science while skipping more advanced topics including lo-
gistic regression and clustering.
Beyond these technical skills, the reader will learn to think holistically about
the field, paying attention to critical topics such as:
• Data literacy
As the field of machine learning is broad and rapidly evolving, and this book is
an introduction, it will not cover all of the popular models. It will stop short of
treating neural networks, for example, although they are briefly discussed in
the final chapter. References will be provided for going deeper. For a detailed
review of machine learning models, for example, see [4].
1.12 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
Once you move beyond doing data science for yourself as a hobby, you’ll need
to communicate your work with others. Data scientists have the potential
to be transformative and highly visible, and it is not uncommon for them
to present work to leadership, board members, venture capitalists, and cus-
tomers. Effective communication is so essential that if you don’t master it,
no paying customer or employer will enlist your technical services for long.
That is why this chapter appears before the technical topics. A side effect of
learning strong communication skills is better relationship building, which is
associated with greater happiness and fulfillment [5].
One of my interviews at a healthcare IT firm was with a seasoned clinician
who I’ll call Janice. She was apprehensive about my ability to come in and lead
a team of data scientists to transform their electronic health record system. To
my surprise, she had worked with many scientists before who had developed
neural networks to predict health outcomes. Digging deeper, it turned out
that Janice had been burned by data science before, and the pain radiated
through the organization. This is very common, unfortunately, as data science
outcomes can be highly uncertain, and the field has been overhyped. To win
over Janice, you’ll need to earn and maintain trust. You will need to show that
you are competent and responsible, and that you respect and value people.
9
10 Communicating Effectively and Earning Trust
great leaders can influence others. As a bonus, when an interviewer says “Tell
me about yourself,” your “why” will do the important work of telling your
unique, important story.
In addition to having a “why,” you will need a strong belief in yourself. You
may worry about the computer science requirements of the field, public speak-
ing, or some other aspect of data science. From the start, keep an open mind,
stay positive, keep a list of topics for exploration, and gradually chip away
at the challenges. Even for accomplished professionals, feelings of self-doubt
and not deserving a “seat at the table” are very common. The phenomenon
is so prevalent that it has a name: imposter syndrome. A meta-analysis in the
Journal of General Internal Medicine found imposter syndrome prevalence
rates ranging from 9% to 82% across 62 studies and 14,161 participants [7].
Having some quick wins in your strength areas can help build confidence.
For the areas that need improvement, form a plan for moving forward. You
may be able to get coaching from a manager or more senior colleague. It is per-
fectly fine to be honest about things you don’t know and ask for help. Finally,
a mentor can help greatly. People in the highest ranks of large organizations
use mentors to share ideas, identify weaknesses, and identify opportunities for
growth.
Let’s imagine now that you’ve been working in data science for a few years
and you feel confident doing tasks like coding, preparing data, and building
machine learning models. While these functions may not change, the under-
lying tools and techniques will constantly change. I mention this because I
want you to be mentally prepared for this certainty. Data science is a field
that constantly evolves, as it is driven forward by technology and changing
customer demands. If you can accept the challenge of learning new things, and
the idea that you’re never done, this will keep you relevant and engaged. For
support and resources, I encourage you to take part in activities that you en-
joy, such as joining a data science community, attending conferences or talks,
or reading online blogs.
Finally, think deeply about what you want to accomplish in your lifetime,
put in the work, and have courage. Here is one of my favorite quotes from
Jonas Salk, the developer of one of the first successful polio vaccines:
“There is hope in dreams, imagination, and in the courage of those who wish
to make those dreams a reality.”
critically. For example, he might realize that a column of numbers should sum
to one. Great managers and leaders can do this, so it’s best to be careful and
thoughtful. As you do the simple things properly, you will be trusted to do
harder things.
A great data scientist needs to be a thoughtful “trail guide.” Like any
technical field, data science uses abstraction, assumptions, complex ideas, and
jargon. Data and models have limitations and weaknesses that may render
your careful analysis useless. To earn trust, make everything crystal clear to
your audience. State assumptions and limitations up front. Explain why the
approach is useful in spite of the assumptions and limitations. Add a key
for acronyms, define the variables, and briefly explain the metrics. Are you
summarizing customer satisfaction with a score on a scale of 1–10? Explain
how this is measured. If possible, test the assumptions and measure their
impact, and try things in different ways.
When we review the work of others, we tend to look for logical fallacies.
Methodology issues early in a presentation can quickly eradicate the interest
of the audience. “Wait . . . this dataset is missing a key group of individuals.
In that case, nothing else matters in this analysis.” When sharing our work,
it is ideal to avoid making such errors, and to anticipate tough questions by
addressing them in the work. An effective strategy that Amazon uses when
developing a new product idea is to write a PR/FAQ, which is a combined
press release and question/answer section. The question/answer section thinks
around corners and gets ahead of reader questions and objections.
but we won’t get into these details with this group. First, this is likely not
the right team to answer the how, and second, it is generally more effective to
brainstorm and think through the how later.
After identifying the right project, I meet with a technical team that has
knowledge of the data and process. This might include an engineering manager
and a data science manager, as well as lead data scientists and data engineers.
At this level, we can talk about the how in great detail, such as nuances in
the data, the strengths and weaknesses of various models, and performance
metrics. Not every technical team has data scientists, and I will need to explain
things appropriately and educate the group as needed.
After the POC is completed, I report back to leadership, and the technical
leaders will join this discussion. At this point, I have an understanding of
the business problem we are trying to solve, the approach taken in the POC,
and the results of our efforts. It is possible to talk in great detail about the
approach and performance metrics, but this likely won’t be appropriate with
this group. If I feel the details may come up in discussion, then I will put
slides in an appendix.
When meeting with leadership, it is extremely useful to put an executive
summary up front. These meetings can sometimes spur lively discussion at
the beginning, and the presenter may not get far into slides. The summary
will make all of the essential points. When presenting results, it is best to keep
it brief, and provide clear definitions. Help the audience understand why the
results are valuable, in business terms when possible. If a case will be made
for replacing their current model with a newer model, it is best to quantify
the impact in dollar terms when possible. Bear in mind that leaders may have
direct experience using “new, improved” models that didn’t deliver what they
promised.
The guiding principle for meetings and presentations is to align on the
agenda. When possible, come to agreement on topics of interest in advance.
It also helps to learn about the audience members before the event: their
backgrounds, roles, and responsibilities. This will make for a more engaging,
valuable discussion.
Product teams make stories a central part of their process. A common tem-
plate looks like this:
This clarifies the persona involved and how the work will be helpful. A specific
example might be:
A seventh grade math teacher wants a way to easily search for relevant con-
tent so that she can more quickly plan her lessons.
Data science can often provide the functionality that a product requires. For
the content discovery story, a recommender system might be added to help the
teacher. Since the audience may be broad, we will mention the type of model,
but will refrain from getting into details. Let’s add to the story, providing
layers about the data and the model.
We can leverage our database of information about the teachers and the
content. The content includes things like articles, lessons, and videos. The
content is tagged with things like the appropriate grade level, the learning
objectives, and the subject area. We will know which grade levels and sub-
jects are taught by the teachers. We also have all of the interaction data:
each time a teacher selects a piece of content, we have the teacher identifier,
content identifier, and the timestamp. We know who selected what content
when.
At this point, I should mention that this would be the ideal data for this kind
of system. As you might imagine, many schools may not yet have this kind of
information. Let’s continue with the model layer.
We trained a recommender system on all of the data from the past school
year. The model was rigorously tested and we measured latency (the time
required to get back recommendations), and the quality of the recommen-
dations. The model is stored securely in production. Teachers can click a
button in a web browser to see a list of useful content.
This layer touches on important considerations like data security, speed, and
relevance of output. The audience will probably be curious if the results look
sensible, and the story should include examples like this:
14 Communicating Effectively and Earning Trust
Jamie teaches seventh grade math. Her students learned about representing
numbers in decimal form. She is planning her next lesson, which will require
students to learn about representing numbers as percentages. She logs into
the learning management system with her credentials and clicks the button
Get Next Lesson.
1. Pterodactyl Percentages
2. 100%! Converting Decimals to Percentages
3. Blue Horseshoe Loves Percentages
Note that this was a small example, and I picked a recommender example.
The story can be crafted to meet the needs of the audience, with additional
layers as appropriate. Let’s put all of the layers together to see a compelling
story:
User Story
A seventh grade math teacher wants a way to easily search for relevant
content so that she can more quickly plan her lessons.
Data
We can leverage our database of information about the teachers and the
content. The content includes things like articles, lessons, and videos. The
content is tagged with things like the appropriate grade level, the learning
objectives, and the subject area. We will know which grade levels and
subjects are taught by the teachers. We also have all of the interaction
data: each time a teacher selects a piece of content, we have the teacher
identifier, content identifier, and the timestamp. We know who selected
what content when.
Model
We trained a recommender system on all of the data from the past school
year. The model was rigorously tested and we measured latency (the time
required to get back recommendations), and the quality of the recommen-
dations. The model is stored securely in production. Teachers can click a
button in a web browser to see a list of useful content.
State Your Needs 15
1. Pterodactyl Percentages
2. 100%! Converting Decimals to Percentages
3. Blue Horseshoe Loves Percentages
Now we have an end-to-end story that can be more readily consumed than
a table of numbers or a graphic. To be sure, relevant numbers and graphics
should be included to support the narrative, but they cannot replace the
narrative. The story clarifies the persona (the who), the functionality (the
what), the benefit (the why), and the plan (the how ). It also explains the
purpose of the model, the problem that it solves, and how it goes about
solving the problem. The story is self-contained, and for those who want to
know more, additional information can be provided.
as this can be gratifying and it shows independence. However, if you feel that
you’re struggling or you have hit a wall, then do not be ashamed to ask for
help. This will help people to trust you, actually, because they will get the
message that if you are stuck, you will work to get unstuck. This is preferable
to people counting on you to get work done, waiting for weeks as you struggle
in silence, and ultimately having nothing delivered.
In some of my early roles, I faced steep learning curves and my managers
were very busy people. They usually couldn’t provide help at a moment’s
notice, so I asked them to suggest how I could get help. Answers varied, of
course, but one thing that helped was collecting all of the blockers, or places
where I was stuck, and thinking through them. I found that for some, I could
break them down into subproblems and then tackle the overall problem. For
other problems, there was information that I needed to move forward. This
habit shortened my list of blockers and it made me more independent. As
people freed up to help (typically after the financial markets closed), we had
fewer things to work through. The second big thing that helped was developing
an internal network of people in similar roles. There was a good chance that
someone near my level could find a little time to offer help.
The current environment is great for crowdsourcing your questions. In-
stant message boards like Slack or Teams are conducive to posing questions
to groups of people and getting answers in a timely fashion. If you have a
coding question, there is a great chance that web resources like Stack Over-
flow will have an answer. In the other direction, I encourage you to share your
knowledge and help others when you can.
Beyond getting help for basic setup and staying productive on projects,
you should also think about what you need to grow and to love your work.
This might include challenging projects, headcount, a promotion, or a raise.
Whatever the request may be, just be sure that you really want it, that you
are ready for it, and that you can document why you deserve it. Some of
these things come with more responsibility, or different responsibilities. For
example, changing from individual contributor (IC) to manager will require
a different set of skills. ICs need to produce work, while managers need to
motivate and coach others to produce the work. Not every IC will enjoy being
a manager! Some data scientists remain ICs for the duration of their careers.
many other things. There are many excellent books on leadership strategies
and ownership, and I highly recommend [10] for going deeper.
2.10 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
8. S You’ve just started a new data science role. What is the best
way to get unstuck at work? Select the best answer.
a) Wait until someone comes to offer help.
b) Keep trying until you eventually find the solution. It’s best not
to waste the time of your colleagues.
c) See if you can quickly solve the problem yourself, and then ask
for help if needed.
d) Ask your manager or nearest colleague each time you have a
question. There is no shame in asking for help when it’s needed.
9. S Which of these selections are examples of blockers? Select all
that apply.
a) The necessary dataset is not yet available for the project
b) A different engineering team needs to build a feature before the
next team can begin their part of the project
c) The development team meets each morning to discuss the project
status and plan for the day
d) A manager needs to sign off on the project and he is not respond-
ing to requests
This chapter discusses the earliest steps in the data science pipeline, which
consist of planning activities. The first step is gaining clarity on the problem
that needs to be solved, the requirements, and the constraints. The problem
is often stated as a business objective, such as increasing the profitability of
a product.
Next, the data science team needs to state the problem in an analytical
framework. For example, a specific machine learning technique might work
well, combined with a loss function and metrics. The executives won’t need
this level of detail, but the team doing the work will use this blueprint to move
forward.
Finally, a plan for data collection needs to be in place. The data might
already be available, or it might need to be collected or purchased. If there
is no way to get the required data, then the project won’t be able to move
forward, so it is important this is discovered early.
23
24 Data Science Project Planning
what they cared about! The customer was adamant that when the model de-
tected the object, it needed to be right. They were not concerned about cases
where the model did not detect the object. This meant that the metric was
actually precision and not accuracy (we will learn more about metrics later).
Ultimately, my team was not speaking the same language as the customer,
and thus we were not aligned. After six months of work and additional changes
to requirements, the project was halted.
I shared this story because it was a painful experience and a common one.
It also could have been prevented. From that experience and other lessons
learned, I developed a planning questionnaire for data science projects. I asked
that the questionnaire be filled out as a team before the project would start.
This required some skin in the game from the person requesting the work,
which seemed fair as the projects would often consume a lot of resources from
my team. It was well worth the effort, it turned out. The upfront planning and
development of a common language helped improve the odds of data science
projects succeeding. Next, let’s look at some of the things in the questionnaire.
Q1: What is the problem that needs to be solved, and why is it valuable?
This should make the problem very clear, and it should help the requestor
understand if it’s really something that should be done. Example objectives
might be to increase revenue for the flagship product, or to lower the cost of
customer acquisitions.
Q3: Who are the right stakeholders and what will be their roles?
For a data science project to succeed, it needs support across many teams.
There needs to be clarity on the people, their roles, and how much they can
support the project. For example, if Jeff is the sole subject matter expert and
he’s tied up for the next six months, then this project likely can’t start right
away.
A Questionnaire for Defining the Objectives 25
Q4: How good does predictive performance need to be, and what are the
important metrics?
If the work is for a customer, they might already have a model and a perfor-
mance benchmark in mind. They might say that their current model has an
area under the ROC curve of 80%, and the new model needs to at least exceed
this threshold. Conversely, the customer might not have metrics in mind, and
you might need to educate them. It can be difficult to get firm numbers, but
I encourage you to push for them and discuss different metrics to get clarity.
Q5: Is there a system in place for running and monitoring models in produc-
tion?
A member of the engineering team should be able to help answer this question.
Assuming the model meets the performance objectives, the next step will be
to deploy it into production. For this to happen, there needs to be a reliable
system available to support data ingestion and processing, model scoring, and
passing results to users. Model scoring, or inference, is the process of running
new data through the model to get predictions. Passing results to users is
typically done with an application programming interface, or API.
Details to consider include how quickly results need to be returned (re-
sponse time), how available the system needs to be (uptime), and where the
data and results will be stored. A system required for real-time scoring will
need high uptime and low latency, while a system returning daily batches of
results won’t have these requirements.
Tip: Response time is the elapsed time from when the request is made to
when the client receives the first byte of the response.
Uptime is the percentage of time that the system is available and working
properly. It is usually expressed in 9’s, as in “six nines” for 99.9999%. This
translates to less than 32 seconds of downtime per year.
If a system isn’t readily available, then the discussion can include how such a
system might be provided. There are several cloud providers offering low-cost,
readily available services including AWS, Google Cloud, and Databricks.
If there is sensitive information in the data, then additional steps and extra
care will likely be required. For example, the data might need to be stored
and prepared on a specific machine for conducting the work. Parts of the data
might need to be masked before colleagues can view it. The owner of the data
should be able to help answer this question.
For some use cases, the model can be treated as a black box. Perhaps as long
as the prediction errors are small, nobody is concerned with how the model
works or why it arrived at the output. This is increasingly rare, however,
because at some point, all models degrade and then stakeholders will want to
see justification for the output.
More commonly, predictions will need to be explained to people. For a
model that recommends a treatment protocol to patients, a doctor will want
to know why 20 mg of medication A is the best course of action. An insurance
regulator will want to see the predictors in the workers’ compensation claim
prediction model.
It is important to know in advance if the model needs to be interpretable,
because some model types are more interpretable than others. When we
learn about modeling, we will cover regression models, which are highly inter-
pretable.
Some industries are highly regulated, such as finance and insurance. To protect
consumers, there are laws specifying how models should be developed and
tested, and how data should be collected and treated. The stakeholders should
have awareness of relevant laws to remain compliant.
Models that fail to remain compliant can lead to financial loss, reputational
damage, and job loss. In the financial industry, for example, lending models
are regularly reviewed to ensure they don’t systematically discriminate against
protected groups of people. We will study this topic later in the book.
product purchased by each customer over the past five years. Given this in-
formation, it would be possible to find similar users based on their purchases.
Within the groups of similar users, some users will have purchased certain
products, and others might be interested in those products. This suggests
relevant product recommendations. The approach we are outlining is a rec-
ommendation algorithm called collaborative filtering, and it falls in the domain
of machine learning. Given this knowledge, the project becomes very action-
able because it can be more easily discussed and researched (among people
with this specialist knowledge), and there is code available to implement the
algorithm. Non-technical audiences will not need to know the details of the
algorithm, but it will help to educate stakeholders about how it works, the
assumptions, and the limitations.
This short example illustrated that the data science team worked to place
the business problem into the framework of a known problem with a readily
available solution. This is the preferable outcome, as it saves time when the
problem has been solved and there is available code. Of course, this won’t
always be the case. It may be that the problem to solve has not been solved
before. There might not be an existing algorithm, and there might not be
available code. In either case, the data science team should sync with stake-
holders to discuss which tools are available, and to estimate the level of effort
for the project. Leadership might be okay with a large research effort to solve a
new problem if the return on investment is expected to be high. Alternatively,
there might be an easier, slightly different problem to solve. The data science
team should think of different approaches, and try to estimate the effort and
feasibility for each approach.
After a feasible analytical framework is determined, the data science team
can plan data collection and use. We review this step next.
For the earlier case of recommending products to users, we need three kinds
of data:
• User data: At a minimum, user identifiers (userids) are required to tell users
apart. Metadata such as age and interests may also be helpful.
• Product data: At a minimum, product identifiers (productids) are required.
Metadata such as category and synopsis may also be helpful.
• User-product interaction data: This captures each user purchase. We need
the userids, productids, and timestamp (date and time) of each purchase.
• There are no available predictors that can explain the target variable
If there is no relationship between any predictor and the target, then ampli-
fying the amount of data will not solve the problem.
If the labels cannot be trusted, this will lead to inaccurate results. Collecting
more bad data will not improve the model.
This presents the same issue as bad labels, as bad data leads to bad models.
Next, let’s take for granted that the data is accurate and there are variables
that can predict the target. Data coverage of the important dimensions is crit-
ical. For example, problems might involve time, location, or subpopulations.
Some problems involving time are subject to cyclicality and seasonal-
ity. Financial markets alternate between bull markets and bear markets. E-
commerce platforms are busier around holiday seasons. A stock price predictor
that is only trained on bull markets will suffer underperformance when the
market turns. The data used for these problems should cover several full cycles
so that models can learn these patterns.
Real estate pricing is partly explained by price per square foot, but this
value will vary over time and by location, such as proximity to the beach
(glorious!) or highway (noisy!). Prices may also be driven by season and month
of the year. The data should include several full cycles across the locations of
interest.
Sourcing Data 29
For any pre-existing data that will be used, its provenance should be un-
derstood. Users will want to know how the data was collected and what it rep-
resents. For example, one of my teams had built a model to predict bankruptcy
using a large database of variables. One of the variables had a perfect correla-
tion with bankruptcy. This was very suspicious, and we asked the owner of the
data to walk us through the variables. It turned out that the amazing predic-
tor was lagged bankruptcy. Since lagged bankruptcy is something that would
not be known in advance, it couldn’t be used in the model. Understanding
the backstory of the data prevented my team from making a huge mistake.
In the event that the required data is not available, options include collect-
ing the data, purchasing it, or finding some open-source data. For collection, it
might be possible to work with teams such as engineering and design to build a
collection mechanism. This approach might be possible if relevant data can be
collected quickly. If this won’t work, then the company might turn to external
data.
For certain use cases, there may be free, useful data in the public domain.
Another option is to pay for data. There are many third-party data providers
such as AtData and People Data Labs that collect, clean, enrich, and sell
various datasets. The data transfer is done through an API which allows for
programmatic delivery and processing. If there is funding available for data
and a provider has something relevant, this may work.
For any dataset, it is important to understand if the data is right for the
project. The data should be fit for purpose, and it should be checked for
quality, consistency, and predictive ability.
3.8 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
Our working definition of data will come from the Collins Dictionary:
Nearly all data is now digital, with books holding less than 10% of the world’s
data. Libraries are working feverishly to digitize and extract information from
their physical documents. Digitization has led to improved search and re-
trieval, and in the age of AI, it enables so much more including classification,
translation, and personalization. Data may be produced by people or agents
such as robots, sensors, devices, and servers. It may look like:
Now that we have this broad view of data, we will examine it from many
different perspectives. This will include understanding the attributes of differ-
ent kinds of data, the different forms that data can take, how data is treated
within systems and applications, and how data can represent (or misrepresent)
things.
33
34 An Overview of Data
To summarize, here are the Python primitive data types with examples:
Type Examples
float 3.14, 2.71, –1.111
integer –2, 0, 500
bool True, False
string ‘success’, ‘bag of words’
Quantitative data can be divided into discrete data and continuous data.
Discrete data is data that takes a finite or countable1 number of values, while
continuous data takes an uncountable number of values. An example of dis-
crete data is the number of attendees at different nature preserves at a given
time. An example of continuous data is the speed of a fastball pitch, assuming
this is measured with perfect precision.
Consider a dataset with several variables, such as eye color, height, weight,
and zip code. The categorical variables (eye color and zip code) generally
require the same processing steps, while the continuous variables (height and
weight) generally require the same steps. For example, one-hot encoding is
commonly used to represent categorical data in machine learning models. It
is useful to programmatically group the variables by statistical data type.
This will make the processing more efficient, and it will keep the workflow
organized.
To summarize, here are the statistical data types with examples for each type:
At any given time, we can think of a dataset belonging to one of three states
[12]:
• At Rest: the data is in storage, such as a file, database, or data lake
• In Use: the data is being updated, processed, deleted, accessed, or read by
a system
• In Motion: the data is moving locations, such as within a computer system
or over a wireless connection
A system or application may comprise many dozens or more datasets. At any
given time, there may be datasets in each of these three states: new data is
ingested into the system, mature data is stored in a database, and a user is
updating a portion of data. This may be a very dynamic process. We will
learn more about each of these states later when discussing data storage and
processing.
An important consideration in each of these states is the risk of malicious
attack. Data at rest is at lowest risk of attack. Of course, data may be cor-
rupted or destroyed on any storage device, so precautions should always be
taken, such as backing up data.
Any time data is moved, hackers may intercept it, and this is particularly
true when it is moved over public networks like the internet. Encrypting the
data before it is transmitted (while at rest), encrypting the connection, or
both will reduce this risk. Security is beyond the scope of this book, but
the interested reader may consult [13]. When data is in use, it is directly
accessible and at highest risk. The risk may be reduced by using encryption,
user authentication, and permissioning.
Given the massive number of data sources on the internet, it can be chal-
lenging to find trusted data. Highly cited and used works will be preferable. To
name a few sites for one-time download of datasets, the UCI Machine Learn-
ing Repository maintains over 500 freely available datasets, while Kaggle holds
over 50,000 public datasets.
Here are some questions to ask when validating data sources, from [14]:
0 2 * * *
where
• The first value (0) indicates the minute
• The second value (2) indicates the hour
• The third value (*) indicates that the job runs every day of the month
• The fourth value (*) indicates that the job runs every month of the year
• The fifth value (*) indicates that the job runs every day of the week
38 An Overview of Data
• How often should the results be updated in storage? (Answer: This again
needs to be decided.)
There are many cloud-based services that streamline the process of establish-
ing, securing, and automating data ingestion, including Apache Kafka, Apache
Flume, and Amazon Kinesis. The important features include speed, security,
connectors to data sources, automation, scalability, and ease of use. Next, we
review approaches for ingesting data by leveraging code.
title
OUT:
<title>Representational state transfer - Wikipedia</title>
The requests and Beautiful Soup modules are imported for the necessary
functionality. A GET request is sent to fetch the data from a Wikipedia page.
Next, a Beautiful Soup object is created and it takes the page content as
input. Lastly, the title of the page is extracted with the find() function. The
output is shown below the OUT: line. The tokens <title> and </title>
are HTML tags which indicate that the enclosed text is a title. HTML is
a standard markup language for web pages, and elements of the page use
various tags. These tags are used by Beautiful Soup and other scrapers to
extract information of interest. We will learn more about modules, variables,
and functions later in the book.
Next, we will have a brief overview of APIs. An API is a protocol for in-
teracting with another system using an agreed-upon set of instructions. To
help developers understand how to use a particular API, an API Reference
document is commonly provided. This will include details on the supported
methods, the expected parameters for making a request (e.g., passing an iden-
tifier for a desired model), the response delivery format and contents, and the
meaning of various status codes, among other elements. The status code will
alert the user to the outcome of the request. For example, a successful request
may return status code 200, while an unauthorized request may return sta-
tus code 401. A detailed understanding of the API will allow for efficient and
proper data collection.
For security purposes, many APIs require users to provide credentials be-
fore requests, or calls, can be made. At the time of writing, a widely used
standard for API authentication is OAuth 2.0. This provides a secure and
convenient method for users to access resources without sharing passwords.
Most web APIs conform to the REST architectural style. Such APIs are
called REST APIs, where REST stands for representational state transfer.
REST follows guidelines to allow for scalable, secure, simple APIs:
• Client-server separation: the client makes a request, and the server responds
to the request.
Data Ingestion 41
• Uniform interface: all requests and responses use the HTTP communication
protocol. Server responses follow a consistent format.
# Wikipedia API
import wikipedia
OUT:
‘New York, often called New York State, is a state
in the Northeastern United States. With 20.2 million residents,
it is the fourth-most populous state in the United States...’
categorized by its level of processing. Raw data is data that has not undergone
any processing. The potential sources and types of raw data are very diverse
and may include video camera footage, ECG waves, student survey results,
and scanned documents. The benefit of capturing and storing raw data is that
all of the data will be available, and the user can revert to the original state
of the data as needed.
Consider the example of a photographer taking a picture of Times Square
in the evening under difficult lighting conditions. The image is stored and
later modified by changing the contrast and brightness, cropping the frame,
and changing it to monochrome. The modified image is saved as a new file.
The photographer later decides that she prefers the picture in color. Since the
original photo is stored, she can revert back to it.
The limitation of raw data is that it likely won’t be immediately useful for
tasks like analysis, predictive modeling, or reporting. One reason is that raw
data is commonly noisy. In the case of the ECG waves, the raw waves cannot
accurately detect a heart arrhythmia due to artifacts such as patient move-
ment and power line interference. Before the data can be trusted to answer
questions, it needs processing. The specific processing steps will depend on
how the data was collected, what defects might be present, how the data will
be used, and who can view the data. In a proper system, the data processing
will often be triggered automatically and implemented by computer code such
as Python or Bash.
What Is Bash?
Bash is a Unix shell and command-line interface (CLI). It is a powerful,
efficient tool for working with files. We will learn more about the shell and
the CLI in Chapter 5: Computer Preliminaries.
An analyst using the trusted data files is less likely to encounter bad char-
acters or other problematic data. However, errors are still possible, and one
reason is inconsistency. For example, the definition of an outlier might change
for a variable, and this might not be reflected properly in the data. In such
a case, the analyst might need to revisit the raw data to rebuild the trusted
data.
The trusted zone is typically not the last stop for the data. There may
be processing steps that need to run on the trusted data, such as data stan-
dardization, natural language processing, and detecting and masking sensitive
information. The steps should support the goals of the organization. A com-
plication is that different users might require different versions of the data,
as in the case of differing privacy requirements. Taking this into account, an
organization needs to be strategic when deciding how to process, secure, store,
and manage the data. If a certain step is required for all users, then it should
take place once as part of the main process; it should not be done by multiple
teams, as this can introduce inconsistencies and redundant work.
• COVID-19
• COVID19
• COVID-19 (Lab X)
• Covid-19
In the Lab X case, a lab provider has included its name in the test name. These
data variations happen in practice, and they make it difficult for systems and
users to discern identical objects from different objects. This entity resolu-
tion problem is challenging and ubiquitous, and techniques such as machine
learning can help.
This is just one example of standardizing data. It may be the case that
every piece of data needs a standardization step. This is a lot of work, which
is why automating these steps is so critical.
splitting text into sentences and words, extracting keywords, classifying the
document (e.g., is it a driver’s license? Is it about sports?), detecting hate
speech, and finding entities such as people, organizations, and locations. This
is a very deep field with a lot of activity. See [17] for an excellent, detailed
treatment.
The table below summarizes the data zones by their degree of processing and
reliability:
200 E Main St
Charlottesville, VA 22902
{‘address’ : {
‘street number’: 200,
‘street name’: ‘E Main’,
‘street type’: ‘St’,
‘city’: ‘Charlottesville’,
The Structure of Data at Rest 47
‘state’: ‘VA’,
‘country’: ‘United States’,
‘zipcode’: 22902
},
‘reviews’ : {
‘num_5_star’ : 10,
‘num_4_star’ : 5,
‘num_3_star’ : 3,
‘num_2_star’ : 1,
‘num_1_star’ : 1
}
}
Notice there is a hierarchy, where address and reviews are separate objects at
the top level. Inside the address, there are components at the same level, such
as street number and state. Here, ‘street number’ is an example of a key, and
it is associated with its value of 200. This is called a key-value pair . In this
format, it would be straightforward for computer programs to retrieve data
elements.
{
‘division’: {
‘AFC West’:
{‘rank’:[‘Chiefs’,‘Chargers’,‘Raiders’,‘Broncos’]},
‘AFC East’:
{‘rank’:‘Bills’,‘Dolphins’,‘Patriots’,‘Jets’]}
}
}
48 An Overview of Data
For example, the top-ranked team in the AFC West is the Chiefs. Forming
these structures can be difficult by hand, but computers can facilitate the
work. If we wanted one structure to hold both the Yelp data and the NFL
data, we could add a unique key for each web page and merge these two
structures together.
This semi-structured format, called JavaScript Object Notation and short-
ened to JSON, is extremely popular due to its ease of use and flexibility. In
fact, it is the universal standard of data exchange through APIs. In other
words, when applications send data to another system, or retrieve data from
a system, they use this format. For each supported method, a JSON object
will be used to make a request, and another JSON object will return the
response.
https://jsoneditoronline.org/
As we will see later, Python has a data type called a dictionary which supports
key-value pairs in a fashion similar to JSON.
4.9 Metadata
Metadata is data about data. Metadata about a book would include the ti-
tle, publisher, author, ISBN, and table of contents. Movie metadata such as
runtime, producer, genre, and actors tell us about the movie, and perhaps
whether it would be of interest. Particularly, as we are awash in more data
than we can possibly review, metadata is useful for finding things, making
Representativeness and Bias 49
Here are some examples where bias can have harmful consequences:
In this case, the minority group is very small and happens to be excluded
from the survey. This results in sampling error since the Native Americans
are not represented.
50 An Overview of Data
The challenge with bias is that it can easily happen unintentionally, and it
can result from a technical shortcoming. To help avoid bias, ask:
Finally, all people working with data should ask these questions – not just
data scientists.
• Where did this data come from, and how was it created?
• How might this data be biased?
• What are the limitations of this data?
• What can we do with this data, and what can we not do?
• How can we improve this data?
• Is this analysis interpreting the data correctly?
Challenge the data, the analysis, and the results. The application of criti-
cal thinking to data and the products of data will always be valuable and
important.
Data ingestion is the process of collecting data from one or more sources.
The data may be used immediately or stored for later use. Two important
approaches for data ingestion are use of an API and web scraping. The API
delivers data in an agreed-upon format. Web scraping can be an effective
method for collecting data, but it can encounter many challenges such as
changes to the structure of web pages.
Data integration consists of preparing and combining the ingested data.
The common pipeline consists of an extraction step, a transform step, and a
load step. This is called the ETL pipeline. In some cases, such as when the
important data elements are not known, the data is stored without processing.
Raw data is unprocessed data. As data is increasingly cleaned and pro-
cessed, it becomes more trusted for downstream applications. Metadata is
data about data, such as the genre and runtime of a movie. This information
can be helpful when searching for relevant objects.
Data can be at rest, in use, or in motion. The velocity of the data is the
speed at which it enters a system for processing. We will measure data volume
as the number of rows of data. High-velocity and high-volume data necessitate
specialized tools for storage and processing.
Some datasets are finite and they can be processed in a large chunk. This
is called batch processing. Data that is infinite, such as video from a sensor, is
called streaming data. This kind of data requires decisions such as what data
to store (since it can’t all be stored) and when to compute (since there is no
end to the data).
Data at rest may be structured, semi-structured, or unstructured. Struc-
tured data can be stored in tables, and transactions are an example. Semi-
structured data is generally hierarchical, and it is stored as key-value pairs.
Home address is an example of semi-structured data. Unstructured data comes
from a variety of sources, such as audio files and images. This format cannot
be easily stored into a tabular form.
To be useful, data needs to be credible and reliable, and it must faithfully
represent the population. In cases where the data is not representative, there
will be bias. Care needs to be taken to prevent and mitigate bias, as it can
have harmful consequences and lead to poor decisions.
In the next chapter, we will briefly study computing preliminaries that
every effective data scientist needs to know. This includes hardware, software,
and version control. We will also get started with essential tools including
GitHub.
4.13 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
52 An Overview of Data
We have covered a lot of information about data and data science project
planning. Doing data science requires computing, and this section will cover
the basics for getting started. Machine learning models are pushing compu-
tational limits, and this requires the latest hardware. We will discuss the
essential hardware components of a computer.
Many new data scientists struggle with operating system differences, file
paths, input/output, and the terminal . While most corporations run the Linux
operating system, many new data scientists have never used Linux. The ter-
minal can be intimidating to the uninitiated, as it is a blank screen. We will
dive in and get experience running commands in the terminal.
The term “cloud” has been briefly mentioned earlier. This chapter will
outline cloud computing, the major players in this space, and the services that
they offer. Companies seek data scientists with experience in the cloud, so this
is an important skill.
Data science is done in teams, and collaborating on code is a daily necessity.
We will discuss how to use Git for version control and GitHub for collaborating
on coding projects. As a side benefit, this will provide the background for
working with the code repo for this book.
Lastly, we will cover some fundamentals of computing with Python. This
will include different ways of running Python code, and using pre-existing
code. Following this chapter, there will be many opportunities to practice and
strengthen Python skills.
5.1 Hardware
Hardware consists of the physical components of the computer. We will focus
on the main components that help us do data science: processors, RAM, hard
drive, and the motherboard.
5.1.1 Processor
The central processing unit (CPU) is the brain of the computer. It is con-
structed from millions of transistors, and consists of one or more cores. Each
55
56 Computing Preliminaries and Setup
core receives instructions from a single computing task, processes the informa-
tion, and temporarily stores it in the Random Access Memory (RAM). The
results are then sent to the relevant components [18].
The CPU is critical in determining how fast programs can run. Major
factors that determine the processing speed are the number of processor cores,
the clock speed , and the word size. The clock speed is the number of cycles the
CPU executes per second, measured in gigahertz (GHz). The word size is the
amount of data that can be transferred to the CPU in one pass. A good clock
speed for today’s machines is 4 GHz, which means 4 billion calculations per
second. A typical word size is 32 bits or 64 bits, where a bit is the smallest unit
of data stored by a computer. Each bit will take a value of 0 or 1. For both
the clock speed and the word size, a larger value will mean faster processing.
The graphical processing unit (GPU) is a processor made of specialized
cores. Like the CPU, it is a critical computing engine. However, its archi-
tecture is different, as its purpose is to accelerate 3D rendering tasks [19].
It does this by performing parallel computations, which happens to be very
useful for heavy machine learning workloads. Today, some GPUs are designed
specifically for machine learning, and some machine learning algorithms and
frameworks – specifically deep learning, which is a subarea of machine learning
– are designed for GPUs.
The technology of CPUs and GPUs has been rapidly improving, which has
supported major advances in data science. Today, GPUs may be integrated
into CPUs. Resource intensive tasks, such as computer vision, benefit from
the combination of CPUs and GPUs. It is common for a data scientist to
think about the task at hand, and select a machine with a desirable number
of CPUs and GPUs.
CPU, and GPU. We will discuss cloud computing in more detail later in this
chapter.
5.1.3 Storage
The computer hard drive, also called disk space or disk is another option
for data storage. Disk is used to store files supporting the operating system,
files for running applications, and other miscellaneous files. Disk is much less
expensive than RAM, but it is also much slower for file access, which translates
to slower processing speeds.
The two main types of disk storage are the older hard disk drive (HDD),
technology and the newer solid-state drive (SSD). The HDD consists of a
spinning platter (the disk) and an arm containing several heads. The arm
moves across the platter to read and write data. SSDs are mechanically very
different, as they store the data in integrated circuits. As they don’t use a
spinning disk and arm, they can be smaller, and the time to read and write is
much faster. SSDs are generally more expensive than HDDs, but the cost has
been falling over time.
5.1.4 Motherboard
The motherboard is a board with integrated circuitry that connects all of
the principal hardware components. It distributes electricity from the power
supply to the essential components, including the CPU, GPU, RAM, and disk.
5.2 Software
Software is the set of instructions used to operate computers and execute
tasks. It can be stored on disk or in memory. The most important software on
the computer is the operating system, or OS. The OS manages the computer’s
memory and processes, including all of its hardware and software. Bringing
up the system information page will show the OS name, version, and other
specifications.
There will be some user experience differences depending on the running
OS. Visual differences will be most apparent, as the backgrounds, icons, and
window panes will look different. Functionality will also vary between the
operating systems, and next we will discuss how directories are structured.
Tip: The Mac operating system derives from the Linux operating system,
and this explains why some of their functionality is the same or very similar.
This also explains why instructions may be grouped for Mac and Linux users.
This book will use this convention as well, treating Mac as a catchall for the
Mac and Linux operating systems.
58 Computing Preliminaries and Setup
5.2.1 Modules
Programming languages provide the capability to save and distribute software
with modules (also called libraries or packages). For commonly used function-
ality, such as matrix computations, data processing, and statistical modeling,
this can help others save tremendous amounts of time and effort.
In particular, Python users can leverage Python modules, which organize
functions and structures into files. We briefly saw examples which used the
requests and Beautiful Soup modules. Anaconda has many essential mod-
ules pre-installed, and it is straightforward to install many others by running
the command pip install [package name], where [package name] is the
package name. This may be run in a terminal or a Jupyter notebook, for
example. We will review these applications later in this chapter.
import os
If a user wants to use a specific function from the os module, such as chdir(),
this can be done by running
os.chdir()
We will import many modules throughout this text. It will become very evi-
dent how we can build on the impressive work of others.
5.3 I/O
Input/output (I/O) refers to the communication between information process-
ing systems. In this book, it will be most common to read data from a file,
which is the input, and use Python code as a set of processing instructions
for the computer. When the computer is finished processing the data, it will
send the result as output. The output might be a new file containing data, a
beautiful graph, or some other form.
locations, or paths. New data scientists may be surprised to learn that paths
look different across operating systems. For example, a file path on Windows
might look like:
C:\Users\user_name\projectx\datasets\file.txt
/home/user_name/projectx/datasets/file.txt
Pathing can be a major source of confusion, so this section will outline how it
works. There are two differences in the appearance of paths: the drive letter
and the slashes. Windows operating systems use a directory for each drive.
Here, datafile.csv is stored on C:, which is the hard drive. Mac operating
systems, by contrast, store everything in a single directory.
Secondly, Windows machines use backslashes “\” in paths, while Macs use
forward slashes “/”. When coding in Python (on any operating system), the
forward slash is required for paths.
There are two options when specifying file paths. The absolute path states
the entire path, while the relative path provides the location from the working
directory. The working directory is where the system would start searching
for files.
Our current example uses an absolute path. A quick indicator is that a
drive letter appears at the start of the path. There are advantages to using
relative paths. The first is that the absolute path can get long if a file is
deeply nested in the hierarchy. The second, which is extremely important,
is that relative paths make code portable. Absolute paths are generally not
portable, because different machines will generally use different paths.
For example, consider a code snippet with an absolute path to a home
directory like /home/apt4c/projectx/datasets/file.txt. If a different user
ran this code on a different machine, it would break because the file would not
be found at that path. Instead, a relative path can be specified in the code.
Suppose that the working directory is /home/apt4c/projectx/. This would
be where the system would begin a search for files. We provide the path from
that location, including the file name, which looks like this:
‘./datasets/file.txt’
The dot denotes the current location and /datasets/ directs the search into
the datasets folder for the file. If a different user runs this code from the
working directory projectx, then the relative path would be successful. This
60 Computing Preliminaries and Setup
avoids the issue of machines having different paths earlier in the directory. To
summarize this important point:
Later in the book, we will need to use relative paths that back up one directory.
For example, this would mean navigating from
/home/user_name/projectx/datasets/
to
/home/user_name/projectx/
The Linux command .. will back up one directory. The command ../.. will
back up two directories, and this pattern can be extended.
The formats we will use the most for data are CSV and TXT. The CSV format
separates columns with the comma delimiter. It can be very useful when each
row contains the same number of columns. Here is a small example of what
the first few rows of a CSV file might look like:
year,location_captured,females,males,cubs,unknown_sex,total
2014,China,14,11,2,0,27
2014,Russia,25,21,3,2,51
This dataset includes a header row which provides the names for each field.
The header is not always included, and it is best to check this before plunging
ahead into analysis.
The TXT format is very useful when the data doesn’t follow a structure, such
as the contents of a web page or a document.
A format that is increasingly useful in big data is the Parquet format, which
stores a dataset by its columns, along with its metadata. It provides efficient
data compression for fast storage and retrieval.
Python provides functions for easily saving and loading file formats. It won’t
be necessary to spend time writing code to loop over rows of data and load
Shell, Terminal, and Command Line 61
them in memory. This means there will be more time for doing data science
and less time spent parsing data.
When we call a Python function to load data, the dataset is saved in RAM. It
is important to verify that there is enough RAM to store the data in computer
memory. For example, a machine with 16 GB of memory will not be able to
load a 100 GB video file; such an attempt will fail with an out of memory
exception.
Most Linux and Mac operating systems use Bash as the default shell. Win-
dows machines come preloaded with two shells: cmd and PowerShell. There
are several other shell programs, but at a basic level, they all support user
interaction, and they differ in some of their functionality. Some power users
can get quite particular about their shells!
There are two ways that users can interact with the shell: visually through the
graphical user interface (GUI), and through the CLI. Most users are familiar
with clicking icons in a GUI. Use of the CLI is often through a terminal
environment.
On a Mac, users can bring up terminal with Spotlight (search for “terminal”).
On a Windows machine, clicking the windows button and typing “terminal”
will bring up PowerShell.
On a Mac, the terminal icon looks like the black square shown below.
Terminal icon
The > symbol is called the prompt, and it is where users enter commands.
The line where commands are entered is called (very sensibly) the command
line.
Command Prompt:
C:\Users\apt4c>
The prompt displays the working directory followed by a blinking cursor. This
is usually intimidating for the uninitiated, as there is not much happening. It
is waiting for a command. The contents of the working directory can be listed
with the ls command, followed by Enter.
The terms shell, terminal, and command line tend to be used interchangeably,
and this is because they are fundamentally intertwined. The shell is the pro-
gram offering interaction with the OS, the terminal is a specific environment
for the interaction, and the command line is the actual line where commands
are entered. The book will use statements such as “open a terminal” or “run
this at the command line,” and clarification will be given as needed.
The command line offers two major benefits over the GUI. Since all of the
commands are typed with the keyboard, this becomes extremely efficient as
the commands and keyboard shortcuts are memorized. This avoids repeated
Version Control 63
movement between the mouse and the keyboard. Additionally, since all com-
mands are typed, they can be easily shared with others trying to replicate
the work. The commands entered in the terminal are stored, and they can
be pulled up with the history command. By comparison, replicating work in
the GUI requires screenshots or other media to clarify the steps; this creates
additional work for everyone.
There is a lot more that can be said about the command line, and we will cover
what we need during our journey. For an excellent, comprehensive treatment
of command line tools, see [20].
5.5.1 Git
The most widely used modern version control system in the world is Git. It is
mature, open source, optimized for performance, and secure. To track history,
it focuses on changes in the file content, or diffs. This means that Git won’t
be fooled by changes to a file’s name; only the file content matters for tracking
diffs.
Git is a very powerful tool that provides a set of commands that can be
entered in a terminal or in a GUI. For details, see [21]; the book is freely
available here: https://git-scm.com/book/en/v2.
The first step is to install Git, which can be downloaded from here:
https://git-scm.com/downloads
Tip: On a Mac, it may be easier to type git in the terminal and install the
command line developer tools. These tools include Git.
We will use the terminal in this book when working with Git. The following
illustration will use PowerShell. After typing git and pressing Enter, a set of
commands will be displayed, as shown below. If these commands aren’t listed,
Git did not install properly. Try reinstalling Git in this case.
typically use Git as part of their daily workflow, and with practice comes
expertise. The optional appendix at the end of this chapter discusses more
things that can be done with Git. These tasks will not be necessary in the
scope of the book.
5.5.2 GitHub
Next, we will review GitHub, which is a popular internet hosting service for
software development and version control built on Git. GitHub also provides
tools for software feature requests, task management, continuous integration,
and project wikis. For more details on GitHub, see [22].
GitHub and Git are often confused, and clarification is in order. Git can
be used for version control independently of GitHub (or similar services like
GitLab). However, tools like GitHub are required to collaborate on version-
controlled software over the web with other users. Many users opt for GitHub
even for themselves, as it provides a nice user interface and many helpful
features.
If you don’t have a GitHub account, visit the GitHub site to create one:
https://github.com/
The signup process will ask for an email address, username, and password.
Version Control 65
Open another tab and go to the GitHub course repository found here:
https://github.com/PredictioNN/intro_data_science_course/
The course repo landing page, shown in Figure 5.2, contains folders for each
semester, instructions, and a README file which serves as the syllabus, among
other things.
Then visit your personal GitHub page. The repo should appear in your ac-
count.
Tip: Users can make changes to the forked copy, such as adding helpful notes
and files. This will not affect the original course repo. Additionally, users can
fetch updates from the original course repo and submit changes for review
with a pull request.
In this section, you will clone the repo from your GitHub account to your
computer. To clone the repo, click the green Code button.
Next, with the HTTPS tab selected, click the copy icon, which will copy the
URL of the repo. Note that in place of PredictioNN in the box shown in
Figure 5.3, you will see your username.
Version Control 67
C:\Users\apt4c\Documents\repos
Figure 5.5 shows the directory change and clone command. Notice how the
prompt updates to the changed directory.
Running git clone will prompt the user for a username and password. The
personal access token can be used for the password.
Once authenticated, the folder and complete set of files should copy to the
specified path. Figure 5.6 shows the input and output for this example. If this
succeeds, give yourself a big congratulations!
If you ran into trouble or found any of these steps confusing, take your time,
visit the Git and GitHub help pages on the web, and return when you are
ready. You can continue through the book in the meantime.
semester1/week_01_intro
Tip: You can click into this path to navigate backward as needed.
semester1/week_01_intro
Coding Tools 69
If you have an application to view pdf files, like Acrobat Reader, you should
be able to open the file.
If that worked, congrats! You are now able to work directly in the course repo.
what_kinds_of_problems_can_ds_address.ipynb.
This file can be directly viewed in GitHub, and it contains the same informa-
tion as the pdf version. The .ipynb extension signifies that it is a Jupyter
notebook file. In the next section, you will install Anaconda on your machine.
This provides support for running the notebooks, among many other things.
It should detect your operating system and figure out the right software ver-
sion. Here is an example recommendation of the Windows 64-Bit version (Fig-
ure 5.7). Currently, Python is on version 3.9.
Next, step through the install process and follow the recommendations.
After the install is finished, open the Anaconda Navigator, which provides
many fully featured tools.
70 Computing Preliminaries and Setup
what_kinds_of_problems_can_ds_address.ipynb
The file opens, and we see that the content is arranged in cell blocks.
The cells can contain Markdown (rich text with a specific format) or Code. It
is possible to toggle between Markdown and Code using the drop-down shown
here:
There are menus that support working with files and changing settings, among
other things. Keyboard shortcuts are supported and allow actions like chang-
ing formats, adding cells, deleting cells, and running them. It is very helpful
to learn the shortcuts to move fast, but they are not critical.
JupyterLab automatically saves the notebook every two minutes; this is called
checkpointing. Users can prompt it to save by clicking the save button (the
floppy disk icon left of the Markdown drop-down).
The section at the top right displays a notebook kernel such as Python 3
(ipykernel). The notebook kernel is a computational engine that executes
the code in the notebook. For the purpose of this book, it won’t need to be
changed, but there are a few things worth knowing about the kernel at this
point:
(1) Sometimes the kernel dies (this is typically rare), or a computation stalls
Coding Tools 71
or runs for a long time. The kernel can be restarted by selecting Kernel >
Restart Kernel from the main menu at the top.
(2) Different kernels can be prepared and loaded for different purposes. For
example, a user might have a kernel that runs a deep learning framework.
If you made it to this step and you can run a Jupyter Notebook in JupyterLab,
I commend you! You have completed all of the setup needed for the book.
5.7.1 IDEs
Interactive Development Environments, or IDEs, provide a rich environment
for developing, debugging, and testing code. They include several panes for
various functionality such as writing code, browsing files, and interactively
running code.
Many IDEs have advanced, time-saving features including code comple-
tion, text highlighting, and variable exploration. Anaconda comes with an IDE
called Spyder which supports Python. Figure 5.9 shows the default multi-pane
display of Spyder, with a notepad for file editing at left, an object browser
at top right, and an interactive Python console at bottom right. Users can
run code at a command line in this console. It is also possible to access the
command history from a separate tab.
While the IDE is very useful for software development, we will use JupyterLab
in this book. JupyterLab is very useful for interactive scripting and demos.
Cloud service offerings are commonly grouped into categories which often
interact:
Cloud providers own and maintain massive warehouses of hardware, and cus-
tomers can access these resources virtually through an internet connection.
Security is essential in this setup, whereby an organization’s data is transmit-
ted and placed on external servers which are shared by others. Customers pay
only for what they use, which is called consumption-based pricing.2 Providers
offer a variety of hardware and software for different use cases and patterns,
2 Note that forgetting to shut down a server, an issue called “leaving the lights on,” will
Most of the Fortune 500 companies use cloud-based services. This makes cloud
skills essential for data science and related roles. Here is a brief sketch of how a
data scientist named Carla might work in the cloud: Carla arrives at the office
ready to test a new predictor in a recommender system that her team built.
She logs into the AWS Management Console and launches SageMaker Studio.
She opens a notebook and selects the desired EC2 compute instances. Since
she will be training a large machine learning model, she selects an instance
from the P3 class. The datasets are saved in an S3 bucket (object file storage),
and she loads them into the notebook. She pulls the latest project code from
GitHub, creates a new branch for her experiment, and starts to modify code.
74 Computing Preliminaries and Setup
She runs experiments and saves her finalized model (to S3) and her code (to
GitHub). When she’s finished, Carla shuts down the SageMaker instance.
From the forked copy in your GitHub account, find the Sync fork button
as shown in Figure 5.11. Clicking the button Update branch will trigger the
update and merge changes.
Following the sync, the forked copy will be current (Panel B).
Lastly, pull updates to the clone by returning to the terminal, changing direc-
tory to the repo, and running the command:
This will pull the changes from the forked copy to your computer. This com-
pletes the sync (Panel C).
The mkdir command creates a new directory (folder) for the project, called
new git project.
Finally, the git init command initializes the project as a Git repo (Figure
5.12).
Appendix: Going Further with Git and GitHub 77
Git mentions that to track this file, the command git add can be run, and
so we do this.
The file has “gone green,” meaning that Git is now tracking it. Git manages
tracking of files in a staging area. One or more files can be added to this
staging area. For example, to add all files with extension txt, a file name
pattern with the wildcard operator * can be used like this:
File changes in the staging area can be finalized with the commit command
as shown in Figure 5.14.
The command includes a flag which appends a message to the commit, where
the flag is -m and the message is
‘added test_file’
78 Computing Preliminaries and Setup
A useful message will describe what the commit did. Lastly, running git
status again will show that the Git staging area is empty and the commit is
complete, as in Figure 5.15.
New users to Git frequently ask why separate add and commit commands
need to be run. The add command is used to mark files to be committed,
which is called staging. It may happen that files are not all staged at once, or
that some files are removed from staging. Once the user is ready, the commit
command will save and track the changes. Using the analogy of a flight, the
add command gets people onto the plane, and the commit command is akin
to takeoff.
5.11 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
After data is ingested and integrated, it needs to be processed from its raw
form to make it more useful. There are many ways that processing might be
done. Some tasks are done as part of a research and development function,
such as interactive processing as part of a data science pipeline workflow. For
example, a data scientist might load data from a file or database and write
and run code to process the data for machine learning. Since the work is
experimental and not required in real time, the speed of processing might not
be critical. This kind of work is considered part of the dev environment.
On the other hand, some processing is time sensitive, as the data might
feed a model that provides recommendations to customers in real time. If
the customer needs to wait more than a few seconds for the response, she
might abandon the application. This workload is running in the production
environment, or prod . Data processing in prod needs to be automated and
optimized for speed. It is unsafe for workers to test features in prod, as they
might interrupt the product delivered to customers.
The methods and tools for working with data will depend on how the data
is structured, which environment is in use, and other factors. In this chapter,
we will build coding skills to process data in unstructured and structured
formats. The examples will begin with smaller datasets and code samples,
which will focus attention on learning the concepts and techniques.
The first dataset in our work will be an unstructured document in TXT
format. The second dataset is structured and saved in CSV format. The third
dataset is a larger CSV file in need of cleaning. It contains missing values
and extreme observations. Since real-world datasets can be large, the cleaning
and exploration should be done programmatically. Statistical and graphical
summaries can be used to identify issues.
81
82 Data Processing
there may be tens of thousands of such files that need to be processed and
analyzed in a batch.
Datasets will be found in the course repo intro data science course in one
of two folders:
semester1/datasets
semester2/datasets
The full path includes the path above with the file name like this:
semester1/week_03_04_data_types/
california_wildfire_text_processing.ipynb
Note that this single line of code extends to the next line due to margin
constraints.
import os
os.getcwd()
This imports the os module and uses the getcwd() function to get the current
working directory. We will talk more about Python functions later, but for
now a quick definition will suffice:
A Python function takes zero or more inputs, runs a set of instructions, and
returns zero or more outputs. At their best, they do useful work, and they are
easily reusable.
OUT:
‘C:\\Users\\apt4c\\Documents\\PredictioNN_LLC\\repos
\\intro_data_science_course\\semester1\\week_03_04_data_types’
Since the notebook is saved in week 03 04 data types, this working directory
makes sense. Paths will be enclosed in quotes, and either single- or double-
quotes will work.
To find the data file, we move like this from week 03 04 data types
1. back up one directory
2. search in the datasets folder for the file california wildfire.txt
The command .. will back up one directory. Folders and files need to be
separated with forward slash, and so we assemble the relative path like this:
PATH_TO_DATA = ‘../datasets/california_wildfire.txt’
Before loading in the data, there are two more concepts we need to cover:
variables and strings.
84 Data Processing
6.1.3 Variables
We assigned the path to a variable named PATH TO DATA
id(PATH_TO_DATA)
OUT:
2340478058544
4badvar = 5
OUT:
Input In [11]
4badvar = 5
^
SyntaxError: invalid syntax
Python throws a syntax error, due to an invalid variable name. In this case,
it doesn’t like that the variable begins with a number.
6.1.4 Strings
When we assigned the path variable, we issued:
PATH_TO_DATA = ‘../datasets/california_wildfire.txt’
California Wildfires 85
This data type, enclosed in quotes, is a string. The string is one of the four
primitive data types that we encountered earlier. It is a sequence of characters,
and the data type commonly represents text. Python will allow single quotes,
double quotes, or triple quotes around a string.
The code below sets the full path to the data file, and it uses functions to open
the file and read from it. First, a quick observation: there are lines containing
this character: #
Now, if things went as intended, data contains the text in the file. Let’s call
the print() function on the data to print its contents. For brevity, we show
the first few lines of output.
print(data)
OUT:
The 2021 California wildfire season was a series of wildfires
that burned across the U.S. state of California. By the end of
2021 a total of 8,835 fires were recorded, burning 2,568,948
acres (1,039,616 ha) across the state. Approximately 3,629
structures were damaged or destroyed by the wildfires, and at
least seven firefighters and two civilians were injured.
86 Data Processing
This is exciting, as we were able to use Python to read data from a file into
a variable and print it to the console. Let’s dig a little deeper on the open()
and read() functions. Different file formats can be opened and processed with
different functions. For this file in TXT format, the built-in open() function
takes a file path and a mode, and it returns a file object. This code uses the
read mode ‘r’. To open a pre-existing file and write to the end of it, one could
use the append mode ‘a’ . Next, read() returns all file contents as a string.
For more details about open() and other built-in functions, please see the
Python documentation:
https://docs.python.org/3/library/functions.html
It is always a good idea to verify the data type of objects. Oftentimes when
errors arise, it is because our expectations are not aligned with reality: the
data is a different type, the shape of a matrix is different than we thought,
etc.
type(data)
OUT:
str
sent = data.split(‘.’)
print(sent)
OUT:
[‘The 2021 California wildfire season was a series of wildfires
that burned across the U’, ‘S’, ‘ state of California’, ‘ By
the end of 2021 a total of 8,835 fires were recorded, burning
2,568,948 acres (1,039,616 ha) across the state’, ‘
Approximately 3,629 structures were damaged or destroyed by the
wildfires, and at least seven firefighters and two civilians
California Wildfires 87
Splitting on the periods removes them, and we see a few things that are new:
First, the fragments of text are wrapped in quotes, which indicates they are
strings.
Second, there is an \n\n string. The ‘\n’ character indicates a new line.
Looking back at the original text, there is a gap between lines 6 and 8. The
first ‘\n’ ended line 6, and the second ‘\n’ skipped line 7.
Third, the output begins with [. If we looked at all of the split text, we would
see that it ends with ]. Let’s check the data type of sent to understand what
is happening.
type(sent)
OUT:
list
It turns out that sent is a list of strings. We have just discovered a new data
type. Unlike strings, the contents of a list can be changed. It is also possible
to subset, or index into, a list to extract elements from it.
sent[0]
OUT:
‘The 2021 California wildfire season was a series of wildfires
that burned across the U’
Notice that the first string was stored in position zero. Python uses this con-
vention for all data types storing a collection of elements. sent[1] will contain
the second element, and if there are 10 elements, then the last one will be
stored in sent[9].
len(sent)
OUT:
21
Running type(21) will return int, which indicates that the value is an integer.
tokens = sent[0].split(‘ ’)
print(tokens)
OUT:
[‘The’,‘2021’,‘California’,‘wildfire’,‘season’,‘was’,‘a’,
‘series’,‘of’,‘wildfires’,‘that’,‘burned’,‘across’,‘the’,‘U’]
Notice what this did: it produced a list of strings, where each string is a word.
Python datatypes can be recognized by their punctuation; strings are enclosed
in quotes, and lists are enclosed in square brackets.
Next, we would like to associate each word with its position in the sentence.
This is useful if we want to incorporate text data in a model, for example. We
can’t directly feed text into a model, so we need to represent it numerically.
This Python object is called a dictionary and it stores key:value pairs. The
keys in this case are strings, and the values are integers indicating positions,
starting from zero. The code snippet below provides a solution.
print(word_index)
OUT:
California Wildfires 89
Dictionaries use curly brackets to wrap their data, and the first command
creates an empty dictionary.
Next, a for loop passes over each word in the list. The enumerate() function
will produce two things for each step of the loop:
The ix variable acts as a counter. For the first step, it will hold value 0. On
the next step, it will be 1, then 2, and so on. The name ix is completely
arbitrary.
The token variable will hold the data at each step. For the first step, it will
hold ‘The’, on the second step it will hold ‘2021’, and so on. The name token
is also arbitrary.
Let’s take the first step in the for loop. At this step, ix = 0 and token =
‘The’
The next thing that happens is that we store the value into the dictionary at
a precise location.
word_index[token] = ix
gives the instruction to store the value 0 into the dictionary where the key is
“The”.
If we wanted to check the value in the dictionary when the key is “The” we
could run:
word_index[‘The’]
The loop then does the identical work for each of the other words, storing the
words as the keys and the counters as the values.
After the for loop processes each word in the list, it stops. Before continuing,
please convince yourself that this works, and run the code. One thing that
often helps is to print intermediate results. Here is a snippet for testing, which
will print the counter and value on each iteration:
90 Data Processing
The last task in this exercise is to extract a string with duplicate words, and
run a command to determine the unique words. A Python data type called a
set will do this for us. A set holds a unique collection of objects. If the objects
are not unique, it deduplicates them to make them unique. It will be highly
useful in data science to work with unique objects, such as unique users and
unique products.
sent[9].split(’ ’)
OUT:
[‘\n\nThe’,‘long’,‘term’,‘trend’,‘is’,‘that’,‘wildfires’,
‘in’,‘the’, ‘state’,‘are’,‘increasing’,‘due’,‘to’,‘climate’,
‘change’,‘in’,‘California’]
The solution is quick: Wrap set() around the statement like this:
set(sent[9].split(’ ’))
OUT:
{‘\n\nThe’,‘California’,‘are’,‘change’,‘climate’,
‘due’,‘in’,‘increasing’,‘is’,‘long’,‘state’,‘term’,
‘that’,‘the’,‘to’,‘trend’,‘wildfires’}
This object is a set, and the words are unique. Notice that the order is shuffled;
sets cannot be used to maintain order.
help(len)
OUT:
len(obj, /)
Return the number of items in a container.
Counting Leopards 91
In addition, Python has excellent options for help on the web, including its own
documentation (https://docs.python.org/) and Stack Overflow. The truth
is that data scientists and software developers do a lot of web searching when
writing code. They get more efficient in their searching as they understand
the fundamentals and relevant keywords.
import pandas as pd
The data represents the counts and sex of Amur leopards captured by camera
trap surveys in China and Russia. In some cases, the China and Russia totals
exceed China & Russia, and this is because some leopards were observed in
both countries. The importance of understanding what the data means cannot
be overstated.
Next, we’re going to run through a series of tasks with the data to build our
skillset.
df.values
OUT:
array([[2014, ‘China’, 14, 11, 2, 0, 27],
[2014, ‘Russia’, 25, 21, 3, 2, 51],
[2014, ‘China & Russia’, 33, 24, 5, 2, 64],
[2015, ‘China’, 10, 12, 0, 0, 22],
[2015, ‘Russia’, 24, 20, 8, 3, 55],
[2015, ‘China & Russia’, 31, 25, 8, 3, 67]],
dtype=object)
df.columns
OUT:
Index([‘year’, ‘location_captured’, ‘females’, ‘males’, ‘cubs’,
‘unknown_sex’,‘total’],dtype=’object’)
list(df.columns)
OUT:
[‘year’,‘location_captured’,‘females’,‘males’,‘cubs’,
‘unknown_sex’,‘total’]
In the first column of the dataframe is a set of values called indexes. These are
generated by pandas. The index column can be used to reference the rows,
and they can be extracted by running:
df.index
OUT:
RangeIndex(start=0, stop=6, step=1)
We can subset on the first row like this:
df.loc[0]
OUT:
year 2014
location_captured China
females 14
males 11
cubs 2
unknown_sex 0
total 27
Name: 0, dtype: object
When one row or column is returned, its data type is actually not a pandas
dataframe, but rather a pandas Series.
6.2.2 Subsetting
Subsetting on multiple rows will produce a dataframe. Here, we select rows 0
through 2:
df.loc[0:2]
OUT:
year location captured females males cubs unknown sex total
2014 China 14 11 2 0 27
2014 Russia 25 21 3 2 51
2014 China & Russia 33 24 5 2 64
df[‘year’]
OUT:
0 2014
1 2014
2 2014
3 2015
4 2015
5 2015
Name: year, dtype: int64
Select multiple columns using a list of strings:
df[[‘year’,‘location_captured’]]
OUT:
index year location captured
0 2014 China
1 2014 Russia
2 2014 China & Russia
3 2015 China
4 2015 Russia
5 2015 China & Russia
Next, let’s get a little fancier. We will apply some filtering to the data.
OUT:
year location captured females males cubs unknown sex total
2014 China 14 11 2 0 27
2015 China 10 12 0 0 22
These Booleans are used to filter the rows of the dataframe df. This means
that only rows where the value is True will be returned, which explains why
rows 0 and 3 were output.
How can we determine which locations captured less than 15 females in 2015?
We can filter on multiple conditions to answer this question, since we are
applying criteria on females and year.
OUT:
Notice that & is the AND operator (note that | is the OR operator).
This gets the correct row, but it returns unnecessary columns. Let’s adjust
the statement to select only the location captured column.
This gets the desired answer. The syntax we used to subset on rows and
columns was:
dataframe[row_condition][column_condition]
The column condition was simple: it was the column name. If we want several
columns, we can put them in a list.
We have been subsetting data by using dataframe methods. For more complex
subsetting, it often makes sense to use SQL, which is shorthand for Structured
Query Language. We will dive into SQL later in the book.
Alternatively, we can use a dot notation on the right, which is shorter and
often preferred:
OUT:
year location captured fems* males cubs unk sex* total f and m
2014 China 14 11 2 0 27 25
2014 Russia 25 21 3 2 51 46
2014 China & Russia 33 24 5 2 64 57
2015 China 10 12 0 0 22 22
2015 Russia 24 20 8 3 55 44
2015 China & Russia 31 25 8 3 67 56
* females and unknown sex are abbreviated to fit. index is suppressed.
Pandas performed this operation on each row, or row-wise, which was very
convenient. For more complex row-wise operations, we can define a function
and use the apply() function. We will discuss this later.
We cannot use the dot notation when creating a new variable; this is not valid
syntax. Watch what happens:
OUT:
UserWarning: Pandas doesn’t allow columns to be created via a
new attribute name - see https://pandas.pydata.org/pandas-
docs/stable/indexing.html#attribute-access
df.f_and_m = df.females + df.males
We will do two final things with this dataset: sort the data by the new column,
and save it to disk.
6.2.4 Sorting
We can sort by one or more columns by passing a list of the column names
to the sort values(). Sorting will be temporary unless we assign the result
to a variable or include the parameter inplace=True. Let’s sort by our new
column from high to low, or descending. To sort descending, set the parameter
ascending=False.
df.sort_values([‘f_and_m’], ascending=False,
inplace=True)
OUT:
Patient Blood Pressure 97
year location captured fems males cubs unk sex total f and m
2014 China & Russia 33 24 5 2 64 57
2015 China & Russia 31 25 8 3 67 56
2014 Russia 25 21 3 2 51 46
2015 Russia 24 20 8 3 55 44
2014 China 14 11 2 0 27 25
2015 China 10 12 0 0 22 22
PATH_TO_OUTFILE = ‘./amur_leopards_final.csv’
df.to_csv(PATH_TO_OUTFILE)
OUT:
98 Data Processing
Notice there are some irregularities with data types. The systolic blood pres-
sure column seems to contain integers, while the diastolic blood pressure col-
umn looks like floating point values, or floats. For consistency and to avoid
issues later, we should investigate the data further, think about which type
makes sense, and make the appropriate conversions.
We can check the data types of the dataframe by retrieving its dtypes at-
tribute like this:
df.dtypes
OUT:
patientid int64
date object
bp_systolic int64
bp_diastolic float64
This confirms our suspicion. One thing to note is that a missing value in
a column will result in the entire column having the float data type. Let’s
summarize the data next by calling describe() which is a fast way to compute
column counts, percentiles, means, and standard deviations (std).
df.describe()
OUT:
Notice there are 99 systolic blood pressure values, but only 98 diastolic val-
ues. Let’s check for missing values by calling the isnull() function on the
bp diastolic column:
Patient Blood Pressure 99
df[df.bp_diastolic.isnull() == True]
This places a condition on the rows: for any row where bp diastolic is miss-
ing, it will result in True. The dataframe is then filtered by the resulting
Booleans, and all rows with the True condition are returned.
OUT:
This isolates the row with the missing value (coded as NaN) and explains why
the column contains floats. Next, we should decide how to impute, or fill in,
this value.
6.3.2 Imputation
Pandas offers several different methods for easily imputing values, such as
replacement with the median, the mean, a value of choice, or the last non-
missing value. Like all things with data science, it is important to consider
what makes sense conceptually. For any variable that is imputed, the analyst
needs to ask what is appropriate. It may be the case that each variable needs
its own imputation method.
For this dataset, there are several patients, and each patient has several
observations.1 It might make more sense to use each patient’s data for impu-
tation, rather than computing a statistic across all patients. The choice made
here is to impute with the last non-missing value, or last value carried forward
(LVCF). This might not be the true measurement at that time point, but it
may be reasonable. The pandas ffill() function will accomplish this, and
passing the parameter inplace=True will update the value in the dataframe.
df.ffill(inplace=True)
Checking for missing again returns no records, meaning that imputation was
successful.
df[df.bp_diastolic.isnull() == True]
OUT:
Looking at records around the previously missing row confirms the LVCF:
1 The number of unique patients can be computed with len(df.patientid.unique())
100 Data Processing
df.loc[5:7]
OUT:
df = df.astype({‘bp_diastolic’:np.int32})
We used the astype() function, passing a dictionary with the column name
as key and the required data type as value. We could have passed additional
key:value pairs to convert multiple columns if we wished.
Now we check the data types, and we see that bp diastolic has the desired
integer type.
df.dtypes
OUT:
patientid int64
date object
bp_systolic int64
bp_diastolic int32
The data type conversion produces a new dataframe. We can see this by calling
id() before and after the assignment and noticing that the dataframe memory
addresses differ.
id(df)
2641496136432
df = df.astype({‘bp_diastolic’:np.int32})
id(df)
2641537679264
Chapter Summary 101
Next, let’s turn to the patient with the systolic measurement of 200. Unlike
the prior case, we cannot simply reject it as incorrect. It is best to check for
contextual information before taking action. If this data was from an electronic
health record system, we might check for notes. In this case, the correspond-
ingly high diastolic measurement of 120 offers supporting evidence. Acting
conservatively, we leave the outlier measurement in place.
semester1/week_05_08_data_clean_prepare/pandas_dataframes.ipynb
semester1/week_05_08_data_clean_prepare/pandas_dataframes2.ipynb
Another helpful resource is the “10 minutes to pandas” guide, which can be
found here:
https://pandas.pydata.org/docs/user guide/10min.html
We have made a lot of progress in building data processing skills. In the next
chapter, we will learn about storing and retrieving data. The data might have
102 Data Processing
been processed and saved by a data scientist, or it might have been done
automatically as part of a system. Databases are essential to data storage,
and we will learn what they do and how to work with them.
6.5 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
‘C:\Users\abc\Documents\some_file.csv’
a function that takes word index as input and returns a new dic-
tionary where the keys and values are reversed. That is, the keys
should hold the positions and the values should hold the strings.
9. Describe two ways that rows can be selected from a dataframe.
10. For the leopard dataset, there was a point made about understand-
ing what the data represents. Explain this point.
11. S Suppose that you’ve extracted the values from a pandas
dataframe df with the df.values attribute. What is the data type
of these values?
12. For the blood pressure dataset, write code to count the number of
records for each patient. Avoid using a loop.
13. S For the patientid variable in the blood pressure dataset, are
the percentile statistics useful? Explain your answer.
14. S A data scientist uses the ffill() function to impute some miss-
ing data in a dataframe column. Provide an example where this
practice would not work well and explain your reasoning.
15. A data scientist notices patient blood pressure measurements which
are negative. She reasons that such values cannot be negative and
updates them to a value of zero. Is this a good practice? Explain
your reasoning.
7
Data Storage and Retrieval
In this chapter, we will learn about databases and how to use them to store and
retrieve data. We can think of a database as a tool that supports the creation
(C), retrieval (R), updating (U), and deletion (D) of data. These operations
are called the CRUD operations, and they provide minimal requirements for a
database. For example, our brains qualify as a database under this definition.
Unfortunately, brains tend to forget things that aren’t sufficiently used. For a
database to be trusted and useful, it needs to offer:
105
106 Data Storage and Retrieval
• Key-value stores: a dictionary structure that allows for flexible, fast storage
• Graph databases: data is stored as nodes (which model objects) and edges
(which model relationships)
• Document databases
The NoSQL databases are very interesting and useful, but since they are spe-
cialized, they are out of scope for this book. References for further exploration
can be found at the end of this chapter.
For the remainder of this chapter, we will focus on studying relational
databases. SQL will be introduced, and we will get hands-on experience work-
ing with a database through Python. This will provide an introduction and
some practice. Databases can be a component of much larger storage systems,
such as a data warehouse or a data lakehouse. We will briefly learn about these
objects.
For example, Callie had 30 sessions with the Sharpen app on her tablet, and
52 sessions across all devices. Based on her activity, she was categorized as a
Moderate User. Suppose that over the next week, she has 15 more sessions
on her tablet. Let us further suppose that a user with 60 or more sessions in
Sharpen across all devices is categorized as a Power User. In this case, Callie’s
status should be updated to Power User. Imagine that the update is made to
Relational Databases 107
the second record of the table, but not the first record. This introduces incon-
sistent data, rendering the database less useful. At the heart of the problem
is the duplication of data. If we maintain user status in a separate table with
a single record for each user, this problem can be avoided. Here is what a
database using two tables can look like:
Student Status
name status
Callie Moderate User
James Power User
Avery New User
Given this new structure, there is no duplication of user status. This reduces
the chance of data inconsistency. In fact, the first table can be broken down
into smaller tables. We won’t do this here, but the important point is that each
concept should be stored in a separate table. The user’s status is a concept,
and this data should be stored in one table. Another concept is device, and the
unique devices should be stored in another table. In this way, the structure is
cleaner and less prone to error.
There is complexity that is introduced when breaking the data into one
table per concept: when the required data is spread across multiple tables,
the tables need to be joined. If the required data is in three tables A, B,
and C, then we need a join between A and B, and a second join between B
and C. This idea is fundamental to relational databases – it is what makes
them “relational.” When joins cannot be done, or they aren’t needed, then a
relational database is not the right tool for the job.
Figure 7.1 shows a database containing two tables. The Customer table
stores all of the customer data, and the Invoice table stores all of the invoices.
This follows the “one table per concept” principle of relational databases. If
we want to combine the information, this can be done by joining the records
on a common field. In this case, ssn appears in both tables, and it can be
used for the join.
Let’s look at two examples. In Example 1, the first two records have matching
first names and last names. Based on names, the records are not unique.
However, the social security number can uniquely identify the records, and it
can be used as the primary key.
In Example 2, the table contains sample records of daily stock prices. Each
record contains a stock ticker, date, and adjusted closing price. It is common
to create tall tables like this for storing time series data. It will not be possible
to use ticker as the primary key, but ticker and date together can uniquely
define a record. We can form a composite key from these two fields. The PK
abbreviation in the table next to ticker and date indicates that these fields
form the composite key.
7.2 SQL
SQL was developed in the 1970s at IBM, and it is the universally accepted
way to work with relational databases. Users submit a query to accomplish
tasks such as creating and deleting tables, inserting data, and retrieving data.
Selecting all of the data from a table, for example, can be done like this:
There is a small set of very common commands like SELECT which are written
in uppercase by convention. The queries generally follow a style to make them
more readable, such as:
SELECT
product_id,
110 Data Storage and Retrieval
product_name
FROM my_table
ORDER BY product_name;
This query retrieves data from the columns product id and product name
across all rows of the table, sorting by product name. In the sections that
follow, we will review illustrations of SQL commands in action, along with
detailed explanations. Note that different relational database systems may
use different extensions, and the syntax may differ slightly. Additionally, when
using an API in Python, the API will have its own conventions. We will see
this in the next section.
Practical Matters
All SQL should be stored and version controlled, as it is code. For example,
the commands to build the tables and insert the data might be saved in files
on GitHub. This allows stakeholders to understand what was done, and to
replicate the work if needed.
Enough background info, now let’s get to work! In the next section, we will
use SQLite. We will walk through creating a database and interacting with
it using Python. Specifically, we will write and execute SQL in Python to do
these things:
• Create a table
• Insert records into the table
• Query the data
and work with SQLite from a terminal. Lastly, we will work with SQLite
from Python. In the next section, we will extend this to working with multiple
tables. The course repo contains notebook demonstrations using different data
for variety. The notebooks are located here:
semester2/week_09_11_relational_databases_and_sql/
If the sqlite folder name is long, it can be shortened to sqlite, for example.
It is important to make note of where the database is saved. This location is
the database path.
The next step after installation is to launch a terminal. Windows users will
press the START key and type CMD.
From the terminal, it will be necessary to change the directory to the database
path:
112 Data Storage and Retrieval
> cd C:\Users\apt4c\Documents\database\sqlite
where:
sqlite3>
sqlite> .databases
OUT:
main: C:\Users\apt4c\Documents\database\sqlite\musicians.db r/w
It is possible to interact with the database in terminal, but here we will switch
to Python.
Python has APIs for working with databases. Here we use the SQLite API.
The API uses specific code that wraps around the SQL query for execution.
To distinguish the SQL code from the wrapper, the query will be defined in
each code snippet. We begin by importing the module and setting the relative
path to the database. This assumes that the working directory is
C:\Users\apt4c\Documents\database
Music Query: Single Table 113
# create cursor
cur = conn.cursor()
Next, we define some artist data. For this small example, we create a list of
tuples containing artist name and genres. For example, Taylor Swift’s music
can be categorized as country and pop.
artists = [
(‘Taylor Swift’, "[‘country’, ‘pop’]"),
(‘Chris Stapleton’, "[‘country’,‘soul’,‘rock’,
‘bluegrass’]"),
(‘Bono Hewson’, "[‘rock’,‘pop’]")]
To create the table in the database, we will need to provide a schema. For
each field, the schema needs to provide:
Each database transaction ends with a commit command. Let’s create the
table, passing the schema:
query = ‘CREATE TABLE artist (artist_name string, genre string);’
cur.execute(query)
conn.commit()
Next, insert multiple records of data with executemany(). For each record,
the placeholder (?,?) will be populated with the two columns of data in
artists. The INSERT INTO statement is how records are written to database
tables with SQL.
Let’s verify that the data was stored properly. We will use a SELECT statement
to select all rows and columns. The ‘*’ token will fetch all columns.
OUT:
(‘Taylor Swift’, "[‘country’,‘pop’]")
(‘Chris Stapleton’, "[‘country’,‘soul’,‘rock’,‘bluegrass’]")
(‘Bono Hewson’, "[‘rock’,‘pop’]")
Note that the loop for fetching rows is required by the API. If we wanted to
fetch all of the rows in SQLite at the command line, we would simply run the
query:
OUT:
Taylor Swift|[‘country’,‘pop’]
Chris Stapleton|[‘country’,‘soul’,‘rock’,‘bluegrass’]
Bono Hewson|[‘rock’,‘pop’]
sqlite>
The result set has a different syntax, but it contains the same data.
Next, let’s select only the musicians who sing country music. Filtering is done
with the WHERE command. The text match can be implemented with the LIKE
command like this:
query = "SELECT * \
FROM artist \
WHERE genre LIKE ‘%country%’;"
# NOTES:
# The character ‘\’ is a line break for readability
# The character ‘%’ is a wildcard for string matching
# This will match on genres containing ‘country’
OUT:
(‘Taylor Swift’, "[‘country’,‘pop’]")
(‘Chris Stapleton’, "[‘country’,‘soul’,‘rock’,‘bluegrass’]")
Music Query: Single Table 115
This gives the correct output, but we really only want the artist names. We
were selecting all columns with ‘*’, but we can specify the columns that we
want. Let’s run the same query but SELECT only the artist name column:
OUT:
(‘Taylor Swift’,)
(‘Chris Stapleton’,)
This has the right output. If we would like the artist names in a list, we can
store the first element of each tuple like this:
data = []
query = "SELECT artist_name \
FROM artist \
WHERE genre LIKE ‘%country%’;"
print(data)
OUT:
[‘Taylor Swift’, ‘Chris Stapleton’]
OUT:
3
116 Data Storage and Retrieval
The count(*) function counts the number of records. The fetchone() func-
tion retrieves the result set.
Data aggregations are useful for partitioning a dataset and computing statis-
tics on each piece of the partition. An example is the calculation of a town’s
daily average rainfall by month. These calculations are commonly and effi-
ciently performed with a Split-Apply-Combine pattern. The first step in the
strategy is to define one or more grouping variables (in this case, the grouping
variable would be month). Next, the records are split into discrete groups.
For the records in each group, a function is applied to the values. Here, we
calculate the mean rainfall over the days in January, February, and so on. This
yields the daily average rainfall for each month. In the final step, the results
from each group (the months) are combined into a data structure, such as a
table.
cur.execute(query)
conn.commit()
OUT:
(‘Red Rocks’, ‘CO’)
(‘Santa Barbara Bowl’, ‘CA’)
(‘Greek Theater’, ‘CA’)
(‘Madison Square Garden’, ‘NY’)
Now we would like to calculate the number of concert venues in each state.
This can be done by:
Music Query: Multiple Tables 117
OUT:
(‘CA’, 2)
(‘CO’, 1)
(‘NY’, 1)
The line
GROUP BY venue_state
From the result set, there are tuples containing each state and its associated
number of venues. The GROUP BY command is very powerful and can aggregate
across multiple grouping variables.
In the next section, we will get to the main purpose of relational databases:
joining datasets. This will allow for queries which can extract information
from across the database.
Joins bring a bit of complexity, so we will first study some join examples.
Then we will continue to build out the music artist example.
Joins
Common joins are INNER, OUTER, LEFT, and RIGHT. These options provide
flexibility for joining tables with missing data. When joining two tables L and
R, we call L the left table and R the right table. Let’s look at some sample
data to understand how the different joins work.
Table: L
id first name favorite pet
0 Taylor rabbit
1 Chris cat
2 Bono dog
Table: R
id first name favorite number
1 Chris 314
2 Bono 5
3 Cher 12
Suppose we join tables L and R on the id field, selecting all of the unique
columns. Both tables have records with id 1 and 2; we say these records
match. However, id=0 is only in table L, while id=3 is only in table R. The
type of join we should use will depend on which records we wish to keep.
A LEFT join will return all of the records from the left table, and only the
matching records from the right table. The result looks like this:
LEFT JOIN
id first name favorite pet favorite number
0 Taylor rabbit None
1 Chris cat 314
2 Bono dog 5
Since we used a LEFT join, only the records from table L were included, and so
Cher was excluded. Since Taylor is not in table R, we don’t have her favorite
number.
A RIGHT join will return all of the records from the right table, and only the
matching records from the left table. The result looks like this:
Music Query: Multiple Tables 119
RIGHT JOIN
id first name favorite pet favorite number
1 Chris cat 314
2 Bono dog 5
3 Cher None 12
Since we used a RIGHT join, only the records from table R were included, and
so Taylor was excluded. Since Cher is not in table L, we don’t have her favorite
pet.
An INNER join will return only the matching records from both tables. This
means that any non-matching records will be excluded in the join. The result
looks like this:
INNER JOIN
id first name favorite pet favorite number
1 Chris cat 314
2 Bono dog 5
Inner joins produce simpler datasets as there are no missing values to handle.
However, this means that records may be dropped. Sorry Taylor and Cher!
An OUTER join will return all records from the tables. The result looks like
this:
OUTER JOIN
id first name favorite pet favorite number
0 Taylor rabbit None
1 Chris cat 314
2 Bono dog 5
3 Cher None 12
The outer join may produce records with missing data, as in this case.
Next, we return to our music artist data. We will create a table in the database
called hometown and pass the schema. This will hold the hometowns of the
artists.
cur.execute(query)
conn.commit()
120 Data Storage and Retrieval
data_hometown = [
(‘Taylor Swift’, ‘West Reading, Pennsylvania’),
(‘Chris Stapleton’, ‘Lexington, Kentucky’),
(‘Bono Hewson’, ‘Dublin, Ireland’),
(‘Rihanna’, ‘Saint Michael, Barbados’)
]
Notice that Rihanna is in this dataset, but not in the artist table. This will
have implications in the joins that we demo later. The following code will
insert the hometown records into the table:
Next, we query the table to verify the data has been loaded:
OUT:
(‘Taylor Swift’, ‘West Reading, Pennsylvania’)
(‘Chris Stapleton’, ‘Lexington, Kentucky’)
(‘Bono Hewson’, ‘Dublin, Ireland’)
(‘Rihanna’, ‘Saint Michael, Barbados’)
This is correct. Next, let’s try out some joins on the hometown and artist
tables. We can use the common field artist name for the join. First, we run
an INNER JOIN:
OUT:
(‘Taylor Swift’,
‘West Reading, Pennsylvania’,
"[‘country’,‘pop’]")
Music Query: Multiple Tables 121
(‘Chris Stapleton’,
‘Lexington, Kentucky’,
"[‘country’,‘soul’,‘rock’,‘bluegrass’]")
(‘Bono Hewson’,
‘Dublin, Ireland’,
"[‘rock’,‘pop’]")
Three records came back from the query, and they are formatted for readabil-
ity. Since we did an INNER JOIN and Rihanna was not in the artist table,
she was dropped from the join. The other three artists were in both tables
and were retained.
There is some new syntax: a query with multiple tables needs to request
fields with their table names, as in hometown.artist name. This makes things
clearer, and avoids a collision when the same field is in both tables. The line:
The line:
ON hometown.artist_name = artist.artist_name
Next, let’s run a LEFT JOIN. This can be accomplished with a small code
change: replace INNER JOIN with LEFT JOIN.
OUT:
(‘Taylor Swift’,
‘West Reading, Pennsylvania’,
"[‘country’,‘pop’]")
(‘Chris Stapleton’,
‘Lexington, Kentucky’,
"[‘country’,‘soul’,‘rock’,‘bluegrass’]")
122 Data Storage and Retrieval
(‘Bono Hewson’,
‘Dublin, Ireland’,
"[‘rock’,‘pop’]")
(‘Rihanna’,
‘Saint Michael, Barbados’,
None)
Since we are treating hometown as the left table, we get back each record
from this table. Since Rihanna was not in the artist table, we don’t have the
associated data for genre, and it appears as None.
• Variety: The data may come from a variety of sources and it may have
different levels of structure. There may be a need to collect transactional
data, which is very structured, as well as documents, images, and videos,
which are unstructured.
• Volume: There may be a massive amount of data to store and retrieve. It is
important to support these requirements at the necessary scale.
• Access: Different teams and users will have different permissions. A team
that performs analytics on home equity loans might not need access to stu-
dent loan data. A business intelligence team might need to compute analytics
across all of the lines of business. An ideal system would make it easy for
an administrator to grant and monitor access.
• Cataloging: It is valuable for permissioned users to browse datasets at a
high level, and request access to what they need. Metadata saved with the
datasets can help users find what they need.
There may be many other organizational requirements as well. Even with this
short list, however, a single database likely won’t be sufficient. In response,
several objects and concrete services have emerged. A very brief overview
follows next, but bear in mind that each of these topics could be their own
book. Let’s start with the data warehouse.
Houses, Lakes, and Lake Houses 123
We reviewed some SQL commands for creating tables, inserting data, re-
trieving data, counting and aggregating, and joining tables. We used SQLite
inside Python with the help of an API. The API includes functions for work-
ing with SQL, and so the code is a little different from writing straight SQL.
We discussed the importance of data scientists knowing SQL well, as it will
be used regularly.
For a complete storage solution at an organization, one database generally
won’t be sufficient. A robust storage strategy will scale, handle a variety of
data, and allow easy, transparent data cataloging and access. A data ware-
house is a central place for integrating data from multiple sources. It only
stores structured data, which limits its effectiveness. A data lake can be a
central data store for semi-structured and unstructured data. It can scale, but
its quality can be compromised since there is no structure imposed. A data
lakehouse is a hybrid of a data lake and a data warehouse. It offers the ability
to scale and store structured data. It is frequently touted as the way forward
in data storage solutions.
In the next chapter, we will spend some time building our mathemati-
cal toolbox. From there, we will build our statistical knowledge up from our
probability foundation. This will make our analytical capabilities more robust.
Going Further
Many data storage tools were introduced in this section. Here are some starting
points for further investigation:
An outline and comparison of the database, data warehouse, and data lake:
https://aws.amazon.com/data-warehouse/
7.7 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
Subset
For sets A and B, we say A is a subset of B (written A ⊆ B) if every element
of A is also an element of B. This allows for the possibility that A and B
contain exactly the same elements, in which case A = B. In the case where
every element of A is an element of B and they are not equal (for example, B
contains additional elements), then A is a proper subset of B, written A ⊂ B.
Empty Set
The empty set, denoted {} or ∅, contains no elements and is a subset of every
set. This set plays a prominent role, for example when testing if a filter returns
any elements from a dataset.
Cardinality
The size or cardinality of a set A denoted |A| is the number of elements in the
set. Here is a small code example of creating an empty set and computing its
cardinality:
127
128 Mathematics Preliminaries
empty = {}
print(empty)
print(len(empty))
OUT:
{}
0
Set Operations
For a universal set of all elements U with subsets A and B, it will be useful
to find:
• The complement of A, which is a set containing elements in U but not in A,
denoted Ac
• The set of elements in A and B; this is their intersection, A ∩ B
• The set of elements in A, B, or both; this is their union, A ∪ B
• The set of elements in A but not in B; this is their set difference, A − B
U = {a, b, c, d, e, f, g}
X = {a, c, e}
Y = {a, c}
Z = {b, c, d}
X c = {b, d, f, g}.
X ∩ Z = {c}
X ∪ Z = {a, b, c, d, e}
X − Y = {e}
U = {‘a’,‘b’,‘c’,‘d’,‘e’,‘f’,‘g’}
X = {‘a’,‘c’,‘e’}
Y = {‘a’,‘c’}
Z = {‘b’,‘c’,‘d’}
Functions 129
print(‘X_complement:’, U - X)
print(‘X intersection Z:’, X.intersection(Z))
print(‘X union Z:’, X.union(Z))
print(‘X minus Y:’, X - Y)
OUT:
X_complement: {‘g’, ‘b’, ‘f’, ‘d’}
X intersection Z: {‘c’}
X union Z: {‘b’, ‘d’, ‘e’, ‘c’, ‘a’}
X minus Y: {‘e’}
print(‘g’ in Z)
print(‘g’ not in Z)
print(‘b’ in Z)
print(‘b’ not in Z)
OUT:
False
True
True
False
The results follow from the logic of g not belonging to Z, and b belonging to
Z.
8.2 Functions
For sets A and B, a function is a rule or mapping that takes an element of
A as input and assigns an element of B as output. This is commonly written
f : A → B for a function f . The space of inputs A is called the domain, and
the space of outputs B is called the range. The value of the function at a
point x can be written f (x). Note that for each input value, there can only
be one output value. The vertical line test is a graphical method for checking
if a curve is a function. If a vertical line intersects the curve in more than one
place, then it is not a function. Note that we might define a function using the
y notation as in y = 2x or using the f (x) notation as in f (x) = 2x to clarify
the functional dependence on x.
130 Mathematics Preliminaries
Function Composition
Function Inverse
Monotonicity
Some functions have the interesting property that they never increase or never
decrease as x increases. A function that increases or remains constant as x
increases is called a monotonically increasing function, as in Figure 8.2. A
function that decreases or remains constant as x increases is called a mono-
tonically decreasing function.
A function is strictly increasing if it always increases as x increases, and it
is strictly decreasing if it always decreases as x increases. Functions meeting
either of these conditions are strictly monotone.
132 Mathematics Preliminaries
Convexity
y = a0 + a1 x + . . . + an xn
Notice the right triangle formed in the figure. The horizontal side formed from
the x-axis is called the adjacent side, the vertical side dropped from the circle
to the x-axis is called the opposite side, and the side forming the radius is
called the hypotenuse. The trigonometric functions sine, cosine, and tangent
are defined as ratios of these triangle sides:
sin(θ) = opposite/hypotenuse
cos(θ) = adjacent/hypotenuse
tan(θ) = sin(θ)/ cos(θ) = opposite/adjacent
Since the hypotenuse has length 1 for the unit circle, it follows in this example
that the adjacent side is equal to cos(θ) and the opposite side is equal to sin(θ).
Figure 8.6 shows the periodic nature of the sine and cosine functions, which is
useful for modeling cyclical data (e.g., sales cycles, seasonal variation). Their
patterns are identical but shifted by a fixed quantity.
The exponential function with base a (for any a > 0) and rational number1
x can be written y = ax . The inverse of the exponential function is the loga-
rithm with base a, which can be written x = loga (y). The logarithm can be
determined by answering the question “the base to which power is y?” For
example, in solving log2 (8), we ask: 2 to which power is 8? The solution is 3.
A very special base is the number e, which can be understood like this:
1 1 1
Compute the sum Sn = 1 + 1! + 2! + ... + n! for some n.
1 1
S2 = 1 + + = 2.5
1! 2!
1 1 1
S3 = 1 + + + ≈ 2.67
1! 2! 3!
1 1 1 1
S4 = 1 + + + + ≈ 2.71
1! 2! 3! 4!
More formally, e can be defined with a statement using a limit, where the
limit will be formally defined in the next section on differential calculus.
e = lim Sn
n→∞
x
The exponential function y = e is very common and it appears in proba-
bility functions that we will soon encounter. The inverse of this function is a
logarithmic function; it is given the special notation ln(x) which denotes the
natural logarithm. The base is generally not written, as it is understood to be
e. For example, ln(e3 ) = 3, as e to the power of 3 is e3 . A second example is
ln(1) = 0, as e to the power of 0 is 1.
Figure 8.7 shows the functions ex and ln(x). As x increases, ex increases
to +∞. As x decreases, ex tends to zero (it approaches an asymptote at zero).
Its domain is therefore (−∞, +∞), while its range is (0, +∞). The function
ln(x) increases as x increases, but the rise is much slower than ex . As x
decreases to zero, ln(x) approaches −∞. The domain is (0, +∞), while the
range is (−∞, +∞). As ex and ln(x) are inverses, their domains and ranges
are reversed.
Vectorized Functions
x=[1,4,9,16]
np.sqrt(x)
OUT:
array([1., 2., 3., 4.])
In
√ this
√ √ case,√ the function was applied to each element in the list:
( 1, 4, 9, 16). Note that not every module supports vectorization. If we
attempt this with the math package, it emits an error:
x=[1,4,9,16]
math.sqrt(x)
OUT:
TypeError: must be real number, not list
The error indicates that sqrt() from the math package should be applied to
a single value.
Limit
As we look at terms deeper into the sequence (with larger n), they get closer
to zero. None of the terms will be equal to zero, but we can get arbitrarily
close to zero. In fact, there is a term aN such that all subsequent terms are
sufficiently close to zero. For example, if we wish to be within 1/10th of zero,
1
then terms beyond a10 = 10 will suffice. We can express this notion by saying
that as n increases, the terms an tend to zero, or that they have a limit of 0.
We can also say that the sequence converges to zero, denoted:
138 Mathematics Preliminaries
lim an = 0
n→∞
A sequence diverges if it does not converge to any number.
Next we turn from the limit of a sequence to the limit of a function. The
central idea is the same: getting arbitrarily close to some value (the limit) by
changing the input value. For the limit of a sequence, we can get arbitrarily
close by sufficiently increasing n. For the limit L of a function f (x), we can
get arbitrarily close by moving x closer to some point ξ. We say that the value
of f (x) tends to limit L as x tends to ξ, which can be written symbolically:
lim f (x) = L
x→ξ
lim ex = 0
x→−∞
lim ex = ∞
x→∞
Continuity
Derivative
is the derivative of f (x) = 2x2 for some point x? The slope of the tangent
to this curve will change as x changes, so this is more complicated than the
earlier cases. We will return to this example shortly.
When a function has a derivative at a point x, we say f (x) is differentiable
at x. A function which is differentiable at each point in its domain is called
a differentiable function. Since continuity is required for differentiation, this
means that a differentiable function over some domain is also a continuous
function over this domain. The opposite is not true, however: if a function is
continuous over some domain, it does not imply that the function is differen-
tiable over the domain. An example is the absolute value function f (x) = |x|,
defined as −x when x < 0 and x when x ≥ 0. This function is continuous over
its domain (−∞, ∞), but it is not differentiable at the vertex x = 0 where the
slope changes sign abruptly. In particular, to the left and right of zero, the
slope (and derivative) is –1 and 1, respectively.
The discussion of the derivative thus far has been qualitative. Next, we con-
sider the formal definition, which is entwined with the limit. The derivative
0
of y = f (x) at point x is denoted y 0 = f (x) using Lagrange notation, and
dy df (x) d
alternatively as dx , dx , or dx f (x) using Leibniz notation. For a constant
h > 0, the derivative is defined as
f (x + h) − f (x)
f 0 (x) = lim
h→0 h
Breaking down this definition, the fraction on the right side of the equation
is the ratio of change in function value, or output, to change in input where h
denotes the change. This change h is taken to zero in the limit, and the ratio
Differential Calculus 141
approaches the tangent at the point x. For this derivative to exist, it needs to
be possible to get arbitrarily close to f (x) by moving h arbitrarily close to 0.
This means that f (x) must be continuous at x.
f (x + h) − f (x)
f 0 (x) = lim
h→0 h
(x + h)2 − x2
= lim
h→0 h
x2 + 2xh + h2 − x2
= lim
h→0 h
2xh + h2
= lim
h→0 h
h(2x + h)
= lim
h→0 h
= lim 2x + h
h→0
= 2x
The final step in the calculation applies the limit, sending h to zero. This
implies that the derivative f 0 (x) = 2x is dependent on the point x. For exam-
ple, f 0 (−10) = −20, f 0 (−5) = −10, f 0 (0) = 0, f 0 (5) = 10, and f 0 (10) = 20.
Recalling the graph of the parabola and considering tangents to the points
x = −10, x = −5, x = 0, x = 5, and x = 10, the slope is most negative at
x = −10, it increases to zero at the parabola vertex x = 0 (the tangent is a
horizontal line at this point), and it is most positive at x = 10.
The derivative of a function can help us find extrema, or maxima and min-
ima. In viewing a graph from left to right, suppose the function decreases,
flattens, and then increases as with f (x) = x2 . The derivative for the corre-
sponding intervals would be negative, zero, and then positive. At the value
where the derivative is zero, say x = c, f (c) would attain a minimum. Con-
versely, for a graph that increases, flattens, and then decreases, the derivative
would be positive, zero, and then negative. Where the derivative is zero, the
function would attain a maximum. This leads to a definition and a useful
application of the derivative to find extrema.
A point x = c is a critical point of the function f (x) if f (c) exists and
either f 0 (c) = 0 or f 0 (c) does not exist. Function maxima and minima occur
at critical points or at the endpoints of intervals. As an example, the function
f (x) = x2 has derivative f 0 (x) = 2x and it is equal to zero when x = 0. Hence,
x = 0 is a critical point. The derivative is negative when x < 0 and positive
when x > 0, indicating that the function reaches a minimum at x = 0.
The strategy for finding critical values to locate extrema is very useful,
and we will use it later when we select optimal parameter values for regression
models. An error function will depend on the parameter values, and we will
142 Mathematics Preliminaries
compute the derivative of the function and look for a critical value. At the
critical value, we will find the parameters that minimize the error.
It is often useful to compute derivatives of the derivatives. If f is a differen-
tiable function and f 0 is its derivative, then the derivative of f 0 , when it exists,
is written f 00 or f 2 and called the second derivative of f . The interpretation
of f 00 is that it represents the curvature of a function. Similarly, higher-order
derivatives can be defined by taking further derivatives as they exist: f 000 is
the third derivative and f n is the nth derivative. For example, we determined
the derivative of f (x) = x2 to be f (x) = 2x. We could apply the definition
00
of the derivative to compute f (x), or we can take a shortcut, realizing that
00
this function is a line with slope 2. Thus, f (x) = 2; the second derivative
is therefore constant and positive. We can go further, computing the third
000
derivative as f (x) = 0, since the second derivative is a horizontal line with
tangent 0. Next, we will present some rules for computing derivatives. It is
faster to use these rules than to apply the definition.
Differentiation Rules
We assume the functions f (x) and g(x) are differentiable on the domain of
interest.
Multiplication by a Constant
For constant c and function f (x),
d
[cf (x)] = cf 0 (x)
dx
In this rule, the constant comes out of the differentiation.
Example:
f (x) = 3x
d
f 0 (x) = 3 x
dx
f 0 (x) = 3(1) = 3
h(x) = 3x + 4
d d
h0 (x) = 3 (x) + (4)
dx dx
h0 (x) = 3 + 0 = 3
Differential Calculus 143
Power Rule
For integer n and function f (x) = xn ,
f 0 (x) = nxn−1
Example 1:
f (x) = x4
f 0 (x) = 4x3
Example 2:
f (x) = 3x5
d
f 0 (x) = 3 (x5 )
dx
f 0 (x) = 15x4
Product Rule
For h(x) = f (x)g(x), h(x) is differentiable and
Example:
h(x) = x2 (3x + 1)
h0 (x) = x2 (3) + (3x + 1)(2x)
h0 (x) = 3x2 + 6x2 + 2x
h0 (x) = 9x2 + 2x
144 Mathematics Preliminaries
Quotient Rule
For h(x) = fg(x)
(x)
where g(x) 6= 0, h(x) is differentiable and
Example:
x
h(x) =
x2
+1
0 (x2 + 1) × (1) − x(2x)
h (x) =
[x2 + 1]2
1 − x2
h0 (x) = 2
[x + 1]2
Chain Rule
For a composition of functions h(x) = f (g(x)), h(x) is differentiable and
Example 1:
h(x) = (2 − x)3
h0 (x) = 3(2 − x)2 (−1) (apply power rule to outer)
h0 (x) = −3(2 − x)2
d
For this problem, it is important to realize that dx (2 − x) = −1.
Example 2:
h(x) = e2x
d
h0 (x) = e2x (2x)
dx
h0 (x) = 2e2x
Differential Calculus 145
Example 3:
There are several additional derivative rules for various functions, such as the
trigonometric functions. They are out of scope for the purposes of this book.
Partial Derivative
The gradient is a vector , which is a quantity with both direction and mag-
nitude. Earlier, we learned that the derivative can be used to find extrema,
which can occur where the derivative is zero or undefined (at critical points).
For a function of several variables, the gradient can be used in an analogous
way to find extrema, as we will see later.
For going deeper on calculus, a masterful pair of books is [26] and [27].
8.4 Probability
Our data will include variables subject to uncertainty or randomness, and this
can be treated effectively with tools from probability. A random experiment
has an output that cannot be predicted with certainty. When the experiment
is repeated a large number of times, however, the average output exhibits
predictability. The roll of an unbiased, six-sided die will have an outcome that
cannot be predicted, but after a large number of rolls, each number should
appear about 17% of the time.
• The sample space is the set of all possible outcomes of the experiment. It
is generally denoted Ω. For an experiment where a fair coin is tossed twice,
Ω = {hh, ht, th, tt} where h and t are heads and tails, respectively.
• The events are outcomes of the experiment. They are subsets of Ω. We can
denote the set of all events A. A particular event from this example is {hh}.
• The probability is a number associated with each event A, written P (A)
and falling in [0, 1]. The probability of an event increases to 1 as it is more
likely. The frequentist idea provides intuition: for a large number of repeated
trials, the probability of an event will be the number of times the event occurs
divided by the number of trials.
• Let X be the number of heads tossed after two flips of a fair coin
• Let Z denote the number of points scored by the winning team in a Super
Bowl
Probability 147
#(ways of achieving 5)
P (Y = 5) =
#(possible outcomes)
|{(4, 1), (1, 4), (2, 3), (3, 2)}|
=
36
4 1
= =
36 9
Conditional Probability
P (X ∩ Y )
P (Y |X) =
P (X)
where P (X ∩ Y ) denotes the probability of both X and Y occurring. For
our example, the numerator represents the probability that the two rolls sum
to 5, with the restriction that at least one die is 1. We saw there are two
such outcomes out of 36 possible outcomes, for a probability of 2/36. The
denominator works out to 12 possible rolls involving one or more 1s out of
36 possible outcomes, for a probability of 12/36. The calculation is done like
this:
(2/36)
P (Y = 5|X = 1) =
(12/36)
2 1
= =
12 6
When the outcome of X does not influence the outcome of Y , we say that
X and Y are independent. This can be written P (Y |X) = P (Y ). In data
science, the goal is to find random variables X1 , X2 , . . . , Xp which predict Y .
Predictors which are independent of Y would not be helpful. We will see later
that prediction relies heavily on conditional probability, as important data is
included to produce refined results.
We are sometimes in the fortunate position of being able to state the prob-
ability of the random variable taking each possible value. This increases the
certainty of the outcome. The probability mass function (pmf) for a discrete
random variable is a function that takes each possible value as input and re-
turns its probability as an output. For a random variable Y equal to the value
showing on a fair, six-sided die, the pmf looks like this:
Probability 149
1/6, if y =1
1/6, if y =2
1/6, if y =3
f (y) = P (Y = y) =
1/6, if y =4
1/6, if y =5
1/6, if y =6
Y denotes the random variable, while y denotes the specific value taken. Given
equal probabilities for each of the outcomes, this distribution is an example
of a discrete uniform distribution.
The cumulative distribution function (cdf) for a discrete random variable
is a function that takes the possible values as input and returns the probability
of realizing that value or lower as output. The cdf for the die looks like this:
1/6, if y = 1
2/6, if y = 2
3/6, if y = 3
F (y) = P (Y ≤ y) =
4/6, if y = 4
5/6, if y = 5
1, if y = 6
For a continuous random variable, there is zero probability that it will achieve
any particular value. For some intuition, consider the case where there are n
outcomes which have equal probability 1/n. When n = 6, each outcome has
probability 1/6. After taking the limit n → ∞, the probability of each outcome
is 1/∞ = 0. Rather than measuring the probability of particular values, we
measure the probability that a continuous random variable assumes a value
in a range, say P (0 < Y < 1).
Since a continuous random variable takes infinitely many values, a proba-
bility mass function is not appropriate. Instead, a probability density function
(pdf) is used. The fundamental difference between the pmf and the pdf is
150 Mathematics Preliminaries
that the pdf value at a single point does not have meaning. For computing
the cumulative probability F (y) = P (Y ≤ x), the sum will not make sense;
instead, the area under the pdf curve is computed by using integration. We
won’t discuss integrals here, but rather will review some continuous random
variables where integrals are not explicitly needed. First, we start with some
examples of discrete random variables.
Bernoulli Distribution
Binomial Distribution
n n!
where y = is the number of ways that y can happen. For example,
y!(n−y)!
if n = 10 and y = 1, then a success can occur in any trial and 10
1 = 10.
Uniform Distribution
We can envision the pdf as a rectangle having length b − a, height 1/(b − a),
and area 1. The cdf F (y) is found by computing the area under the rectangle
starting at Y = a and ending at Y = x:
0,
if y ≤ a
F (y) = y−ab−a , if a<y<b
1, if y ≥ b
Normal Distribution
(x − µ)2
1
f (x) = √ exp − , −∞ < x < ∞
σ 2π 2σ 2
1 2
f (z) = √ e−z /2 , −∞ < z < ∞
2π
The cdf is computed numerically. The following code snippet shows some
probability calculations for a standard normal. It uses the scipy module which
includes functionality for statistics, optimization, linear algebra, and more.
A Note on numpy
The numpy module also supports work with random variables; it is an essential
package for numerical computing in Python. Numpy defines a highly efficient
object for arrays called the NumPy array. We will work with NumPy arrays
throughout the book.
# compute as 1 - F(0)
print(‘P(Z > 0):’, 1-stats.norm.cdf(0))
OUTPUT:
The output indicates some interesting facts about the standard normal distri-
bution. First, half of the probability is below Z=0 and half is above, by sym-
metry. Roughly 68% of the probability is between –1 and 1, or one standard
deviation of the mean (which is zero). There is about 95% of the probability
within two standard deviations of the mean, and 99.7% (nearly all) of the
Probability 153
probability within three standard deviations of the mean. These numbers are
so common in probability and statistics that they are worth remembering.
For simulation and other applications, it is valuable to generate draws of
random variables from various distributions. This is easily accomplished with
software, which uses a seed paired with a pseudorandom number generator. For
a given seed, it will generate a fixed sequence of numbers. Here is an example
of generating standard normals, where two values are taken (size=2) at a
time in two separate runs. The values are completely different on each run.
import numpy as np
np.random.seed(seed=314)
print(‘run 1:’, stats.norm.rvs(size=2))
np.random.seed(seed=314)
print(‘run 2:’, stats.norm.rvs(size=2))
Figure 8.10 shows a histogram based on 100,000 random draws from the stan-
dard normal distribution. Notice the bell shape and domain where most values
fall (between –3 and 3). The source code is minimal.
sns.histplot(stats.norm.rvs(size=100000))
√
its standard deviation σ/ n, the resulting variable will converge to a standard
normal. Let’s state the central limit theorem:
As n → ∞,
√
X̄n − µ
n → N (0, 1)
σ
Probability is a vast subject and we have only scratched the surface. See [28]
for a broader treatment of the subject.
Matrices (plural of matrix) will be useful when we organize our data for calcu-
lations. They also provide a compact notation for solving systems of equations,
which we encounter in the upcoming regression modeling.
Matrix Algebra 155
Here is a 2 × 3 matrix:
1 5 0
X=
2 12 −1
A matrix with one row is called a row vector , while a matrix with one column
is called a column vector . A vector with only one element is a scalar .
−1
1 5 0 1 4 0
A + B = 2 12 −1 + −1 0 6
4 12 3 0 1 0
1+1 5+4 0+0
= 2 + (−1) 12 + 0 −1 + 6
4+0 12 + 1 3 + 0
2 9 0
= 1
12 5
4 13 3
156 Mathematics Preliminaries
Matrix Multiplication
Next, let’s look at a code example. We create the matrices A and B as NumPy
arrays. The operator @ supports matrix multiplication. Note that the operator
∗ is element-wise multiplication and not matrix multiplication. The character
\n inserts carriage returns for nicer output.
A = np.array([[1,5,0],[2,12,-1],[4,12,3]])
B = np.array([[1,4,0],[-1,0,6],[0,1,0]])
print(‘A: \n’,A)
print(‘’)
Matrix Algebra 157
print(‘B: \n’,B)
print(‘’)
print(‘AB: \n’, A@B)
OUTPUT:
A:
[[ 1 5 0]
[ 2 12 -1]
[ 4 12 3]]
B:
[[ 1 4 0]
[-1 0 6]
[ 0 1 0]]
AB:
[[ -4 4 30]
[-10 7 72]
[ -8 19 72]]
Let’s define another matrix D with dimensions that do not support matrix
multiplication with A. Then we can see how Python handles such a case.
D = np.array([[1,4,0],[-1,0,6]])
print(‘D: \n’, D)
print(‘’)
print(‘A dimensions:’, A.shape)
print(‘D dimensions:’, D.shape)
print(‘AD: \n’, A@D)
OUTPUT:
D:
[[ 1 4 0]
[-1 0 6]]
A dimensions: (3, 3)
D dimensions: (2, 3)
Transpose
For example, the first row of A, [1, 5, 0], became the first column of AT . Note
that [1, 5, 0] is a row vector, and taking its transpose made it a column vector.
Taking the transpose of a column vector makes it a row vector.
Matrix Diagonal
Identity
The zero vector and zero matrix contain all zeros. Here are examples of each:
0T = 0
0 0
0 0 0
0 = 0 0 0
0 0 0
Inverse
8.7 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
1. Consider two sets A and B. How could you prove they are equal?
U = {a, b, c, d, e, f, g}
X = {a, b, c}
Y = {d, e, f }
Z = {b, c, g}
−x x ≤ −1
f (x) =
x x > −1
22. S A couple has two children. One child is known to be a girl. What
is the probability that the other child is a girl?
23. A fair coin is flipped five times. What is the probability of three
heads?
24. S A random variable U follows a Uniform(0,1). What is the prob-
ability that U is between 1/3 and 2/3?
25. S True or False: For a random variable X following a normal dis-
tribution, the probability that X=0 is zero.
26. S A random variable X follows a normal distribution with mean 10
and standard deviation 2. What is the probability that X is between
8 and 14?
27. S A random variable follows a Weibull distribution with shape pa-
rameter 1 and scale parameter 2. One thousand independent draws
are made and the mean value is computed. This is repeated 100,000
times. What is the approximate distribution of the means?
28. Consider symmetric matrix A and its transpose B. True or False:
A is equal to B.
29. True or False: It is possible to add two matrices if they have different
dimensions.
30. S Consider a vector v which contains zeroes in the first 99 positions
and a value of 1 in the final position. A second vector w contains a
1 in the first position and zeroes in the next 99 positions. What is
the inner product of v and w?
31. Verify that for a square n × n matrix M and the identity matrix
In , the matrix product MIn = In M = M.
32. Use numpy to create 3×3 matrices A and B populated with random
values. Calculate A + B and A − B. Verify the results by hand.
33. Use numpy to create a 2×3 matrix A and a 3×3 matrix B. Populate
A and B with random values. Calculate AB. Verify the results by
hand.
9
Statistics Preliminaries
Frequency Count
For data with low cardinality (say below 50 unique levels), we can count
the number of times that each value is taken. These occurrences can be
called frequency counts or simply counts. It is also useful to calculate the
percentage of times each value is taken. The values with their percentages
are then equivalent to the probability mass function. For example, given
observed data S = (1, 1, 1, 2, 3, 5000), the values and associated counts are
F = {1 : 3, 2 : 1, 3 : 1, 5000 : 1}. The values, counts, and percentages can be
assembled in a table like the following:
165
166 Statistics Preliminaries
For data with high cardinality, the values can be placed into bins, and then
counts and percentages can be computed for each bin. This is what is done
when creating a histogram. The table below summarizes counts and percent-
ages for a large number of values measured to the thousandths place, such as
0.412 and 15.749. The bin [0, 5) will contain values including the left endpoint
of 0 and excluding the right endpoint of 5.
For a large number of bins, the data distribution can be better understood
with a graph of the bins on the x-axis and their occurrences on the y-axis (as
in a vertical bar chart).
Central Tendency
The central tendency or center of the data is one of the most common and
useful attributes. It can be measured by the mode, median, or mean. The mode
is the most common value taken by the variable. Given frequency counts as in
F , the mode is equal to the key with highest value (1 in this case). The mode
will not be influenced by outliers, as they are rare by definition. A shortcoming
of the mode is that it is purely frequency based.
The median is the value in the “middle” of the data. To be more precise,
we can first define the pth percentile of the data to be the value which exceeds
p percent of the data. Then the median can be defined as the 50th percentile
of the data; half of the values are above and below the median. For an odd
number of data points, this can be found by sorting the values and taking the
middle point. For an even number of data points, the median is computed as
the midpoint of the two middle values. For the example S, the median will
be 1.5, which is the midpoint of 1 and 2. Like the mode, the median is not
influenced by outliers. The median gives a better estimate of center since it
uses the ranking of values.
Descriptive Statistics 167
The average or sample mean of the data applies equal weight to each
observation in estimating where the data “balances.” It is a statistic that
estimates the population mean, denoted µ. The sample mean is computed by
summing the values and dividing by the number of observations n. For values
x1 , x2 , . . . , xn drawn from random variable X, the formula is
n
1X
x̄ = xi
n i=1
Extreme observations will have a large influence on the sample mean; for mea-
suring the center in such a case, the median will be better. From our example
S, the sample mean is approximately 835 due to the large outlier. While this
value might give a sense of where the data balances, it does not represent most
of the data. When we see the median and sample mean together for a dataset,
their similarity or difference provides additional information. When the two
statistics are similar, this indicates symmetry in the data distribution. When
the sample mean is much larger (smaller) than the median, this indicates one
or more large (small) outliers.
Spread
The dispersion or spread of the data gives a sense of how the values are
distributed about a central value. For a spread of zero, all values would be the
same; such a variable would not be random at all. The spread is a measure
of the variable’s uncertainty, and several statistics are used: the range, the
interquartile range (IQR), the variance, and the standard deviation.
The range is the difference between the maximum and minimum values of
the data. It is simple to calculate and intuitive, but influenced by outliers, as
we see with our sample data where range(S) = 5000 − 1 = 4999.
For measuring the interquartile range, we first sort the data from lowest
to highest and divide the sorted values into four equal parts, or quartiles.
The lowest quartile of the data ranges from the minimum to Q1 (the 25th
percentile), the second quartile ranges from Q1 to Q2 (the 50th percentile or
median), the third quartile ranges from Q2 to Q3 (the 75th percentile), and
the fourth quartile ranges from Q3 to Q4 (the maximum). The middle 50%
of the data is then found between Q1 and Q3.
The interquartile range is calculated as Q3 − Q1, which represents the
range covered by the middle 50% of the data. Since the IQR removes the
lowest and highest 25% of the data, it is not sensitive to outliers. For our data
sample, IQR(S) = Q3 − Q1 = 2.75 − 1.0 = 1.75. This is substantially lower
than the range as it removed the outlier.
A five-number summary is very common and useful for reducing a data
distribution to five statistics, namely the minimum, Q1, median, Q3, and max-
imum. These numbers can be depicted graphically with a boxplot as in Figure
9.1. The example boxplot summarizes randomly generated data following a
standard normal distribution. Notice that the median does not necessarily
168 Statistics Preliminaries
need to place in the middle of the boxplot. Here is a list of the boxplot com-
ponents:
z = stats.norm.rvs(size=200)
sns.boxplot(z, orient=‘v’, whis=1000)
Comovement
H0 : θ = θ0
6 θ0
HA : θ =
For example, we might hypothesize that the mean batting average of a baseball
player is 0.300. In this case, the parameter of interest is the population mean
µ, the null hypothesis is H0 : µ = 0.300, and the alternative is HA : µ 6= 0.300.
To provide evidence to reject the null hypothesis H0 , we can collect a
sample of data from the population and compute an estimate called the test
1 Note that other hypothesis structures are possible, such as testing θ > θ0 versus θ ≤ θ0 .
Inferential Statistics 171
H0 : θ ≥ 0
HA : θ < 0
and the test statistic is –0.25. The question is then: “What is the probability
of a test statistic of –0.25 or less assuming that θ ≥ 0?” Intuitively, if this
probability (or p-value) is very large, this means that H0 is likely true. How-
ever, if the p-value is very small, then there is evidence to reject H0 in favor
of HA .
There are two possible errors that can be made with this approach. If we
reject the null when it is true, this is called a Type I error . If we fail to reject
the null when it is false, this is called a Type II error . Since uncertainty cannot
be eliminated, the strategy is to accept some probability of making a Type
I error called the significance level , denoted α. The most common value for
α is 0.05, which amounts to potential error in 1 out of 20 decisions. A less
stringent value of 0.10 (1 out of 10) and a more stringent value of 0.01 (1 out
of 100) is also common. These numbers are not magical and should be taken
with a grain of salt when results are borderline. A two-way table of decision
versus actual (or truth) is shown below. For example, the decision to reject
H0 (accepting HA ) when HA is true results in a correct decision.
decision/actual H0 HA
H0 correct Type II error
HA Type I error correct
We now have all of the necessary ingredients for a hypothesis test. We can
172 Statistics Preliminaries
H0 : µ = µ0
6 µ0
HA : µ =
Given the alternative hypothesis, this is a two-tailed test where we must con-
sider the parameter taking a value less than µ0 or greater than µ0 . When
we compute the p-value, we must calculate the probability of a test statistic
taking its value or something more extreme in either direction. For example,
if the test statistic t = −1, then the p-value is the probability that t < −1 or
t > 1 assuming that H0 is true. This can be written P (|t| > 1).
The sample mean X̄ will be used to estimate µ. As its distribution will depend
on the sample size n, we will denote this random variable as X̄n . The central
limit theorem makes this statement about the sampling distribution of the
mean:
As n → ∞,
√
X̄n − µ
n → N (0, 1)
σ
Before standardization, this says that as n increases, X̄ is approximately
√ nor-
mally
√ distributed with mean µ and standard deviation σ/ n. The quantity
σ/ n is called the standard error (SE) of the sample mean. In general, the
SE of a statistic is the standard deviation of its sampling distribution. As the
sample size increases, the standard error decreases, with X̄n approaching the
population mean µ.
One complication is that we generally won’t know the population stan-
dard deviation σ. In such a case, we can replace it with the sample standard
deviation s. Now let’s look at the quantity
X̄n − µ
√
s/ n
2 It is also possible to test functions of multiple parameters, such as the difference θ1 − θ2 .
Inferential Statistics 173
This is the standardized form of X̄n which measures its distance from µ in
units of standard error. For n >= 30, this quantity approximately follows the
standard normal distribution, and it otherwise follows a t-distribution. The
t-distribution has more probability in the higher and lower values, or tails, and
its shape depends on a parameter called the degrees of freedom, denoted ν.
Like the standard normal, the t-distribution is symmetric about zero. Figure
9.2 shows the probability density functions of different t-distributions; the
case where ν = +∞ depicts the standard normal distribution. Notice that
the standard normal distribution has much less probability in the tails where
P (X < −2) and P (X > 2).
H0 : µ = 0
HA : µ 6= 0
The parameters are n = 1000 and α = 0.5, and the sample statistics (com-
puted from the data sample) are X̄1000 = −0.5, and s = 10. From this sample,
the mean is different from zero, but the question is whether the difference is
statistically significant. We compute the test statistic as
174 Statistics Preliminaries
X̄n − µ
z= √
s/ n
−0.5 − 0
= √
10/ 1000
≈ −1.581
Next, we can compute the probability of this value or lower under the stan-
dard normal distribution. This represents the left-tailed probability. We can
then double this value to include the right-tailed probability. From Python,
we can call the function stats.norm.cdf(z), which computes the left tail us-
ing the cumulative distribution function of a standard normal. Doubling the
probability gives a p-value of approximately 0.1138. As the p-value is greater
than α = 0.05, we cannot reject H0 as the risk of making a Type I error would
be too large. Thus, based on the data sample, the mean µ is not statistically
different from zero.
Python code is shown below for experimenting with different values and mea-
suring the impact on the conclusion. For example, changing the sample mean
from –0.5 to –1.0 will produce a p-value of 0.00157 and lead to rejecting H0 .
n = 1000
alpha = 0.05
xbar = -0.5
mu = 0
s = 10
print(‘test-stat:’, round(test_stat,4))
print(‘p-value (two-tailed):’, round(pval,4))
OUT:
test-stat: -1.5811
p-value (two-tailed): 0.1138
organized as in Table 9.1. The table summarizes the contribution and impor-
tance of predictors in the model. For background, this model predicts Y using
a linear equation
θ̂ ± cv × SE
or equivalently
Next, let’s consider this specific case: we want to construct a 95% confidence
interval for the population mean µ, the population standard deviation σ is
unknown, and we have drawn a small sample of size n = 20. We can use x̄ as
a point estimate for µ and the sample standard deviation s as a point estimate
for σ. Suppose x̄ = 15 and s = 2.
Since the sample is small, the sampling distribution of X̄ follows a t-
distribution with ν = n − 1 = 19 degrees of freedom. Then the confidence
interval can be expressed like this:
s s
(x̄ − tn−1,α/2 · √ , x̄ + tn−1,α/2 · √ )
n n
Inferential Statistics 177
import scipy.stats
OUTPUT:
2.093024054408263
Simplifying gives this 95% confidence interval for the population mean:
(14.0640, 15.9360)
were reviewed for building a conceptual understanding. The various tests will
differ, as the sampling distribution will depend on the size of the data sample,
the parameter being tested (e.g., the population mean), and the availability
of the population standard deviation. We examined a regression parameter
estimate table to understand what the output looks like and what it means.
The table contains quantities which are essential to any hypothesis test: point
estimates, standard errors, test statistics, and p-values. More generally, hy-
pothesis tests are routinely computed with software. Manual calculations are
not usually necessary, but it is essential to understand what is being done and
what assumptions are being made.
There are several other common hypothesis tests, such as the one-sample
test of proportion and the two-sample t-test. The interested reader should
explore these tests, but they won’t be necessary for advancing through this
book. Each of these tests will use the ideas of setting up a null and alterna-
tive hypothesis, working with a sample statistic, and a computing standard
error, test statistic, and p-value. For a broad, excellent resource on statistical
methods, please see [30].
9.4 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
We have ingested and processed raw data into cleansed data that promises to
be useful in further analysis. The end goal might be visualizing data for review
by leadership, producing quantitative summaries to share with customers, or
a machine learning model for predicting an outcome. There is a large step
that is missing, however, and that is maximizing the signal in the cleansed
data. Specifically, we might want to know how two variables are correlated. It
might be important to understand if one or more variables are trending over
time (that is, if their means are increasing or decreasing).
Unfortunately, there are factors that get in the way of easily discerning a
signal in data, and chief among them are noise, scale, and data representation.
We will start by exploring transformations, or transforms, for handling these
challenges. In the interest of space, we will cover a selection that is common
and useful.
It is also possible to create new predictors which may produce insights
and improve models. This activity can be a fun and creative process. We will
consider some examples for illustration.
181
182 Data Transformation
steps = 100
The stock price and moving average trajectories are shown in Figure 10.2. The
moving average curve is much smoother than the prices, which helps to see the
trend. The moving average exhibits a lag, responding slower to movements.
This is apparent where the stock price sharply dropped around time step 60,
while the moving average fell much more slowly. Finally, the moving average
curve does not begin until time step 20, which matches the number of points
used in the average. In general, a moving average with a longer term will
require more points to begin, and it will induce more smoothing over the raw
data.
Transforms for Treating Noise 183
OUT:
masked_array(data=[1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9],
mask=False, fill_value=999999)
we included the outliers in the calculation, they would have exerted a large
influence and produced a mean of approximately 4.09.
Trimming data is different from winsorizing, as it discards outliers. A 20%
trimmed mean would remove the lowest 10% and highest 10% of the data
and compute the mean. Continuing with our example data, the 20% trimmed
mean would also be 5. The code to implement this is the following:
x = [0,1,2,3,4,5]
y = [10,10,10,20,20,10000]
We are expecting the y-value of 10,000 to cause trouble. Figure 10.3 graphs
this data. As we expected, the 10,000 makes the smaller y values appear
as zero or nearly zero. Depending on the context, this can be problematic.
If the question were: how many fatalities were there in a given year, and y
represented these fatalities, then an answer of zero is very different from one,
ten, or twenty.
Since we are trying to better understand all of this data, we are looking
for a way to better represent it. A common technique when facing values
with different orders of magnitude is to take the logarithm of the data using
Transforms for Treating Scale 185
base 10. This is consistent with measuring order of magnitude, which means
that a value of 10 will have a logarithm of 1, and a value of 10,000 will have a
logarithm of 4. Figure 10.4 demonstrates that the log(y) transformation better
represents the entirety of the data.
10.2.2 Standardization
In this section, we will spend more time on standardization, which we encoun-
tered when working with random variables following a normal distribution.
Imagine two different language classes that took the same exam. The grades
from each class roughly followed a normal distribution with the same standard
deviation of 2. In class 1, the mean was 90, while the mean for the second class
was 75. Suppose a student in each class scored a 90.
186 Data Transformation
The student in class 1 scored at the mean, while the student in class 2 per-
formed remarkably better. To make a fair comparison, we standardize the
scores by computing z-scores according to z = (x − µ)/σ. A grade of 90 in
class 1 equates to a z-score of (90 − 90)/2 = 0, while the grade in class 2 is
equivalent to a z-score of (90 − 75)/2 = 7.5. Note that a z-score of 7.5 is ex-
tremely high, recalling that 99.7% of the time, a standard normally distributed
random variable will have a z-score between –3 and 3.
In this small example, we standardized one value from each class. In data
science, we will often standardize one or more columns of data, where each
column is a variable. This places each variable on a similar scale and retains
outliers. Standardization in Python is commonly done using the sklearn mod-
ule. The function StandardScaler() can be used to standardize one or more
columns. Here is a code example for illustration:
print(data)
print(‘’)
print(scaled_data)
OUT:
[[4 4]
[6 0]
[0 2]
[1 1]]
[[ 0.52414242 1.52127766]
[ 1.36277029 -1.18321596]
[-1.15311332 0.16903085]
[-0.73379939 -0.50709255]]
By default, the scaler computes the mean and standard deviation of each
column and produces the z-scores. For this data, each column has values in
a range of [–3, 3], which is common for a variable with a standard normal
Transforms for Treating Scale 187
distribution. Additionally, each column will have a mean of zero and standard
deviation of one. We can compute the means and standard deviations of each
column to check one of the z-scores by “hand.”
data.mean(axis=0)
OUT: [2.75 1.75]
data.std(axis=0)
OUT: [2.384848 1.47901995]
The axis=0 parameter allows for selecting across rows when computing the
statistic. For the value of 4, its z-score is then (4−2.75)/2.384848 = 0.52414242
which matches the scaled data entry.
10.2.3 Normalization
Another common approach to scaling data is normalization. This approach
squeezes the data into a range which is commonly selected to be [0, 1] or
[–1, 1]. Unlike what the name may suggest, this transform has nothing to do
with the normal distribution, and the data does not need to follow a normal
distribution. Outliers are floored to the lower bound and capped at the upper
bound. This transformation is effective in scaling data, but when the data
is approximately normally distributed, standardization is recommended for
retaining symmetry about the origin and for retaining outliers. The formula
for normalizing data is the following:
x − xmin
xnorm =
xmax − xmin
Here is a code snippet for normalizing the dataset from the previous example.
By default, MinMaxScaler() uses a lower bound of 0 and an upper bound of
1.
print(norm_data)
OUT:
188 Data Transformation
[[0.66666667 1. ]
[1. 0. ]
[0. 0.5 ]
[0.16666667 0.25 ]]
For each column, the original minimum and maximum were mapped to zero
and one, respectively. As the columns are in the same range [0, 1], scaling is
achieved.
countvec = CountVectorizer()
[[1 1 0 1 2 0 0 0 0 2 1 1 1 0 1 2 0 1 1 1]
[0 0 1 0 1 1 1 1 1 0 0 0 1 1 0 0 1 0 0 0]]
When we call get feature names out(), this returns the set of words ex-
tracted from the text, known as the vocabulary. From the matrix, each column
representing a word’s frequency, or count, in the documents can be used as a
predictor.
Let’s review the matrix of counts. We see for example that the first doc-
ument contains the word ‘california’ twice, while the second document has
a count of one (see column 5). Many of the words appear in one document
and not the other, such as ‘wildfires.’ These differences can help with topic
classification. In summary, we began with data in text form, and we converted
it to a numeric representation for eventual use in a model.
breeds : [‘persian’,‘persian’,‘siamese’,‘himalayan’,‘burmese’]
190 Data Transformation
To encode this categorical variable, we can form a column for each unique
breed: burmese, himalayan, persian, and siamese. Each row will refer to a
record, and each element will take value zero or one to indicate the breed. It
may look as follows:
The first two records are persian, and this is represented by the 1 in the persian
column and zeroes in the rest. Written as a vector, record 1 is encoded as [0, 0,
1, 0]. We can condense this further by observing that for a categorical variable
with n levels, or unique values, we need only n-1 columns to represent them.
In this case, we will drop the burmese column and update the matrix.
Record 5 takes value ‘burmese’, and this is reflected by each of the remaining
cat breeds taking values of zero. The columns in this matrix are sometimes
called indicator variables or dummy variables. These columns can now be
included in modeling techniques such as regression. Next, let’s look at a code
snippet that begins with the categorical data and converts it to numeric using
one-hot encoding.
print(‘--categorical data’)
print(cats)
prefix=‘breed’)
print(‘\n’)
print(‘--one hot encoded categorical data’)
print(cats)
OUT:
--categorical data
breed
0 persian
1 persian
2 siamese
3 himalayan
4 burmese
0 x≤c
f (x) =
1 x>c
Here is a code snippet to binarize data using sklearn. Binarizer() takes a
threshold which is zero by default. Values less than or equal to the threshold
will be set to zero, and set to one otherwise.
print(data)
print(‘’)
print(bin_data)
OUT:
[[4 4]
[6 0]
[0 2]
[1 1]]
[[1 1]
[1 0]
[0 1]
[1 1]]
10.4.2 Discretization
Discretizing a continuous variable into buckets can sometimes yield a more
powerful predictor. In particular, it can be effective when the variable has a
wide range and it is used in a regression model. The qcut() function from
pandas can discretize data into equal-sized buckets based on ranks or quan-
tiles.
The example below creates 1000 random values drawn from a standard
normal distribution. They are multiplied, or scaled, by 100 to increase their
range. The data is then divided into deciles, where decile 0 is the lowest 10%
of data values, decile 1 is the next 10%, and so on. The original data and
the deciles are plotted in Figures 10.5(a) and 10.5(b) to show the discretized
effect.
data = 100*np.random.normal(size=1000)
data.sort()
plt.plot(data)
plt.plot(data_dec)
Notice the sawtooth pattern in the quantized data. As the data are grouped
into deciles, this produces a series of step functions where values are constant
within each decile, and then a jump occurs at the next decile. This has the
194 Data Transformation
effect of mapping many values to the same output, and narrowing the range
of the data to 0, 1, . . . , 9.
Transformation Example
Ratio of variables Assets to liabilities
Difference of variables Long-term minus short-term interest rate
Lagged variable Last month medication dose
Cumulative sum of variable Total study time
Power of a variable Square of time
semester1/week 09 10 transform
10.6 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
One of the first things that should be done with a new dataset is to get a
feel for the data. We did some of this during the preprocessing step, when
we looked for missing values and extreme observations, and when we trans-
formed variables. Every dataset will be different and might contain surprises;
understanding the nuances and the tools to treat them will allow for accurate,
insightful analysis.
The two activities we will explore are statistical summarization and vi-
sualization of the data. As the datasets grow larger, it can be difficult or
impossible to review all of the data points, and this is where statistics will
help. Statistical summaries, such as computing means and percentiles, can
provide a quick overview of the distribution of one or more variables. Ag-
gregations can summarize data along one or more axes, such as computing
average daily measurements on high-frequency data. We will see examples of
statistical applications in this chapter’s exercises.
Properly done graphics will lead the eye to important features of the data,
prompting further investigations. Python includes several modules for visu-
alization, with matplotlib and seaborn among the most popular. We will
show several examples of seaborn plots to give a sense of its power and
ease. Seaborn is based on matplotlib and it provides a high-level interface
for drawing attractive and informative statistical graphics. For customizing
seaborn graphics, there are times when it is necessary to include elements
from matplotlib. The space in this book does not allow for a comprehensive
treatment of the modules, but the examples to follow will serve as a starting
point. For an introduction to seaborn, please see [32]. To get started with
matplotlib, see [33]. Both of these packages will be pre-installed with Ana-
conda.
Exploratory data analysis (EDA) can be done on variables individually,
of course, but many of the interesting insights will come when we consider
how one variable changes, or co-varies, with another. This speaks to the im-
portance of understanding relationships between variables, which is essential
for prediction. To drive the point home, consider that each time a variable is
selected, that data is taken out of context from a large system. If the column
of data is a set of student midterm grades, the excluded data might be the
students that weren’t feeling well that day. For a series of stock prices, the
underlying economic regime has been left out. In each case, these excluded
variables might be essential to truly understanding the data. By thoroughly
197
198 Exploratory Data Analysis
studying each variable individually and in groups, it will help build a more
complete understanding of what insights the data holds.
When we conduct analysis on a single variable, this is often called a uni-
variate analysis. For the case of studying two variables together, it is often
called a bivariate analysis. Different statistics and graphs are relevant for each,
and they are both necessary for a complete study. In the next two sections,
we will conduct EDA on a binary target and a continuous target, respectively.
The first exercise will consider a synthetic dataset from a bank where the
records represent customer check deposits and the target variable is whether
the check was fraudulent or not. The second exercise will study population
happiness. These exercises should give a good feel for the data exploration
process and methods.
The synthetic dataset for this analysis lives in the course repo here:
semester1/datasets/check_deposit_fraud.csv
semester1/week_11_12_visualize_summarize/check_deposit_eda.ipynb
df = pd.read_csv(‘../datasets/check_deposit_fraud.csv’)
df.head(3)
OUT:
OUT:
Check Fraud 199
A real-world dataset would include additional attributes like bank name, ac-
count number, routing number, check number, and so on. Here, attention is
drawn to a small number of interesting attributes, as they have exhibited his-
torical predictive ability. For the binary variable check signed and the target
variable is fraud, we should compute the frequency and percentage of each
value. We count the number of fraudulent and non-fraudulent checks with the
value counts() function:
df.is_fraud.value_counts()
0 1096
1 4
The rows are ordered from highest number of observations to lowest. In this
case, there are 1096 rows where is fraud is 0 (no fraud; the negative label),
and 4 rows where is fraud is 1 (fraud; the positive label). Including the
parameter normalize=True will compute their percentages:
df.is_fraud.value_counts(normalize=True)
0 0.996364
1 0.003636
df.check_signed.value_counts()
1 1096
0 4
df.check_signed.value_counts(normalize=True)
1 0.996364
0 0.003636
For a larger dataset, there might be dozens or more binary variables and
categorical variables that can be summarized in this way. It can help to form
groupings of variables so they can be processed together. This example forms
a binary variable grouping with additional fictitious variables, and computes
their frequency distributions in a loop:
200 Exploratory Data Analysis
df.check_amount.describe()
count 1100.000000
mean 5071.434100
std 2934.500657
min 2.710000
25% 2484.380000
50% 5077.185000
75% 7662.152500
max 10000.000000
The data seems to be uniformly distributed with a minimum close to zero and
a maximum of 10000. Next, we use the seaborn package and plot a histogram
to see the full distribution (see Figure 11.1). Since elements from matplotlib
will be brought into seaborn, the former package will be imported as well.
Notice the aliases used by convention: seaborn is referenced as sns, while
matplotlib.pyplot is referenced as plt.
sns.histplot(data=df.check_amount)
The plot confirms the approximate uniform distribution. The values are all
plausible (e.g., there are no negative check amounts).
To change the plot type of a single variable, we would change its name and
include the relevant parameters. For example, if we wanted to change from a
histogram to a boxplot, the code would change from:
sns.histplot(data=df.check_amount)
to
sns.boxplot(data=df.check_amount)
To understand how check signed is distributed for the fraudulent and non-
fraudulent checks, we can produce a two-way classification table with the
pandas crosstab() function:
pd.crosstab(index=df[‘check_signed’], columns=df[is_fraud])
is_fraud 0 1
check_signed
0 1 3
1 1095 1
To read the table, the check signed = 0 row indicates unsigned checks. Read-
ing across that row, of the 4 unsigned checks, 1 was not fraudulent and 3 were
fraudulent. Of the 1096 signed checks, 1095 were not fraudulent and 1 was
fraudulent.
We can put this in percentage terms for greater clarity, dividing each count
by the row sum. Programmatically, this requires adding the parameter
normalize.
pd.crosstab(index=df[‘check_signed’],
columns=df[is_fraud], normalize=‘index’)
is_fraud 0 1
check_signed
0 0.25 0.75
1 0.999088 0.000912
202 Exploratory Data Analysis
This says that 75% of the unsigned checks were fraudulent, while only 0.09% of
the signed checks were fraudulent. The ratio 0.75 / 0.000912 = 822 strongly
suggests that unsigned checks pose a heightened risk of fraud. If we were
going to build a predictive model, check signed would be a good candidate
predictor.
Next, let’s further explore check amount to understand how the amounts vary
by the fraud outcome. The scatterplot in Figure 11.2 may clarify their rela-
tionship.
Gridlines are incorporated from matplotlib. For a bivariate plot (also called
a joint plot), the commonly specified parameters are x, y, and the dataframe
data. The values of x and y are the variable names entered as strings, while
data references the dataframe object. Similar to the univariate case, the type
of graphic can be changed by swapping the graph type and including the
relevant parameters.
Most of the checks were not fraudulent, and their points are plotted on
the bottom horizontal grid line. We see the four fraudulent checks on the
top horizontal grid line which marks where is fraud=1. The check amounts
appear to roughly be valued at 1000, 4500, 5000, and 10000. We can also
inspect the fraud records in the dataset:
df[df.is_fraud==1]
World Happiness 203
The table confirms a key observation: For three of the four fraud cases, the
check amounts were multiples of 1000. Given this, it makes sense to create a
predictor variable which indicates if each check amount is a multiple of 1000.
We might name it check amount mult 1000.
From our exploratory work, we have uncovered that check signed and
check amount mult 1000 are useful predictors of check fraud. In making these
insights, we produced statistical summaries, filtered the data, and created
plots. We needed to be creative and try different approaches, as the predic-
tors were not immediately apparent.
https://worldhappiness.report/ed/2018/
https://www.kaggle.com/unsdsn/world-happiness
The 2018 dataset is saved in the course repo, and we will load it from there.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv(‘../datasets/Country_Happiness.csv’)
df.head(2)
OUT:
Each row represents a unique country, column names are abbreviated, and
some columns are suppressed due to margin constraints. We will focus on
the happiness score (Happy), which is the target variable, and six factors
that contribute to higher life evaluations: economic production (GDP), social
support (Social), life expectancy (Health), freedom (Free), generosity (Gener),
and perceptions of corruption (Corrupt). Respondents were asked to rate their
happiness from 0 (worst possible life) to 10 (best possible life). The hypothesis
is that the six factors contribute to the happiness score, either directly (they
rise together) or inversely (they move in opposition).
First, let’s produce histograms of the happiness score and the factors to un-
derstand their distributions. The code is shown below, along with a sample of
the histograms.
The happiness score ranges in value from 2.91 to 7.63, with most scores falling
between 4.5 and 6.5 (see Figure 11.3). The lowest score was well above the
potential low of zero, and the highest score was well below the potential high
of 10.
The social support factor ranges from 0 to 1.64 and it is left skewed (see Figure
11.4). This means that the distribution is longer to the left of the peak, which
occurs at about 1.3.
The corruption factor ranges from 0 to 0.46. Seventy-five percent of the values
are below 0.14, and the distribution is skewed to the right (see Figure 11.5).
Another useful plot for visualizing the distribution of the data is the box-
plot. Figure 11.6 shows a boxplot of happiness scores, with the five-number
summary shown:
• The 25th percentile forms the bottom of the box (value: 4.45)
A boxplot provides less detail than the histogram, but it can be easier to
interpret. Additionally, several boxplots can be easily compared together.
Next, let’s look at bivariate plots of the data to understand the relationships
between the factors and the target. We create a scatter plot of happiness
versus social support:
Figure 11.7 shows a direct relationship between social support and happiness;
they tend to move together. The correlation is 0.76, which indicates a strong
positive (linear) association.
A direct relationship is not apparent here, and there are some outliers to the
far right. A correlation of 0.41 suggests a weak direct relationship.
We have separately created univariate plots for the underlying variables and
a scatterplot to see their comovements. A jointplot (Figure 11.9) combines
the scatterplot with the histograms from each of the underlying, or marginal,
variables:
Notice that we changed from a scatter plot to a joint plot by changing the
word scatterplot to jointplot. The seaborn interface makes this very easy.
df.corr()[‘Happiness score’]
OUT:
208 Exploratory Data Analysis
This shows that the factor with strongest correlation is GDP, while Generosity
has the weakest correlation.
Lastly, seaborn can graph the joint and marginal distributions for all pair-
wise relationships and each variable, respectively, with the pairplot. The
data density of this plot in Figure 11.10 is very high and it is possible to un-
derstand overall patterns quickly. Since the column names are long, they are
first abbreviated with a passed dictionary of ‘old name’ : ‘new name’ pairs.
the job, and understanding the limits of a statistic’s usefulness. Let’s consider
two examples:
1. The correlation between a predictor X and target Y cannot tell
us whether X caused Y . This is particularly true in observational
data, which is not collected under the control of a researcher. A
positive correlation indicates that an increase in X is associated
with an increase in Y , but we cannot confirm a causative effect.
For example, in a study on whether coffee X lowers blood pressure
Y for a group of individuals, additional variables like exercise E or
stress S may be related to both X and Y and play a causative role.
These additional variables are called lurking variables, and they can
have a confounding effect in the study.
Graphical Excellence 211
Stocks 65%
Bonds 35%
212 Exploratory Data Analysis
For the next example, consider plotting the ordered points (1,5), (2,8), (3,10),
and (5,12). Suppose we run this code to produce a line plot:
sns.lineplot(x=xval, y=yval)
We would do better with a scatterplot, which shows the actual data points,
or even a small table.
We also need to take care when using seaborn to create bar charts. Here is
an example:
Graphical Excellence 213
The default behavior is to assign a different color to each data point as shown
in Figure 11.13. This effectively creates a fake data dimension. We can do
better by using one color for all of the bars:
sns.barplot(x=xval,y=yval, color=‘grey’)
The plot in Figure 11.14 better conveys the data. To be sure, these points
might be better reported in a table, but this small example serves to illustrate
the following recommendations:
• Avoid fake dimensions
• Select the right plot for the data
• Question the output of statistical software. Default settings are not always
appropriate.
214 Exploratory Data Analysis
Adding layers to a plot should only be done if they are helpful in understanding
the data. Consider the histogram in Figure 11.16 with an overplotted set of
gridlines. The gridlines are so dark that they dominate the plot.
Lighter gridlines as shown in Figure 11.17 allow the data to come forward.
From the examples given above, we see the importance for graphics to be
clear and accurate to avoid misleading viewers. As we create graphics to show
data, we should challenge them with questions including those listed below.
By honestly addressing the questions, we can produce better graphics.
11.6 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
pd.crosstab(index=df[‘invalid_address’],
columns=df[is_fraud])
is_fraud 0 1
invalid_address
0 2 99
1 2050 1
Exercises 217
3. S For each of the given scenarios, indicate whether the mean, me-
dian, or mode would be the most appropriate measure of central
tendency:
a) You would like to compute the central tendency of a set of home
sales in a certain neighborhood. The homes that sold consist of 20
nearly identical ranch-style 3-bedroom houses.
b) You would like to compute the central tendency of a set of home
sales in a certain neighborhood. The homes that sold consist of
20 nearly identical ranch-style 3-bedroom houses and one seaside
mansion.
c) A dataset has a categorical variable with some missing values.
You wish to use a measure of central tendency to impute the missing
values.
4. You hypothesize that stray marks on a bank check make it more
likely to be fraudulent. What would be a useful way to explore this
hypothesis?
5. True or False: You would like to display four numbers to a viewer.
It is generally preferable to show such a small set of data in a table
rather than a graph.
6. S True or False: The primary objective of graphics should be to
show the viewer something visually impressive.
7. You produce a 3-dimensional bar chart where the depth of the chart
is not related to the data. Which principle does this violate?
a) Graphics must be beautiful
b) Avoid fake dimensions
c) Question the output of statistical software
d) Do not show incomplete observations
8. You would like to show the relationship between two variables.
Which plot would be most suitable?
a) Bar plot
b) Candlestick plot
c) Pie chart
d) Scatter plot
9. True or False: A scatterplot indicates a weak correlation between a
predictor and target. This suggests that the predictor will not be
useful in any machine learning model.
10. S You produce a bar plot where each month of the year is repre-
sented with a different color. Is this a good practice? Explain why
or why not.
11. S From the World Happiness dataset, use seaborn to produce a
scatterplot with the Generosity variable on the x-axis and the
Happiness score on the y-axis.
218 Exploratory Data Analysis
12. From the World Happiness dataset, you will remove outliers and
then compute the correlation between the Generosity variable and
the Happiness score. For this exercise, an observation should be
flagged as an outlier if the Generosity value is more than 1.5×IQR
below Q1 or more than 1.5 × IQR above Q3.
12
An Overview of Machine Learning
In this chapter, we will review the branches of machine learning and the
archetypal problems that they solve. Each approach provides a way to learn
from data and use it for one or more of these purposes: making a predic-
tion, making a decision, discovering patterns, or simplifying structure. This
is fundamentally different from telling the computer how to solve a problem
through a set of instructions. Additionally, we will learn about the important
elements that arise in solving problems with machine learning. Here are the
branches we will explore:
• Supervised Learning
• Unsupervised Learning
• Semi-supervised Learning
• Reinforcement Learning
We will begin with a simple approach to decision making that does not learn
from data and does not require an algorithm. This will surface some important
concepts, but it will leave us short of solving the problem. It will help us
understand why some data problems are too complex to solve using intuition
and human number-crunching. We will need tools from machine learning and
the help of computers.
219
220 An Overview of Machine Learning
I encourage thinking about a decision you personally made using a pros and
cons list.
To create the list, we make a pros column to contain the advantages of
taking the path, and a cons column for disadvantages. Then we think about
different attributes that we believe will be relevant, what those attributes are
like on that path, and how we would perceive them. We may decide that our
next home should maximize our overall life satisfaction. Perhaps moderate,
sunny weather is important, as well as being near family, having a lot of
work opportunities, and the presence of a great food scene. This gives four
attributes that seem most important. Are there others that could increase our
satisfaction or decrease our satisfaction? This is important, because missing
predictors can lead us to the incorrect decision. It turns out that the summer
internship last summer included a heinous commute, and so we add commute
time to the attribute list.
Let’s pause for a minute and answer a question: are we using data in
making this list? Well, we might not be directly collecting a massive dataset,
but we may research the location’s weather, traffic, companies where we can
work, and restaurants. We might visit Yelp or Google Maps to collect this
information. The other way we are using data is from our mental database of
life experiences. We recalled the summer commute, and perhaps what it was
like having long winters as a kid. However, we encounter some complications.
First, our memory might not be so accurate. The average commute might have
been 40 minutes, but we might recall it being 60 minutes. Second, we might
respond differently to the attributes in the future than we did in the past. A
snowy day as a kid may have involved snowball fights and snow angels, while
snow as a recently graduated college student or adult might mean shoveling
a driveway or canceling a get-together. In short, we will need to use possibly
faulty memories and predictions to measure our satisfaction. Sounds hard,
right?
Moving on, we decide that we have all of the important attributes and
how we feel about them. It turns out that our location under consideration
has the weather, food scene, traffic patterns, and job opportunities that we
would like, but it is far from family. Let’s construct the list:
Pros
1. Moderate, sunny weather
2. Light traffic and short commute times
3. Many job opportunities
4. Great selection of excellent restaurants
Cons
1. Far away from family
Once again, we remind ourselves to be sure that we are not missing any
important attributes. Next, we need to think about how to reach a conclusion,
A Simple Tool for Decision Making 221
which means somehow aggregating this data into a Go/No Go decision. This
is the hard part. Do we count the number of pros and subtract the number of
cons? This would assume that each attribute is equally important. Do we say
that family is ten times more important than each of the other attributes? Or
is the number twenty? One hundred? It becomes apparent that some kind of
weighting might play a role, even if each item carries an equal weight. At this
point, some people change the weighting to make the decision for them. They
increase one of the weights to the point where the decision is obvious. This
might be okay so long as the correct decision is reached.
Let’s recap and reflect on what we have learned. In making a decision,
we collected the relevant attributes. We needed to gather all of them, or risk
making the wrong decision. We might have gathered some data and done some
research. We likely reflected on past experience, but this was troublesome as
our memories can fool us. Additionally, the thought experiment required us
to predict how we would perceive somewhat related experiences in a different
environment. For experiences very far back in the past, they might not have
even been useful. Recency of data is valuable, as the world might have changed
from long ago. Lastly, after listing each attribute under the pros and cons
columns, we faced the decision of how to weigh them.
I hope you can appreciate how hard this task is. At the same time, I hope
you can start to see how machine learning might help. Now, machine learning
is not going to tell us what we should feel, but it might help us understand
our feelings, and act in a way that is consistent with our goals. How can it
do this? We still need to collect data, and we should keep only the data that
is relevant. We still need to create the predictors. However, machine learning
can be used to determine the optimal weights for any number of predictors.
This will require data, a mathematical objective function, an algorithm, and
a way for the algorithm to learn from the data. Depending on the algorithm,
learning will differ, but in general it follows this flow in the supervised learning
case we will discuss next.
1. Start with some weights, which are perhaps randomly assigned
2. Using the weights and predictors, combine them to make a predic-
tion. Each predictor might be multiplied by its weight, and then
the terms can be summed.
3. Compare each prediction to the correct answer. For example, dif-
ferences might be computed. This returns errors.
4. Change the weights to reduce a measure of the total error. The
measure might be the sum of squared errors, for example.
5. Repeat 2-4 until we run out of permitted iterations or the improve-
ment is marginal
Now, let us study supervised learning in more detail. This is a classical machine
learning approach that has wide applicability.
222 An Overview of Machine Learning
• The patient recovers (the positive class) or not (the negative class)
• A bank deposit is fraudulent (positive class) or not (negative class)
• A stock price rises over the next trading day (positive class) or not (negative
class)
These examples are called binary classification problems. The positive class is
assigned to the event of interest; it makes no judgment on good versus bad.
Programmatically, the positive class is coded as 1, and the negative class as
0.
In the case where the outcome has more than two possible values, it is
called a multiclass classification problem. Examples might include predicting
the fastest runner in a marathon, or the next number rolled on a six-sided die.
Coding proceeds by numbering each class from zero to the total classes minus
one. For the six-sided die problem, the classes would be numbered 0, 1, . . . , 5.
Unsupervised Learning 223
• Which of these data points are outliers? This can be used for outlier detec-
tion to flag suspicious activity, for example.
• If we wanted to group the data points according to some distance metric,
how should they best be grouped? This can help discover substructure to
understand different personas in the data, for example.
• For a new data point, which group provides the best fit? For example, in an
ideal case, we might find that bank transactions form two groups: fraudulent
transactions and non-fraudulent transactions. The new data point, based
on the values of the predictors, might clearly belong to the non-fraudulent
group.
If possible, the data points to be assigned labels should be reviewed for quality
control. After the newly labeled points are added to the training set, the model
can be refit. This process can be repeated to increase the size of the training
set. After some amount of iteration, it might not be possible or necessary to
continue iterating, and the trained model can be used.
To determine the optimal policy, the long-term value of the actions needs to
be computed. Thinking long term is essential, because selecting the action
that produces the highest next reward (greedy behavior ) might be suboptimal
in the long run. The long-term value is computed as the discounted expected
cumulative reward.
As a quick example, I used to love playing volleyball in high school. Fast
forward fifteen years, and I found myself playing a pickup game at the local
YMCA after not playing since college. The game lasted three hours and I
loved it. Compared to all the other sports I played, including basketball and
tennis, I ranked this at the top. That was, until I woke up the next morning
and could hardly walk for the next two days, let alone do other activities
like weightlifting. I realized that the short-term value of this action was very
different from its long-term value!
Another important concept in RL is exploration versus exploitation. For
example, imagine deciding the best Chinese restaurant in town by trying each
of them once. Perhaps restaurant A is the best restaurant but had an off
night, while restaurant B performed on par. Greedy behavior would dictate
that the agent should avoid restaurant A and return to restaurant B. This is
exploitative behavior, and it will produce a suboptimal policy, as restaurant
A is the better restaurant on average.
An exploratory strategy encourages the agent to sample each action to
learn the optimal policy. In an -greedy approach, the agent explores for a
small fraction of the time and takes the highest-valued action for the re-
maining majority of the time 1 − . For the portion of time when the agent
explores, it randomly selects from all available actions. This allows the agent to
generally benefit from its learnings, while taking some time to explore actions
which may be more valuable.
226 An Overview of Machine Learning
• Labeled data is not required. For some problems, such as the best way to
treat patients for a life-threatening condition like sepsis, the right answer is
not known. RL can help guide the patient to improvement based on observed
rewards.
• It models a sequence of actions for a complex process
Depending on the problem, there can also be several challenges such as:
• Navigating a drone to a safe landing zone while avoiding obstacles. The state
space is the location of the drone in three spatial dimensions (perhaps sur-
rounded by a bounding box) and the action space is the direction taken with
a remote controller. The reward might be the negative straight-line distance
to the landing zone (i.e., moving closer to the landing zone is better). The
epoch can terminate if the drone strikes an obstacle.
12.6 Generalization
We will focus most of our modeling attention on the supervised learning ap-
proach. A critical assumption in SL is that the patterns learned from the
training set will generalize to additional data. The additional data might be
a separate segment of data that has already been collected, or data yet to be
collected.
One of the greatest fears of a data scientist is that a model is trained on
some data, it fits the data well, it is deployed, and it fails to work as well in
production. This would mean that the model learned relationships between
the predictors and target from the training set, but those relationships were
not present in production data. This can happen, and when it does, it is a
major disappointment.
How can we avoid this lack of generalization? Unfortunately, it can’t be
eliminated with certainty, but there are steps a data scientist can take to
improve the chances of generalization. First, it is important to distinguish
between two cases:
1. The model fails out of the gate. This is often preventable.
2. The model works in production for some time and later fails
In the first case, the problem is often caused by overfitting. As an example,
imagine cramming for an exam by taking practice tests. If we learn the practice
exams perfectly, we can score perfectly on the real exam only if it matches
the practice questions. Suppose there are some quirky practice questions that
don’t deeply test the core concepts. We might learn those and have a false
sense of confidence that we know the material. It is unlikely, however, that
these questions will be recycled on the real exam.
Model overfitting can have the same effect as cramming the quirky ques-
tions. While the model is learning some patterns, it is also memorizing idiosyn-
228 An Overview of Machine Learning
cratic effects that are unlikely to appear again. This can fool us into overes-
timating the model’s ability. As an example, consider Figure 12.1. There is a
data point at x = 3.5, y = 9 that stands apart from the rest of the data which
exhibits a clear linear relationship. A good model will fit the linear trend and
ignore the outlier. It will generalize well to new data, assuming that it con-
tinues to follow the relationship y = 2x. An overfitting model will attempt to
fit the outlier, which will take it off course from a linear relationship. As it is
unlikely that future data will follow the outlier, the more complex model will
not fit as well as the linear model.
• The training set is used for training the parameters, or weights, in the model
• The validation set is used for tuning hyperparameters in a model. A hy-
perparameter is an external configuration variable which is used to control
the learning process. Several models use hyperparameters, and their optimal
values are not known in advance. As an example, some models build decision
trees, and the number of trees to build is a hyperparameter. We will review
hyperparameter tuning shortly.
• The test set (or holdout set) is used for evaluating the model performance
after all of the modeling decisions have been made. That is, the model, the
Generalization 229
• Geopolitical factors
• Changes in consumer preferences
• Business strategy changes
practice, the expected loss can be estimated empirically by averaging the loss
over all of the training observations. The objective then becomes minimizing
the quantity Remp defined as:
n
1X
Remp (θ) = L(yi , f (xi , θ))
n i=1
This process is called empirical risk minimization. The fact that this operation
yields the appropriate estimator is an important result in machine learning.
Details can be found in [38].
There is flexibility in selecting a loss function, but some attributes are impor-
tant:
Beyond this, the choice of loss function can depend on whether the SL task is
regression or classification. The squared error , or L2 loss, is a popular choice
for the regression task. Since the expected loss needs to be measured, the
squared error can be averaged over the training set to produce the mean
squared error (MSE) as follows:
n
1X
M SE = (yi − ŷi )2
n i=1
Given that the errors yi −ŷi are squared, their contributions are relatively small
between –1 and 1, but they grow quadratically for larger values. This means
that outliers will have a large influence on MSE (it is not a robust metric).
Mathematically, MSE is continuously differentiable, making it straightforward
to minimize with respect to parameters.
import numpy as np
# compute errors
232 An Overview of Machine Learning
error = y - yhat
# squared errors
sq_error = error ** 2
mse = sq_error.mean()
if logging:
print(‘error:’, error)
print(‘sq_error:’, sq_error)
print(‘mse:’, mse)
return mse
Notice that the MSE of 0.02 is smaller than each of the errors. This is because
this loss function is squaring errors which are fractions.
The absolute error , or L1 loss, is another common choice in the regression
task. We can average the absolute errors over the training set for an estimate
of the expected loss. This metric is called mean absolute error (MAE), and it
is defined as follows:
n
1X
M AE = |yi − ŷi |
n i=1
Since the absolute value is computed on the errors yi − ŷi , their contribu-
tions grow linearly. Outliers will have less influence on MAE when compared
to MSE. MAE is more challenging to minimize, however, as the function is
piecewise linear and not differentiable when the error is zero. Next, let’s look
at code to calculate MAE. We will use the same data for comparability.
mae = abs_error.mean()
if logging:
print(‘error:’, error)
print(‘abs_error:’, abs_error)
print(‘mae:’, mae)
return mae
Aside from variable names, the only meaningful difference is the line that
computes the loss. Now let’s see the output:
mae(y_actual, y_hat)
For this case, as the errors are fractions, MAE is larger than MSE. Let us
now compare squared error loss to absolute error loss across a range of errors.
Figure 12.2 represents squared error with a solid curve and absolute error with
a dotted curve. This confirms what was explained earlier: for fractional errors,
absolute loss will be greater. Outside of this range, squared loss will dominate.
Next, let’s turn to the binary classification task. Let p̂i and 1 − p̂i denote
the predicted probability that observation i has positive and negative label,
respectively. The true label for this observation is yi . The cross-entropy loss
is defined as:
234 An Overview of Machine Learning
• the observation has a positive label and the predicted probability of the
positive label is 1: yi = 1, p̂i = 1
• the observation has a negative label and the predicted probability of the
negative label is 1: yi = 0, 1 − p̂i = 1
When there is disagreement between the actual label and the predicted prob-
ability, there will be a loss. Figure 12.3 shows the cross entropy when yi = 1
for different values of p̂i . When p̂i = 1, the loss is zero. As the probability
decreases, the model is less confident that the label is 1 and the loss increases
asymptotically.
We briefly reviewed some common loss functions in this section. When fitting
regression and classification models in Python (for example with sklearn), it
will not be necessary to explicitly code the loss functions. It is useful, however,
to understand how they work.
Hyperparameter Tuning 235
K-fold cross validation is a best practice for model evaluation and hyper-
parameter tuning, but it can be expensive. From our earlier example with 15
combinations of hyperparameters, consider running 10-fold cross validation.
Each combination will require fitting and evaluating 10 models, for a total of
150 models. For a massive dataset, it might be sufficient to use k = 5 or even
k = 3.
The sklearn module provides functionality for grid search through the
GridSearchCV function. One of the important inputs is the grid containing
hyperparameter lists. Here is a small example:
‘cv’ parameter can be used to specify the number of folds to use in cross
validation (10 in this case). The fit() function will train the model using
each hyperparameter combination, and a performance metric is calculated.
The final model will use the optimal hyperparameters which produced the
best metric.
12.9 Metrics
We examined loss functions, which are designed to measure errors and opti-
mize model parameters. For the purpose of understanding how well a model
performs, not all loss functions are easily understood by a human. Cross-
entropy loss, for example, is a number that approaches infinity as the pre-
dicted probability of the correct label approaches zero. For this reason, we
use metrics to measure and report on model performance, and the relevant
metrics will depend on the task.
For the regression problem, the MSE and MAE are both reasonable met-
rics. As MSE represents a sum of squared errors, the square root is often taken,
producing the root mean square error (RMSE). The RMSE then measures the
average error, irrespective of the error directions.
Another useful regression metric is R-squared (R2 ), and it represents the
fraction of variation in the target variable explained by the predictors. R2
falls in range [0, 1], where 0 indicates that the predictors explain none of
the variation (and a useless model), and 1 indicates full explanation. It is
difficult to specify a good R2 in advance as it is problem dependent, but when
comparing two models, higher R2 is better, all else equal.
Increasing the number of parameters in the model can inflate the R2 with-
out increasing explanatory power. To combat this, the Adjusted R-squared
can be computed. We will discuss the detailed calculation of R-squared and
Adjusted R-squared in the next chapter.
For the binary classification task, each outcome and its associated pre-
diction will take values 0 or 1. Useful metrics will quantify the number and
fraction of predictions that are correct (these are cases where the actual label
is equal to the predicted label) and incorrect (cases where the actual label is
not equal to the predicted label). The fraction of correct predictions is called
the accuracy, which is simple to compute but sometimes not the best metric.
After introducing additional metrics, we will learn why this is the case.
• P : a positive value, which can be used for the actual or predicted label
• N : a negative value, which can be used for the actual or predicted label
238 An Overview of Machine Learning
• T P (true positive): the positive label is predicted, and this is the true label
• F P (false positive): the positive label is predicted, but actual label is nega-
tive
• T N (true negative): the negative label is predicted, and this is the true label
• F N (false negative): the negative label is predicted, but the actual label is
positive
To summarize, there are two outcomes that are correct (T P and T N ), and
two outcomes that are incorrect (F P and F N ). Which error is worse – a
false positive or a false negative? This depends on the problem. In the case
of a COVID-19 test, a false positive would cause unwarranted worry. A false
negative would suggest that the person isn’t infected, which could lead to
spreading the disease and not seeking treatment.
Let’s imagine that 100 individuals take a COVID-19 test, and the results
are TP=85, FP=5, TN=7, and FN=3. We notice that most individuals tested
positive, which could make sense if they don’t feel well and this prompts the
test. From this information, we can compute quantities such as:
predicted P N
actual
P TP FN
N FP TN
For example, the cell F N represents the number of false negatives. This is
where the actual label is P and the predicted label is N . We can fill in the
confusion matrix with our data:
predicted P N
actual
P 85 3
N 5 7
Metrics 239
In this case, there were 3 false negatives where the actual label was P and
the predicted label was N . The diagonal of the table holds the correct counts
(85+7), while the errors appear on the off diagonal (3+5). The accuracy can
be computed as follows:
TP + TN
accuracy =
TP + FP + TN + FN
Next, let’s discuss recall and precision. Recall is the fraction of true positives
that are predicted as positive. It measures how good the classifier is at detect-
ing actual positives. The trouble is that a model can always predict positive
and attain perfect recall. As a counterbalance, we use a second metric.
Precision is the fraction of predicted positives that are correct. It measures
the accuracy of positive predictions. The trouble here is that the classifier
can make positive predictions only when it is very confident, attaining high
precision. However, it would then miss most of the actual positives, resulting
in low recall. For a classifier to be strong, it must have high precision and high
recall. When it predicts the positive label, it needs to be accurate (precision),
and it also needs to identify the positive cases (recall). Precision and recall
can be calculated as follows:
TP
precision =
TP + FP
TP
recall =
TP + FN
For our example, precision = 85/(85+5) = 0.944 and recall = 85/(85+3) =
0.966.
The last metric we will discuss here is the F1 score (F1 ), which is the harmonic
mean of precision and recall:
precision × recall
F1 = 2 ×
precision + recall
As a single number combining recall and precision, the F1 score is very popular
and effective for binary classifier evaluation. Additionally, the harmonic mean
is more punitive than the arithmetic mean, which results in a conservative
metric. As an example, for a classifier with recall=0.2 and precision=1.0, the
arithmetic mean is 0.6, while the harmonic mean is 0.333.
The sklearn module provides functions for computing regression and bi-
nary classifier metrics. However, it is very valuable to understand what they
mean so they can be used and interpreted appropriately.
240 An Overview of Machine Learning
The optimal parameter values will minimize risk, and optimization techniques
can be used to find these values.
We discussed hyperparameter tuning and k-fold cross validation. Models
often include hyperparameters which are not known in advance and must be
fine-tuned. This can be an expensive step, particularly when the search space
is large and grid search is used. K-fold cross validation provides a thorough
method for evaluating model performance, as each fold of the training data is
held out for validation. The technique can be combined with grid search for
hyperparameter tuning, and this is a best practice in machine learning.
Metrics are used to report on model performance. We discussed important
metrics for the regression problem, such as RMSE and R-squared, and the
binary classification problem, such as recall and precision. We will see these
metrics again as we work with regression and classification models.
This chapter was very dense with concepts. In the next chapter, we will
look at linear regression modeling in detail. Many of the concepts we just
studied will be reinforced through code and examples.
12.11 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
1. Discuss a pros and cons list that you have made, or a list that you
could make. What was the decision you were trying to make? What
were the pros, what were the cons, and how did you ultimately
combine them to produce an answer?
2. S Discuss the major difference between supervised learning and un-
supervised learning. What is one reason why unsupervised learning
might be used in favor of supervised learning?
3. Given a dataset, how do we categorize whether a supervised learning
problem should be a regression problem or a classification problem?
4. S For each problem below, indicate whether it is a regression prob-
lem (R), classification problem (C), or clustering problem (S).
a) Summer campers are signing up for club activities. The activity
each camper will select will depend on interests and who is teaching
the activities. You are asked to predict the probability that a camper
signs up for each activity.
242 An Overview of Machine Learning
We have learned how to clean and prepare data, how to summarize and visu-
alize it, and how to transform it into useful predictors. We just learned about
the building blocks of machine learning. In this chapter, we fit supervised
learning models to data when the target variable is continuous. This is the
regression task, and we will use linear regression for solving the problem.
The chapter will begin with a discussion of the important concepts and
mathematics, and continue with Python code for implementation. We will
use sklearn for the modeling. As this chapter provides an outline, further
exploration is encouraged. Additional resources can be found in the course
repo in this folder:
semester1/week_13_14_model/
Tip: Be sure to confirm the validity of any data before training a model on
it. Describing the data with statistical summaries and visualizations will be
helpful here.
245
246 Modeling with Linear Regression
i ∼ N (0, σ 2 )
The normality assumption is not necessary for fitting the model, but it is a
requirement in the statistical testing.
Next, let’s look at the mathematical form for linear regression, which is stated
for each observation i and p predictors:
yi = β0 + β1 xi1 + . . . + βp xip + i , i = 1, . . . , n
where
In this setup, we assume that the target can be modeled by a linear combi-
nation of the predictors. There is a linear equation for each observation, for a
total of n equations. There are p + 1 unknowns, represented by the weights β0 ,
β1 , . . . , βp applied to each predictor. The weighted combination of predictors
is an attempt to match the target value.
2 = β0 + β1 (1) + 1
3 = β0 + β1 (2) + 2
5 = β0 + β1 (3) + 3
2 = 0 + 2 + 1
3 = 0 + 4 + 2
5 = 0 + 6 + 3
For the first equation, the weighted combination of predictors exactly matches
the target. For the next two equations, however, there are errors. In fact, this
system has more equations than unknowns, and it cannot be exactly solved. It
is called an overdetermined system. An overdetermined system is better than
an underdetermined system, which results when there are more unknowns than
equations. Underdetermined systems cannot be solved.
In algebra class, you might have seen many systems where the number of
equations matched the number of unknowns. These systems can be exactly
solved, which is wonderful. In practice, it won’t generally be possible to have
a matching number of equations and unknowns; we will collect observations
which generally outnumber the predictors: n > p. To solve the overdetermined
system, we will establish a metric and an objective, such as minimizing squared
errors. This procedure will yield estimates β̂i for each parameter βi . Then the
predictions can be estimated with a fitting equation:
In the case of a single predictor (p = 1), we can drop the subscript on the
predictor and write the fitting equation as:
This model is called simple linear regression. In the case of multiple predic-
tors (p > 1), the model is sometimes called multiple linear regression. The
prediction ŷi is called the fitted value. The difference between the target and
the fitted value, yi − yˆi , is called the residual . The residual is an estimate of
the true, unobservable error.
Minimizing SSres (β) with respect to β can be done with differential calculus,
and this will yield a parameter estimate β̂i for each parameter βi . Next, we
will review the details for the simple linear regression case.
n
X 2
SSres (β) = (yi − (β0 + β1 xi ))
i=1
There are two parameters in this model: β0 and β1 . Our strategy is to simplify
the equation and look for a critical point. This is done by taking the partial
derivative with respect to each parameter (i.e., computing the gradient), set-
ting the equations to zero, and solving the system of equations.
n
X
yi2 − 2yi (β0 + β1 xi ) + (β0 + β1 xi )2
SSres (β) =
i=1
n
X
yi2 − 2yi β0 − 2yi β1 xi + β02 + 2β0 β1 xi + β12 x2i
=
i=1
Next, take the partial derivative with respect to β0 and set equal to zero:
n
∂SSres (β) X
= (−2yi + 2β0 + 2β1 xi )
∂β0 i=1
n
X n
X
0=− yi + nβ̂0 + β̂1 xi (dividing through by 2)
i=1 i=1
β̂0 = ȳ − β̂1 x̄
Next, we need to determine the parameter estimate β̂1 . We take the partial
derivative of S with respect to β1 and set equal to zero:
n
∂SSres (β) X
−2xi yi + 2β0 xi + 2β1 x2i
=
∂β1 i=1
n
X
0= −xi yi + β̂0 xi + β̂1 x2i (dividing through by 2)
i=1
Xn
0= −xi yi + (ȳ − β̂1 x̄)xi + β̂1 x2i (substituting for βˆ0 )
i=1
n
X n
X n
X n
X
0=− xi yi + xi ȳ − β̂1 x̄xi + β̂1 x2i (applying sums)
i=1 i=1 i=1 i=1
n
X n
X
0=− xi (yi − ȳ) + β̂1 xi (xi − x̄) (grouping terms)
i=1 i=1
Pn
xi (yi − ȳ)
β̂1 = Pni=1
i=1 i (xi − x̄)
x
This is equivalent to the common equation for β̂1 , which is left as an exercise:
Pn
(x − x̄)(yi − ȳ)
β̂1 = Pn i
i=1
2
i=1 (xi − x̄)
We have now seen how to derive the simple linear regression parameter esti-
mates using differential calculus. This is done in software, but it is a valuable
exercise nonetheless. For the more general case, we can use matrices. This will
handle the simple linear regression case that we just completed, as well as the
multiple linear regression case. This involves creating:
• A design matrix X which holds the predictor data. The n rows represent
the observations and the p + 1 columns represent the intercept term (in the
first column) and the p predictors in the remaining columns. For example,
element x12 is the value of predictor 2 for observation 1.
250 Modeling with Linear Regression
1 x11 x12 ... x1p
1 x21 x22 ... x2p
X = .
.. .. .. ..
.. . . . .
1 xn1 xn2 ... xnp
• Vectors for the parameters β, targets y, and errors :
β0 y1 1
β1 y2 2
β= . y= . =.
.. .. ..
βp yn n
Putting the pieces together allows for writing the system of equations in com-
pact matrix notation:
y = Xβ +
β̂ = (XT X)−1 XT y
The notation XT and (XT X)−1 denotes the transpose of X and inverse of
XT X, respectively. In practice, computation of the inverse may be intractable
for massive datasets. In this case, an iterative approach such as gradient de-
scent (GD) can be followed. We will discuss GD in the next chapter on logistic
regression. Software packages for big data such as Apache Spark will follow
the iterative approach.
Given the parameter estimates, the fitted values can be computed. The fitting
question can be written in matrix form as follows:
ŷ = Xβ̂
The matrix notation aligns with how we organize data for computation. A
pandas dataframe will hold the data in a NumPy array which arranges the
observations on the rows and the predictors on the columns. When fitting a
model in sklearn, the matrix of predictors X and the vector of targets y will
be passed to a function. It is not necessary to include the column of 1s in X.
Now that we have briefly covered the conceptual and mathematical frame-
work of the linear regression model, we can think about predictors and code
implementation. For a deeper dive into linear regression, [39] and [40] are ex-
cellent references. In the next section, we will briefly learn some qualitative
best practices for predictor selection.
Being Thoughtful about Predictors 251
• They should make intuitive sense. In practice, the data scientist might dis-
cuss the predictors with the line of business, the product manager, and other
stakeholders.
• They should not introduce bias or ethical issues into the model
• Age, race, ethnicity, gender, religion, disability, color, national origin, reli-
gion
• Employment type, where jobs like nursing can introduce a gender bias
• An indicator if a caller pressed 2 to hear a message in Spanish, which can
introduce an ethnicity bias
• Salary, which can introduce a race/ethnicity bias
import pandas as pd
housing = fetch_california_housing()
Checking the data type with type(housing) shows this specialized object:
sklearn.utils.Bunch. As the object appears to contain key:value pairs, we
request the keys:
housing.keys()
OUT:
The key DESCR provides descriptive data about the dataset. The name of the
target variable is MedHouseVal, which is the median house value in units of
$100K. The first two target values are 4.526 and 3.585. Predictors include
Predicting Housing Prices 253
the age of the house (HouseAge), average number of rooms (AveRooms), and
average number of bedrooms (AveBedrms). Both the target values and the
predictor values are stored in NumPy arrays.
housing[‘data’].shape
OUT:
returns (20640, 8)
There are 20640 rows, or block groups, and 8 columns containing predictors.
Note that for this data type, we can reference an attribute like data in two
ways:
housing[‘data’]
or
housing.data
INPUTS:
• housing.data – the predictor data
• housing.target – the target data
254 Modeling with Linear Regression
• train size – the fraction of data used in the training set (the remainder is
used in the test set)
• random state – a seed for replicating the random split of rows into train
and test sets
OUTPUTS:
The train test split() function split the data into 60% train and 40% test,
producing four new NumPy arrays. We can check the number of records in
each piece by computing their lengths:
scaler = StandardScaler()
x_train_s = scaler.fit_transform(x_train)
x_train_s
x_train_s.mean(axis=0)
OUT:
OUT:
array([1., 1., 1., 1., 1., 1., 1., 1.])
The column means are approximately zero and the standard deviations are
one, as we expect.
The reg object holds the trained regression model, and we can explore its
contents. We can list its attributes and methods like this:
dir(reg)
OUT:
[‘__abstractmethods__’,
‘__class__’,
‘__delattr__’,
...,
256 Modeling with Linear Regression
‘fit’,
‘fit_intercept’,
‘get_params’,
‘intercept_’,
...,
‘score’,
‘set_params’,
‘singular_’]
Among the list are methods for fitting the model with fit() and extracting
the intercept estimate. Next, let’s extract the intercept estimate:
reg.intercept_
OUT:
2.060088804909497
reg.coef_
OUT:
array([ 0.81403101, 0.11883616, -0.260123 , 0.31025271,
-0.00178077, -0.04600269, -0.91689468, -0.88930004])
Let’s code the mathematical formula for the parameter estimates. Then we
can verify that the calculated results match the estimates from sklearn. As
a reminder, here is the formula:
β̂ = (XT X)−1 XT y
import numpy as np
XtX = X.T @ X
OUT
array([ 2.06008880e+00, 8.14031008e-01, 1.18836164e-01,
-2.60123003e-01, 3.10252714e-01, -1.78076796e-03,
-4.60026851e-02, -9.16894684e-01, -8.89300037e-01])
The vector includes the intercept term as the first element, and the slopes as
the remaining elements. This matches the estimates from sklearn. Generally,
we won’t need to explicitly use the formula, but it is good to understand what
the software is doing. Next, let’s bring up a list of the predictors to understand
what the slopes mean:
housing.feature_names
OUT:
[‘MedInc’,
‘HouseAge’,
‘AveRooms’,
‘AveBedrms’,
‘Population’,
‘AveOccup’,
‘Latitude’,
‘Longitude’]
As the order of the predictors corresponds with the slope estimates, the
‘MedInc’ variable (median income in the block group) has value 0.814 after
rounding. The positive sign indicates that higher median income is associated
with a higher median house value, on average. The parameter estimates can
be interpreted as the incremental change in target value for an incremental
change in the predictor. For each additional unit in median income in the
block group, the increase in median house value would be 0.814 x 100K =
$81,400, on average, assuming all other predictors are held constant.
The regression intercept estimate of 2.06, after rounding, indicates the
average target value when all of the predictor values are zero. This is only
meaningful if it’s reasonable for all predictor values to actually be equal to
zero. In this case, it does not make sense for the average number of rooms and
bedrooms to be zero, and so the intercept interpretation is not useful.
Next, let’s use the model to make predictions on the test set. First, we will
need apply the StandardScaler to scale the data:
x_test_s = scaler.fit_transform(x_test)
Next, we run the scaled test data through the model to predict target values
ŷ. The predict() function is used for the fitting equation ŷ = Xβ̂.
y_test_predicted = reg.predict(x_test_s)
258 Modeling with Linear Regression
# import statsmodels
import statsmodels.api as sm
Table 13.1 shows the parameter estimate table from the model summary.
The p-values and confidence intervals indicate that all variables except x5
(‘Population’) are significant. For x5, the p-value is much greater than 0.05,
and zero is contained in the 95% confidence interval. The next step would be
to drop this variable and refit the model.
print(r2_score(y_test, y_test_predicted))
OUT:
0.6078399673783987
The R2 indicates that 61% of the variation in the median house value is ex-
plained by the predictors. The set of predictors is useful, but there is a large
fraction of unexplained variation. It can help to plot the data to understand
large discrepancies between actual and predicted values, so let’s make a scat-
terplot of these quantities.
sns.scatterplot(x=y_test, y=y_test_predicted)
plt.plot((-2,7),(-2,7))
plt.xlabel(‘actual’)
plt.ylabel(‘predicted’)
has mean zero. There should not be patterns in the data. The presence of a
trend line or curve would be problematic, as this would suggest one or more
missing predictors. The residual plot here is problematic, as the residuals do
not cluster about the zero line, and there is a negatively sloping set of points.
The sloping points are a result of the target maximum value of five.
At this stage, an analyst may spend a fair bit of time doing the mentioned
error analysis. Creating various plots and filtering and sorting the data in
different ways will provide insight. Defects in the data may be discovered and
fixed, such as incorrect predictor or target values. When updates need to be
made to the data, it is best to make the changes in code and save the adjusted
data to separate files. This will make the changes transparent and repeatable,
and it will be possible to undo the changes if necessary. For further model
improvement, there may be additional predictors that can be engineered and
included. It may also be possible to collect more data, and focus should be
given where the model is weak.
x_train_subset = x_train[:,:3]
Predicting Housing Prices 261
The notation [:,:3] references the rows and columns, respectively. The colon
denotes selection of all rows, while :3 denotes selection of columns zero
through two, inclusive.
scaler2 = StandardScaler()
x_train_subset_s = scaler2.fit_transform(x_train_subset)
We now have our new model. To assess the performance, we will first subset
the test set and scale it.
x_test_subset_s = scaler2.fit_transform(x_test[:,:3])
Next, we use the reg2 model to make predictions on the test set:
y_test_predicted2 = reg2.predict(x_test_subset_s)
The R2 is useful for measuring the fraction of variance explained by the pre-
dictors, but when comparing models, the adjusted R2 is more appropriate. The
adjusted R2 is a modification of R2 that penalizes model complexity, which
increases as additional parameters are included in a model. It is calculated as
follows:
262 Modeling with Linear Regression
2 2 n−1
adjusted R = 1 – (1 − R )
n−p−1
where
Next, let’s compute the R2 and the adjusted R2 for this smaller model:
OUT:
R2: 0.5206094971517826
adj R2: 0.5204352155826424
For comparison, we compute the R2 and adjusted R2 for the larger model:
OUT:
R2: 0.6078399673783987
adj R2: 0.6074595526505009
There are two important observations. First, the smaller model has a lower
adjusted R2 than the larger model. This tells us that the predictors we dropped
are important in explaining the target. Second, for each model, the adjusted
Predicting Housing Prices 263
denoting
• SStot = total sum of squares
• SSexp = explained sum of squares
• SSres = residual sum of squares
• R2 = SSexp /SStot
gives
In turn, this will result in a large R2 . If the model perfectly fits the data,
SSres will be zero and R2 will be one. If the model does not explain any of
the variation in the target, SSres will equal SStot and R2 = 1 − 1 = 0. Next,
we need to explain exactly how to compute SSres and SStot .
y = Xβ +
Now we take the derivative of SSres with respect to β. This is like taking the
derivative of quadratic function (y −Xβ)2 . The result is 2 times the derivative
of (y − Xβ)T with respect to β:
dSSres (β)
= −2XT (y − Xβ)
dβ
0 = −2XT (y − Xβ) (set equal to zero)
266 Modeling with Linear Regression
XT y = XT Xβ
This produces the normal equations. The last step is to solve for β. We mul-
tiply on the left by the inverse of the matrix XT X. This is done to each side
of the equation:
β̂ = (XT X)−1 XT y
13.6 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
Pn
xi (yi − ȳ)
βˆ1 = Pni=1
i=1 i (xi − x̄)
x
Pn
(x − x̄)(yi − ȳ)
βˆ1 = Pn i
i=1
2
i=1 (xi − x̄)
Pn
Hint: After expansion, you might find terms involving i=1 (xi − x̄).
Intuitively, what is the value of this expression?
12. What is a benefit of setting the seed when calling the
train test split() model?
13. S Explain how R2 is related to correlation.
14. True or False: If the regression predictors fail to explain all of the
variation in the target variable, then the model should never be
used.
15. What is the purpose of a residual plot?
16. S A residual plot shows a clear pattern. What does this mean?
17. What is a useful performance metric for comparing regression mod-
els? Explain your answer.
18. S Using the sklearn module, fit a linear regression model to this
dataset: {(3, 6), (3.5, 9), (4, 8)}. Show your code and print the inter-
cept and slope coefficient estimates and the R-squared.
19. Using a spreadsheet or calculator, compute the intercept and
slope coefficient estimates from a linear regression model given the
dataset: {(3, 6), (3.5, 9), (4, 8)}. Organize and show all of your cal-
culations.
14
Classification with Logistic Regression
In the last chapter, we learned a modeling approach when the target variable
is continuous. In this chapter, we will learn a model for the case when the
target is binary. This task is widespread across domains, such as predicting
survival, the movement of a stock price, or the outcome of a sporting event.
Logistic regression is an appealing model because it can work well in practice
and the results are interpretable. The model is a good starting point before
attempting more complex models.
We will begin by outlining the mathematical framework, which makes use
of the calculus concepts we covered earlier. Following along is encouraged, but
a high-level understanding will be enough. Estimation of the parameters in
logistic regression cannot be done in closed form, and we will review gradient
descent as an iterative solution. Next, we will build a model to detect malig-
nancy, and learn how to interpret the parameter estimates. Python packages
will do the model fitting for us, but we will confirm the results by running gra-
dient descent. Lastly, we will evaluate model performance with binary classifier
metrics. Additional resources can be found in the course repo in this folder:
semester2/week_03_04_classification_and_logistic_regression/
269
270 Classification with Logistic Regression
1
p(t) = σ(t) =
1 + e−t
An example sigmoid function is shown in Figure 14.1. The function is useful
because it has domain (−∞, ∞) and range (0, 1). Given this range, the out-
put can be treated as the probability of the event. Notice the S shape and
horizontal asymptotes at zero and one when the input tends to –∞ and ∞,
respectively.
The odds of an event is the ratio of the probability of the event taking place
divided by the probability it does not take place, denoted p/(1 − p). We first
simplify 1 − p, showing the dependence on t:
1
1 − p(t) = 1 −
1 + e−t
1 + e−t 1
= −
1 + e−t 1 + e−t
e−t
=
1 + e−t
p(t)
ln =t
1 − p(t)
This results from exp() and ln() being inverses of one another. We now assume
that t is a linear combination of the predictors:
t = β0 + β1 xi1 + . . . + βk xik
In fact, the right-hand side is the specification from the linear regression
model. Then the equation for the log odds is
p(x)
ln = β0 + β1 xi1 + . . . + βk xik
1 − p(x)
This is the equation for logistic regression. It assumes a linear relationship
between the predictors and the log odds of the probability of the event. In the
case of a single predictor (k = 1), we can drop the subscript on the predictor
and write the equation as
p(x)
ln = β0 + β1 xi
1 − p(x)
Now that the form is specified, we will need parameter estimates β̂i for each
parameter βi where i = 0, 1, . . . , k. Our technique will be to write the proba-
bility of observing a sequence of outcomes as a function of the parameters (this
is called the likelihood function), and to apply gradient descent to maximize
a version of the likelihood.
Note: This section is more mathematically heavy than the earlier sec-
tions. It will be good to attempt the section and see the application of
preliminary material, such as calculus, but it is not required.
n
Y
L(β) = p(xi |β)yi (1 − p(xi |β))1−yi
i=1
272 Classification with Logistic Regression
The term p(xi |β) denotes the probability of the event for outcome i given
the parameter values, and the term 1 − p(xi |β) denotes the probability of a
non-event for outcome i. The exponents yi and 1 − yi allow for a compact
expression for a Bernoulli trial. For example, if the event occurs for outcome
i, then yi = 1 and 1 − yi = 0. It follows that
n
Y
L(β) = p(xi |β)yi (1 − p(xi |β))1−yi
i=1
3
Y
= p(xi |β)
i=1
= p(x1 |β)p(x2 |β)p(x3 |β)
n
X
l(β) = yi ln p(xi |β) + (1 − yi ) ln (1 − p(xi |β))
i=1
n
Y
l(β) = ln p(xi |β)yi (1 − p(xi |β))1−yi
i=1
n
X
= ln p(xi |β)yi (1 − p(xi |β))1−yi
i=1
n
X
= ln p(xi |β)yi + ln (1 − p(xi |β))1−yi
i=1
n
X
= yi ln p(xi |β) + (1 − yi ) ln (1 − p(xi |β))
i=1
n
X p(xi |β)
l(β) = ln (1 − p(xi |β)) + yi ln
i=1
1 − p(xi |β)
The term on the right is the log odds, which can be replaced by the linear
term β0 + β1 xi1 + . . . + βk xik like this:
n
X n
X
l(β) = ln (1 − p(xi |β)) + yi (β0 + . . . + βk xik )
i=1 i=1
n
X n
X
l(β) = −ln (1 + exp(β0 + . . . + βk xik )) + yi (β0 + . . . + βk xik )
i=1 i=1
Gradient Descent
Now that we have an expression for the log-likelihood, we want to select the
274 Classification with Logistic Regression
∂l(β (t) )
β (t+1) = β (t) − α
∂β (t)
where
A useful analogy for understanding gradient descent is a hiker finding the low-
est point in the mountains. Imagine that she moves downhill in the direction
of greatest steepness (the gradient) for a given number of paces (the learning
rate), and then repeats this process. After a certain number of iterations, she
will reach the lowest point (the global minimum) or a point which is lowest in
the nearby area (the local minimum).
The algorithm begins by initializing the parameter estimates as β (0) . The
parameter update moves in the direction of the largest change, which is deter-
mined by the gradient. For a small enough step size, moving in the direction of
the gradient will provide an update to the parameter estimates which brings
them closer to their optimal values. A larger step size can lead to faster con-
vergence, but it might miss the minimum. When optimality is reached, the
algorithm is said to converge. In practice, it may not be possible to reach
optimality (due to a bad starting guess), or it may take a very long time (due
to a small α). Stopping criteria are used to prevent the algorithm from run-
ning indefinitely: (1) a maximum number of iterations is provided, and (2) a
tolerance is set to determine when a change in the objective function is small
enough to stop the algorithm.
We will need a formula for the partial derivative of the log-likelihood with
respect to each parameter βj . This includes taking the derivative of the func-
tion ln (1 + exp(β0 + β1 xi1 + . . . + βk xik )). Recall that the differentiation rule
for the logarithm looks like this:
Mathematical Framework 275
d 1
ln(x) =
dx x
Here, we have a composite function, so the chain rule can be used. For example,
d 1 d
ln(1 + βx) = · (1 + βx)
dβ 1 + βx dβ
x
=
1 + βx
The function we differentiate also includes an exponential term. As a reminder,
the differentiation rule for the exponential looks like this:
d x
e = ex
dx
We have a composite function of the exponential, so the chain rule can again
be used. For example,
d βx d
e = eβx · (βx)
dβ dβ
= xeβx
We now have all the pieces that we need, and the equation for the log-likelihood
is repeated here:
n
X n
X
l(β) = −ln (1 + exp(β0 + . . . + βk xik )) + yi (β0 + . . . + βk xik )
i=1 i=1
Applying the differentiation rule for the logarithm with the chain rule yields
the partial derivative for parameter j:
n n
∂l(β) X exp(β0 + β1 xi1 + . . . + βk xik ) X
=− xij + yi xij
∂βj i=1
1 + exp(β0 + β1 xi1 + . . . + βk xik ) i=1
The term
is equivalent to p(xi |β) (this will be left as an exercise), and so we can simplify
the equation to:
276 Classification with Logistic Regression
n n
∂l(β) X X
=− p(xi |β)xij + yi xij
∂βj i=1 i=1
n
X
= (yi − p(xi |β))xij
i=1
Code to run gradient descent is provided below. The essential functions are
the sigmoid p(xi |β), the log-likelihood, and the gradient descent algorithm.
def sigmoid(t):
return 1 / ( 1 + np.exp(-t) )
OUTPUTS
betas numpy array of parameter estimates
ll list of log-likelihoods
’’’
for it in np.arange(n_iter):
return betas, ll
Let’s walk through the gradient descent algorithm. For understanding the
calculations and doing matrix algebra, it helps to look at the object shapes.
We will take a particular dataset, which we will study in more detail in the
next section. The predictor data x tr has 341 observations and 2 predictors
for shape (341,2). The labels y tr are in a vector of length 341. An empty
list is created for storing the log-likelihood. A column of 1s is created for the
intercept term. It is placed as the first column in the design matrix using
hstack(); the shape of x tr is now (341,3). The parameter estimates (betas)
are initialized to a column vector of zeros with length 3. It will hold the
intercept and two predictors. If we check the shape of the betas, it will be
reported as (3,).
The code then enters the loop, which will iteratively update the betas. The
linear combination of predictors is calculated by computing a matrix product
between the predictor values with shape (341,3) and the betas with shape (3,).
As the inner dimensions of the product match, this is a valid operation which
278 Classification with Logistic Regression
will produce a column vector with length 341. Next, the sigmoid function
computes the probability of the event for each of the 341 observations. It can
do this without a loop, as it is a vectorized function.
Next, we need to compute the gradient of the negative log-likelihood, which
provides the update direction of the parameter vector. Mathematically, this
has form
n
X
−(yi − p(xi |β))xij
i=1
-np.dot(X.T, y - p)
Let’s break this down. Both y and p have the same shape, and we can compute
their difference. The term y − p represents the prediction errors for each ob-
servation. For each parameter j, we need to compute the dot product between
the prediction errors and the values of the jth predictor; this is summing over
the observations. The end result is a column vector of length 3, where element
j represents a partial derivative with respect to parameter j.
Now that the gradient is computed, we update the parameter estimates
taking a step of size α in the direction of the negative gradient. This is expected
to minimize the negative log-likelihood (or maximize the log-likelihood). For
every 1000 iterations, we report the iteration number (the progress) and store
the log-likelihood. If we plot the log-likelihood versus the iterations, we should
see it converging to a value, or limit. For an appropriate α and number of iter-
ations, the vector of parameter estimates should converge to the true values.
In the next section, we will consider a numerical example. We will first fit
a model using sklearn, and then we will call gradient descent logistic()
to show that the parameter estimates converge.
https://archive.ics.uci.edu/ml/datasets/
breast+cancer+wisconsin (diagnostic)
semester2/week_03_04_classification_and_logistic_regression/
logistic_regression_w_breast_cancer_data.ipynb
Detecting Breast Cancer 279
The target variable is called diagnosis and it takes value ‘M’ for malignant
and ‘B’ for benign. Each patient has a unique identifier saved in the id column,
and the columns f1-f30 are cell measurements that can be used as predictors.
We begin by importing modules and reading in the data:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
datapath = ‘../datasets/wdbc.csv’
df = pd.read_csv(datapath)
We will need to code the target values as 1 for the event ‘M’ and 0 for the non-
event ‘B.’ The predictors are selected, and the dataset is split into a training
set and test set for performance measurement.
OUTPUT:
x_tr:
[[19.21 18.57]
[19.59 25. ]
[10.29 27.61]
[13.85 19.6 ]
[12.47 18.6 ]]
280 Classification with Logistic Regression
y_tr:
[1 1 0 0 0]
Next, we train the model on the training data. There are several parameters
that can be set, such as the maximum number of iterations, but the default
settings will be used here. To turn off the penalty for large parameter values,
we include penalty=‘none’.
model.predict_proba(x_tr)[:5,:]
OUTPUT
array([[0.00893677, 0.99106323],
[0.00135816, 0.99864184],
[0.95568155, 0.04431845],
[0.7259081 , 0.2740919 ],
[0.93858145, 0.06141855]])
For each row, the values are the probabilities that the tumor is benign (nega-
tive class) and malignant, respectively. For example, for the first subject, the
probability of a benign cell is 0.00893677, and the probability of a malignant
cell is 0.99106323. Since the malignant probability is greater than the default
threshold of 0.5, the predicted cell type is 1 (malignant).
Suppose we want to change the threshold and predict malignancy if the prob-
ability of the positive label is greater than 0.85. The threshold adjustment can
be done like this:
This will compute the predicted probabilities for each subject and compare
the positive-label probabilities against the threshold for each subject. Since
we now need to be more confident in predicting malignancy, we can expect the
precision to be higher (fewer false positives) but the recall to be lower (more
false negatives).
The sklearn package allows for easily training the model, but it abstracts
away the mathematics. To dive deeper, we will make the connection between
what the code is doing and how the sigmoid function works. We can extract
the parameter estimates for the intercept and slopes, and apply the sigmoid
function to calculate the probability of malignancy for a sample patient.
Detecting Breast Cancer 281
# parameter estimates
b0 = model.intercept_
b1 = model.coef_[0][0]
b2 = model.coef_[0][1]
# sigmoid
1 / ( 1 + np.exp(-(b0 + b1 * x1 + b2 * x2) ))
OUTPUT:
array([0.99106323])
This value matches the probability from the predict proba() function.
# import statsmodels
import statsmodels.api as sm
Table 14.1 shows the parameter estimate table from the model summary. The
p-values and confidence intervals indicate that the intercept and predictors are
significant. The positive sign on predictors x1 and x2 indicates that they both
increase the probability of malignancy. Next, we will learn how to compute
the magnitude of the increase.
Since the model relates the log odds to the linear combination of predictors,
the parameter interpretation is different from the linear regression case. We
282 Classification with Logistic Regression
can think about two subjects A and B with identical data except for a single
predictor where the value differs by 1 unit. Suppose the data looks like this:
subject x1 x2
A v+1 w
B v w
In this case, the x1 variable differs by one unit between the subjects. This can
be done with any predictor to yield an analogous interpretation. We will now
write a formula that compares these two subjects with their data. Recall the
formula for the odds of the event, with the linear combination of predictors
substituted:
p(x)
= exp(β0 + β1 xi1 + . . . + βk xik )
1 − p(x)
Denote subject A data as xA and subject B data as xB . Next, we form the
ratio of the odds, called the odds ratio (abbreviated OR) for the two subjects
like this:
exp(a)/exp(b) = exp(a − b)
This will simplify the ratio greatly, since for example
More generally, the same term in the numerator and denominator yields a
multiplicative factor of 1.
exp(a + b) = exp(a)exp(b)
This allows for writing simple factors in the numerator and denominator,
which cancel. Returning to the odds ratio, the equation becomes
The odds ratio has a very simple form and interpretation: increasing the value
of predictor j by one unit will multiply the odds of the event by the factor
exp(βj ). Returning to the parameter estimate table, the coefficient on x1 is
1.1042. Increasing x1 by one unit will multiply the odds of malignancy by a
factor of e1.1042 = 3.017.
To understand from the parameter estimate if the odds will increase, de-
crease, or remain unchanged, we can consider the case where the value is zero.
In this case, e0 = 1, which means that the odds are multiplied by a factor of
one (it is unchanged). A positive parameter estimate will increase the odds of
the event, while a negative parameter estimate will decrease the odds of the
event.
betas, ll = gradient_descent_logistic(x_tr,
y_tr,
alpha=1e-4,
n_iter=2e6)
284 Classification with Logistic Regression
print(classification_report(y_te, model.predict(x_te)))
precision recall f1-score support
Detecting Breast Cancer 285
The first section in the report shows metrics for the cases with a negative
label (0) and a positive label (1). Focusing on row (1), the precision indicates
that 85% of the predicted positives were truly positive. The recall indicates
that 75% of the true positives were predicted as positive. The F1 score is
the harmonic mean of the precision and recall, valued at 0.80 or 80%. The
support indicates there were 77 positive-labeled cases. The calculations for
row (0) are analogous, and they represent the classification ability for the
negative-labeled cases. Given the F1 score of 0.91, we learn that the model
does better on negative labels than positive labels. From the second section
in the table, we see that the accuracy, or fraction of correct predictions, was
0.87, or 87%. For data where the number of positive labels is different from
the number of negative labels, the accuracy is not very useful.
In practice, it can be helpful to set a range of thresholds and compute recall
and precision for each. The threshold giving the highest F1 score, for example,
might then be selected. Next, we report metrics for a sample of thresholds:
print(‘threshold:’, th[100:105])
print(‘’)
print(‘precision:’, pre[100:105])
print(‘’)
print(‘recall: ’, re[100:105])
OUTPUT:
For threshold 0.523, the precision and recall are approximately 0.87 and 0.74,
respectively. Moving the threshold up to 0.564 (the last shown threshold)
286 Classification with Logistic Regression
increases the precision and decreases the recall. This makes sense, as a higher
threshold makes it more stringent to predict a positive label. This reduces the
number of false positives, which increases precision. At the same time, there
are fewer predicted positives, and this increases the number of false negatives
(cases which were actually positive but predicted as negative). The F1 scores
for the lower and higher thresholds are 0.8028 and 0.7971, respectively, which
suggests that the lower threshold is the better of the two.
A different model will likely produce different metrics. While the threshold
may be changed to give better results, it is always preferable to build the best
model possible, and then try different thresholds to tune it. The model and
optimal threshold can then be saved to predict outcomes on new data.
14.4 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
Exercises 287
Expression I
Expression II
1
1 + exp(−(β0 + β1 xi1 + . . . + βk xik ))
import numpy as np
a = np.array((1,2,3))
b = np.array([(1,2,3),(1,2,3)])
np.hstack((a,b))
288 Classification with Logistic Regression
We don’t always have the luxury of labeled data for supervised learning. A
more common use case is that a dataset {xi }ni=1 is not labeled, but we want to
identify substructure such as groupings. Clustering techniques are also helpful
in finding outliers; such points may be located in isolated clusters or small
groups. We will begin this chapter by studying concepts essential to clustering.
Then we will learn about k-means clustering and work out a small example.
Lastly, we will apply k-means to a nutrition dataset of roughly 9000 foods.
289
290 Clustering with K-Means
Let’s narrow the focus to three students who have taken two exams. Their
scores are as follows:
What is a reasonable way to group these students based on their exam scores?
First, we would probably agree that a group, or cluster , should consist of
students who are more similar to each other. If two students are different, then
they should be in different clusters. To measure similarity, we can represent
each student as a point based on data. We can then measure the distance
between points using a distance metric. There are many possible distance
metrics, and the particular problem might define the metric for us. In this
example, we will use Euclidean distance, which is a straight-line distance.
Figure 15.1 provides an illustration with students A and B as points.
Notice where the horizontal and vertical lines cross, as it shows three things:
The distance between A and B will be the hypotenuse of the formed right
triangle. The horizontal side of the triangle represents the difference in
scores for exam 1, 75 - 70, while the vertical side represents the differ-
encepin scores for exam 2, 82 -√80. The distance may then be calculated
as (75 − 70)2 + (82 − 80)2 = 29 ≈ √5.39 according to the Pythagorean
theorem (which you might recall as c = a2 + b2 ).
Euclidean distance can be stated as a formula for two vectors x and y, each
with length p:
v
u p
uX
d(x, y) = t (xi − yi )2 (15.1)
i=1
15.2 K-Means
The k-means algorithm divides n objects, or points, into k clusters by placing
each point in the cluster with the nearest centroid. Figure 15.2 shows k-means
clustering for the three-student example. The clustering uses two clusters, and
the × marks show the centroids. To understand how the centroids are decided,
we’ll need to study the steps of the algorithm, which we do next.
Given the number of clusters k, there are different methods for initializing the
centroids. A common approach is to randomly select k of the points and make
them centroids.
Each point is then assigned to the closest centroid based on Euclidean distance.
This is the assignment step.
The pair of steps (assignment, update) is repeated until there are no assign-
ment changes or the algorithm has run the allowed number of iterations.
1) Since the algorithm uses Euclidean distance, the clustering pattern of the
data will be important. Specifically, it will do well when the clusters are spher-
ically shaped. It won’t do well if there are concentric rings of data or complex
shapes; an alternative method like DBSCAN will work better for these cases.
4) K-means will always give an answer. It may take a large amount of time to
evaluate the results, iterate, and understand if the clusters are useful.
5) The results can be highly dependent on the variables and observations used.
For example, including or dropping a variable can produce different clustering
results. This should be explored as part of testing.
Initialization
Let k = 2. Initialize the process by placing centroid 0 at A and centroid 1 at
C. These points are used as the initial averages.
Iteration 1: Assign
Based on the centroids, we need to assign the points to their closest centroid.
To do this, we first need to measure the distance of each point P to each
centroid as follows:
p
For example, d(A, centroid1) = (95 − 70)2 + (90 − 82)2 = 26.2. Now we
can make assignments:
Iteration 1: Update
Now we update the centroids by selecting students in each cluster and comput-
ing the averages of each exam. For centroid 0, we average exams for students
A and B. Only student C is assigned to centroid 1. Here are the updated
centroids:
Iteration 2: Assign
Given the updated centroids, we need to recompute distances between the
points and centroids. Here are the new values:
K-Means 295
Since centroid 1 hasn’t moved, its distances from students A and B are un-
changed; this means the points’ assignments are unchanged. Student C is
already at a centroid, thus its assignment won’t change.
Iteration 2: Update
Since the assignments haven’t changed, the centroids won’t change and the
algorithm stops. We now have final cluster assignments:
student centroid
A 0
B 0
C 1
Each centroid provides predicted values for the points in its cluster. Student A
earned a 70 on exam 1 and an 82 on exam 2. The student is assigned to cluster
0, which has a centroid with a predicted exam 1 score of 72.5 and a predicted
exam 2 score of 81. Likewise, students B and C each have two exam scores
and two predicted scores. Similar to the case of regression, we can’t measure
accuracy by summing deviations between actual values and predictions, since
cancellations will occur. Instead, we compute sums of squared deviations.
1) For each data point i, calculate its mean distance from all other points in
its cluster, calling it a(i). This measures how well the point is aligned in its
cluster.
2) For each data point i, compute its mean distance from all of the points in a
different cluster. Repeat this mean calculation for each of the different clusters
and identify the cluster with smallest mean. This is the neighboring cluster of
point i, since it is the next-best cluster. The mean distance to points in the
neighboring cluster is denoted b(i).
b(i) − a(i)
s(i) =
max{a(i), b(i)}
A large difference b(i)−a(i) indicates a good fit between point i and its cluster.
In the extreme case where each point in a cluster is identical, a(i) = 0 for each
of these points. The denominator normalizes values to fall in the range [–1,1].
For the extreme case mentioned, the numerator would be b(i) − a(i) = b(i)
and the denominator would be max{a(i), b(i)} = b(i). The silhouettes are
then s(i) = b(i)/b(i) = 1.
4) The silhouette score is calculated as the mean s(i) over all i. For the case
where each point in a cluster is identical, s(i) = 1 for all i, and the silhouette
score will be 1.
The silhouette score can be calculated in Python using sklearn and we will
see this illustrated in the next section.
https://www.kaggle.com/datasets/trolukovich/
nutritional-values-for-common-foods-and-products?
resource=download
The Jupyter notebook for this demo can be found in the course repo:
/book_materials/kmeans_demo.ipynb
298 Clustering with K-Means
# import modules
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
Next we read in the CSV file and specify a list of variables. We will use a
subset of the variables to quickly show some results.
# specify variables
vars = [‘calories’, ‘total_fat’, ‘saturated_fat’, ‘cholesterol’,
‘sodium’, ‘choline’, ‘folate’, ‘folic_acid’, ‘niacin’,
‘pantothenic_acid’, ‘riboflavin’, ‘thiamin’,
‘vitamin_a’, ‘vitamin_a_rae’, ‘carotene_alpha’]
Next, we will select only the columns of interest. This consists of the variables
and the name column, which holds the food names. Note that lists can be
added as list1 + list2.
Clustering Foods by Nutritional Value 299
df = df[[‘name’] + vars]
A thorough analysis might individually impute missing values for each vari-
able. Outliers should also be treated carefully. Here, we drop any row which
has a missing value, and outliers are not treated.
The calories variable contains integers, while the other variables contain
strings due to the inclusion of units such as “10.5mg” or “10.5 mg.” Note
that in a proper database, units would be in a separate field from values,
but data in the wild may have all sorts of complications. Our strategy will
be to loop over each variable, determine which are strings, retain only the
numeric portions, and cast the values to floats. Then “10.5mg” or “10.5 mg”
will become 10.5. The regular expression pattern, or regex, that we will use is
this: [.|\d]+. Regexes can be cryptic to the uninitiated, but with exposure
and practice comes skill. The plus sign (+) denotes one or more occurrences,
and it allows for retaining all dots (.) and digits (\d). For small tasks, simply
searching for a regex should give the right form. An excellent tool for building,
testing, and debugging regexes is https://regex101.com/
After this completes, all of the predictor data will be numeric. Table 15.1
shows a sample of the rows and columns.
Observe the different scales of the columns: calories and saturated fat may
differ by three orders of magnitude. Since k-means uses Euclidean distance
between points, is it essential to scale the data. We apply StandardScaler()
on the variables as follows:
# set up the scaler
scaler = StandardScaler()
# fit k-means
kmeans = KMeans(n_clusters=clus,
random_state=rand,
n_init=100).fit(scaled_data_sub)
scores
Next we plot the silhouette scores, which are shown in Figure 15.4.
plt.bar(clus_range, scores)
plt.xlabel(‘number of clusters’)
plt.ylabel(‘silhouette score’)
Clustering Foods by Nutritional Value 301
The clustering with 3 groups edges out the others, and we will use this model.
The silhouette score is slightly above 0.6, which indicates cohesion in the
clustering. Since we didn’t save the earlier models, we refit k-means using
k = 3. Note the function np.argmax() which returns the location holding the
maximum score. For example, np.argmax([1,5,2])=1 since position 1 holds
the maximum of 5 (recall that arrays begin with position 0 in Python).
kmeans_final = KMeans(n_clusters=k_best,
random_state=rand,
n_init=100).fit(scaled_data_sub)
Now that we’ve trained the model, each food has been assigned to a cluster.
The number of foods in clusters 0, 1, and 2 is 162, 11, and 2, respectively.
The small number of foods in cluster 2 suggests that it may contain outliers;
it can be flagged for follow-up.
It will be helpful to review the foods in each cluster, and to observe how
their nutritional values differ in aggregate. To do this, we will return to the
dataframe, select the appropriate rows and columns, and append the cluster
assignments.
df_sub.groupby(df_sub.cluster).agg(func=np.mean).round(3)
Figures 15.5 and 15.6 show boxplots of select nutrition variables segmented
by cluster. Cluster 1 stands out from the others, with the majority of foods
having higher calories and total fat.
Lastly, we can print the list of foods in each cluster. In the interest of space,
we show up to ten foods per cluster.
Cluster 0 is large and diverse. Cluster 1 contains nuts, oils, and butter, which
are fattening foods. It is interesting that these foods were grouped. Cluster 2
is very small and may hold outliers. It contains freeze-dried chives, which are
roughly 10 times the calories of raw chives. Better clusters might be extracted
from this data with more effort.
304 Clustering with K-Means
15.5 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
Exercises 305
It is a great outcome when a model from the data science team is identified
for deployment to production. At the same time, the process of moving the
model from development to production takes a lot of work. For organizations
that do not have a lot of experience productionizing machine learning models,
the steps can seem unclear, onerous, and manual. Concrete, common steps
include:
• Sharing code
• Setting up testing and production environments with sufficient software and
hardware (e.g., dependencies)
• Testing the model and data
• Insuring that the data pipelines are properly processing the data and feeding
the model
• Setting up a secure API endpoint to serve requests
• Collecting and storing artifacts and predictions
• Returning results at the appropriate frequency (e.g., batch or real time)
• Standing up a model monitoring system
• Setting up a pipeline for model retraining
These steps fall in the domain of MLOps, which is often handled by members
of the engineering team. While a different team owns this process, data scien-
tists will need to actively contribute their knowledge. For example, they will
advise on which metrics should be monitored. The better that data scientists
understand the path to production, the more they can help. This can allow for
a smoother handoff of the model, saving time and effort. Additionally, tight
integration between data science and MLOps can help ensure that things are
working as expected.
In this chapter, we will explore some of the items listed above. In par-
ticular, we will focus on the practices that help make data science products
reproducible. The first habit is to maintain code in a collaborative reposi-
tory such as GitHub. The second habit is to develop and include tests with
307
308 Elements of Reproducible Data Science
16.2 Testing
There are many kinds of testing in software development, and we will briefly
review some pertinent tests for data science. Unit testing gives developers
confidence that their code performs as expected. The process consists of writ-
ing unit tests, which are used to test discrete units of code such as functions.
Writing tests for each small piece of code enables rapid testing, isolation of
Testing 309
bugs, and the surfacing of unexpected results. For these reasons, it is good
practice for data scientists to write unit tests when developing code. It is also
important to include clear comments on what each test does. This will allow
for verifying that the tests match expectations.
Testing the data and the model output is also highly recommended. This
practice can help to identify issues that may arise when new data is collected,
when the code evolves, or when the model is updated. Particularly in machine
learning, it is possible for the model to receive malformed or intermittent
data and return an incorrect result. In such an event, the bad result may go
undetected since nothing breaks.
Testing the stability of model output is also essential. For example, a data
scientist may train a model that performs well. She may proceed to refine
the application code for release to engineering, while keeping the model un-
changed. Confirming that the model produces the same predictions for test
cases will be important. A set of test cases with model inputs and outputs
(e.g., predicted probabilities) can be created and saved with the code. Their
careful design is essential, since their coverage defines the scope of testing.
The tests may include:
The practice of writing tests on the model output allows the data scientist
to assert that the model predictions match known values, both at present
and over time. Additionally, this allows others to test the application, such
as quality assurance (QA) engineers. Since QA engineers are a step removed
from the data science project, it will be harder for them to detect issues; the
tests will provide a layer of support.
Python has modules for unit testing, and one of the modules is unittest. It
provides a rich set of tools for constructing and running tests. This includes a
set of methods for Boolean verification such as assertEqual(), which checks
if two values are equal. For example, we might assert that a given model
input will produce an expected output. The output may be known from an
earlier call to the model. For the case where the prediction does not match
the expected value, the test would fail and issue an AssertionError. The
code block below illustrates two simple unit tests. The first test passes and
the second test fails. An explanation follows the code.
310 Elements of Reproducible Data Science
The unittest module includes the TestCase class, which provides a number
of methods to check for and report failures. In the first test, we compute an
actual value which evaluates to True (since 100 divided by 2 has remainder 0).
The expected value is also True. The assertEqual() method checks that the
actual value matches the expected value. Since these values match, the test
passes. For the second test, we compute an actual value which evaluates to
False. The assertion checks if False is equal to True, which is a false statement.
The output includes the error:
raise self.failureException(msg)
AssertionError: False != True
Upon reviewing the output, the data scientist can investigate the code, correct
the error, and rerun the code. This can be repeated as necessary until all of
the tests pass.
The unit tests above used simple arithmetic to produce the actual values.
In practice, a more common way to generate actual values will be to pass
the inputs to the model and run inference (make the predictions). Next, the
relevant output can be used for the actual values. For a classification task,
the actuals might be predicted probabilities. For a regression task, the actuals
will be the predicted target values.
There are times when a test fails because it is written improperly. The
developer needs to carefully review the tests and run them, to ensure they
meet expectations. This may sound circular or excessive, but it can happen and
faulty logic is often the culprit. One way to mitigate these errors is to design
tests consistently. Specifically, testing some functions with assertEqual()
and others with assertNotEqual() can lead to confusion. It can be better to
design the latter tests to check for equality.
Containers 311
16.3 Containers
After the data science team completes a model, it will be delivered to engineer-
ing. It will be staged in a testing environment to ensure that everything works
properly. From there, the model will be promoted to a production environ-
ment for deployment. Oftentimes when running the code on another machine,
there may be compatibility issues which cause failure. There may be a differ-
ence in Python versions or Python modules, for example. It can take great
effort to keep all machines running the same package versions (consider some
organizations use dozens or more machines). Moreover, it may not even be
possible to sync all versions. For example, two different ML applications may
require different versions of a module.
The container emerged as a tool to maintain consistent, isolated project
environments. It is a standalone, executable package containing everything
needed to run an application. This will include software, configurations, and
variables. The package can then easily be shipped to another machine, and
run on that machine. It is even possible for a single machine to run multiple
containers with different package versions. For example, a cloud server might
run several machine learning models at the same time.
Virtualization is technology that uses software to simulate hardware func-
tionality, thereby creating a virtual computer system. Containers are designed
to share the operating system of a machine, rather than requiring their own
copies. This is particularly helpful when multiple containers run on the same
machine. Based on this feature, we say that containers are lightweight.
The universally accepted, open-source tool for containerization is Docker .
Getting started with Docker includes learning some of the terminology and
basic commands, which we outline next. A Docker image is a read-only file
that contains instructions for creating a container. We can think of the image
as a template. New images can be created, and they can also be shared.
The online repository DockerHub has a large array of Docker images for the
community (see dockerhub.com). It is similar to GitHub for Docker images.
A Docker container is a runtime instance of a Docker image. As we will
see shortly, it is created by running the docker run command. A Docker
container can run on any machine that has Docker installed. This allows great
portability, as Docker can run on machines including a laptop, local server,
or cloud server. Docker runs on operating systems including Linux, Windows,
and Mac.
cloud. This pathway will require running steps at the command line. Details
can be found on the Linux installation page. For the purpose of this demonstra-
tion, we will install and run Docker Desktop for Windows. The steps involve
downloading an executable file and following the configuration steps in the
wizard. The steps for installing on a Mac are very similar.
After the Docker installation completes, we can test that it installed by
opening a terminal and typing docker at the command line. Figure 16.2 shows
some of the output in PowerShell. It shows that Docker is found, and there
is a listing of common commands. If Docker is not found, the install can be
rerun.
command purpose
docker build build an image from a Dockerfile
docker run run a container from an image
docker images list all images downloaded
docker ps list all running containers
docker stop stop a running container
docker rm remove a stopped container
Containers 313
import pandas as pd
housing = fetch_california_housing()
y_test_pred = reg.predict(x_test)
r2 = r2_score(y_test, y_test_pred)
print(‘R-squared:’, r2)
314 Elements of Reproducible Data Science
We can test the script by opening a terminal, navigating to the directory with
the script, and running the command:
python housing_script.py
R-squared: 0.6092116009090456
For this simple model, the only module in use is scikit-learn. We include
it in the requirements.txt file with its corresponding version:
scikit-learn==1.0.2
Next, we create the Dockerfile to set up the Python environment for running
the script. Note that the file name does not have an extension. The Dockerfile
uses specific commands, which are explained below.
FROM python:3.9
WORKDIR /src
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python","housing_script.py"]
Lastly, we will build the Docker image. We can first verify that Docker is
running by opening a terminal and running this command:
docker info
If Docker is not running, an example error like this might appear in the output:
Server:
ERROR: error during connect: this error may indicate that
the docker daemon is not running...
Once we verify that Docker is running, we can navigate to the directory with
the Dockerfile and run this command:
where housing is the name given to the image. Note the dot . at the end of
the command to build in the current directory. As the image builds, logging
will stream to the console. Here is an example of a successful build message:
docker images
The output shows the housing image with its unique identifier (IMAGE ID)
and size:
We can build and run a new container from the image using the command
The output is the same R-squared that we encountered when we ran the
Python script earlier:
R-squared: 0.6092116009090456
This validates that Docker has created a container and run the machine learn-
ing script successfully in that container. We can see a list of all containers with
the command
docker ps -a
Figure 16.2 shows the results following the creation of three containers based
on the housing image. The output includes a unique identifier (CONTAINER
ID) and a randomly created name for each container, among other informa-
tion.
16.5 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
Exercises 317
The use of models invariably presents model risk, which is the potential for ad-
verse consequences from decisions based on incorrect or misused model outputs
and reports.
https://www.federalreserve.gov/
supervisionreg/srletters/sr1107.htm
We can think about the entire process as the model. In this way, we consider
not only the risk from, say, a mathematical or statistical model, but also the
risk from the systems running the model and the input data. Each of these
components can also introduce risk that needs to be managed.
Some divisions or entire companies make decisions based on models run-
ning in production. Incorrect predictions can lead to poor decision making and
financial loss. Biased data and models can cause reputational damage and sys-
tematic harm to individuals. Copyright infringement can lead to lawsuits from
content owners. An enlightening book replete with hard-hitting case studies
on model risk is Weapons of Math Destruction [46].
Some companies never recover from bad models. One doesn’t need to
search for very long to find articles in the news about problematic models
which led to financial meltdowns, disparate treatment, legal risk, and reputa-
tional damage. Here are some examples:
319
320 Model Risk
• Zillow used predictive models to fuel the purchase and flipping of homes in
their Zillow Offers program. During the COVID pandemic, their predictions
were highly inaccurate, leading to losses in excess of $400 million and a 25%
reduction in staff.
• Stability AI developed a model to create images from a text description. The
firm is currently embroiled in a lawsuit with Getty Images, which claims that
Stability AI unlawfully scraped millions of images from its site. Getty Images
believes that Stability AI used these images in training its model.
The purpose of this chapter is not to scare the reader away from data science,
of course, but to bring awareness of what can go wrong and how to prevent
it. I spent the early part of my career building models for investment banks
and asset managers as a Quant. One of my roles a bit later was in model
validation, where I examined, tested, and challenged an array of models for a
major commercial bank. My group was independent from model development,
mandated by the federal government, and designed to provide another effective
line of defense against model risk. The work that my team did gave me a
tremendous appreciation for managing model risk, and it made me a better
model developer. It opened my eyes to the “gotchas” hidden in models, such
as predictors that might lead to unfair treatment, assumptions that might not
hold up in practice, and patterns that likely would vanish.
Model validation is an essential function that is not restricted to banks or
the financial sector. Anywhere that a model is used, it introduces risk that
needs to be measured and managed. The person doing this work must review
the model carefully to ensure proper use, correct design and logic, proper
data handling, robust performance measurement and testing, and complete
documentation. From the standpoint of a model developer, while validation
introduces extra work, it brings the benefit of a fresh perspective. An effective
validation will challenge the model and likely strengthen it.
• Model purpose
• Names and titles of stakeholder
A model validator will find the devdoc extremely useful in gaining a better
understanding of the end-to-end process. She can then write her own report
on findings from the validation process. For the lifetime of the model, the
documentation will grow and evolve to reflect the latest state.
It is helpful to summarize the high-level details of a model in one place,
and the model card is such a tool (though this name is not universal). His-
torically, organizations that developed models were typically the only ones
to use them. This is no longer the case. Increasingly, models are a shared
resource. A principal reason for this is cost: building state-of-the-art mod-
els, such as models that understand language, requires massive datasets and
expensive computing resources. It also requires specialized expertise. Fortu-
nately, many of these leading machine learning models are freely available on
platforms such as Hugging Face (https://huggingface.co). Repurposing an
appropriate model can provide large savings and accelerate development. An
important question is whether a model is appropriate for a given use case,
and this is where model cards can help. An informative model card can pro-
vide transparency and safety, which helps foster model use. In practice, the
depth of information provided in model cards runs the gamut from empty to
complete, but a card might include:
The model is often where a good portion of the complexity and risk is located.
This is because models are abstractions of reality where a tradeoff is made.
Consider face-recognition algorithms. How can they still recognize someone
with glasses or a slightly different hair style? The model does not attempt to
match every pixel, but rather it learns a representation, or simplified structure,
of the face. The model essentially condenses the information down into a subset
of information which may be sufficient to solve the task at hand. This happens
when converting an image in RAW (uncompressed) format to a JPEG, for
example. Does it work? In some cases, it does. The JPEG is a lower-resolution
image, but this is sufficient for sending photos on a mobile app; it won’t be
good enough for museum-quality prints, however.
The models used in data science and machine learning will make assump-
tions to similarly abstract away details to simplify the problem. We might
assume that a variable follows a normal distribution, when this may not be
exactly true. Does it matter if a variable isn’t quite normally distributed? This
depends. If the variable under study is the percentage change of a stock price,
and the model will be used in a trading system, then it certainly matters.
Suppose the real-world probability of a negative value is greater than what
the normal distribution would suggest. This could lead to serious financial
losses.
Each model brings a level of complexity which depends on factors like the
architecture and the number of predictors. We should strive to use only the
level of complexity necessary to solve the problem. There may be a temptation
to use the most advanced methods to solve a problem, but if it can be solved
with linear regression, by all means use this method. However, if a non-linear
relationship exists between predictors and a target, then a more flexible (and
complex) model may be warranted. The data should be the guide, and not
publication trends.
When I was a graduate student, one of my areas of specialization was
regime-switching models. In the right circumstances, these models were very
useful and powerful. They used additional parameters when compared to linear
regression models. These parameters added flexibility, but also complexity.
324 Model Risk
• The variable will be readily available in the future. If the variable will need
to be processed in real time, but the system requires two hours to process
it, then we should not include the variable in the model.
• The variable does not induce bias
• The variable makes intuitive sense in the model
• The data is accurate. When possible, it should be cross-referenced.
• We are permitted to use the data. For example, unauthorized use of copy-
righted material can provoke a lawsuit.
After reviewing the data and model inputs, it is also important to ensure that
the dataset is sufficient. For example, if we are building a risk model, then the
Outcomes Analysis 325
data should include observations from a period of heightened risk. For cyclical
data, it is important to include at least one full cycle in the dataset.
The metrics we have discussed earlier can be used to understand how the
model is performing in aggregate. This may include adjusted R-squared and
RMSE for assessing fit for linear regression models. For logistic regression
models, precision, recall, and F1 score may be useful. Metrics should also be
collected over time for ongoing model performance monitoring, which we will
discuss shortly. Based on the out-of-sample errors in this case, the RMSE
is 0.15. This can be verified by computing the squared errors between each
prediction and its associated actual value, computing the average squared
error, and taking the square root.
326 Model Risk
• When we split the data into a training set and a test set, we used a seed
for the random selection of rows. A different seed likely would have split the
data differently.
• When we used a linear regression model, we assumed that the errors were
normally distributed
Now that we have discussed both sensitivity analysis and stress testing, we
can revisit the Columbia University case. If one or more of the model inputs
between 2021 and 2022 changed drastically, this might have been more like
a stress test than sensitivity analysis. This could explain the large change in
ranking (but don’t quote me on this).
330 Model Risk
Model ID 123
Model Name Student Loan Default Model
Date of Report 2023-04-15
Period Covered 2023-01-01 through 2023-03-31
Review frequency quarterly
Ongoing Model Performance Monitoring 331
In the event that the YELLOW threshold is crossed, these stakeholders will
be alerted:
Product Manager
Engineering Manager
Data Science Manager
The model development team and the engineering team will review the model
components and data. The model may be retrained to include recent data.
In the event that the RED threshold is crossed, these stakeholders will be
alerted:
VP of Product
VP of Engineering
VP of Data Science
Product Manager
Engineering Manager
Data Science Manager
{
‘baseline_date’: {
‘2023-01-31’:
{
‘metrics’: {
‘F1 score’: 0.68
},
‘predictors’: {
‘age’:
{‘min’: 25,
‘q1’: 30,
‘median’: 37,
‘q3’: 55,
‘max’: 95}
}
}
}
}
In the event that metrics trigger an alert, data scientists can start leveraging
the logs and the baselines. They can first check the logs for indicators of
problems. Next, they can investigate if there was a change in the distribution of
the predictors, for example. To statistically test if the distribution of a variable
changed between times t1 and t2 , the two-sample Kolmogorov-Smirnov test
(K-S test) could be used. The K-S test is beyond the scope of this book, but
more details can be found in [30].
A change in a predictor’s distribution over time is called factor drift, and
it can be problematic. This is because the model learned specific relationships
between predictors and a target from a training set. If the distribution of a
predictor changes, its relationship with the target may change as well. We
would need to retrain the model with the new data to capture the change.
There is no guarantee, however, that the predictor will continue to provide
the same level of predictive ability.
Other potential areas for investigation occur earlier in the data pipeline.
There may be defects in the processed data, the raw data, or the data feeds.
Perhaps there was a step where an integer was expected, but a string was
input. If the code was not properly written to handle such a case, an issue
may arise. There might have been a period of time when a certain data feed
wasn’t working properly, and an important predictor was not supplied to the
model. Diving deeper into the data can take a lot of time and effort, and this
reinforces the importance of system observability.
334 Model Risk
• Race or color
• National origin
• Religion
• Sex
• Age (provided the applicant is old enough to enter into a contract)
• Women
Case Study: Fair Lending Risk 335
Illegal, disparate treatment occurs when a lender bases its decision on one or
more discriminatory factors covered by fair lending laws. For example, a bank
using a lending process where females are offered an auto loan with a higher
interest rate than males would be a violation of fair lending laws.
In modern finance, banks and other lenders typically use machine learning
models to make lending decisions. The underwriting decision may use a lo-
gistic regression model. Given approval, the pricing decision may use a linear
regression model.
The model predictors should capture the ability and willingness of bor-
rowers to repay the loan. The predictors should not use protected class infor-
mation, as this can discriminate and promote unfair lending.
There will be two models discussed in our example. The underwriting
model MU will be used for making automated underwriting decisions. The
fair lending model MF L will be used to assess if the underwriting process was
systematically fair. Given this background information, we will now review
the data and modeling.
semester2/datasets/mortgage_lending.csv
semester2/week_15_16_fair_lending_application/
IDS_hw4_fair_lending_intro.ipynb
This dataset includes a subset of what a typical file would look like. It includes:
import numpy as np
import pandas as pd
import statsmodels.api as sm
syn = pd.read_csv(‘../datasets/mortgage_lending.csv’)
syn.head()
OUTPUT:
We conduct some exploratory data analysis to better understand the data and
uncover potential predictors. It will be helpful to understand the composition
of the records by occupation and gender, for example. We can build a two-way
table to explore this. For example, of the 12 accountants in the dataset, 5 are
female and 7 are male.
pd.crosstab(syn.occupation, syn.gender)
OUTPUT:
gender f m
occupation
accountant 5 7
contractor 1 4
... ... ...
lawyer 7 5
librarian 0 1
mason 0 1
mechanic 0 2
nurse 10 0
... ... ...
that nurses have a high risk of default. This will lead to the model denying
loan requests to nurses in the future, which effectively denies lending to more
females. The occupation variable is acting as a discriminatory variable in this
case, and it should not be included in the model.
Let’s examine distributions of fico and loan to value for denied and
approved applicants. These are summarized as boxplots in Figures 17.2 and
17.3. Denied applicants tended to have lower FICO scores and higher loan-to-
value ratios, which makes sense.
L_U = [‘fico’,‘loan_to_value’,‘loan_term’,‘nurse’]
We will use these predictors to build the model next, but first: which of these
predictors should have been used? The variables fico and loan to value
measure credit worthiness and the ability to repay the loan, so these should
be used. The variable loan term is a loan attribute, and it should be used.
As discussed earlier, nurse will introduce the risk of discriminating against
females, and it should NOT be used.
X = syn[[‘fico’,‘loan_to_value’,‘loan_term’,‘nurse’]]
# target variable
y = syn[‘denied’]
print(result_uw.summary())
The parameter estimates are shown in Table 17.5. The only significant pre-
dictor in the model is fico as indicated by its p-value less than 0.05. The
negative coefficient estimate of –0.14 indicates that a higher FICO score is
associated with lower odds of denial, which makes sense.
Chapter Summary 339
X = syn[[‘fico’,‘loan_to_value’,‘loan_term’,‘gender_female’]]
X = sm.add_constant(X)
y = syn[‘denied’]
print(result_fl.summary())
The parameter estimates in Table 17.6 indicate that only fico is significant.
The predictor gender female has p-value 0.067. Since this value is close to
0.05, the process should be closely monitored over time for fair lending risk.
In practice, all of the relevant protected class variables would be rigorously
tested and examined.
reputational, and societal damage. We studied fair lending risk in detail, which
results from disparate treatment of protected classes.
We reviewed how machine learning models are used in lending decisions,
including underwriting and pricing. Diving into the data showed how discrim-
inatory behavior can unintentionally creep into models. In this case, all of the
nurses were female, and this could bias the lending practice. The case study
illustrated the application of data science to a critically important task in
finance. More broadly, it outlined the examination of one aspect of risk.
To properly validate a model, it must be effectively challenged. The concep-
tual soundness of the overall system and each component must be examined
and understood. The validator needs to understand the assumptions and lim-
itations of the data and model, as well as the development decisions made. A
model development document should capture this information, and it should
be updated over the model lifecycle.
The performance of the model needs to be measured and monitored over
time to ensure that it meets expectations. Outcomes analysis will compare pre-
dicted values to actual outcomes. If model users find that predictions exhibit
bias, or deviate from actual values in a predictable way, they might override
the model. Model users should be interviewed to capture and understand any
overrides, as this presents an opportunity to improve the model.
To make models easier to monitor, they should be observable, and their
baseline statistics should be captured. The important steps and intermediate
quantities can be logged. If metrics deteriorate, the logs can be searched first.
Other layers to investigate are the predictors, processed data, raw data, and
data feeds. The distributions of predictors can change over time, and this is
called factor drift. This phenomenon can be problematic as the relationship
between a predictor and target is fundamental to model performance.
Model benchmarking compares the model to an alternative model or met-
ric. The benchmarks will be relatively simple approaches or heuristics. A more
complex model should beat these benchmarks to be valuable. Sensitivity anal-
ysis is the exercise of slightly changing the model assumptions and measuring
the impact to the output. Small changes are expected, and large changes will
warrant further investigation. Stress testing is the practice of running the
model on plausible, extreme scenarios to measure its response. This may sur-
face issues with the model which can be investigated. It is also useful as a
“what if” exercise.
We have covered a lot of ground in this introduction to data science. In
the next chapter, we will plan some next steps for going deeper and broader.
The field is extremely active – particularly in deep learning. We will see some
resources for learning more, and some of the recent advances.
Exercises 341
17.11 Exercises
Exercises with solutions are marked with S . Solutions can be found in the
book materials folder of the course repo at this location:
Congratulations, you’ve made it through the content of the book! You should
now have a good understanding of the field of data science, important data
literacy topics, and how to implement a data science pipeline. Data science,
and machine learning in particular, is vast and growing rapidly. The intention
of this chapter is to provide some ideas and suggestions of where to go next.
Over time, we can expect to see new areas of activity, applications, and tools.
To prepare for the more advanced models and applications, it will be necessary
to add to your skills in mathematics, statistics, computing, data analysis, and
communication. This section will be a bit of a laundry list, and I apologize in
advance. I don’t expect you to build all of these skills in a month; it will take
longer to fully absorb the ideas and have the ability to apply them. Some of
the topics may only come up for certain roles, and you’ll know this when you
see a job posting or have an interview.
Mathematics
Our treatment of calculus stopped short of power series and integral calculus.
These topics will be important to learn. An understanding of integration will
provide a foundation to better understand the continuous random variables
of probability theory. The expected value of a continuous random variable is
an integral, for example.
Many of the leading models in machine learning have deep underpinnings
in probability, as they model sequences of events subject to randomness. Over
time, you will need to learn advanced probability concepts and techniques.
We studied gradient descent for fitting parameters in machine learning mod-
els. Some of the more complex models will use more specialized optimization
techniques, such as stochastic gradient descent.
It will be necessary to go deeper in linear algebra to understand topics
including norms, diagonalization, eigenvalues and eigenvectors, determinants,
343
344 Next Steps
Statistics
Computer Science
A data scientist should be comfortable with common data structures and al-
gorithms, and should have an understanding of when to apply them. A good
understanding of time complexity and space complexity will help you write
better algorithms. The time complexity is the time it takes to run the algo-
rithm as a function of the input length. The space complexity of an algorithm
is the amount of memory consumed as a function of the input length. When
comparing two algorithms, the one with lower time complexity and space
complexity will be preferable.
We briefly covered databases and SQL. It will be very valuable to gain
mastery of SQL. I should point out that large tech companies love to ask
interview candidates about data structures, algorithms, and SQL.
It is essential for data scientists to be able to program well in Python, R, or
something similar. Many data scientists do not regularly apply software devel-
opment practices such as design patterns and unit testing, but these practices
are highly encouraged. Writing efficient, reusable code takes practice and the
study of great source material. There are many excellent open-source repos-
itories available for such study. Moreover, for ML Engineering roles, strong
software skills are a requirement.
Familiarity with a cloud-based system, such as AWS, Google Cloud, or
Microsoft Azure, will be very valuable, as many organizations use the cloud
for their computing and storage needs. The cloud providers evangelize their
services and products, and their documentation can be very informative on
topics including databases, analytics, machine learning, and hardware. As
many of their services overlap, don’t worry about learning each service offered
by each provider.
Data Analysis
Communication
Improving your communication skills can be done through regular practice and
by asking for feedback. The practice might come through meetings, meetups,
seminars, and other formal and informal events. Watch some skilled speakers
and observe their habits and delivery. TED talks can be good sources of inspi-
ration. Try to get experience presenting to both technical and non-technical
audiences. Take note of the kinds of questions that each audience asks, so that
you may better prepare in the future.
n
X 2
SSres (β) = (yi − (β0 + β1 xi1 + . . . + βp xip ))
i=1
The penalty term in Lasso regression consists of two parts: the sum of
the absolute value of the parameter estimates, which penalizes magnitudes,
and a scalar multiple λ which controls the relative importance of the penalty
term versus the sum of squared residuals term. The regularization term for p
parameters looks like this:
p
X
λ |βj |, λ≥0
j=1
The complete loss function for Lasso regression with p predictors combines
the two terms as follows:
n
X p
X
2
SSres (β) = (yi − (β0 + β1 xi1 + . . . + βp xip )) + λ |βj |, λ≥0
i=1 j=1
The first term can be minimized by selecting parameter estimates which bring
the predictions closer to the response values yi . At the same time, the second
term adds a positive quantity which increases with the sum of the magnitudes
of the parameter estimates. Specifically, increasing any absolute coefficient
|βj | will increase the penalty. Additionally, increasing λ will increase the value
of the loss function. Ultimately, the optimal parameter estimates must strike
a balance between these two terms.
Earlier, we encountered hyperparameters, which are important quantities
in the configuration of a model. Additionally, their values are not known in
advance. The λ term is an example of a hyperparameter, and a common way
to estimate its value is through k-fold cross validation. Here is an outline of
possible steps:
1. Provide a list of possible values for λ. We can start with powers
of 10 since we are unsure of the right order of magnitude. This is
the hyperparameter grid. Referring to this set as Λ, we might use
Λ = {1 × 10−4 , 1 × 10−3 , . . . , 1 × 103 , 1 × 104 }.
2. Decide on the number of folds k
3. Set aside a portion of the dataset for final evaluation (a test set)
4. For each possible value λi ∈ Λ, repeat these steps:
5. From the pairs (λi , M SEi ), select the λi with lowest value M SEi .
This yields the optimal value λ∗ .
6. The final model is the version with regularization value λ∗
For a more accurate λ∗ , a finer grid can be used in the first step after the MSEs
are observed from the original course grid. Modern software packages support
Lasso regression, as well as the popular variants Ridge regression and elastic
net regularization. For more details on regularization, you might consult [4].
it down the relevant path. For example, an observation where age=19 and
visits=10 would first travel left and then right, receiving prediction N.
2x x≤c
y=
3x x>c
In this case, the slope gets steeper when x exceeds c. Linear regression would
not handle this relationship properly, but a tree-based model could learn the
non-linear pattern from the data and use branching as needed.
Regression models require the user to determine the best set of predic-
tors.1 Tree-based models use an algorithm to automatically determine useful
predictors, placing the most important predictors higher in the tree. Predictors
that aren’t sufficiently helpful for increasing explanatory power are dropped
from the model. For a large set of potential predictors, this can save a lot of
development time.
Regression models require imputation of missing values. For some imple-
mentations of tree-based models, missing values are treated as a separate level
for each predictor. In doing this, imputation is not necessary in the tree-based
model.
A challenge introduced by tree-based models is that interpretation is less
clear when compared to linear and logistic regression models. This is primarily
because tree-based models combine output from multiple trees when making
a prediction. There are specialized methods for measuring the importance of
predictors in these models, but they are still more challenging to understand
than the parameter estimates in a regression model.
then one node is sufficient to provide the probability of the positive class p;
the probability of the negative class will then be 1 − p.
and biases so that the predictions are close to the target values. The optimiza-
tion uses a loss function and a variant of gradient descent (most commonly).
To decide how the parameters should change, partial derivatives of the loss
function are computed with respect to each parameter. Much of the complex-
ity arises from the layers in the architecture: there is a sequence of derivatives
that must be computed on composed functions. You may recall that to cal-
culate the derivative of composed functions, the chain rule can be used. The
chain rule, combined with matrix algebra, is essential in optimizing the pa-
rameters of a neural network. The breakthrough method for enabling these
calculations is called backpropagation. When you study neural networks, I
highly recommend that you spend time understanding how backpropagation
works, as this is fundamental to fitting neural networks.
The model just described is a fully connected neural network, as each
node in a layer is connected to each node in the next layer. The leading
models in areas such as computer vision use neural networks that are not
fully connected. Specifically, the connections may be placed based on nearby
locations or time points. These architectures significantly reduce the number of
connections, which reduces storage and computation. For example, a computer
vision model used to detect objects may not need pixel information from one
image corner to predict the contents of a different image corner. These models
include other adaptations as well to improve performance and accommodate
inputs of different shapes. However, you have already learned many of the
required concepts to dive deeper into neural networks.
machine. The completed work can be collected back to a single machine called the driver.
352 Next Steps
support analytics, SQL, machine learning, graph processing, and stream pro-
cessing. It can ingest a variety of formats and send results downstream to
applications and dashboards. Spark is so useful that I developed big data
courses around it for the University of Virginia and the University of Califor-
nia, Santa Barbara. You can find the Big Data Systems course at UVA here:
https://github.com/UVADS/ds5110. An excellent book for learning Spark is
[50].
18.5 Resources
In addition to the textbooks mentioned throughout this book, there are several
other places to learn data science. There is a wide array of online courses from
vendors including Coursera, Udemy, Udacity, and Codeacademy. There are
several free courses on YouTube. A course that I found particularly valuable
early in my data science career is the Stanford CS229 Machine Learning course
taught by Andrew Ng. This course offers a detailed, mathematical treatment
of many essential models and techniques.
Several companies and universities offer bootcamps, certifications, and de-
gree programs. There are several online master’s programs, such as the pro-
gram where I teach at the UVA School of Data Science. When deciding to
make a large investment of resources, it is best to understand costs, curricu-
lum content, support, networking opportunities, and exit opportunities. It can
be helpful to review testimonials and placement statistics from the various
programs.
Cloud Computing
There are many no-cost and low-cost resources for computing in the cloud.
AWS offers a free tier to try their services:
https://studiolab.sagemaker.aws/
Google Colab is another free option for computing in your browser. Users
Applications 353
can access CPUs and GPUs. Notebooks are saved in Google Drive for easy
sharing. Colab can be accessed here:
https://colab.research.google.com/
Rounding out the three largest cloud providers, Microsoft also offers free access
to some of their services. Some services are always free, while others are free
for the first 12 months. You can browse the services here:
https://azure.microsoft.com/en-us/pricing/free-services/
For each of these services, it is best to confirm the pricing structure in advance.
Communities
There are many other sites for online data science competitions (which I
haven’t tried), such as DrivenData, Devpost, and Numerai.
18.6 Applications
The breadth of data science and machine learning applications is staggering.
If you name a problem, there is likely a model for it, or the opportunity for a
model. Below, I provide a few broad, exciting areas with high activity. Deep
learning is the leading approach for each of them.
Recommendation
Computer Vision
One of the earliest language models, called the bag of words model, uses the
count of each word in a document to provide information about the document.
An article containing several mentions of the word “football” might be classi-
fied as an article about sports. For a more robust understanding of the text,
the context of each word needs to be incorporated. One of the challenges for
a language model is understanding the relationship between far-apart words
or phrases in the text. For example, consider this text:
“Jacqui loved visiting Barcelona in the summertime. She enjoyed the way that
flamenco guitarists would play music in the streets. Yesterday, when I asked
what she would like to do next summer, she mentioned how much she would
like to go back there.”
Speech Recognition
Protein Folding
Generative AI
I hope that this chapter provided some good ideas of where to go next, and
that it stoked excitement about where the field is headed. Finally, I sincerely
hope that you enjoyed reading this book as much as I enjoyed writing it. Please
feel free to reach out to me on LinkedIn, including a note that you read the
book.
https://www.linkedin.com/in/adam-tashman-93a82722
356 Next Steps
You can learn about my company, PredictioNN, and browse data science re-
sources at this URL:
http://prediction-n.com/
The PredictioNN website includes links to the data science course, supporting
material for this book, and other helpful resources for learning data science.
Bibliography
[1] How much data do we create every day? The mind-blowing stats everyone
should read. https://tinyurl.com/2s35wxd4. Accessed: 2022-11-28.
357
358 Bibliography
[13] Kevin Mitnick. The art of invisibility: The world’s most famous hacker
teaches you how to be safe in the age of big brother and big data. Little,
Brown, 2017.
[14] FAQ: How do i know if my sources are credible/reliable? https:
//guides.lib.uw.edu/research/faq/reliable. Accessed: 2022-12-15.
[15] What is a REST API? https://www.redhat.com/en/topics/api/
what-is-a-rest-api. Accessed: 2023-07-04.
[16] Wikipedia 1.4.0. https://pypi.org/project/wikipedia/. Accessed:
2023-07-04.
[17] Daniel Jurafsky and James H Martin. Speech and Language Processing:
An Introduction to Natural Language Processing, Computational Linguis-
tics, and Speech Recognition. Pearson, 2014.
[18] What is processor speed and why does it matter? https://tinyurl.
com/mwaj3ufe. Accessed: 2022-12-20.
[19] CPU vs GPU: What’s the difference? https://www.intel.com/
content/www/us/en/products/docs/processors/cpu-vs-gpu.html.
Accessed: 2022-12-20.
[20] William Shotts. The Linux command line: A complete introduction. No
Starch Press, 2019.
[21] Scott Chacon and Ben Straub. Pro Git. Springer Nature, 2014.
[22] An intro to Git and GitHub for beginners. https://product.
hubspot.com/blog/git-and-github-tutorial-for-beginners. Ac-
cessed: 2022-12-20.
[23] Anna V Vitkalova, Limin Feng, Alexander N Rybin, Brian D Gerber,
Dale G Miquelle, Tianming Wang, Haitao Yang, Elena I Shevtsova,
Vladimir V Aramilev, and Jianping Ge. Transboundary cooperation im-
proves endangered species monitoring and conservation actions: A case
study of the global population of Amur leopards, volume 11. Wiley Online
Library, 2018.
[24] Wes McKinney. Python for data analysis: Data wrangling with Pandas,
NumPy, and IPython. O’Reilly Media, Inc., 2012.
[25] Rod Stephens. Beginning database design solutions. John Wiley & Sons,
2009.
[26] Richard Courant. Differential and Integral Calculus, Volume 1. Ishi Press,
2010.
[27] Richard Courant. Differential and Integral Calculus, Volume 2. John
Wiley & Sons, 2011.
Bibliography 359
[29] Stephen Friedberg, Arnold Insel, and Lawrence Spence. Linear Algebra,
volume 4. Prentice Hall, 2002.
[30] George W Snedecor and William G Cochran. Statistical methods. Iowa
State University Press, 1989.
361
362 Index
uncountable, 224
underwriting decision, 336
unique(), 99
unit circle, 134
unit testing, 310
unittest, 311
uptime, 25