Datawranglingpy Screen v1.0.3 20230206
Datawranglingpy Screen v1.0.3 20230206
with Python
Marek Gagolewski
v1.0.3
Dr habil. Marek Gagolewski
Deakin University, Australia
Systems Research Institute, Polish Academy of Sciences
Warsaw University of Technology, Poland
https://www.gagolewski.com
Product and company names mentioned herein may be the trademarks of their
respective owners. Rather than use a trademark symbol with every occurrence of
a trademarked name, the names are used in an editorial fashion to the benefit of the
trademark owner, with no intention of infringement of the trademark.
Weird is the world we live in, but the following had to be written.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is provided
without warranty, either express or implied. The author will of course not be held liable
for any damages caused or alleged to be caused directly or indirectly by this book.
Anyway, any bug reports/corrections/feature requests are welcome. To make this text-
book even better, please file them at https://github.com/gagolews/datawranglingpy.
Typeset with XeLATEX. Please be understanding: it was an algorithmic process, hence
the results are ∈ [good enough, perfect).
Homepage: https://datawranglingpy.gagolewski.com/
Datasets: https://github.com/gagolews/teaching-data
Preface xiii
0.1 The Art of Data Wrangling . . . . . . . . . . . . . . . . . . . . . xiii
0.2 Aims, Scope, and Design Philosophy . . . . . . . . . . . . . . . . xiv
0.2.1 We Need Maths . . . . . . . . . . . . . . . . . . . . . . . xv
0.2.2 We Need Some Computing Environment . . . . . . . . . . xv
0.2.3 We Need Data and Domain Knowledge . . . . . . . . . . . xvi
0.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
0.4 The Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
0.5 About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
0.6 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . xxi
I Introducing Python 1
1 Getting Started with Python 3
1.1 Installing Python . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Working with Jupyter Notebooks . . . . . . . . . . . . . . . . . . 3
1.2.1 Launching JupyterLab . . . . . . . . . . . . . . . . . . . . 5
1.2.2 First Notebook . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 More Cells . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Edit vs Command Mode . . . . . . . . . . . . . . . . . . . 7
1.2.5 Markdown Cells . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 The Best Note-Taking App . . . . . . . . . . . . . . . . . . . . . 8
1.4 Initialising Each Session and Getting Example Data (!) . . . . . . . 9
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
II Unidimensional Data 41
4 Unidimensional Numeric Data and Their Empirical Distribution 43
4.1 Creating Vectors in numpy . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1 Enumerating Elements . . . . . . . . . . . . . . . . . . . 45
4.1.2 Arithmetic Progressions . . . . . . . . . . . . . . . . . . . 46
4.1.3 Repeating Values . . . . . . . . . . . . . . . . . . . . . . 47
4.1.4 numpy.r_ (*) . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.5 Generating Pseudorandom Variates . . . . . . . . . . . . . 48
4.1.6 Loading Data from Files . . . . . . . . . . . . . . . . . . . 48
4.2 Mathematical Notation . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Inspecting the Data Distribution with Histograms . . . . . . . . . 50
4.3.1 heights: A Bell-Shaped Distribution . . . . . . . . . . . . . 51
4.3.2 income: A Right-Skewed Distribution . . . . . . . . . . . . 51
CONTENTS V
Changelog 411
References 415
XII CONTENTS
1 https://datawranglingpy.gagolewski.com/datawranglingpy.pdf
2 https://datawranglingpy.gagolewski.com/
3 https://deepr.gagolewski.com/
4 https://github.com/gagolews/datawranglingpy/issues
5 https://dx.doi.org/10.5281/zenodo.6451068
0
Preface
We primarily focus on methods and algorithms that have stood the test of time and
that continue to inspire researchers and practitioners. They all meet a reality check
that is comprised of the three following properties, which we believe are essential in
practice:
9 We might have entitled it Introduction to Data Science (with Python).
PREFACE XV
• simplicity (and thus interpretability, being equipped with no or only a few un-
derlying tunable parameters; being based on some sensible intuitions that can be
explained in our own words),
• mathematical analysability (at least to some extent; so that we can understand
their strengths and limitations),
• implementability (not too abstract on the one hand, but also not requiring any
advanced computer-y hocus-pocus on the other).
Note Many more complex algorithms are merely variations on or clever combinations
of the more basic ones. This is why we need to study the fundamentals in great detail.
We might not see it now, but this will become evident as we progress.
we need programming at all? Unfortunately, some mathematicians forgot that probability and statistics are
deeply rooted in the so-called real world. We should remember that theory beautifully supplements practice
and provides us with very deep insights, but we still need to get our hands dirty from time to time.
XVI PREFACE
In this course, we will be writing code in Python, which we shall introduce from
scratch. Consequently, we do not require any prior programming experience.
The 2021 StackOverflow Developer Survey11 lists it as the 2nd most popular program-
ming language nowadays (slightly behind JavaScript, whose primary use is in Web de-
velopment). Over the last few years, Python has proven to be a very robust choice for
learning and applying data wrangling techniques. This is possible thanks to the fam-
ous12 high-quality packages written by the devoted community of open-source pro-
grammers, including but not limited to numpy, scipy, pandas, matplotlib, seaborn, and
scikit-learn.
Nevertheless, Python and its third-party packages are amongst many software tools
which can help gain new knowledge from data. Other13 open-source choices include,
e.g., R14 and Julia15 . And many new ones will emerge in the future.
software should be free. Consequently, we are not going to talk about them here at all.
14 https://www.r-project.org/
15 https://julialang.org/
PREFACE XVII
Yet, many textbooks introduce statistical concepts using carefully crafted datasets
where everything runs smoothly, and all models work out of the box. This gives a false
sense of security. In practice, however, most datasets are not only unpolished but also
(even after some careful treatment) uninteresting. Such is life. We will not be avoiding
the more difficult problems during our journey.
0.3 Structure
This book is a whole course and should be read from the beginning to the end.
The material has been divided into five parts.
1. Introducing Python:
• Chapter 1 discusses how to execute the first code chunks in Jupyter Note-
books, which are a flexible tool for the reproducible generation of reports
from data analyses.
• Chapter 2 introduces the basic scalar types in base Python, ways to call ex-
isting and to write our own functions, and control a code chunk’s execution
flow.
• Chapter 3 mentions sequential and other iterable types in base Python; more
advanced data structures (vectors, matrices, data frames) that we introduce
below will build upon these concepts.
2. Unidimensional Data:
• Chapter 4 introduces vectors from numpy, which we use for storing data on
the real line (think: individual columns in a tabular dataset). Then, we look at
the most common types of empirical distributions of data (e.g., bell-shaped,
right-skewed, heavy-tailed ones).
• In Chapter 5, we list the most basic ways for processing sequences of num-
bers, including methods for data aggregation, transformation (e.g., stand-
ardisation), and filtering. We also mention that a computer’s floating-point
arithmetic is imprecise and what we can do about it.
• Chapter 6 reviews the most common probability distributions (normal, log-
normal, Pareto, uniform, and mixtures thereof), methods for assessing how
well they fit empirical data, and pseudorandom number generation that is
crucial for experiments based on simulations.
3. Multidimensional Data:
• Chapter 7 introduces matrices from numpy. They are a convenient means of
storing multidimensional quantitative data (many points described by pos-
sibly many numerical features). We also present some methods for their
XVIII PREFACE
Note (*) The parts marked with a single or double asterisk can be skipped the first time
we read this book. They are of increased difficulty and are less essential for beginner
students.
things are actually free (see Rule #9). Therefore, this name is misleading.
PREFACE XXI
0.6 Acknowledgements
Minimalist Data Wrangling with Python is based on my experience as an author of a quite
successful textbook Przetwarzanie i analiza danych w języku Python (Data Processing and
Analysis in Python; see [35]) that I wrote (in Polish, 2016, published by PWN) with my
former (successful) PhD students, Maciej Bartoszuk and Anna Cena – thanks! The cur-
rent book is an entirely different work; however, its predecessor served as an excellent
testbed for many ideas conveyed here.
The teaching style exercised in this book has proven successful in many similar courses
that yours truly has been responsible for, including at Warsaw University of Techno-
logy, Data Science Retreat (Berlin), and Deakin University (Melbourne). I thank all my
students and colleagues for the feedback given over the last 10+ years.
A thank-you to all the authors and contributors of the Python packages that we use
throughout this course: numpy [45], scipy [90], matplotlib [51], seaborn [91], and pan-
18 https://www.gagolewski.com
19 https://deepr.gagolewski.com/
20 https://stringi.gagolewski.com
21 https://genieclust.gagolewski.com
22 https://github.com/gagolews
XXII PREFACE
das [62], amongst others (as well as the many C/C++/Fortran libraries they provide
wrappers for). Their version numbers are given in Section 1.4.
This book was prepared in a Markdown superset called MyST23 , Sphinx24 , and
TeX (XeLaTeX). Python code chunks were processed with the R (sic!) package
knitr [97]. A little help from Makefiles, custom shell scripts, and Sphinx plugins
(sphinxcontrib-bibtex25 , sphinxcontrib-proof26 ) dotted the j’s and crossed the f ’s.
The Ubuntu Mono27 font is used for the display of code. Typesetting of the main text
relies upon the Alegreya28 and Lato29 typefaces.
This book received no funding, administrative, technical, or editorial support from
Deakin University, Warsaw University of Technology, Polish Academy of Sciences, or
any other source.
To my friends: Ania, Basia, Grzesiek, Fizz Grady, and Tessa – thanks for being so pa-
tient and for your comments about different things!
23 https://myst-parser.readthedocs.io/en/latest/index.html
24 https://www.sphinx-doc.org/
25 https://pypi.org/project/sphinxcontrib-bibtex/
26 https://pypi.org/project/sphinxcontrib-proof/
27 https://design.ubuntu.com/font/
28 https://www.huertatipografica.com/en
29 https://www.latofonts.com/
Part I
Introducing Python
1
Getting Started with Python
Users of Unix-like operating systems (GNU/Linux5 , FreeBSD, etc.) may download Py-
thon via their native package manager (e.g., sudo apt install python3 in Debian and
Ubuntu). Then, additional Python packages (see Section 1.4) can be installed6 by the
said manager or directly from the Python Package Index (PyPI7 ) via the pip tool.
Users of other operating systems can download Python from the project’s website or
some other distribution available on the market, e.g., Anaconda or Miniconda.
Exercise 1.1 Install Python on your computer.
of joy.
4 (*) CPython was written in the C programming language. Many Python packages are just convenient
on the desktop and in the cloud. Switching to a free system at some point cannot be recommended highly
enough.
6 https://packaging.python.org/en/latest/tutorials/installing-packages/
7 https://pypi.org/
8 https://jupyterlab.readthedocs.io/en/stable/
9 https://github.com/jupyter/jupyter/wiki/Jupyter-kernels
4 I INTRODUCING PYTHON
venient space for exercising data science in Python (writing standalone scripts in some
more advanced editors is the preferred option), we chose it here because of its educat-
ive advantages (interactive, easy to start with, etc.).
Note (*) More advanced students might consider, for example, jupytext12 as a means
to create .ipynb files directly from Markdown documents.
This should launch the JupyterLab server and open the corresponding web app in the
default web browser.
Important The file is stored relative to the current working directory of the run-
ning JupyterLab server instance. Make sure you can locate HelloWorld.ipynb on
your disk using your favourite file explorer (by the way, .ipynb is just a JSON file
that can also be edited using an ordinary text editor).
print("G'day!")
12 https://jupytext.readthedocs.io/en/latest/
13 https://www.youtube.com/watch?v=Ag1AKIl_2GM
6 I INTRODUCING PYTHON
4. Press Ctrl+Enter (or Cmd+Return on macOS) to execute the code cell and display
the result; see Figure 1.2.
2. Press Ctrl+Enter to execute the code and replace the previous outputs with the
new ones.
3. Enter a command to print some other message that is to your liking. Note that
character strings in Python must be enclosed in either double quotes or apo-
strophes.
4. Press Shift+Enter to execute the code cell, create a new one below, and then enter
the edit mode.
5. In the new cell, enter and then execute the following:
6. Add three more code cells, displaying some text or creating other bar plots.
Exercise 1.3 Change print(2+5) to PRINT(2+5). Execute the code chunk and see what hap-
pens.
1 GETTING STARTED WITH PYTHON 7
Note In the Edit mode, JupyterLab behaves like an ordinary text editor. Most keyboard
shortcuts known from elsewhere are available, for example:
• Shift+LeftArrow, DownArrow, UpArrow, or RightArrow – select text,
• Ctrl+c – copy,
• Ctrl+x – cut,
• Ctrl+v – paste,
• Ctrl+z – undo,
• Ctrl+] – indent,
• Ctrl+[ – dedent,
• Ctrl+/ – toggle comment.
Important ESC and Enter switch between the Command and Edit modes, respectively.
# Section
## Subsection
* one
* two
1. aaa
2. bbbb
* three
```python
# some code to display (but not execute)
2+2
```

Let us not waste our time finding the best app for our computers, phones, or tablets.
The best and most versatile note-taking solution is an ordinary piece of A4 paper and
a pen or a pencil. Loose sheets of paper, 5 mm grid-ruled for graphs and diagrams,
work nicely. They can be held together using a cheap landscape clip folder (the one
with a clip on the long side). An advantage of this solution is that it can be browsed
through like an ordinary notebook. Also, new pages can be added anywhere, and their
ordering altered arbitrarily.
import os
os.environ["COLUMNS"] = "74" # output width, in characters
np.set_printoptions(linewidth=74)
pd.set_option("display.width", 74)
_linestyles = [
"solid", "dashed", "dashdot", "dotted"
]
plt.rcParams["axes.prop_cycle"] = plt.cycler(
# each plotted line will have a different plotting style
(continues on next page)
10 I INTRODUCING PYTHON
The above imports the most frequently used packages (together with their usual ali-
ases, we will get to that later). Then, it sets up some further options that yours truly is
particularly fond of. On a side note, for the discussion on the reproducible pseudor-
andom number generation, please see Section 6.4.2.
The software we use regularly receives feature upgrades, API changes, and bug fixes.
It is good to know which version of the Python environment was used to evaluate all
the code included in this book:
import sys
print(sys.version)
## 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
The versions of the packages that we use in this course are given below. They can usu-
ally be fetched by calling, for example, print(np.__version__), etc.
Package Version
numpy 1.24.1
scipy 1.10.0
pandas 1.5.2
matplotlib 3.6.2
seaborn 0.12.2
sklearn (scikit-learn) (*) 1.1.1
icu (PyICU) (*) 2.9
IPython (*) 8.4.0
mplfinance (*) 0.12.9b1
We expect 99% of the code listed in this book to work in future versions of our envir-
onment. If the kind reader discovers that this is not the case, filing a bug report at
https://github.com/gagolews/datawranglingpy/issues will be much appreciated (for
the benefit of other students).
Important All example datasets that we use throughout this course are available for
download at https://github.com/gagolews/teaching-data.
Exercise 1.7 Ensure you are comfortable accessing raw data files from the above repository.
Chose any file, e.g., nhanes_adult_female_height_2020.txt in the marek folder, and then
1 GETTING STARTED WITH PYTHON 11
click Raw. It is the URL that you were redirected to, not the previous one, that includes the link to
be referred to from within your Python session.
Note that each dataset starts with several comment lines explaining its structure, the meaning of
the variables, etc.
1.5 Exercises
Exercise 1.8 What is the difference between the Edit and the Command mode in Jupyter?
Exercise 1.9 How can we format a table in Markdown? How can we insert an image?
2
Scalar Types and Control Structures in Python
In this part, we introduce the basics of the Python language itself. As it is a general-
purpose tool, various packages supporting data wrangling operations will provided as
third-party extensions. In further chapters, based on the concepts discussed here, we
will be able to use numpy, scipy, pandas, matplotlib, seaborn, and other packages with
some healthy degree of confidence.
True
## True
to instantiate one of them. This might seem boring; unless, when trying to play with
the above code, we fell into the following pitfall.
Arithmetic Operators
Here is the list of available arithmetic operators:
1 + 2 # addition
## 3
1 - 7 # subtraction
## -6
4 * 0.5 # multiplication
## 2.0
7 / 3 # float division (the result is always of type float)
## 2.3333333333333335
7 // 3 # integer division
## 2
7 % 3 # division remainder
## 1
2 ** 4 # exponentiation
## 16
1 https://docs.python.org/3/reference/expressions.html#operator-precedence
2 SCALAR TYPES AND CONTROL STRUCTURES IN PYTHON 15
We can check that x (great name, by the way: it means something of general interest in
mathematics) is now available for further reference by printing out the value that is
bound therewith:
Also, existing variables can be re-bound to any other value whenever we please:
x = x/3 # let the new `x` be equal to the old `x` (7) divided by 3
print(x)
## 2.3333333333333335
Exercise 2.2 Create two named variables height (in centimetres) and weight (in kilograms).
Based on them, determine your BMI2 .
x *= 3
print(x)
## 7.0
In this context, the above is equivalent to x = x*3, i.e., a new variable has been created.
Nevertheless, in other scenarios, augmented assignments modify the objects they act
upon in place; compare Section 3.5.
"""
spam\\spam
tasty\t"spam"
lovely\t'spam'
"""
## '\nspam\\spam\ntasty\t"spam"\nlovely\t\'spam\'\n'
Exercise 2.3 Call the print function on the above object to reveal the special meaning of the
included escape sequences.
Important Many string operations are available. They are related, for example to
formatting, pattern searching, or extracting matching chunks. They are especially im-
portant in the art of data wrangling as oftentimes information comes to us in textual
form. We shall be covering this topic in detail in Chapter 14.
x = 2
f"x is {x}"
## 'x is 2'
Notice the “f” prefix. The “{x}” part was replaced with the value stored in the x variable.
There are many options available. As usual, it is best to study the documentation4 in
search of interesting features. Here, let us just mention that we will frequently be
referring to placeholders like “{variable:width}” and “{variable:width.precision}”,
3 https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
4 https://docs.python.org/3/reference/lexical_analysis.html#f-strings
2 SCALAR TYPES AND CONTROL STRUCTURES IN PYTHON 17
which specify the field width and the number of fractional digits of a number. This
can result in a series of values nicely aligned one below another.
π = 3.14159265358979323846
e = 2.71828182845904523536
print(f"""
π = {π:10.8f}
e = {e:10.8f}
""")
##
## π = 3.14159265
## e = 2.71828183
10.8f means that a value should be formatted as a float, be of at least width 10, and
use eight fractional digits.
e = 2.718281828459045
round(e, 2)
## 2.72
print(
round(e, 2), # two arguments matched positionally
round(e, ndigits=2), # positional and keyword argument
round(number=e, ndigits=2), # two keyword arguments
round(ndigits=2, number=e) # the order does not matter for keyword args
)
## 2.72 2.72 2.72 2.72
That no other form is allowed is left as an exercise, i.e., positionally matched argu-
ments must be listed before the keyword ones.
18 I INTRODUCING PYTHON
import math # the math module must be imported prior its first use
print(math.log(2.718281828459045)) # the natural logarithm (base e)
## 1.0
print(math.floor(-7.33)) # the floor function
## -8
print(math.sin(math.pi)) # sin(pi) equals 0 (with some numeric error)
## 1.2246467991473532e-16
See the official documentation5 for the comprehensive list of objects defined therein.
On a side note, all floating-point computations in any programming language are sub-
ject to round-off errors and other inaccuracies. This is why the result of sin 𝜋 is not
exactly 0, but some value very close thereto. We will elaborate on this topic in Sec-
tion 5.5.6.
Packages can be given aliases, for the sake of code readability or due to our being lazy.
For instance, we are used to importing the numpy package under the np alias:
import numpy as np
And now, instead of writing, for example, numpy.random.rand(), we can call instead:
x = 1+2j
type(x)
## <class 'complex'>
5 https://docs.python.org/3/library/math.html
2 SCALAR TYPES AND CONTROL STRUCTURES IN PYTHON 19
Exercise 2.5 Call help("complex") to reveal that the complex class features, amongst others,
the conjugate method and the real and imag slots.
Here is how we can read the two slots:
Logical results might be combined using and (conjunction; for testing if both operands
are true) and or (alternative; for determining whether at least one operand is true).
Likewise, not (negation) is available too.
Notice that not 100 <= 3 is equivalent to 100 > 3. Also, based on the de Morgan’s laws,
not (1 > 2 and 2 < 3) is true if and only if 1 <= 2 or 2 >= 3 holds.
Exercise 2.6 Assuming that p, q, r are logical and a, b, c, d are float-type variables, simplify
the following expressions:
• not not p,
• not p and not q,
• not (not p or not q or not r),
• not a == b,
• not (b > a and b < c),
• not (a>=b and b>=c and a>=c),
• (a>b and a<c) or (a<c and a>d).
we can react enthusiastically to its being less than 0.5 (note the colon after the tested
condition):
if x < 0.5: print("spam!")
Multiple elif (else-if ) parts can also be added, followed by an optional else part, which
is executed if all the conditions tested are not true.
if x < 0.25: print("spam!")
elif x < 0.5: print("ham!") # i.e., x in [0.25, 0.5)
elif x < 0.75: print("bacon!") # i.e., x in [0.5, 0.75)
else: print("eggs!") # i.e., x >= 0.75
## bacon!
2 SCALAR TYPES AND CONTROL STRUCTURES IN PYTHON 21
If more than one statement is to be executed conditionally, an indented code block can
be introduced.
Important The indentation must be neat and consistent. We recommend using four
spaces. The reader is encouraged to try to execute the following code chunk and note
what kind of error is generated:
if x < 0.5:
print("spam!")
print("ham!") # :(
Exercise 2.7 For a given BMI, print out the corresponding category as defined by the WHO
(underweight if below 18.5, normal range up to 25.0, etc.). Let us bear in mind that the BMI is
a simplistic measure. Both the medical and statistical communities point out its inherent limit-
ations. Read the Wikipedia article thereon for more details (and appreciate the amount of data
wrangling required for its preparation – tables, charts, calculations; something that we will be
able to do quite soon, given good reference data, of course).
Exercise 2.8 (*) Check if it is easy to find on the internet (in reliable sources) some raw data sets
related to the body mass studies, e.g., measuring subjects’ height, weight, body fat and muscle
percentage, etc.
count = 0
while np.random.rand() > 0.01:
count = count + 1
print(count)
## 117
22 I INTRODUCING PYTHON
Exercise 2.9 Using the while loop, determine the arithmetic mean of 10 randomly generated
numbers (i.e., the sum of the numbers divided by 10).
Example calls:
Note that the function returns a value. The result can be fetched and used in further
computations:
Exercise 2.10 Write a function named bmi which computes and returns a person’s BMI, given
their weight (in kilograms) and height (in centimetres). As documenting functions constitutes a
good development practice, do not forget about including a docstring.
We can also introduce new variables inside a function’s body. This can help the func-
tion perform what it has been designed to do.
def min3(a, b, c):
"""
A function to determine the minimum of three given inputs
(alternative version).
"""
m = a # a local (temporary/auxiliary) variable
if b < m:
m = b
if c < m: # be careful! no `else` or `elif` here — it's a separate `if`
m = c
return m
Example call:
m = 7
n = 10
o = 3
min3(m, n, o)
## 3
All local variables cease to exist after the function is called. Notice that m inside the func-
tion is a variable independent of m in the global (calling) scope.
print(m) # this is still the global `m` from before the call
## 7
Exercise 2.11 Write a function max3 which determines the maximum of three given values.
Exercise 2.12 Write a function med3 which defines the median of three given values (the one
value that is in-between the other ones).
Exercise 2.13 (*) Write a function min4 to compute the minimum of four values.
they can be anonymous. This is useful when calling methods that take other functions
as their arguments. With lambdas, the latter can be generated on the fly.
2.5 Exercises
Exercise 2.14 What does import xxxxxx as x mean?
Exercise 2.15 What is the difference between if and while?
Exercise 2.16 Name the scalar types we introduced in this chapter.
Exercise 2.17 What is a docstring and how can we create and access it?
Exercise 2.18 What are keyword arguments?
3
Sequential and Other Types in Python
3.1.1 Lists
Lists consist of arbitrary Python objects. They are created using square brackets:
Above is an example list featuring objects of type: bool, str, int, list (yes, it is possible
to have a list inside another list), and None (the None object is the only of this kind, it
represents a placeholder for nothingness), in this order.
Note We will often be using lists when creating vectors in numpy or data frame columns
in pandas. Further, lists of lists of equal lengths can be used to create matrices.
Each list is mutable. Consequently, its state may be changed arbitrarily. For instance,
we can append a new object at its end:
x.append("spam")
print(x)
## [True, 'two', 3, [4j, 5, 'six'], None, 'spam']
26 I INTRODUCING PYTHON
3.1.2 Tuples
Next, tuples are like lists, but they are immutable (read-only) – once created, they cannot
be altered.
This gave us a triple (a 3-tuple) featuring a string, an empty list, and a pair (a 2-tuple).
Let us stress that we can drop the round brackets and still get a tuple:
Also:
Note the trailing comma; the above notation defines a singleton (a 1-tuple). It is not
the same as the simple 42 or (42), which is an object of type int.
Note Having a separate data type representing an immutable sequence makes sense
in certain contexts. For example, a data frame’s shape is its inherent property that
should not be tinkered with. If a tabular dataset has 10 rows and 5 columns, we
should not allow the user to set the former to 15 (without making further assumptions,
providing extra data, etc.).
When creating collections of items, we usually prefer lists, as they are more flexible
a data type. Yet, in Section 3.4.2, we will mention that many functions return tuples.
We should be able to handle them with confidence.
3.1.3 Ranges
Objects defined by calling range(from, to) or range(from, to, by) represent arith-
metic progressions of integers. For the sake of illustration, let us convert a few of them
to ordinary lists:
Let us point out that the rightmost boundary (to) is exclusive and that by defaults to 1.
3 SEQUENTIAL AND OTHER TYPES IN PYTHON 27
print("lovely\nspam")
## lovely
## spam
Strings are often treated as scalars (atomic entities, as in: a string as a whole). How-
ever, as we will soon find out, their individual characters can also be accessed by index.
Furthermore, in Chapter 14, we will discuss a plethora of operations on text.
The valid indexes are 0, 1, …, n-2, n-1, where n is the length (size) of the sequence, which
can be fetched by calling len.
Important Think of an index as the distance from the start of a sequence. For example,
x[3] means “three items away from the beginning”, i.e., the fourth element.
"string"[3]
## 'i'
Indexing a string returns a string – that is why we classified strings as scalars too.
More examples:
Important The same “thing” can have different meanings in different contexts. There-
fore, we should always be mindful.
For instance, raw square brackets are used to create a list (e.g., [1, 2, 3]) whereas
their presence after a sequential object indicates some form of indexing (e.g., x[1] or
even [1, 2, 3][1]).
Similarly, (1, 2) creates a 2-tuple and f(1, 2) denotes a call to a function f with two
arguments.
3.2.2 Slicing
We can also use slices of the form from:to or from:to:by to select a subsequence of a
given sequence. Slices are similar to ranges, but `:` can only be used within square
brackets.
In fact, from and to are optional – when omitted, they default to one of the sequence
boundaries.
Important Knowing the difference between element extraction and subsetting a se-
quence (creating a subsequence) is crucial.
For example:
x[0] # extraction (indexing with a single integer)
## 'one'
gives the object of the same type as x (here, a list) featuring the items at that indexes (in
this case, only the first object, but a slice can potentially select any number of elements,
including none).
pandas data frames and numpy arrays will behave similarly, but there will be many more
indexing options (as discussed in Section 5.4, Section 8.2, and Section 10.5).
Exercise 3.1 There are quite a few methods that we can use to modify list elements: not only the
aforementioned append, but also insert, remove, pop, etc. Invoke help("list") to access their
descriptions and call them on a few example lists.
Exercise 3.2 Verify that we cannot perform anything similar to the above on tuples, ranges, and
strings. In other words, they are immutable.
30 I INTRODUCING PYTHON
7 in range(0, 10)
## True
[2, 3] in [ 1, [2, 3], [4, 5, 6] ]
## True
For strings, in tests whether a string features a specific substring, so we do not have to
restrict ourselves to single characters:
Exercise 3.3 Check out the count and index methods for the list and other classes.
"spam" * 3
## 'spamspamspam'
(1, 2) * 4
## (1, 2, 1, 2, 1, 2, 1, 2)
3.3 Dictionaries
Dictionaries (objects of type dict) are sets of key:value pairs, where the values (any
Python object) can be accessed by key (usually a string1 ).
1 Overall, hashable data types can be used as dictionary keys, e.g., integers, floats, strings, tuples, and
x = {
"a": [1, 2, 3],
"b": 7,
"z": "spam!"
}
print(x)
## {'a': [1, 2, 3], 'b': 7, 'z': 'spam!'}
We can also create a dictionary with string keys using the dict function which accepts
any keyword arguments:
x["a"]
## [1, 2, 3]
In this context, x[0] is not valid – it is not an object of sequential type; a key of 0 does
not exist in a given dictionary.
Example 3.4 (*) In practice, we often import JSON files (which is a popular data exchange
format on the internet) exactly in the form of Python dictionaries. Let us demo it quickly:
import requests
x = requests.get("https://api.github.com/users/gagolews/starred").json()
Now x is a sequence of dictionaries giving the information on the repositories starred by yours
truly on GitHub. As an exercise, the reader is encouraged to inspect its structure.
32 I INTRODUCING PYTHON
list("spam")
## ['s', 'p', 'a', 'm']
tuple(range(0, 10, 2))
## (0, 2, 4, 6, 8)
list({ "a": 1, "b": ["spam", "bacon", "spam"] })
## ['a', 'b']
Exercise 3.5 Take a look at the documentation of the extend method for the list class. The
manual page suggests that this operation takes any iterable object. Feed it with a list, tuple,
range, and a string and see what happens.
The notion of iterable objects is essential, as they appear in many contexts. There are
quite a few other iterable types that are, for example, non-sequential (we cannot access
their elements at random using the index operator).
Exercise 3.6 (*) Check out the enumerate, zip, and reversed functions and what kind of iter-
able objects they return.
Another example:
for i in range(len(x)):
print(i, x[i], sep=": ") # sep=" " is the default (element separator)
## 0: 1
## 1: two
(continues on next page)
3 SEQUENTIAL AND OTHER TYPES IN PYTHON 33
One more example – computing the elementwise multiply of two vectors of equal
lengths:
x = [1, 2, 3, 4, 5] # for testing
y = [1, 10, 100, 1000, 10000] # just a test
z = [] # result list – start with an empty one
for i in range(len(x)):
z.append(x[i] * y[i])
print(z)
## [1, 20, 300, 4000, 50000]
Yet another example: here is a function that determines the minimum of a given iter-
able object (compare the built-in min function, see help("min")).
import math
def mymin(x):
"""
The smallest element in an iterable object x.
We assume that x consists of numbers only.
"""
curmin = math.inf # infinity is greater than any other number
for e in x:
if e < curmin:
curmin = e # a better candidate for the minimum
return curmin
Exercise 3.7 Write your own basic versions (using the for loop) of the built-in max, sum, any,
and all functions.
Exercise 3.8 (*) The glob function in the glob module can be used to list all files in a given dir-
ectory whose names match a specific wildcard, e.g., glob.glob("~/Music/*.mp3") ("~" points
to the current user’s home directory, see Section 13.6.1). Moreover, getsize from the os.path
module returns the size of a given file, in bytes. Write a function that determines the total size of
all the files in a given directory.
This is useful, for example, when the swapping of two elements is needed:
Another use case is where we fetch outputs of functions that return many objects at
once. For instance, later we will learn about numpy.unique which (depending on argu-
ments passed) may return a tuple of arrays:
import numpy as np
result = np.unique([1, 2, 1, 2, 1, 1, 3, 2, 1], return_counts=True)
print(result)
## (array([1, 2, 3]), array([5, 3, 1]))
That this is indeed a tuple of length two (which we should be able to tell already by
merely looking at the result: note the round brackets and two objects separated by a
comma) can be verified as follows:
type(result), len(result)
## (<class 'tuple'>, 2)
values = result[0]
counts = result[1]
we can write:
print(values)
## [1 2 3]
print(counts)
## [5 3 1]
3 SEQUENTIAL AND OTHER TYPES IN PYTHON 35
If only the second item was of our interest, we could have written:
Note (**) If there are too many values to unpack, we can use the notation like *name
inside the tuple_of_identifiers. This will serve as a placeholder that gathers all the
remaining values and wraps them up in a list:
a, b, *c, d = range(10)
print(a, b, c, d, sep="\n")
## 0
## 1
## [2, 3, 4, 5, 6, 7, 8]
## 9
This placeholder may appear only once on the lefthand side of the assignment oper-
ator.
Arguments to be matched positionally can be wrapped inside any iterable object and
then unpacked using the asterisk operator:
Keyword arguments can be wrapped inside a dictionary and unpacked with a double
asterisk:
The unpackings can be intertwined. For this reason, the following calls are equivalent:
For example:
We see that *args gathers all the positionally matched arguments (except a and b,
3 SEQUENTIAL AND OTHER TYPES IN PYTHON 37
which were set explicitly) into a tuple. On the other hand, **kwargs is a dictionary that
stores all keyword arguments not featured in the function’s parameter list.
Exercise 3.10 From time to time, we will be coming across *args and **kwargs in various con-
texts. Study what matplotlib.pyplot.plot uses them for (by calling help(plt.plot)).
x = [1, 2, 3]
y = x
the assignment operator does not create a copy of x; both x and y refer to the same
object in the computer’s memory.
Important If x is mutable, any change made to it will affect y (as, again, they are two
different means to access the same object). This will also be true for numpy arrays and
pandas data frames.
For example:
x.append(4)
print(y)
## [1, 2, 3, 4]
That now a call to print(x) gives the same result as above is left as an exercise.
And now:
myadd(x, 5)
myadd(y, 6)
print(x)
## [1, 2, 3, 4, 5, 6]
38 I INTRODUCING PYTHON
x = [1, 2, 3]
y = x.copy()
x.append(4)
print(y)
## [1, 2, 3]
This did not change the object referred to as y, because it is now a different entity.
x = [5, 3, 2, 4, 1]
print(sorted(x)) # returns a sorted copy of x (does not change x)
## [1, 2, 3, 4, 5]
print(x) # unchanged
## [5, 3, 2, 4, 1]
x = [5, 3, 2, 4, 1]
x.sort() # modifies x in place and returns nothing
print(x)
## [1, 2, 3, 4, 5]
Additionally, random.shuffle is a function (not: a method) that changes the state of the
argument:
x = [5, 3, 2, 4, 1]
import random
random.shuffle(x) # modifies x in place, returns nothing
print(x)
## [2, 1, 5, 4, 3]
Later we will learn about the Series class in pandas, which represents data frame
columns. It has the sort_values method which by default returns a sorted copy of the
object it acts upon:
3 SEQUENTIAL AND OTHER TYPES IN PYTHON 39
import pandas as pd
x = pd.Series([5, 3, 2, 4, 1])
print(list(x.sort_values())) # inplace=False
## [1, 2, 3, 4, 5]
print(list(x)) # unchanged
## [5, 3, 2, 4, 1]
x = pd.Series([5, 3, 2, 4, 1])
x.sort_values(inplace=True) # note the argument now
print(list(x)) # changed
## [1, 2, 3, 4, 5]
long run, it is best to focus on developing the most transferable skills, as other software
solutions might not enjoy all the Python’s syntactic sugar, and vice versa.
The reader is encouraged to skim through at least the following chapters in the official
Python 3 tutorial3 :
• 3. An Informal Introduction to Python4 ,
• 4. More Control Flow Tools5 ,
• 5. Data Structures6 .
3.7 Exercises
Exercise 3.11 Name the sequential objects we introduced.
Exercise 3.12 Is every iterable object sequential?
Exercise 3.13 Is dict an instance of a sequential type?
Exercise 3.14 What is the meaning of `+` and `*` operations on strings and lists?
Exercise 3.15 Given a list x featuring numeric scalars, how can we create a new list of the same
length giving the squares of all the elements in the former?
Exercise 3.16 (*) How can we make an object copy and when should we do so?
Exercise 3.17 What is the difference between x[0], x[1], x[:0], and x[:1], where x is a se-
quential object?
3 https://docs.python.org/3/tutorial/index.html
4 https://docs.python.org/3/tutorial/introduction.html
5 https://docs.python.org/3/tutorial/controlflow.html
6 https://docs.python.org/3/tutorial/datastructures.html
Part II
Unidimensional Data
4
Unidimensional Numeric Data and Their Empirical
Distribution
Our data wrangling adventure starts the moment we get access to, or decide to collect,
dozens of data points representing some measurements, such as sensor readings for
some industrial processes, body measures for patients in a clinic, salaries of employ-
ees, sizes of cities, etc.
For instance, consider the heights of adult females (>= 18 years old, in cm) in the lon-
gitudinal study called National Health and Nutrition Examination Survey (NHANES1 )
conducted by the US Centres for Disease Control and Prevention.
heights = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_height_2020.txt")
This is an example of quantitative (numeric) data. They are in the form of a series of
numbers. It makes sense to apply various mathematical operations on them, includ-
ing subtraction, division, taking logarithms, comparing which one is greater than the
other, and so forth.
Most importantly, here, all the observations are independent of each other. Each value
represents a different person. Our data sample consists of 4,221 points on the real
line (a bag of points whose actual ordering does not matter). In Figure 4.1, we see
that merely looking at the numbers themselves tells us nothing. There are too many
of them.
This is why we are interested in studying a multitude of methods that can bring some
insight into the reality behind the numbers. For example, inspecting their distribu-
tion.
1 https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx
44 II UNIDIMENSIONAL DATA
(jitter)
Figure 4.1: The heights dataset is comprised of independent points on the real line; we
added some jitter on the y-axis for dramatic effects only: the points are too plentiful
Many other packages are built on top of numpy, including: scipy [90], pandas [62], and
sklearn [71]. This is why we should study it in great detail. Whatever we learn about
vectors will be beautifully transferable to the case of the processing of data frame
columns.
import numpy as np
Our code can now refer to the objects defined therein as np.spam, np.bacon, or np.spam.
2 https://numpy.org/doc/stable/reference/index.html
3 https://scipy.github.io/old-wiki/pages/History_of_SciPy
4 UNIDIMENSIONAL NUMERIC DATA AND THEIR EMPIRICAL DISTRIBUTION 45
Here, the vector elements were specified by means of an ordinary list. Ranges and
tuples can also be used as content providers, which the kind reader is encouraged to
check themself.
len(x)
## 6
x.shape
## (6,)
Recall that Python lists, e.g., [1, 2, 3], represent simple sequences of objects of any
kind. Their use cases are very broad, which is both an advantage and something quite
the opposite. Vectors in numpy are like lists, but on steroids. They are powerful in sci-
entific computing because of the underlying assumption that each object they store
is of the same type4 . Although it is possible to save references to arbitrary objects
therein, in most scenarios we will be dealing with vectors of logical values, integers,
and floating-point numbers. Thanks to this, a wide range of methods could have been
defined to enable the performing of the most popular mathematical operations.
And so, above we created a sequence of integers:
4 (*) Vectors are directly representable as simple arrays in the C programming language, in which numpy
procedures are written. Operations on vectors will be very fast provided that we are using the functions that
process them as a whole. The readers with some background in other lower-level languages will need to get
out of the habit of acting on individual elements using a for-like loop.
46 II UNIDIMENSIONAL DATA
But other element types are possible too. For instance, we can convert the above to a
float vector:
Let us emphasise that the above is now printed differently (compare the output of
print(x) above).
Furthermore:
gave a logical vector. The constructor detected that the common type of all the ele-
ments is bool. Also:
This yielded an array of strings in Unicode (i.e., capable of storing any character in any
alphabet, emojis, mathematical symbols, etc.), each of no more than five code points
in length. We will point out in Chapter 14 that replacing any element with new content
will result in the too-long strings’ truncation. We will see that this can be remedied by
calling x.astype("<U10").
np.repeat(5, 6)
## array([5, 5, 5, 5, 5, 5])
np.repeat([1, 2], 3)
## array([1, 1, 1, 2, 2, 2])
np.repeat([1, 2], [3, 5])
## array([1, 1, 1, 2, 2, 2, 2, 2])
In each case, every element from the list passed as the 1st argument was repeated
the corresponding number of times, as defined by the 2nd argument. The kind reader
should not expect us to elaborate upon the obtained results any further, because
everything is evident: they need to look at the example calls, carefully study all the dis-
played outputs, and make the conclusions by themself. If something is unclear, they
should consult the official documentation and apply Rule #4.
Moving on. numpy.tile, on the other hand, repeats a whole sequence with recycling:
np.tile([1, 2], 3)
## array([1, 2, 1, 2, 1, 2])
Notice the difference between the above and the result of numpy.repeat([1, 2], 3).
See also7 numpy.zeros and numpy.ones for some specialised versions of the above.
Here, nan stands for a not-a-number and is used as a placeholder for missing values
(discussed in Section 15.1) or wrong results, such as the square root of -1 in the domain
6 DuckDuckGo also supports search bangs like “!numpy linspace” which redirect to the official document-
ation automatically.
7 When we write See also, it means that this is an exercise for the reader (Rule #3), in this case: to look
of reals. The inf object, on the other hand, means infinity, ∞. We can think of it as a
value that is too large to be represented in the set of floating-point numbers.
We see that numpy.r_ uses square brackets instead of the round ones. This is smart,
because we mentioned in Section 3.2.2 that slices (`:`) cannot be used outside them.
And so:
Here, 5j does not have a literal meaning (a complex number). By an arbitrary con-
vention, and only in this context, it denotes the output length of the sequence to be
generated. Could the numpy authors do that? Well, they could, and they did. End of
story.
Finally, we can combine many chunks into one:
and to pick a few values from a given set with replacement (so that any number can be
generated multiple times):
It is worth knowing, though, that arrays with elements of the same type can be read
4 UNIDIMENSIONAL NUMERIC DATA AND THEIR EMPIRICAL DISTRIBUTION 49
efficiently from text files (e.g., CSV) using numpy.loadtxt. See the code chunk at the
beginning of this chapter for an example.
Exercise 4.2 Use numpy.loadtxt to read the population_largest_cities_unnamed8 data-
set from GitHub (click Raw to get access to its contents and use the URL you were redirected to,
not the original one).
𝒙 = (𝑥1 , 𝑥2 , … , 𝑥𝑛 ),
where 𝑥𝑖 is the 𝑖-th element therein and 𝑛 is the length (size) of the tuple. Using the
programming syntax, 𝑛 corresponds to len(x) or, equivalently, x.shape[0]. Further-
more, 𝑥𝑖 is x[i-1] (because the first element is at index 0).
The bold font (hopefully visible) is to emphasise that 𝒙 is not an atomic entity (𝑥), but
rather a collection thereof. For brevity, instead of saying “let 𝒙 be a real-valued se-
quence9 of length 𝑛”, we shall write “let 𝒙 ∈ ℝ𝑛 ”. Here:
• the “∈” symbol stands for “is in” or “is a member of ”,
• ℝ denotes the set of real numbers (the very one that features, 0, −358745.2394,
42 and 𝜋, amongst uncountably many others), and
• ℝ𝑛 is the set of real-valued sequences of length 𝑛 (i.e., 𝑛 such numbers considered
at a time); e.g., ℝ2 includes pairs such as (1, 2), (𝜋/3, √2/2), and (1/3, 103 ).
Note Mathematical notation is pleasantly abstract (general) in the sense that 𝒙 can be
anything, e.g., data on the incomes of households, sizes of the largest cities in some
country, or heights of participants in some longitudinal study. At first glance, such a
representation of objects from the so-called real world might seem overly simplistic,
especially if we wish to store information on very complex entities. Nonetheless, in
most cases, expressing them as vectors (i.e., establishing a set of numeric attributes
that best describe them in a task at hand) is not only natural but also perfectly suffi-
cient for achieving whatever we aim at.
• How would you represent a car in an insurance company’s database (to determine how much
a driver should pay annually for the mandatory policy)?
• How would you represent a student in a university (to grant them scholarships)?
In each case, list a few numeric features that best describe the reality of concern. On a side note,
descriptive (categorical) labels can always be encoded as numbers, e.g., female = 1, male = 2, but
this will be the topic of Chapter 11.
By 𝑥(𝑖) (notice the bracket10 ) we will denote the i-th smallest value in 𝒙 (also called the 𝑖-
th order statistic). In particular, 𝑥(1) is the sample minimum and 𝑥(𝑛) is the maximum.
The same in Python:
To avoid the clutter of notation, in certain formulae (e.g., in the definition of the type-7
quantiles in Section 5.1.1), we will be assuming that 𝑥(0) is the same as 𝑥(1) and 𝑥(𝑛+1)
is equivalent to 𝑥(𝑛) .
Note It is customary to call a single function from seaborn and then perform a series of
additional calls to matplotlib to tweak the display details. It is important to remember
that the former uses the latter to achieve its goals, not the other way around. In many
exercises, seaborn might not even have the required functionality at all, and we will be
using matplotlib only, and nothing else.
10 Some textbooks denote the i-th order statistic with 𝑥𝑖∶𝑛 , but we will not.
11 https://seaborn.pydata.org
12 https://matplotlib.org/
4 UNIDIMENSIONAL NUMERIC DATA AND THEIR EMPIRICAL DISTRIBUTION 51
1200
1000
800
Count
600
400
200
0
130 140 150 160 170 180 190
Figure 4.2: Histogram of the heights dataset; the empirical distribution is nicely bell-
shaped
The data were split into 11 bins and plotted in such a way that the bar heights are pro-
portional to the number of observations falling into each interval. The bins are non-
overlapping, adjacent to each other, and of equal lengths. We can read their coordin-
ates by looking at the bottom side of each rectangular bar. For example, ca. 1200 ob-
servations fall into the interval [158, 163] (more or less) and ca. 400 into [168, 173] (ap-
proximately).
This distribution is bell-shaped13 – nicely symmetrical around about 160 cm. The most
typical (normal) observations are somewhere in the middle, and the probability mass
decreases quickly on both sides. As a matter of fact, in Chapter 6, we will model this
dataset using a normal distribution and obtain an excellent fit. In particular, we will
mention that observations outside the interval [139, 181] are very rare (probability less
than 1%; via the 3σ rule, i.e., expected value ± 3 standard deviations).
case14 , e.g., in psychology (IQ or personality tests), physiology (the above heights), or
when measuring stuff with not-so-precise devices (distribution of errors). We might
be tempted to think now that everything is normally distributed, but this is very much
untrue.
Let us consider another dataset. In Figure 4.3, we depict the distribution of a sim-
ulated15 sample of disposable income of 1,000 randomly chosen UK households, in
British Pounds, for the financial year ending 2020.
income = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/uk_income_simulated_2020.txt")
sns.histplot(income, stat="percent", bins=20, color="lightgray")
plt.show()
20
15
Percent
10
0
0 25000 50000 75000 100000 125000 150000 175000 200000
We normalised (stat="percent") the bar heights so that they all sum to 1 (or, equival-
ently, 100%), which resulted in a probability histogram.
We notice that the probability density quickly increases, reaches its peak at around
£15,500–£35,000, and then slowly goes down. We say that it has a long tail on the right
or that it is right- or positive-skewed. Accordingly, there are several people earning good
money. It is quite a non-normal distribution. Most people are rather unwealthy: their
14 In fact, we have a proposition stating that the sum or average of many observations or otherwise sim-
pler components of some more complex entity, assuming that they are independent and follow the same
(any!) distribution with finite variance, is approximately normally distributed. This is called the Central
Limit Theorem and it is a very strong mathematical result.
15 For privacy and other reasons, the UK Office for National Statistics does not publish details on in-
dividual taxpayers. This is why we needed to guesstimate them based on data from a report published at
https://www.ons.gov.uk/peoplepopulationandcommunity.
4 UNIDIMENSIONAL NUMERIC DATA AND THEIR EMPIRICAL DISTRIBUTION 53
income is way below the per-capita revenue (being the average income for the whole
population).
Note Looking at Figure 4.3, we might have taken note of the relatively higher bars, as
compared to their neighbours, at ca. £100,000 and £120,000. We might be tempted to
try to invent a story about why there can be some difference in the relative probability
mass, but we should refrain from it. As our data sample is quite small, they might
merely be due to some natural variability (Section 6.4.4). Of course, there might be
some reasons behind it (theoretically), but we cannot read this only by looking at a
single histogram. In other words, it is a tool that we use to identify some rather general
features of the data distribution (like the overall shape), not the specifics.
Exercise 4.4 There is also the nhanes_adult_female_weight_202016 dataset in our data re-
pository, giving corresponding weights (in kilograms) of the NHANES study participants. Draw
a histogram. Does its shape resemble the income or heights distribution more?
For example, in the histogram with five bins, we miss the information that the
ca. £20,000 income is more popular than the ca. £10,000 one. (as given by the first
two bars in Figure 4.3).
On the other hand, the histogram with 200 bins already seems too fine-grained.
2020.txt
54 II UNIDIMENSIONAL DATA
35
700
30
600
25
500
20
Count
400
300 15
200 10
100 5
0 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Figure 4.4: Too few and too many histogram bins (the income dataset)
the data being presented in the most unambiguous fashion possible. Providing two or
three histograms can sometimes be a much better idea.
Further, let us be aware that someone might want to trick us by choosing the number
of bins that depict the reality in good light, when the truth is quite the opposite. For
instance, the histogram on the left above hides the poorest households inside the first
bar – the first income bracket is very wide. If we cannot request access to the original
data, the best thing we can do is to simply ignore such a data visualisation instance
and warn others not to trust it. A true data scientist must be sceptical.
Thus, there are 238 observations both in the [15,461; 25,172) and [25,172; 34,883) inter-
vals.
Note A table of ranges and the corresponding counts can be effective for data report-
ing. It is more informative and takes less space than a series of raw numbers, especially
if we present them like in the table below.
Table 4.1: Incomes of selected British households; the bin edges are pleasantly round
numbers
income bracket [£1000s] count
0–200 236
200–400 459
400–600 191
600–800 64
800–1000 26
1000–1200 11
1200–1400 10
1400–1600 2
1600–1800 0
1800–2000 1
Reporting data in tabular form can also increase the privacy of the subjects (making
subjects less identifiable, which is good) or hide some uncomfortable facts (which is
not so good; “there are ten people in our company earning more than £200,000 p.a.” –
this can be as much as £10,000,000, but shush).
Exercise 4.5 Find how we can provide the seaborn.histplot and numpy.histogram func-
tions with custom bin breaks. Plot a histogram where the bin edges are 0, 20,000, 40,000, etc.
(just like in the above table). Also let us highlight the fact that bins do not have to be of equal sizes:
set the last bin to [140,000; 200,000].
Example 4.6 Let us also inspect the bin edges and counts that we see in Figure 4.2:
Exercise 4.7 (*) There are quite a few heuristics to determine the number of bins automagic-
ally, see numpy.histogram_bin_edges for a few formulae. Check out how different values of the
bins argument (e.g., "sturges", "fd") affect the histogram shapes on both income and heights
56 II UNIDIMENSIONAL DATA
datasets. Each has its limitations, none is perfect, but some might be a good starting point for fur-
ther fine-tuning.
We will get back to the topic of manual data binning in Section 11.1.4.
peds = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/southern_cross_station_peds_2019_dec.txt")
peds
## array([ 31.22580645, 18.38709677, 11.77419355, 8.48387097,
## 8.58064516, 58.70967742, 332.93548387, 1121.96774194,
## 2061.87096774, 1253.41935484, 531.64516129, 502.35483871,
## 899.06451613, 775. , 614.87096774, 825.06451613,
## 1542.74193548, 1870.48387097, 884.38709677, 345.83870968,
## 203.48387097, 150.4516129 , 135.67741935, 94.03225806])
This time, data have already been binned by somebody else. Consequently, we cannot
use seaborn.histplot to depict them. Instead, we can rely on a more low-level func-
tion, matplotlib.pyplot.bar; see Figure 4.5.
matura = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/matura_2019_polish.txt")
plt.bar(np.arange(0, 71), width=1, height=matura,
color="lightgray", edgecolor="black", alpha=0.8)
plt.show()
17 http://www.pedestrian.melbourne.vic.gov.au/
4 UNIDIMENSIONAL NUMERIC DATA AND THEIR EMPIRICAL DISTRIBUTION 57
2000
1750
1500
1250
1000
750
500
250
0
0 5 10 15 20
3.0
2.5
2.0
1.5
1.0
0.5
0.0
0 10 20 30 40 50 60 70
This gives the distribution18 of the 2019 Matura (end of high school) exam scores in
Poland (in %) – Polish literature19 at the basic level.
It seems that the distribution should be bell-shaped, but someone tinkered with it.
Still, knowing that:
• the examiners are good people – we teachers love our students,
• 20 points were required to pass,
• 50 points were given for an essay – and beauty is in the eye of the beholder,
it all starts to make sense. Without graphically depicting this dataset, we would not
know that such (albeit lucky for some students) anomalies occurred.
marathon = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/37_pzu_warsaw_marathon_mins.txt")
Plotting the histogram of the data on the participants who finished the 42.2 km run in
less than three hours, i.e., a truncated version of this dataset, reveals that the data are
highly left-skewed, see Figure 4.7.
This was of course expected – there are only a few elite runners in the game. Yours truly
wishes his personal best will be less than 180 minutes someday. We shall see. Running
is fun, and so is walking; why not take a break for an hour and go outside?
Exercise 4.8 Plot the histogram of the untruncated (complete) version of this dataset.
cities = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/other/us_cities_2000.txt")
18 https://cke.gov.pl/images/_EGZAMIN_MATURALNY_OD_2015/Informacje_o_wynikach/2019/
sprawozdanie/Sprawozdanie%202019%20-%20J%C4%99zyk%20polski.pdf
19 Gombrowicz, Nałkowska, Miłosz, Tuwim, etc.; I recommend.
4 UNIDIMENSIONAL NUMERIC DATA AND THEIR EMPIRICAL DISTRIBUTION 59
60
50
40
Count
30
20
10
0
130 140 150 160 170 180
Figure 4.7: Histogram of a truncated version of the marathon dataset; the distribution
is left-skewed
Let us restrict ourselves only to the cities whose population is not less than 10,000 (an-
other instance of truncating, this time on the other side of the distribution). It turns
out that, even though they constitute ca. 14% of all the US settlements, as much as
about 84% of all the citizens live there.
Here are the populations of the five largest cities (can we guess which ones are they?):
The histogram is depicted in Figure 4.8. It is virtually unreadable because the dis-
tribution is not just right-skewed; it is extremely heavy-tailed: most cities are small,
and those that are large – such as New York – are really unique. Had we plotted the
whole dataset (cities instead of large_cities), the results’ intelligibility would be
even worse.
This is why we should rather draw such a distribution on the logarithmic scale, see Fig-
ure 4.9.
2500
2000
1500
Count
1000
500
0
0 1 2 3 4 5 6 7 8
1e6
600
500
400
Count
300
200
100
0
104 105 106 107
The log-scale on the x-axis does not increase linearly: it is not based on steps of equal
sizes like 0, 1,000,000, 2,000,000, …, and so forth. Instead, now the increases are
geometrical: 10,000, 100,000, 1,000,000, etc.
This is a right-skewed distribution even on the logarithmic scale. Many real-world
datasets have similar behaviour, for instance, the frequencies of occurrences of words
in books. On a side note, in Chapter 6, we will discuss the Pareto distribution family
which yields similar histograms.
Exercise 4.9 Draw the histogram of income on the logarithmic scale. Does it resemble a bell-
shaped distribution?
Exercise 4.10 (*) Use numpy.geomspace and numpy.histogram to apply logarithmic binning
of the large_cities dataset manually, i.e., to create bins of equal lengths on the log-scale.
100
80
60
Percent
40
20
0
130 140 150 160 170 180 190
Very similar is the plot of the empirical cumulative distribution function (ECDF), which for
a sample 𝒙 = (𝑥1 , … , 𝑥𝑛 ) we denote as 𝐹𝑛̂ . And so, at any given point 𝑡 ∈ ℝ, 𝐹𝑛̂ (𝑡) is
a step function20 that gives the proportion of observations in our sample that are not
greater than 𝑡:
|𝑖 ∶ 𝑥𝑖 ≤ 𝑡|
𝐹𝑛̂ (𝑡) = .
𝑛
We read |𝑖 ∶ 𝑥𝑖 ≤ 𝑡| as the number of indexes like 𝑖 such that the corresponding 𝑥𝑖 is
less than or equal to 𝑡. It can be shown that, given the ordered inputs 𝑥(1) ≤ 𝑥(2) ≤
… ≤ 𝑥(𝑛) , it holds:
⎧ 0
{ for 𝑡 < 𝑥(1) ,
𝐹𝑛̂ (𝑡) = ⎨ 𝑘/𝑛 for 𝑥(𝑘) ≤ 𝑡 < 𝑥(𝑘+1) ,
{ 1 for 𝑡 ≥ 𝑥(𝑛) .
⎩
Let us underline the fact that drawing the ECDF does not involve binning – we only
need to arrange the observations in an ascending order. Then, assuming that all ob-
servations are unique (there are no ties), the arithmetic progression 1/𝑛, 2/𝑛, … , 𝑛/𝑛
is plotted against them; see Figure 4.1121 .
n = len(heights)
heights_sorted = np.sort(heights)
plt.plot(heights_sorted, np.arange(1, n+1)/n, drawstyle="steps-post")
plt.xlabel("$t$") # LaTeX maths
plt.ylabel("$\\hat{F}_n(t)$, i.e., Prob(height $\\leq$ t)")
plt.show()
Exercise 4.11 Check out seaborn.ecdfplot for a built-in method implementing the drawing
of an ECDF.
Note (*) Quantiles (which we introduce in Section 5.1.1) can be considered a general-
ised inverse of the ECDF.
4.4 Exercises
Exercise 4.12 What is the difference between numpy.arange and numpy.linspace?
Exercise 4.13 (*) What happens when we convert a logical vector to a numeric one? And what
about when we convert a numeric vector to a logical one? We will discuss that later, but you might
want to check it yourself now.
20
We cannot see the steps in Figure 4.11, because the points are too plentiful.
21
(*) We are using (La)TeX maths typesetting within "$...$" to obtain nice plot labels, see [68] for a good
introduction.
4 UNIDIMENSIONAL NUMERIC DATA AND THEIR EMPIRICAL DISTRIBUTION 63
0.8
0.6
0.4
0.2
0.0
130 140 150 160 170 180 190
t
Figure 4.11: Empirical cumulative distribution function for the heights dataset
Exercise 4.14 Check what happens when we try to create a vector featuring a mix of logical,
integer, and floating-point values.
Exercise 4.15 What is a bell-shaped distribution?
Exercise 4.16 What is a right-skewed distribution?
Exercise 4.17 What is a heavy-tailed distribution?
Exercise 4.18 What is a multi-modal distribution?
Exercise 4.19 (*) When does logarithmic binning make sense?
5
Processing Unidimensional Data
It is extremely rare for our datasets to bring interesting and valid insights out of the
box. The ones we are using for illustrational purposes in the first part of our book have
already been curated. After all, this is an introductory course, and we need to build
up the necessary skills and not overwhelm the kind reader with too much information
at the same time. We learn simple things first, learn them well, and then we move to
more complex matters with a healthy level of confidence.
In real life, various data cleansing and feature engineering techniques will need to be per-
formed on data. Most of them are based on the simple operations on vectors that we
cover in this chapter:
• summarising data (for example, computing the median or sum),
• transforming values (applying mathematical operations on each element, such as
subtracting a scalar or taking the natural logarithm),
• filtering (selecting or removing observations that meet specific criteria, e.g., those
that are larger than the arithmetic mean ± 5 standard deviations).
Important The same operations we are going to be applying on individual data frame
columns in Chapter 10.
(𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 ) 1 𝑛
𝑥̄ = = ∑ 𝑥𝑖 ,
𝑛 𝑛 𝑖=1
• the median, being the middle value in a sorted version of the sample if its length is
odd or the arithmetic mean of the two middle values otherwise:
𝑥(𝑛+1)/2 if 𝑛 is odd,
𝑚={ 𝑥(𝑛/2) +𝑥(𝑛/2+1)
2
if 𝑛 is even.
heights = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_height_2020.txt")
np.mean(heights), np.median(heights)
## (160.13679222932953, 160.1)
income = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/uk_income_simulated_2020.txt")
np.mean(income), np.median(income)
## (35779.994, 30042.0)
• for symmetric distributions, the arithmetic mean and the median are expected to
be more or less equal,
• for skewed distributions, the arithmetic mean will be biased towards the heavier
tail.
Exercise 5.1 Get the arithmetic mean and median for the 37_pzu_warsaw_marathon_mins
dataset mentioned in Chapter 4.
Exercise 5.2 (*) Write a function that computes the median without the use of numpy.median
(based on its mathematical definition and numpy.sort).
Note (*) Technically, the arithmetic mean can also be computed using the mean method
for the numpy.ndarray class – it will sometimes be the case that we have many ways
to perform the same operation. We can even “implement” it manually using the sum
function. Thus, all the following expressions are equivalent:
print(
np.mean(income),
income.mean(),
np.sum(income)/len(income),
income.sum()/income.shape[0]
)
## 35779.994 35779.994 35779.994 35779.994
On the other hand, there exists the numpy.median function but, unfortunately, the me-
dian method for vectors is not available. This is why we prefer sticking with functions.
Comparing this new result to the previous one, oh we all feel much richer now, right?
In fact, the arithmetic mean reflects the income each of us would get if all the wealth
were gathered inside a single Santa Claus’s (or Robin Hood’s) sack and then distrib-
uted equally amongst all of us. A noble idea provided that everyone contributes equally
to the society, which unfortunately is not the case.
On the other hand, the median is the value such that 50% of the observations are less
than or equal to it and 50% of the remaining ones are not less than it. Hence, it is
68 II UNIDIMENSIONAL DATA
completely insensitive to most of the data points – on both the left and the right side
of the distribution:
print(np.median(income), np.median(income2))
## 30042.0 30076.0
Because of this, we cannot say that one measure is better than the other. Certainly, for
symmetrical distributions with no outliers (e.g., heights), the mean will be better as it
uses all data (and its efficiency can be proven for certain statistical models). For skewed
distributions (e.g., income), the median has a nice interpretation, as it gives the value
in the middle of the ordered sample. Let us still remember that these data summaries
allow us to look at a single data aspect only, and there can be many different, valid
perspectives. The reality is complex.
Sample Quantiles
Quantiles generalise the notions of the sample median and of the inverse of the em-
pirical cumulative distribution function (Section 4.3.8). They provide us with the value
that is not exceeded by the elements in a given sample with a predefined probability.
Before proceeding with a formal definition, which is quite technical, let us point out
that for larger sample sizes, we have the following rule of thumb.
Important For any 𝑝 between 0 and 1, the 𝑝-quantile, denoted 𝑞𝑝 , is a value dividing
the sample in such a way that approximately 100𝑝% of observations are not greater
than 𝑞𝑝 , and the remaining ca. 100(1 − 𝑝)% are not less than 𝑞𝑝 .
Quantiles appear under many different names, but they all refer to the same concept.
In particular, we can speak about the 100𝑝-th percentiles, e.g., the 0.5-quantile is the
same as the 50th percentile.
Furthermore:
• 0-quantile (𝑞0 ) – the minimum (also: numpy.min),
• 0.25-quantile (𝑞0.25 ) – the 1st quartile (denoted 𝑄1 ),
• 0.5-quantile (𝑞0.5 ) – the 2nd quartile a.k.a. median (denoted 𝑚 or 𝑄2 ),
• 0.75-quantile (𝑞0.75 ) – the 3rd quartile (denoted 𝑄3 ),
• 1.0-quantile (𝑞1 ) – the maximum (also: numpy.max).
Here are the above five aggregates for our two datasets:
Exercise 5.3 What is the income bracket for 95% of the most typical UK taxpayers? In other
words, determine the 2.5- and 97.5-percentiles.
Exercise 5.4 Compute the midrange of income and heights, being the arithmetic mean of the
minimum and the maximum (this measure is extremely sensitive to outliers).
Note (*) As we do not like the approximately part in the “asymptotic definition” given
above, in this course we shall assume that for any 𝑝 ∈ [0, 1], the p-quantile is given by
where 𝑘 = (𝑛 − 1)𝑝 + 1 and ⌊𝑘⌋ is the floor function, i.e., the greatest integer less
than or equal to 𝑘 (e.g., ⌊2.0⌋ = ⌊2.001⌋ = ⌊2.999⌋ = 2, ⌊3.0⌋ = ⌊3.999⌋ = 3,
⌊−3.0⌋ = ⌊−2.999⌋ = ⌊−2.001⌋ = −3, and ⌊−2.0⌋ = ⌊−1.001⌋ = −2).
𝑞𝑝 is a function that linearly interpolates between the points featuring the consecutive
order statistics, ((𝑘 − 1)/(𝑛 − 1), 𝑥(𝑘) ) for 𝑘 = 1, … , 𝑛. For instance, for 𝑛 = 5, we
connect the points (0, 𝑥(1) ), (0.25, 𝑥(2) ), (0.5, 𝑥(3) ), (0.75, 𝑥(4) ), (1, 𝑥(5) ). For 𝑛 = 6,
we do the same for (0, 𝑥(1) ), (0.2, 𝑥(2) ), (0.4, 𝑥(3) ), (0.6, 𝑥(4) ), (0.8, 𝑥(5) ), (1, 𝑥(6) ).
See Figure 5.1 for an illustration.
Notice that for 𝑝 = 0.5 we get the median regardless of whether 𝑛 is even or not.
n=5 n=6
x(5) xx(6)
(5)
x(4) x(4)
x(3) x(3)
qp
x(2) x(2)
x(1) x(1)
0 1 2 3 4 0 1 2 3 4 5
4 4 4 4 4 5 5 5 5 5 5
p p
Figure 5.1: 𝑞𝑝 as a function of 𝑝 for example vectors of length 5 (left subfigure) and 6
(right)
Note (**) There are many definitions of quantiles across statistical software; see the
method argument to numpy.quantile. They were nicely summarised in [53] as well as in
70 II UNIDIMENSIONAL DATA
the corresponding Wikipedia2 article. They are all approximately equivalent for large
sample sizes (i.e., asymptotically), but the best practice is to be explicit about which
variant we are using in the computations when reporting data analysis results. Ac-
cordingly, in our case, we say that we are relying on the type-7 quantiles as described
in [53], see also [44].
In fact, simply mentioning that our computations are done with numpy version 1.xx
(as specified in Section 1.4) implicitly implies that the default method parameters are
used everywhere, unless otherwise stated. In many contexts, that is good enough.
np.std(heights), np.std(income)
## (7.062021850008261, 22888.77122437908)
2
https://en.wikipedia.org/wiki/Quantile
3
(**) Based on the so-called uncorrected for bias version of the sample variance. We prefer it here for di-
dactical reasons (simplicity, better interpretability). Plus, it is the default one in numpy. Passing ddof=1 (delta
degrees of freedom) to numpy.std will apply division by 𝑛 − 1 instead of by 𝑛. When used as an estimator of
the distribution’s standard deviation, the latter has slightly better statistical properties (which we normally
explore in a course on mathematical statistics, which this one is not). On the other hand, we will see later
that the std methods in the pandas package have ddof=1 by default. Therefore, we might be interested in
setting ddof=0 therein.
5 PROCESSING UNIDIMENSIONAL DATA 71
The standard deviation quantifies the typical amount of spread around the arithmetic
mean. It is overall adequate for making comparisons across different samples meas-
uring similar things (e.g., heights of males vs of females, incomes in the UK vs in
South Africa). However, without further assumptions, it is quite difficult to express
the meaning of a particular value of 𝑠 (e.g., the statement that the standard deviation
of income is £22,900 is hard to interpret). This measure makes therefore most sense
for data distributions that are symmetric around the mean.
Note (*) For bell-shaped data (more precisely: for normally-distributed samples; see
the next chapter) such as heights, we sometimes report 𝑥 ̄ ±2𝑠, because the theoretical
expectancy is that ca. 95% of data points fall into the [𝑥 ̄ − 2𝑠, 𝑥 ̄ + 2𝑠] interval (the so-
called 2σ rule).
Further, the variance is the square of the standard deviation, 𝑠2 . Mind that if data are
expressed in centimetres, then the variance is in centimetres squared, which is not very
intuitive. The standard deviation does not have this drawback. Mathematicians find
the square root annoying though (for many reasons); that is why we will come across
the 𝑠2 every now and then too.
Interquartile Range
The interquartile range (IQR) is another popular measure of dispersion. It is defined
as the difference between the 3rd and the 1st quartile:
The IQR has an appealing interpretation: it is the range comprised of the 50% most typ-
ical values. It is a quite robust measure, as it ignores the 25% smallest and 25% largest
observations. Standard deviation, on the other hand, is extremely sensitive to out-
liers.
Furthermore, by range (or support) we will mean a measure based on extremal
quantiles: it is the difference between the maximal and minimal observation.
given by:
1 𝑛
𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 3
𝑔= 3
.
𝑛
(√ 𝑛1 ∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2 )
scipy.stats.skew(heights)
## 0.0811184528074054
scipy.stats.skew(income)
## 1.9768735693998942
Note (*) It is worth stressing that 𝑔 > 0 does not imply that the sample mean is greater
than the median. As an alternative measure of skewness, sometimes the practitioners
use:
𝑥̄ − 𝑚
𝑔′ = .
𝑠
Yule’s coefficient is an example of a robust skewness measure:
𝑄3 + 𝑄1 − 2𝑚
𝑔″ = .
𝑄3 − 𝑄 1
heights
income
Note We are used to referring to the individually marked points as outliers. Still, it
does not automatically mean there is anything anomalous about them. They are atypical
in the sense that they are considerably farther away from the box. But this might as
4 The 1.5IQR rule is the most popular in the statistical literature, but some plotting software may use
well indicate some problems in data quality (e.g., when someone made a typo entering
the data). Actually, box plots are calibrated (via the nicely round magic constant 1.5) in
such a way that we expect there to be no or only few outliers if the data are normally
distributed. For skewed distributions, there will naturally be many outliers on either
side; see Section 15.4 for more details.
Box plots are based solely on sample quantiles. Most of the statistical packages do not
draw the arithmetic mean. If they do, it is marked with a distinctive symbol.
Exercise 5.5 Call matplotlib.pyplot.plot(numpy.mean(..data..), 0, "wX") after
seaborn.boxplot to mark the arithmetic mean with a white cross.
Box plots are particularly beneficial for comparing data samples with each other (e.g.,
body measures of men and women separately), both in terms of the relative shift (loc-
ation) as well as spread and skewness; see, e.g., Figure 12.1.
Example 5.6 (*) We may also sometimes be interested in a violin plot (Figure 5.3), which com-
bines a box plot (although with no outliers marked) and the so-called kernel density estimator
(which is a smoothened version of a histogram; see Section 15.4.2).
heights
income
As far as spread measures are concerned, the interquartile range (IQR) is a robust stat-
istic. If necessary, the standard deviation might be replaced with:
𝑛
• mean absolute deviation from the mean: 1
𝑛 ∑𝑖=1 |𝑥𝑖 − 𝑥|,̄
𝑛
• mean absolute deviation from the median: 1
𝑛 ∑𝑖=1 |𝑥𝑖 − 𝑚|,
• median absolute deviation from the median: the median of (|𝑥1 − 𝑚|, |𝑥2 −
𝑚|, … , |𝑥𝑛 − 𝑚|).
The coefficient of variation, being the standard deviation divided by the arithmetic mean,
is an example of a relative (or normalised) spread measure. It can be appropriate for
comparing data on different scales, as it is unit-less (think how standard deviation
changes when you convert between metres and centimetres).
The Gini index, widely used in economics, can also serve as a measure of relative dis-
persion, but assumes that all data points are nonnegative:
𝑛 𝑛 𝑛
∑𝑖=1 ∑𝑗=1 |𝑥𝑖 − 𝑥𝑗 | ∑𝑖=1 (𝑛 − 2𝑖 + 1)𝑥(𝑛−𝑖+1)
𝐺= = 𝑛 .
2(𝑛 − 1)𝑛 𝑥 ̄ (𝑛 − 1) ∑𝑖=1 𝑥𝑖
It is normalised so that it takes values in the unit interval. An index of 0 reflects the
situation where all values in a sample are the same (0 variance; perfect equality). If
there is a single entity in possession of all the “wealth”, and the remaining ones are 0,
then the index is equal to 1.
Overall, numerical aggregates should be used in cases where data are unimodal. For
multimodal mixtures or data in groups, they should rather be applied to summarise
each cluster/class separately; compare Chapter 12. Also, in Chapter 8, we will extend
consider a few summaries for multidimensional data.
We will frequently be using such operations for adjusting the data, e.g., as in Fig-
ure 6.7, where we discover that the logarithm of the UK incomes has a bell-shaped dis-
tribution.
Important Thanks to the vectorised functions, our code is not only more readable,
but also runs faster: we do not have to employ the generally slow Python-level while or
for loops to traverse through each element in a given sequence.
5 PROCESSING UNIDIMENSIONAL DATA 77
Here are some significant properties of the natural logarithm and its inverse, the ex-
ponential function. By convention, Euler’s number 𝑒 ≃ 2.718, log 𝑥 = log𝑒 𝑥, and
exp(𝑥) = 𝑒𝑥 .
• log 1 = 0, log 𝑒 = 1; note that logarithms are only defined for 𝑥 > 0: in the limit
as 𝑥 → 0, we have that log 𝑥 → −∞,
• log 𝑥𝑦 = 𝑦 log 𝑥 and hence log 𝑒𝑥 = 𝑥,
• log(𝑥𝑦) = log 𝑥 + log 𝑦 and thus log(𝑥/𝑦) = log 𝑥 − log 𝑦,
• 𝑒0 = 1, 𝑒1 = 𝑒, and 𝑒𝑥 → 0 as 𝑥 → −∞,
• 𝑒log 𝑥 = 𝑥,
• 𝑒𝑥+𝑦 = 𝑒𝑥 𝑒𝑦 and so 𝑒𝑥−𝑦 = 𝑒𝑥 /𝑒𝑦 ,
• 𝑒𝑥𝑦 = (𝑒𝑥 )𝑦 .
Both functions are strictly increasing. For 𝑥 ≥ 1, the logarithm grows very slowly
whereas the exponential function increases very rapidly; see Figure 5.4.
plt.subplot(1, 2, 1)
x = np.linspace(np.exp(-2), np.exp(3), 1001)
plt.plot(x, np.log(x), label="$y=\\log x$")
plt.legend()
plt.subplot(1, 2, 2)
x = np.linspace(-2, 3, 1001)
plt.plot(x, np.exp(x), label="$y=\\exp(x)$")
plt.legend()
plt.show()
1 12.5
10.0
0 7.5
5.0
1
2.5
2 0.0
0 5 10 15 20 2 0 2
Figure 5.4: The natural logarithm (left) and the exponential function (right)
78 II UNIDIMENSIONAL DATA
Logarithms of different bases and non-natural exponential functions are also avail-
able. In particular, when drawing plots, we used the base-10 logarithmic scales on the
log 𝑥
axes. It holds log10 𝑥 = log 10 and its inverse is 10𝑥 = 𝑒𝑥 log 10 . For example:
Moving on, the trigonometric functions in numpy accept angles in radians. If 𝑥 is the
degree in angles, then to compute its cosine, we should instead write cos(𝑥𝜋/180);
see Figure 5.5.
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
2 0 /2 3 /2 2 4
In such a case, each element in the vector is being operated upon (e.g., squared, di-
vided by 5) and we get a vector of the same length in return. Hence, in this case, the
operators behave just like the vectorised mathematical functions discussed above.
Mathematically, it is common to assume that the scalar multiplication and, less com-
monly, the addition are performed in this way.
We will also become used to writing, (𝒙−𝑡)/𝑐 which is of course equivalent to (1/𝑐)𝒙+
(−𝑡/𝑐).
0.5x + 2
0.5x
original x
2x
4 3 2 1 0 1 2 3 4
Note Let 𝒚 = 𝑐𝒙 + 𝑡 and let 𝑥,̄ 𝑦,̄ 𝑠𝑥 , 𝑠𝑦 denote the vectors’ arithmetic means and
standard deviations. The following properties hold.
• The arithmetic mean and all the quantiles (including, of course, the median), are
equivariant with respect to translation and scaling; it holds, for instance, 𝑦 ̄ = 𝑐𝑥+
̄
𝑡.
5 PROCESSING UNIDIMENSIONAL DATA 81
• The standard deviation, the interquartile range, and the range are invariant to
translations and equivariant to scaling; e.g., 𝑠𝑦 = 𝑐𝑠𝑥 .
heights[-5:] # preview
## array([157. , 167.4, 159.6, 168.5, 147.8])
np.mean(heights), np.std(heights)
## (160.13679222932953, 7.062021850008261)
heights_std = (heights-np.mean(heights))/np.std(heights)
heights_std[-5:] # preview
## array([-0.44417764, 1.02848843, -0.07601113, 1.18425119, -1.74692071])
What we obtained is sometimes referred to as the z-scores. They are nicely inter-
pretable:
• z-score of 0 corresponds to an observation equal to the sample mean (perfectly
average);
• z-score of 1 is obtained for a datum 1 standard deviation above the mean;
• z-score of -2 means that it is a value 2 standard deviations below the mean;
and so forth.
Because of the way they emerge, the mean of the z-scores is always 0 and standard
deviation is 1 (up to a tiny numerical error, as usual; see Section 5.5.6):
np.mean(heights_std), np.std(heights_std)
## (1.8920872660373198e-15, 1.0)
82 II UNIDIMENSIONAL DATA
Even though the original heights were measured in centimetres, the z-scores are unit-
less (centimetres divided by centimetres).
Exercise 5.8 We have a patient whose height z-score is 1 and weight z-score is -1. How can we
interpret this information?
Exercise 5.9 How about a patient whose weight z-score is 0 but BMI z-score is 2?
On a side note, sometimes we might be interested in performing some form of robust
standardisation (e.g., for skewed data or those that feature some outliers). In such a
case, we can replace the mean with the median and the standard deviation with the
IQR.
Here, the smallest value is mapped to 0 and the largest becomes equal to 1. Let us stress
that, in this context, 0.5 does not represent the value which is equal to the mean (unless
we are incredibly lucky).
Also, clipping can be used to replace all values less than 0 with 0 and those greater than
1 with 1.
np.clip(x, 0, 1)
## array([0. , 0.5 , 1. , 0. , 0.25, 0.8 ])
The function is of course flexible; another popular choice is clipping to [−1, 1]. This
can also be implemented manually by means of the vectorised pairwise minimum and
maximum functions.
Exercise 5.10 Normalisation is similar to standardisation if data are already centred (when
the mean was subtracted). Show that we can obtain one from the other via the scaling by √𝑛.
x / np.sum(np.abs(x))
## array([ 0.06896552, 0.34482759, -0.27586207, 0.13793103, 0.17241379])
5 (*) A Box–Cox transformation can help achieve this in some datasets; see [10]. In Chapter 6, we ap-
ply its particular case: it will turn out that the logarithm of incomes follow a normal distribution (hence,
incomes follow a log-normal distribution). Generally, there is nothing “wrong” or “bad” about data’s being
not-normally distributed. It is just a nice feature to have in certain contexts.
84 II UNIDIMENSIONAL DATA
We did not apply numpy.abs, because the values were already nonnegative.
We see that the first element in the left operand (2) was multiplied by the first element
in the right operand (10). Then, we multiplied 3 by 100 (the second corresponding ele-
ments), and so forth.
Such a behaviour of the binary operators is inspired by the usual convention in vector
algebra where applying + (or −) on 𝒙 = (𝑥1 , … , 𝑥𝑛 ) and 𝒚 = (𝑦1 , … , 𝑦𝑛 ) means
exactly:
𝒙 + 𝒚 = (𝑥1 + 𝑦1 , 𝑥2 + 𝑦2 , … , 𝑥𝑛 + 𝑦𝑛 ).
Using other operators this way (elementwisely) is less standard in mathematics (for
instance multiplication might denote the dot product), but in numpy it is really con-
venient.
Example 5.12 Let us compute the value of the expression ℎ = −(𝑝1 log 𝑝1 +⋯+𝑝𝑛 log 𝑝𝑛 ),
𝑛
i.e., ℎ = − ∑𝑖=1 𝑝𝑖 log 𝑝𝑖 (the so-called entropy):
The above involves the use of a unary vectorised minus (change sign), an aggregation function
(sum), a vectorised mathematical function (log), and an elementwise multiplication of two vec-
tors of the same lengths.
Example 5.13 Let us assume that – for whatever reason – we would like to plot two mathem-
atical functions, the sine, 𝑓 (𝑥) = sin 𝑥, and a polynomial of degree 7, 𝑔(𝑥) = 𝑥 − 𝑥3 /6 +
𝑥5 /120 − 𝑥7 /5040 for 𝑥 in the interval [−𝜋, 3𝜋/2].
To do this, we can probe the values of 𝑓 and 𝑔 at sufficiently many points using the vectorised
operations discussed so far and then use the matplotlib.pyplot.plot function to draw what
we see in Figure 5.7.
1 f(x)
g(x)
0
3 2 1 0 1 2 3 4 5
Figure 5.7: With vectorised functions, it is easy to generate plots like this one; we used
different line styles so that the plot is readable also when printed in black and white
Decreasing the number of points in x will reveal that the plotting function merely draws a series
of straight-line segments. Computer graphics is essentially discrete.
Exercise 5.14 Using a single line of code, compute the vector of BMIs of all persons based on the
86 II UNIDIMENSIONAL DATA
numpy vectors support two additional indexing schemes: using integer and boolean se-
quences.
We can also use lists of vectors of integer indexes, which return a subvector with ele-
ments at the specified indexes:
x[ [0] ]
## array([10])
x[ [0, 1, -1, 0, 1, 0, 0] ]
## array([10, 20, 50, 10, 20, 10, 10])
x[ [] ]
## array([], dtype=int64)
We added some spaces between the square brackets, but only because, for example,
6 https://github.com/gagolews/teaching-data/raw/master/marek/nhanes_adult_female_height_
2020.txt
7 https://github.com/gagolews/teaching-data/raw/master/marek/nhanes_adult_female_weight_
2020.txt
5 PROCESSING UNIDIMENSIONAL DATA 87
x[[0]] might seem slightly more enigmatic. (What are these double square brackets?
Nah, it is a list inside the index operator.)
This returned the 1st, 3rd, and 4th element (select 1st, omit 2nd, select 3rd, select 4th,
omit 5th).
This is particularly useful as a data filtering technique. Knowing that the relational op-
erators `<`, `<=`, `==`, `!=`, `>=`, and `>` on vectors are performed elementwisely (just
like `+`, `*`, etc.), for instance:
x >= 30
## array([False, False, True, True, True])
we can write:
x[ x >= 30 ]
## array([30, 40, 50])
to mean “select the elements in x which are not less than 30”.
Of course, the indexed vector and the vector specifying the filter do not8 have to be the
same:
y = (x/10) % 2 # whatever
y # equal to 0 if a number is a multiply of 10 times an even number
## array([1., 0., 1., 0., 1.])
x[ y == 0 ]
## array([20, 40])
Important If we would like to combine many logical vectors, sadly we cannot use the
and, or, and not operators, because they are not vectorised (this is a limitation of our
language per se).
In numpy, we use the `&`, `|`, and `~` operators instead. Unfortunately, they have a
8 (*) This is because the indexer is computed first, and its value is passed as an argument to the index
operator. Python neither is a symbolic programming language, nor does it feature any nonstandard evalu-
ation techniques. In other words, [...] does not care how the indexer was obtained.
88 II UNIDIMENSIONAL DATA
lower order of precedence than `<`, `<=`, `==`, etc. Therefore, the bracketing of the
comparisons is obligatory.
For example:
x[ (20 <= x) & (x <= 40) ] # check what happens if we skip the brackets
## array([20, 30, 40])
means “elements in x between 20 and 40” (greater than or equal to 20 and less than or
equal to 40).
Exercise 5.15 Compute the BMIs only of the women whose height is between 150 and 170 cm.
5.4.3 Slicing
Just as with ordinary lists, slicing with “:” can be used to fetch the elements at indexes
in a given range like from:to or from:to:by.
x[::-1]
## array([50, 40, 30, 20, 10])
x[3:]
## array([40, 50])
x[1:4]
## array([20, 30, 40])
Important For efficiency reasons, slicing returns a view on existing data. It does not
have to make an independent copy of the subsetted elements, because sliced ranges
are regular by definition.
In other words, both x and its sliced version share the same memory. This is import-
ant when we apply operations which modify a given vector in place, such as the sort
method.
y = np.array([6, 4, 8, 5, 1, 3, 2, 9, 7])
y[::2] *= 10 # modifies parts of y in place
y # has changed
## array([60, 4, 80, 5, 10, 3, 20, 9, 70])
This multiplied every second element in y by 10 (i.e., [6, 8, 1, 2, 7]). On the other
hand, indexing with an integer or logical vector always returns a copy.
This did not modify the original vector, because we applied `*=` on a different object,
which has not even been memorised after that operation took place.
This gave, in this order: the first element, the sum of first two elements, the sum of
first three elements, …, the sum of all elements.
np.diff([5, 8, 4, 5, 6, 9])
## array([ 3, -4, 1, 1, 3])
returned the difference between the 2nd and 1st element, then the difference between
the 3rd and the 2nd, and so forth. The resulting vector is one element shorter than the
input one.
We often make use of cumulative sums and iterated differences when processing
time series, e.g., stock exchange data (e.g., by how much the price changed since the
previous day?; Section 16.3.1) or determining cumulative distribution functions (Sec-
tion 4.3.8).
5.5.2 Sorting
The numpy.sort function returns a sorted copy of a given vector, i.e., determines the
order statistics.
x = np.array([40, 10, 20, 40, 40, 30, 20, 40, 50, 10, 10, 70, 30, 40, 30])
np.sort(x)
## array([10, 10, 10, 20, 20, 30, 30, 30, 40, 40, 40, 40, 40, 50, 70])
The sort method (as in: x.sort()), on the other hand, sorts the vector in place (and
returns nothing).
90 II UNIDIMENSIONAL DATA
Exercise 5.16 Readers interested more in chaos than in bringing order should give numpy.
random.permutation a try. This function shuffles the elements in a given vector.
x = np.array([40, 10, 20, 40, 40, 30, 20, 40, 50, 10, 10, 70, 30, 40, 30])
np.unique(x)
## array([10, 20, 30, 40, 50, 70])
This can be used to determine if all the values in a vector are unique:
np.all(np.unique(x, return_counts=True)[1] == 1)
## False
Exercise 5.17 Play with the return_index argument to numpy.unique that allows pinpoint-
ing the indexes of the first occurrences of each unique value.
x = np.array([40, 10, 20, 40, 40, 30, 20, 40, 50, 10, 10, 70, 30, 40, 30])
np.argsort(x)
## array([ 1, 9, 10, 2, 6, 5, 12, 14, 0, 3, 4, 7, 13, 8, 11])
Which means that the smallest element is at index 1, then the 2nd smallest is at index
9, 3rd smallest at index 10, etc. Therefore:
x[np.argsort(x)]
## array([10, 10, 10, 20, 20, 30, 30, 30, 40, 40, 40, 40, 40, 50, 70])
is equivalent to numpy.sort(x).
Element 10 is the smallest (“the winner”, say, the quickest racer). Hence, it ranks first.
Element 40 is the 4th on the podium. Thus, its rank is 4. And so on.
On a side note, there are many methods in nonparametric statistics (those that do
not make any too particular assumptions about the underlying data distribution) that
are based on ranks. In particular, in Section 9.1.4, we cover the Spearman correlation
coefficient.
Exercise 5.18 Consult the manual page of scipy.stats.rankdata and test various methods
for dealing with ties.
Note (**) Readers with some background in discrete mathematics will be inter-
ested in the fact that calling numpy.argsort on a vector representing a permutation
of elements in fact generates its inverse. In particular, np.argsort(np.argsort(x,
kind="stable"))+1 is equivalent to scipy.stats.rankdata(x, method="ordinal").
and read it as: let 𝑖 be the index of the smallest element in the sequence. Alternatively,
it is the argument of the minimum, whenever:
𝑥𝑖 = min 𝑥𝑗 ,
𝑗
We can use numpy.flatnonzero to fetch the indexes where a logical vector has elements
equal to True (in Section 11.1.2, we mention that a value equal to zero is treated as the
logical False, and as True in all other cases). For example:
np.flatnonzero(x == np.max(x))
## array([2, 5])
It is a version of numpy.argmax that lets us decide what we would like to do with the
tied maxima (there are two).
Exercise 5.19 Let x be a vector with possible ties. Write an expression that returns a randomly
chosen index pinpointing one of the sample maxima.
which is almost zero (0.0000000000000134), but not exactly zero (it is zero for an en-
gineer, not a mathematician). We saw a similar result when performing standardisa-
tion (which involves centring) in Section 5.3.2.
Important All floating-point operations on a computer12 (not only in Python) are per-
formed with finite precision of 15–17 decimal digits.
12 Double precision float64 format as defined by the IEEE Standard for Floating-Point Arithmetic (IEEE
754).
5 PROCESSING UNIDIMENSIONAL DATA 93
When a comparison is needed, we need to take some error margin into account.
Ideally, instead of testing x == y, we should either inspect the absolute error:
|𝑥 − 𝑦| ≤ 𝜀,
|𝑥 − 𝑦|
≤ 𝜀,
|𝑦|
np.allclose(np.mean(heights_centred), 0)
## True
To avoid sorrow surprises, even the testing of inequalities like x >= 0 should rather be
performed as, say, x >= 1e-8.
Note Our data are often imprecise by nature. When asked about people’s heights,
rarely will they provide a non-integer answer (assuming they know how tall they are
and are not lying about it, but it is a different story). We will most likely get data roun-
ded to 0 decimal digits. In our dataset the precision is a bit higher:
heights[:6] # preview
## array([160.2, 152.7, 161.2, 157.4, 154.6, 144.7])
94 II UNIDIMENSIONAL DATA
But still, there is an inherent observational error. Even if, for example, the mean thereof
was computed exactly, the fact that the inputs themselves are not necessarily ideal
makes the estimate approximate as well. We can only hope that these errors will more
or less cancel out in the computations.
Exercise 5.20 Compute the BMIs of all females in the NHANES study. Determine their arith-
metic mean. Compare it to the arithmetic mean computed for BMIs rounded to 1, 2, 3, 4, etc.,
decimal digits.
Note (*) Another problem is related to the fact that floats on a computer use the binary
base, not the decimal one. Therefore, some fractional numbers that we believe to be
representable exactly, require an infinite number of bits. As a consequence, they are
subject to rounding.
0.1 + 0.1 + 0.1 == 0.3 # obviously
## False
This is because 0.1, 0.1+0.1+0.1, and 0.3 is literally represented as, respectively:
print(f"{0.1:.19f}, {0.1+0.1+0.1:.19f}, and {0.3:.19f}.")
## 0.1000000000000000056, 0.3000000000000000444, and 0.2999999999999999889.
A good introductory reference to the topic of numerical inaccuracies is [41]; see also
[48, 56] for a more comprehensive treatment of numerical analysis.
If we wish to filter our all elements that are not greater than 0, we can write:
5 PROCESSING UNIDIMENSIONAL DATA 95
[ e for e in x if e > 0 ]
## [0.86, 0.14, 0.19, 0.93, 0.31, 0.5, 0.31]
We can also use the ternary operator of the form x_true if cond else x_false to
return either x_true or x_false depending on the truth value of cond.
e = -2
e**0.5 if e >= 0 else (-e)**0.5
## 1.4142135623730951
There is also a tool which vectorises a scalar function so that it can be used on numpy
vectors:
def clip01(x):
"""clip to the unit inverval"""
if x < 0: return 0
elif x > 1: return 1
else: return x
In the above cases, it is much better (faster, more readable code) to rely on vector-
ised numpy functions. Still, if the corresponding operations are unavailable (e.g., string
processing, reading many files), list comprehensions provide a reasonable replace-
ment therefor.
Exercise 5.21 Write equivalent versions of the above expressions using vectorised numpy func-
tions.
Exercise 5.22 Write equivalent versions of the above expressions using base Python lists, the
for loop and the list.append method (start from an empty list that will store the result).
96 II UNIDIMENSIONAL DATA
5.6 Exercises
Exercise 5.23 What are some benefits of using a numpy vector over an ordinary Python list?
What are the drawbacks?
Exercise 5.24 How can we interpret the possibly different values of the arithmetic mean, me-
dian, standard deviation, interquartile range, and skewness, when comparing between heights
of men and women?
Exercise 5.25 There is something scientific and magical about numbers that make us ap-
proach them with some kind of respect. However, taking into account that there are many possible
data aggregates, there is a risk that a party may be cherry-picking – reporting the one that por-
trays the analysed entity in a good or bad light. For instance, reporting the mean instead of the
median or vice versa. Is there anything that can be done about it?
Exercise 5.26 Even though, mathematically speaking, all measures can be computed on all
data, it does not mean that it always makes sense to do so. For instance, some distributions will
have skewness of 0, but we should not automatically assume that they are delightfully symmetric
and bell-shaped (e.g., this can be a bimodal distribution). This is why we always need to visual-
ise our data. Give some examples of datasets and measures where we should be critical of the
obtained results.
Exercise 5.27 Give some examples where simple data preprocessing can drastically change the
values of chosen sample aggregates.
Exercise 5.28 Give the mathematical definitions, use cases, and interpretations of standard-
isation, normalisation, and min-max scaling.
Exercise 5.29 How are numpy.log and numpy.exp related to each other? How about numpy.
log vs numpy.log10, numpy.cumsum vs numpy.diff, numpy.min vs numpy.argmin, numpy.sort
vs numpy.argsort, and scipy.stats.rankdata vs numpy.argsort?
Exercise 5.30 What is the difference between numpy.trunc, numpy.floor, numpy.ceil, and
numpy.round?
Exercise 5.31 What happens when we apply `+` on two vectors of different lengths?
Exercise 5.32 List the four ways to index a vector.
Exercise 5.33 What is wrong with the expression x[ x >= 0 and x <= 1 ], where x is a
numeric vector? How about x[ x >= 0 & x <= 1 ]?
Exercise 5.34 What does it mean that slicing returns a view on existing data?
Exercise 5.35 (**) Reflect on the famous13 saying: not everything that can be counted
counts, and not everything that counts can be counted.
Exercise 5.36 (**) Being a data scientist can be a frustrating job, especially when you care for
13 https://quoteinvestigator.com/2010/05/26/everything-counts-einstein/
5 PROCESSING UNIDIMENSIONAL DATA 97
some causes. Reflect on: some things that count can be counted, but we will not count
them, because there’s no budget for them.
Exercise 5.37 (**) Being a data scientist can be a frustrating job, especially when you care for
the truth. Reflect on: some things that count can be counted, but we will not count them,
because some people might be offended or find it unpleasant.
Exercise 5.38 (**) Assume you were to establish your own nation on some island and become
the benevolent dictator thereof. How would you measure if your people are happy or not? Let us
say that you need to come up with 3 quantitative measures (key performance indicators). What
would happen if your policy-making was solely focused on optimising those KPIs? How about the
same problem but with regard to your company and employees? Think about what can go wrong
in other areas of life.
6
Continuous Probability Distributions
Each successful data analyst will deal with hundreds or thousands of datasets in their
lifetime. In the long run, at some level, most of them will be deemed boring. This is
because only a few common patterns will be occurring over and over again.
In particular, the previously mentioned bell-shapedness and right-skewness are quite
prevalent in the so-called real world. Surprisingly, however, this is exactly when things
become scientific and interesting – allowing us to study various phenomena at an ap-
propriate level of generality.
Mathematically, such idealised patterns in the histogram shapes can be formalised
using the notion of a probability density function (PDF) of a continuous, real-valued random
variable.
Intuitively1 , a PDF is a smooth curve that would arise if we drew a histogram for the
entire population (e.g., all women living currently on Earth and beyond or otherwise an
extremely large data sample obtained by independently querying the same underlying
data generating process) in such a way that the total area of all the bars is equal to 1
and the bin sizes are very small.
As stated at the beginning, we do not intend this to be a course in probability theory
and mathematical statistics. Rather, it precedes and motivates them (e.g., [21, 38, 40,
78]). Therefore, our definitions are out of necessity simplified so that they are digest-
ible. For the purpose of our illustrations, we will consider the following characterisa-
tion.
Some distributions appear more frequently than others and appear to fit empirical
data or parts thereof particularly well; compare [27]. In this chapter, we review a few
1 (*) This intuition is of course theoretically grounded and is based on the asymptotic behaviour of the
histograms as the estimators of the underlying probability density function, see, e.g., [28] and the many
references therein.
100 II UNIDIMENSIONAL DATA
1 (𝑥 − 𝜇)2
𝑓 (𝑥) = exp (− ).
√2𝜋𝜎 2 2𝜎 2
0.8 N(0, 1)
0.7 N(0, 0.5)
N(1, 0.5)
0.6
0.5
0.4
0.3
0.2
0.1
0.0
3 2 1 0 1 2 3 4
Figure 6.1: The probability density functions of some normal distributions N(μ, σ);
note that μ is responsible for shifting and σ affects scaling/stretching of the probabil-
ity mass
𝑠 are natural, statistically well-behaving estimators of the said parameters: if all obser-
vations would really be drawn independently from N(μ, σ) each, then we expect 𝑥 ̄ and
𝑠 be equal to, more or less, μ and σ (the larger the sample size, the smaller the error).
Recall the heights (females from the NHANES study) dataset and its bell-shaped his-
togram in Figure 4.2.
heights = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_height_2020.txt")
n = len(heights)
n
## 4221
mu = np.mean(heights)
sigma = np.std(heights, ddof=1)
mu, sigma
## (160.13679222932953, 7.062858532891359)
Mathematically, we will denote these two with 𝜇̂ and 𝜎̂ (mu and sigma with a hat) to
emphasise that they are merely guesstimates2 of the unknown respective parameters
𝜇 and 𝜎. On a side note, we use ddof=1, because in this context this estimator has
slightly better statistical properties.
Let us draw the fitted density function (i.e., the PDF of N(160.1, 7.06) which we can
compute using scipy.stats.norm.pdf), on top of the histogram; see Figure 6.2. We
pass stat="density" to seaborn.histplot so that the histogram bars are normalised
(i.e., the total area of these rectangles sums to 1).
At first glance, this is a genuinely nice match. Before proceeding with an overview of
the ways to assess the goodness-of-fit more rigorously, we should praise the potential
benefits of having an idealised model of our dataset at our disposal.
2 (*) It might be the case that we will have to obtain the estimates of the probability distribution’s para-
𝑛 (𝑥𝑖 −𝜇)2
meters by numerical optimisation, for example, by minimising ℒ(𝜇, 𝜎) = ∑𝑖=1 ( + log 𝜎 2 )
𝜎2
with respect to 𝜇 and 𝜎 (corresponding to the objective function in the maximum likelihood estimation
problem for the normal distribution family). In our case, however, we are lucky; there exist open-form for-
mulae expressing the solution to the above, exactly in the form of the sample mean and standard deviation.
For other distributions, things can get a little trickier, though. Furthermore, sometimes we will have many
options for point estimators to choose from, which might be more suitable if data are not of top quality
(e.g., contain outliers). For instance, in the normal model, it can be shown that we can also estimate 𝜇 and
𝜎 via the sample median and IQR/1.349.
102 II UNIDIMENSIONAL DATA
0.06
PDF of N(160.1, 7.06)
0.05
0.04
Density
0.03
0.02
0.01
0.00
130 140 150 160 170 180 190
Figure 6.2: A histogram and the probability density function of the fitted normal dis-
tribution for the heights dataset
istical methods that could additionally be used if we assumed the data normality, e.g.,
the t-test to compare the expected values.
Exercise 6.1 How different manufacturing industries (e.g., clothing) can make use of such
models? Are simplifications necessary when dealing with complexity? What are the alternatives?
Important We should always verify the assumptions of a model that we wish to apply
in practice. In particular, we will soon note that incomes are not normally distributed.
Therefore, we must not refer to the above 2σ or 3σ rule in their case. A cow neither barks
nor can it serve as a screwdriver. Period.
For the normal distribution family, the values of the theoretical CDF can be computed
by calling scipy.stats.norm.cdf; see Figure 6.3.
3 The probability distribution of any real-valued random variable 𝑋 can be uniquely defined by means
of a nondecreasing, right (upward) continuous function 𝐹 ∶ ℝ → [0, 1] such that lim𝑥→−∞ 𝐹(𝑥) = 0
and lim𝑥→∞ 𝐹(𝑥) = 1, in which case Pr(𝑋 ≤ 𝑥) = 𝐹(𝑥). The probability density function only exists for
continuous random variables and is defined as the derivative of 𝐹.
104 II UNIDIMENSIONAL DATA
0.6
Prob(height
0.4
0.2
0.0
130 140 150 160 170 180 190
x
Figure 6.3: The empirical CDF and the fitted normal CDF for the heights dataset; the
fit is superb
Indeed, almost all observations are within [𝜇 − 3𝜎, 𝜇 + 3𝜎], if data are normally distributed.
Note A common way to summarise the discrepancy between the empirical and a given
theoretical CDF is by computing the greatest absolute deviation:
It holds:
i.e., 𝐹 needs to be probed only at the 𝑛 points from the sorted input sample.
Dn = compute_Dn(heights, F)
Dn
## 0.010470976524201148
If the difference is sufficiently4 small, then we can assume that a normal model de-
scribes data quite well. This is indeed the case here: we may estimate the probability
of someone being as tall as any given height with an error less than about 1.05%.
i.e., the smallest 𝑥 such that the probability of drawing a value not greater than 𝑥 is at
least 𝑝.
Important If a CDF 𝐹 is continuous, and this is the assumption in the current chapter,
then 𝑄 is exactly its inverse, i.e., it holds 𝑄(𝑝) = 𝐹−1 (𝑝) for all 𝑝 ∈ (0, 1); compare
Figure 6.4.
tion 6.2.3.
106 II UNIDIMENSIONAL DATA
1.0 N(0, 1) 3
N(0, 0.5)
0.8 N(1, 0.5) 2
1
0.6
0
0.4
1
0.2 2
0.0 3
2 0 2 4 0.00 0.25 0.50 0.75 1.00
Figure 6.4: The cumulative distribution functions (left) and the quantile functions (be-
ing the inverse of the CDF; right) of some normal distributions
For instance, in our N(160.1, 7.06)-distributed heights dataset, 𝑄(0.9) is the height
not exceeded by 90% of the female population. In other words, only 10% of American
women are taller than:
5 (*) scipy.stats.probplot uses a slightly different definition (there are many other ones in common
use).
6 CONTINUOUS PROBABILITY DISTRIBUTIONS 107
Figure 6.5 depicts the Q-Q plot for our example dataset.
190
180
170
Sample quantiles
160
150
140
130
140 150 160 170 180
Quantiles of N(160.1, 7.06)
Figure 6.5: Q-Q plot for the heights dataset; it’s a nice fit
Ideally, the points are expected to be arranged on the 𝑦 = 𝑥 line (which was added
for readability). This would happen if the sample quantiles matched the theoretical
ones perfectly. In our case, there are small discrepancies6 in the tails (e.g., the smal-
lest observation was slightly smaller than expected, and the largest one was larger than
6 (*) We can quantify (informally) the goodness of fit by using the Pearson linear correlation coefficient;
expected), although it is quite a normal behaviour for small samples and certain dis-
tribution families. Still, we can say that we observe a very good fit.
The popular goodness-of-fit test by Kolmogorov and Smirnov can give us a conser-
vative interval of the acceptable values of 𝐷̂ 𝑛 (again: the largest deviation between the
empirical and theoretical CDF) as a function of 𝑛 (within the framework of frequentist
hypothesis testing).
Namely, if the test statistic 𝐷̂ 𝑛 is smaller than some critical value 𝐾𝑛 , then we shall deem
the difference insignificant. This is to take into account the fact that reality might devi-
ate from the ideal. In Section 6.4.4, we mention that even for samples that truly come
from a hypothesised distribution, there is some inherent variability. We need to be
somewhat tolerant.
A good textbook in statistics will tell us (and prove) that, under the assumption that 𝐹𝑛̂
is the ECDF of a sample of 𝑛 independent variables really generated from a continu-
ous CDF 𝐹, the random variable 𝐷̂ 𝑛 = sup𝑡∈ℝ |𝐹𝑛̂ (𝑡) − 𝐹(𝑡)| follows the Kolmogorov
distribution with parameter 𝑛 (available via scipy.stats.kstwo).
In other words, if we generate many samples of length 𝑛 from 𝐹, and compute 𝐷̂ 𝑛 s for
each of them, we expect it to be distributed like in Figure 6.6.
The choice 𝐾𝑛 involves a trade-off between our desire to:
• accept the null hypothesis when it is true (data really come from 𝐹), and
• reject it when it is false (data follow some other distribution, i.e., the difference is
significant enough).
These two needs are, unfortunately, mutually exclusive.
In practice, we assume some fixed upper bound (significance level) for making the
former kind of mistake, the so-called type-I error. A nicely conservative (in a good way7 )
value that we suggest employing is 𝛼 = 0.001 = 0.1%, i.e., only 1 out of 1,000 samples
that really come from 𝐹 will be rejected as not coming from 𝐹.
Such a 𝐾𝑛 may be determined by considering the inverse of the CDF of the Kolmogorov
distribution, Ξ𝑛 . Namely, 𝐾𝑛 = Ξ−1
𝑛 (1 − 𝛼):
n = 10 1.0
100 n = 100
n = 4221 0.8
80
0.6
60
40 0.4
20 0.2
0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Figure 6.6: Densities (left) and cumulative distribution functions (right) of some
Kolmogorov distributions; the greater the sample size, the smaller the acceptable de-
viations between the theoretical and empirical CDFs
In our case 𝐷̂ 𝑛 < 𝐾𝑛 , because 0.01047 < 0.02996. We conclude that our empirical
(heights) distribution does not differ significantly (at significance level 0.1%) from
the assumed one, i.e., N(160.1, 7.06). In other words, we do not have enough evidence
against the statement that data are normally distributed. It is the presumption of in-
nocence: they are normal enough.
We will go back to this discussion in Section 6.4.4 and Section 12.2.6.
Pareto distribution.
110 II UNIDIMENSIONAL DATA
uted, at least approximately, log-normally. Let us investigate whether this is the case
for UK taxpayers.
income = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/uk_income_simulated_2020.txt")
The plotting of the histogram of the logarithm of income is left as an exercise (we can
pass log_scale=True to seaborn.histplot; we will plot it soon anyway in a different
way). We proceed directly with the fitting of a log-normal model, LN(μ, σ). The fit-
ting process is similar to the normal case, but this time we determine the mean and
standard deviation based on the logarithms of data:
lmu = np.mean(np.log(income))
lsigma = np.std(np.log(income), ddof=1)
lmu, lsigma
## (10.314409794364623, 0.5816585197803816)
Figure 6.7 depicts the fitted probability density function together with the histograms
on the log- and original scale. When creating this plot, there are two pitfalls, though.
Firstly, scipy.stats.lognorm encodes the distribution via the parameter 𝑠 equal to 𝜎
and scale equal to 𝑒𝜇 . Computing the PDF at different points is done as follows:
And now:
plt.subplot(1, 2, 1)
sns.histplot(income, stat="density", bins=b, color="lightgray") # own bins!
plt.xscale("log") # log-scale on the x-axis
plt.plot(x, fx, "r--")
plt.subplot(1, 2, 2)
sns.histplot(income, stat="density", color="lightgray")
plt.plot(x, fx, "r--", label=f"PDF of LN({lmu:.1f}, {lsigma:.2f})")
plt.legend()
plt.show()
Overall, this fit is not too bad. Nonetheless, we are only dealing with a sample of 1,000
6 CONTINUOUS PROBABILITY DISTRIBUTIONS 111
1e 5 1e 5
3.0
PDF of LN(10.3, 0.58)
2.5 2.5
2.0 2.0
Density
Density
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
104 105 0 50000 100000 150000 200000
Figure 6.7: A histogram and the probability density function of the fitted log-normal
distribution for the income dataset, on log- (left) and original (right) scale
households; the original UK Office of National Statistics data9 could tell us more about
the quality of this model in general, but it is beyond the scope of our simple exercise.
Furthermore, Figure 6.8 gives the quantile-quantile plot on a double logarithmic scale
for the above log-normal model. Additionally, we (empirically) verify the hypothesis of
normality (using a “normal” normal distribution, not its “log” version).
plt.subplot(1, 2, 1)
qq_plot( # see above for the definition
income,
lambda q: scipy.stats.lognorm.ppf(q, s=lsigma, scale=np.exp(lmu))
)
plt.xlabel(f"Quantiles of LN({lmu:.1f}, {lsigma:.2f})")
plt.ylabel("Sample quantiles")
plt.xscale("log")
plt.yscale("log")
plt.subplot(1, 2, 2)
mu = np.mean(income)
sigma = np.std(income, ddof=1)
qq_plot(income, lambda q: scipy.stats.norm.ppf(q, mu, sigma))
plt.xlabel(f"Quantiles of N({mu:.1f}, {sigma:.2f})")
(continues on next page)
9 https://www.ons.gov.uk/peoplepopulationandcommunity/personalandhouseholdfinances/
incomeandwealth/bulletins/householddisposableincomeandinequality/financialyear2020
112 II UNIDIMENSIONAL DATA
plt.show()
200000
175000
105
150000
Sample quantiles
125000
100000
75000
50000
104
25000
0
104 105 0 50000 100000
Quantiles of LN(10.3, 0.58) Quantiles of N(35780.0, 22900.22)
Figure 6.8: Q-Q plots for the income dataset vs a fitted log-normal (good fit; left) and
normal (bad fit; right) distribution
Exercise 6.3 Graphically compare the empirical CDF for income and the theoretical CDF of
LN(10.3, 0.58).
Exercise 6.4 (*) Perform the Kolmogorov–Smirnov goodness-of-fit test as in Section 6.2.3, to
verify that the hypothesis of log-normality is not rejected at the 𝛼 = 0.001 significance level. At
the same time, the income distribution significantly differs from a normal one.
The hypothesis that our data follow a normal distribution is most likely false. On the
other hand, the log-normal model, might be quite adequate. It again reduced the
whole dataset to merely two numbers, μ and σ, based on which (and probability the-
ory), we may deduce that:
2 /2
• the expected average (mean) income is 𝑒𝜇+𝜎 ,
• median is 𝑒𝜇 ,
2
• most probable one (mode) in 𝑒𝜇−𝜎 ,
etc.
Note Recall again that for skewed distributions such as this one, reporting the mean
might be misleading. This is why most people get angry when they read the news about
the prospering economy (“yeah, we’d like to see that kind of money in our pockets”).
Hence, it is not only μ that matters, it is also σ that quantifies the discrepancy between
6 CONTINUOUS PROBABILITY DISTRIBUTIONS 113
the rich and the poor (too much inequality is bad, but also too much uniformity is to
be avoided).
For a normal distribution, the situation is vastly different, because the mean, the me-
dian, and the most probable outcomes tend to be the same – the distribution is sym-
metric around μ.
Exercise 6.5 What is the fraction of people with earnings below the mean in our LN(10.3, 0.58)
model? Hint: use scipy.stats.lognorm.cdf to get the answer.
cities = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/other/us_cities_2000.txt")
len(cities), sum(cities) # number of cities, total population
## (19447, 175062893.0)
Figure 6.9 gives the histogram of the city sizes with the populations on the log-scale. It
kind of looks like a log-normal distribution again, which the reader can inspect them-
self when they are feeling playful.
3500
3000
2500
2000
Count
1500
1000
500
0
100 101 102 103 104 105 106 107
Figure 6.9: Histogram of the unabridged cities dataset; note the log-scale on the x-
axis
114 II UNIDIMENSIONAL DATA
This time, however, will be interested in not what is typical, but what is in some sense
anomalous or extreme. Let us look again at the truncated version of the city size distribu-
tion by considering the cities with 10,000 or more inhabitants (i.e., we will only study
the right tail of the original data, just like in Section 4.3.7).
s = 10_000
large_cities = cities[cities >= s]
len(large_cities), sum(large_cities) # number of cities, total population
## (2696, 146199374.0)
𝛼𝑠𝛼
𝑓 (𝑥) = ,
𝑥𝛼+1
and 𝑓 (𝑥) = 0 otherwise.
𝑠 is usually taken as the sample minimum (i.e., 10,000 in our case). 𝛼 can be estimated
through the reciprocal of the mean of the scaled logarithms of our observations:
alpha = 1/np.mean(np.log(large_cities/s))
alpha
## 0.9496171695997675
10 5
10 6
Density
10 7
10 8
10 9
10 10
104 105 106 107
Figure 6.10: Histogram of the large_cities dataset and the fitted density on a double
log-scale
Figure 6.11 gives the corresponding Q-Q plot on a double logarithmic scale.
We see that the populations of the largest cities are overestimated. The model could
be better, but the cities are still growing, right?
Example 6.6 (*) It might also be interesting to see how well we can predict the probability of a
randomly selected city being at least a given size. Let us denote with 𝑆(𝑥) = 1 − 𝐹(𝑥) the com-
plementary cumulative distribution function (CCDF; sometimes referred to as the survival
function), and with 𝑆𝑛̂ (𝑥) = 1 − 𝐹𝑛̂ (𝑥) its empirical version. Figure 6.12 compares the empir-
ical and the fitted CCDFs with probabilities on the linear- and log-scale.
116 II UNIDIMENSIONAL DATA
Figure 6.11: Q-Q plot for the large_cites dataset vs the fitted Paretian model
In terms of the maximal absolute distance between the two functions, 𝐷̂ 𝑛 , from the left plot we see
that the fit looks fairly good (let us stress that the log-scale overemphasises the relatively minor
differences in the right tail and should not be used for judging the value of 𝐷̂ 𝑛 ).
That the Kolmogorov–Smirnov goodness-of-fit test rejects the hypothesis of Paretianity (at a sig-
nificance level 0.1%) is left as an exercise to the reader.
6 CONTINUOUS PROBABILITY DISTRIBUTIONS 117
0.6
0.4 10 2
0.2
10 3
0.0
104 105 106 107 104 105 106 107
x x
lotto = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/lotto_table.txt")
lotto
## array([720., 720., 714., 752., 719., 753., 701., 692., 716., 694., 716.,
## 668., 749., 713., 723., 693., 777., 747., 728., 734., 762., 729.,
## 695., 761., 735., 719., 754., 741., 750., 701., 744., 729., 716.,
## 768., 715., 735., 725., 741., 697., 713., 711., 744., 652., 683.,
## 744., 714., 674., 654., 681.])
Each event seems to occur more or less with the same probability. Of course, the num-
bers on the balls are integer, but in our idealised scenario, we may try modelling this
dataset using a continuous uniform distribution, which yields arbitrary real numbers on
a given interval (a, b), i.e., between some a and b. We denote such a distribution with
U(a, b). It has the probability density function given for 𝑥 ∈ (𝑎, 𝑏) by:
1
𝑓 (𝑥) = ,
𝑏−𝑎
and 𝑓 (𝑥) = 0 otherwise.
Notice that scipy.stats.uniform uses parameters a and scale equal to 𝑏 − 𝑎 instead.
118 II UNIDIMENSIONAL DATA
In our case, it makes sense to set 𝑎 = 1 and 𝑏 = 50 and interpret an outcome like
49.1253 as representing the 49th ball (compare the notion of the floor function, ⌊𝑥⌋).
0.025
PDF of U(1, 50)
0.020
0.015
0.010
0.005
0.000
0 10 20 30 40 50
Visually, see Figure 6.13, this model makes much sense, but again, some more rigorous
statistical testing would be required to determine if someone has not been tampering
with the lottery results, i.e., if data does not deviate from the uniform distribution
significantly.
Unfortunately, we cannot use the Kolmogorov–Smirnov test in the version defined
above, because data are not continuous. See, however, Section 11.4.3 for the Pearson
chi-squared test that is applicable here.
Exercise 6.7 Does playing lotteries and engaging in gambling make rational sense at all, from
the perspective of an individual player? Well, we see that 16 is the most frequently occurring out-
come in Lotto, maybe there’s some magic in it? Also, some people sometimes became millionaires,
right?
is chosen as a placeholder for “we know nothing about a phenomenon, so let us just
assume that every event is equally likely”. Nonetheless, it is quite fascinating that the
real world tends to be structured after all. Emerging patterns are plentiful, most often
they are far from being uniformly distributed. Even more strikingly, they are subject
to quantitative analysis.
peds = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/southern_cross_station_peds_2019_dec.txt")
It might not be a bad idea to try to fit a probabilistic (convex) combination of three
normal distributions 𝑓1 , 𝑓2 , 𝑓3 , corresponding to the morning, lunch-time, and even-
ing pedestrian count peaks. This yields the PDF:
0.12
0.10
0.08
0.06
0.04
0.02
0.00
0 5 10 15 20 25
Figure 6.14: Histogram of the peds dataset and a guesstimated mixture of three normal
distributions
Important It will frequently be the case in data wrangling that more complex entities
(models, methods) will be arising as combinations of simpler (primitive) components.
This is why we should spend a great deal of time studying the fundamentals.
Note Some data clustering techniques (in particular, the k-means algorithm that we
briefly discuss later in this course) could be used to split a data sample into disjoint
chunks corresponding to different mixture components.
Also, it might be the case that the mixture components can in fact be explained by
another categorical variable that divides the dataset into natural groups; compare
Chapter 12.
np.random.rand(5)
## array([0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897])
gives five observations sampled independently from the uniform distribution on the
unit interval, i.e., U(0, 1).
The same with scipy, but this time the support will be (-10, 15).
Alternatively, we could do that ourselves by shifting and scaling the output of the
random number generator on the unit interval using the formula numpy.random.
rand(5)*25-10.
Then, we set the seed once again via the same number and see how “random” the next
values are:
Note If we do not set the seed manually, it will be initialised based on the current wall
time, which is different every… time. As a result, the numbers will seem random to us.
Many Python packages that we will be using in the future, including pandas and sk-
122 II UNIDIMENSIONAL DATA
learn, rely on numpy’s random number generator. We will become used to calling
numpy.random.seed to make them predictable.
scipy.stats.uniform.rvs(size=5, random_state=123)
## array([0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897])
Pseudorandom deviates from the standard normal distribution, i.e., N(0, 1), can also
be generated using numpy.random.randn. As N(100, 16) is a scaled and shifted version
thereof, the above is equivalent to:
np.random.seed(50489)
np.random.randn(3)*16 + 100
## array([113.41134015, 46.99328545, 157.1304154 ])
Important Conclusions based on simulated data are trustworthy, because they can-
not be manipulated. Or can they?
The pseudorandom number generator’s seed used above, 50489, is quite suspicious. It
might suggest that someone wanted to prove some point (in this case, the violation of
the 3σ rule).
This is why we recommend sticking to only one seed most of the time, e.g., 123, or –
when performing simulations – setting consecutive seeds for each iteration: 1, 2, ….
Exercise 6.8 Generate 1,000 pseudorandom numbers from the log-normal distribution and
draw a histogram thereof.
Note (*) Having a good pseudorandom number generator from the uniform distribu-
tion on the unit interval is crucial, because sampling from other distributions usually
involves transforming independent U(0, 1) variates.
For instance, realisations of random variables following any continuous cumulative
6 CONTINUOUS PROBABILITY DISTRIBUTIONS 123
distribution function 𝐹 can be constructed through the inverse transform sampling (see
[37, 77]):
1. Generate a sample 𝑥1 , … , 𝑥𝑛 independently from U(0, 1).
2. Transform each 𝑥𝑖 by applying the quantile function, 𝑦𝑖 = 𝐹−1 (𝑥𝑖 ).
Now 𝑦1 , … , 𝑦𝑛 follows the CDF 𝐹.
Exercise 6.9 (*) Generate 1,000 pseudorandom numbers from the log-normal distribution us-
ing inverse transform sampling.
Exercise 6.10 (**) Generate 1,000 pseudorandom numbers from the distribution mixture dis-
cussed in Section 6.3.4.
There is some ruggedness in the bar sizes that a naïve observer might try to interpret as
something meaningful. A competent data scientist must train their eye to ignore such
impurities (but should always be ready to detect those which are worth attention). In
this case, they are only due to random effects.
Exercise 6.11 Repeat the above experiment for samples of sizes 10, 1,000, and 10,000.
Example 6.12 (*) Using a simple Monte Carlo simulation, we can verify (approximately) that
the Kolmogorov–Smirnov goodness-of-fit test introduced in Section 6.2.3 has been calibrated
properly, i.e., that for samples that really follow the assumed distribution, the null hypothesis
is rejected only in ca. 0.1% of the cases.
Let us say we are interested in the null hypothesis referencing the standard normal distribution,
10 Compare the Fundamental Theorem of Statistics (the Glivenko–Cantelli theorem).
124 II UNIDIMENSIONAL DATA
N(0, 1), and sample size 𝑛 = 100. We need to generate many (we assume 10,000 below) such
samples for each of which we compute and store the maximal absolute deviation from the theor-
etical CDF, i.e., 𝐷̂ 𝑛 .
n = 100
distrib = scipy.stats.norm(0, 1) # assumed distribution - N(0, 1)
Dns = []
for i in range(10000): # increase this for better precision
x = distrib.rvs(size=n, random_state=i+1) # really follows distrib
Dns.append(compute_Dn(x, distrib.cdf))
Dns = np.array(Dns)
Now let us compute the proportion of cases which lead to 𝐷̂ 𝑛 greater than the critical value 𝐾𝑛 :
6 CONTINUOUS PROBABILITY DISTRIBUTIONS 125
In theory, this should be equal to 0.001. But our values are necessarily approximate, because we
rely on randomness. Increasing the number of trials from 10,000 to, say, 1,000,000 will make
the above estimate more precise.
It is also worth checking out that the density histogram of Dns resembles the Kolmogorov distri-
bution that we can compute via scipy.stats.kstwo.pdf.
Exercise 6.13 (*) It might also be interesting to check out the test’s power, i.e., the probabil-
ity that when the null hypothesis is false, it will actually be rejected. Modify the above code in
such a way that x in the for loop is not generated from N(0, 1), but N(0.1, 1), N(0.2, 1), etc., and
check the proportion of cases where we deem the sample distribution different from N(0, 1). Small
differences in the location parameter 𝜇 are usually ignored, and this improves with sample size
𝑛.
Adding noise also might be performed for aesthetic reasons, e.g., when drawing scat-
terplots.
For a more comprehensive introduction to exploratory data analysis, see the classical
books by Tukey [86, 87] and Tufte [85].
We took the logarithm of the log-normally distributed incomes and obtained a nor-
mally distributed sample. In statistical practice, it is not rare to apply different non-
linear transforms of the input vectors at the data preprocessing stage (see, e.g., Sec-
𝜆
tion 9.2.6). In particular, the Box–Cox (power) transform [10] is of the form 𝑥 ↦ 𝑥 𝜆−1
for some 𝜆. Interestingly, in the limit as 𝜆 → 0, this formula yields 𝑥 ↦ log 𝑥 which
is exactly what we were applying in this chapter.
[14, 67] give a nice overview of the power-law-like behaviour of some “rich” or oth-
erwise extreme datasets. It is worth noting that the logarithm of a Paretian sample
divided by the minimum follows an exponential distribution (which we discuss in
Chapter 16). For a comprehensive catalogue of statistical distributions, their proper-
ties, and relationships between them, see [27].
6.6 Exercises
Exercise 6.14 Why is the notion of the mean income confusing the general public?
Exercise 6.15 When manually setting the seed of a random number generator makes sense?
Exercise 6.16 Given a log-normally distributed sample x, how can we turn it to a normally dis-
tributed one, i.e., y=f(x), with f being… what?
Exercise 6.17 What is the 3σ rule for normally distributed data?
Exercise 6.18 (*) How can we verify graphically if a sample follows a hypothesised theoretical
distribution?
Exercise 6.19 (*) Explain the meaning of type I error, significance level, and a test’s power.
Part III
Multidimensional Data
7
Multidimensional Numeric Data at a Glance
Important Just like vectors, matrices were designed to store data of the same type. In
Chapter 10, we will cover data frames, which further increase the degree of complexity
(and freedom) by not only allowing for mixed data types (e.g., numerical and categor-
ical; this will enable us to perform data analysis in subgroups more easily) but also for
the rows and columns be named.
Many data analysis algorithms convert data frames to matrices automatically and
deal with them as such. From the computational side, it is numpy that does most of
the “mathematical” work. pandas implements many recipes for basic data wrangling
tasks, but we want to go way beyond that. After all, we would like to be able to tackle
any problem.
1 Assuming we solved all the suggested exercises, which we did, didn’t we? See Rule #3.
130 III MULTIDIMENSIONAL DATA
Notice that we converted the data frame to a matrix by calling the numpy.array func-
tion. Here is a preview of the first few rows:
body[:6, :] # six first rows, all columns
## array([[ 97.1, 160.2, 34.7, 40.8, 35.8, 126.1, 117.9],
## [ 91.1, 152.7, 33.5, 33. , 38.5, 125.5, 103.1],
## [ 73. , 161.2, 37.4, 38. , 31.8, 106.2, 92. ],
## [ 61.7, 157.4, 38. , 34.7, 29. , 101. , 90.5],
## [ 55.4, 154.6, 34.6, 34. , 28.3, 92.5, 73.2],
## [ 62. , 144.7, 32.5, 34.2, 29.8, 106.7, 84.8]])
This is an extended version of the National Health and Nutrition Examination Survey
(NHANES2 ), where the consecutive columns give the following body measurements
of adult females:
body_columns = np.array([
"weight (kg)",
"standing height (cm)",
"upper arm length (cm)",
"upper leg length (cm)",
"arm circumference (cm)",
"hip circumference (cm)",
"waist circumference (cm)"
])
numpy matrices do not support column naming. This is why we noted them down sep-
arately. It is only a minor inconvenience. pandas data frames will have this capability,
but from the algebraic side, they are not as convenient as matrices for the purpose of
scientific computing.
2 https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx
7 MULTIDIMENSIONAL NUMERIC DATA AT A GLANCE 131
body.shape
## (4221, 7)
The above gave the total number of rows and columns, respectively.
yields a 3-by-1 one (we call it a column vector, but it is a special matrix — we will soon
learn that shapes can make a significant difference), and
np.array([ [1, 2, 3, 4] ])
## array([[1, 2, 3, 4]])
Note An ordinary vector (a unidimensional array) only uses a single pair of square
brackets:
np.array([1, 2, 3, 4])
## array([1, 2, 3, 4])
repeats a row vector rowwisely (i.e., over axis 0 – the first one).
Replicating a column vector columnwisely (i.e., over axis 1 – the second one) is possible
as well:
1 2
⎡ ⎤
⎢ 1 2 ⎥
⎢ 1 2 ⎥
⎢ ⎥ 1 2 1 2 1 2 1 1 2 2 2
⎢ 3 4 ⎥, [ ], [ ].
⎢ ⎥ 1 2 1 2 1 2 3 3 4 4 4
⎢ 3 4 ⎥
⎢ ⎥
⎢ 3 4 ⎥
⎣ 3 4 ⎦
Exercise 7.3 Using numpy.insert, and a new row/column at the beginning, end, and in the
middle of an array. Let us stress that this function returns a new array.
np.random.seed(123)
np.random.rand(2, 5) # not: rand((2, 5))
## array([[0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897],
## [0.42310646, 0.9807642 , 0.68482974, 0.4809319 , 0.39211752]])
The way we specify the output shapes might differ across functions and packages.
Consequently, as usual, it is always best to refer to their documentation.
Exercise 7.4 Check out the documentation of the following functions: numpy.eye, numpy.diag,
numpy.zeros, numpy.ones, and numpy.empty.
A = np.array([
[ 1, 2, 3, 4 ],
[ 5, 6, 7, 8 ],
[ 9, 10, 11, 12 ]
])
Internally, a matrix is represented using a long flat vector where elements are stored
in the row-major3 order:
It is the shape slot that is causing the 12 elements to be treated as if they were arranged
on a 3-by-4 grid, for example in different algebraic computations and during the print-
ing thereof. This arrangement can be altered anytime without modifying the under-
lying array:
A.shape = (4, 3)
A
## array([[ 1, 2, 3],
## [ 4, 5, 6],
## [ 7, 8, 9],
## [10, 11, 12]])
For convenience, there is also the reshape method that returns a modified version of
the object it is applied on:
Here, “-1” means that numpy must deduce by itself how many rows we want in the res-
ult. Twelve elements are supposed to be arranged in six columns, so the maths behind
it is not rocket science.
Thanks to this, generating row or column vectors is straightforward:
3 (*) Sometimes referred to as a C-style array, as opposed to Fortran-style which is used in, e.g., R.
7 MULTIDIMENSIONAL NUMERIC DATA AT A GLANCE 135
Reshaping is not the same as matrix transpose, which also changes the order of ele-
ments in the underlying array:
A # before
## array([[ 1, 2, 3],
## [ 4, 5, 6],
## [ 7, 8, 9],
## [10, 11, 12]])
A.T # transpose of A
## array([[ 1, 4, 7, 10],
## [ 2, 5, 8, 11],
## [ 3, 6, 9, 12]])
np.arange(24).reshape(2, 4, 3)
## array([[[ 0, 1, 2],
## [ 3, 4, 5],
## [ 6, 7, 8],
## [ 9, 10, 11]],
##
## [[12, 13, 14],
## [15, 16, 17],
## [18, 19, 20],
## [21, 22, 23]]])
Is an array of “depth” 2, “height” 4, and “width” 3; we can see it as two 4-by-3 matrices
stacked together. Theoretically, they can be used for representing contingency tables
for products of many factors. Still, in our application areas, we prefer to stick with long
data frames instead; see Section 10.6.2. This is due to their more aesthetic display and
better handling of sparse data.
136 III MULTIDIMENSIONAL DATA
𝑥 𝑥1,2 ⋯ 𝑥1,𝑚
⎡ 1,1 ⎤
⎢ 𝑥 𝑥2,2 ⋯ 𝑥2,𝑚 ⎥.
𝐗 = ⎢ 2,1 ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 𝑥𝑛,1 𝑥𝑛,2 ⋯ 𝑥𝑛,𝑚 ⎦
efficient data structures for handling these data; see scipy.sparse for more details as this goes beyond the
scope of our introductory course.
7 MULTIDIMENSIONAL NUMERIC DATA AT A GLANCE 137
weak connection between 𝑘 and 𝑙 (e.g., who is a friend of whom, whether a user
recommends a particular item);
• images, where 𝑥𝑖,𝑗 represents the intensity of a colour component (e.g., red, green,
blue or shades of grey or hue, saturation, brightness; compare Section 16.4) of a
pixel in the (𝑛 − 𝑖 + 1)-th row and the 𝑗-th column.
Note In practice, more complex and less-structured data can quite often be mapped
to a tabular form. For instance, a set of audio recordings can be described by meas-
uring the overall loudness, timbre, and danceability of each song. Also, a collection
of documents can be described by means of the degrees of belongingness to some
automatically discovered topics (e.g., someone said that Joyce’s Ulysses is 80% travel
literature, 70% comedy, and 50% heroic fantasy, but let us not take it for granted).
𝑥
⎡ 1,𝑗 ⎤
𝑇 ⎢ 𝑥2,𝑗 ⎥
𝐱⋅,𝑗 = [ 𝑥1,𝑗 𝑥2,𝑗 ⋯ 𝑥𝑛,𝑗 ] =⎢ ⎥,
⎢ ⋮ ⎥
⎣ 𝑥𝑛,𝑗 ⎦
where ⋅𝑇 denotes the transpose of a given matrix (thanks to which we can save some
vertical space, we do not want this book to be 1000 pages long, do we?).
Also, recall that we are used to denoting vectors of length 𝑚 with 𝒙 = (𝑥1 , … , 𝑥𝑚 ). A
138 III MULTIDIMENSIONAL DATA
Note To avoid notation clutter, we will often be implicitly promoting vectors like 𝒙 =
(𝑥1 , … , 𝑥𝑚 ) to row vectors 𝐱 = [𝑥1 ⋯ 𝑥𝑚 ], because this is the behaviour that numpy5
uses; see Chapter 8.
7.3.2 Transpose
The transpose of a matrix 𝐗 ∈ ℝ𝑛×𝑚 is an (𝑚 × 𝑛)-matrix 𝐘 given by:
𝑥 𝑥2,1 ⋯ 𝑥𝑚,1
⎡ 1,1 ⎤
⎢ 𝑥 𝑥2,2 ⋯ 𝑥𝑚,2 ⎥,
𝐘 = 𝐗𝑇 = ⎢ 1,2 ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 𝑥1,𝑛 𝑥2,𝑛 ⋯ 𝑥𝑚,𝑛 ⎦
np.eye(5) # I
## array([[1., 0., 0., 0., 0.],
## [0., 1., 0., 0., 0.],
## [0., 0., 1., 0., 0.],
## [0., 0., 0., 1., 0.],
## [0., 0., 0., 0., 1.]])
The identity matrix is a neutral element of the matrix multiplication (Section 8.3).
More generally, any diagonal matrix, diag(𝑎1 , … , 𝑎𝑛 ), can be constructed from a given
sequence of elements by calling:
np.diag([1, 2, 3, 4])
## array([[1, 0, 0, 0],
## [0, 2, 0, 0],
## [0, 0, 3, 0],
## [0, 0, 0, 4]])
body[:6, :] # preview
## array([[ 97.1, 160.2, 34.7, 40.8, 35.8, 126.1, 117.9],
## [ 91.1, 152.7, 33.5, 33. , 38.5, 125.5, 103.1],
## [ 73. , 161.2, 37.4, 38. , 31.8, 106.2, 92. ],
## [ 61.7, 157.4, 38. , 34.7, 29. , 101. , 90.5],
## [ 55.4, 154.6, 34.6, 34. , 28.3, 92.5, 73.2],
## [ 62. , 144.7, 32.5, 34.2, 29.8, 106.7, 84.8]])
body.shape
## (4221, 7)
This is an example of tabular (“structured”) data. The important property is that the
elements in each row describe the same person. We can freely reorder all the columns
at the same time (change the order of participants). Still, sorting a single column and
leaving the other ones unchanged will be semantically invalid.
Mathematically, we consider the above as a set of 4221 points in a seven-dimensional
space, ℝ7 . Let us discuss how we can try visualising different natural projections
thereof.
7.4.1 2D Data
A scatterplot can be used to visualise one variable against another one.
Figure 7.1 depicts upper leg length (the y-axis) vs (versus; against; as a function of)
standing height (the x-axis) in the form of a point cloud with (𝑥, 𝑦) coordinates like
(body[i, 1], body[i, 3]).
Example 7.6 Here are the exact coordinates of the point corresponding to the person of the smal-
lest height:
and here is the one with the greatest upper leg length:
50
45
upper leg length (cm)
40
35
30
25
130 140 150 160 170 180 190
standing height (cm)
fig = plt.figure()
ax = fig.add_subplot(projection="3d", facecolor="#ffffff00")
ax.scatter(body[:, 1], body[:, 3], body[:, 0], color="#00000011")
ax.view_init(elev=30, azim=60, vertical_axis="y")
ax.set_xlabel(body_columns[1])
ax.set_ylabel(body_columns[3])
ax.set_zlabel(body_columns[0])
plt.show()
7 MULTIDIMENSIONAL NUMERIC DATA AT A GLANCE 141
Infrequently will such a 3D plot provide us with readable results, though. We are pro-
jecting a three-dimensional reality onto a two-dimensional screen or page. Some in-
formation must inherently be lost. Also, what we see is relative to the position of the
virtual camera.
Exercise 7.7 (*) Try finding an interesting elevation and azimuth angle by playing with the
arguments passed to the mpl_toolkits.mplot3d.axes3d.Axes3D.view_init function. Also,
depict arm circumference, hip circumference, and weight on a 3D plot.
Note (*) Sometimes there might be facilities available to create an interactive scat-
terplot (running the above from the Python’s console enables this), where the virtual
camera can be freely repositioned with a mouse/touchpad. This can give some more
insight into our data. Also, there are means of creating animated sequences, where
we can fly over the data scene. Some people find it cool, others find it annoying, but
the biggest problem therewith is that they cannot be included in printed material. Yet,
if we are only targeting the display for the Web (this includes mobile devices), we can
try some Python libraries6 that output HTML+CSS+JavaScript code to be rendered by
a browser engine.
Example 7.8 Instead of drawing a 3D plot, it might be better to play with different marker col-
ours (or sometimes sizes: think of them as bubbles). Suitable colour maps7 can be used to distin-
guish between low and high values of an additional variable, as in Figure 7.3.
6 https://wiki.python.org/moin/NumericAndScientific/Plotting
7 https://matplotlib.org/stable/tutorials/colors/colormaps.html
142 III MULTIDIMENSIONAL DATA
We can see some tendency for the weight to be greater as both the arm and the hip circumferences
increase.
Exercise 7.9 Play around with different colour palettes. However, be wary that ca. every 1 in
12 men (8%) and 1 in 200 women (0.5%) have colour vision deficiencies, especially in the red-
green or blue-yellow spectrum. For this reason, some diverging colour maps might be worse than
others.
A piece of paper is two-dimensional. We only have height and width. Looking around
7 MULTIDIMENSIONAL NUMERIC DATA AT A GLANCE 143
us, we also understand the notion of depth. So far so good. But when the case of more-
dimensional data is concerned, well, suffice it to say that we are three-dimensional
creatures and any attempts towards visualising them will simply not work, don’t even
trip.
Luckily, this is where mathematics comes to our rescue. With some more knowledge
and intuitions, and this book helps us develop them, it will be as easy8 as imagining a
generic m-dimensional space, and then assuming that, say, m=7 or 42.
This is exactly why data science relies on automated methods for knowledge/pattern
discovery. Thanks to them, we can identify, describe, and analyse the structures that
might be present in the data, but cannot be perceived with our imperfect senses.
sns.pairplot(
data=pd.DataFrame( # sns.pairplot needs a DataFrame...
body[:, [0, 1, 4, 5]],
columns=body_columns[[0, 1, 4, 5]]
),
plot_kws=dict(alpha=0.1)
)
# plt.show() # not needed :/
Plotting variables against themselves is uninteresting (exercise: what would that be?).
This is why we included histograms on the main diagonal to see how they are distrib-
uted (the marginal distributions).
A scatterplot matrix can be a valuable tool for identifying interesting combinations
of columns in our datasets. We see that some pairs of variables are more “structured”
than others, e.g., hip circumference and weight are more or less aligned on a straight
line. This is why in Chapter 9 we will be interested in describing the possible relation-
ships between the variables.
Exercise 7.10 (*) Use matplotlib.pyplot.subplot and other functions we learnt in the pre-
vious part to create a scatterplot matrix manually. Draw weight, arm circumference, and hip
circumference on a logarithmic scale.
8 This is an old funny joke that most funny mathematicians find funny. Ha.
144 III MULTIDIMENSIONAL DATA
Figure 7.4: Scatterplot matrix for selected columns in the body dataset: scatterplots for
all unique pairs of variables together with histograms on the main diagonal
7.5 Exercises
Exercise 7.11 What is the difference between [1, 2, 3], [[1, 2, 3]], and [[1], [2], [3]]
in the context of array creation?
Exercise 7.12 If A is a matrix with 5 rows and 6 columns, what is the difference between A.
reshape(6, 5) and A.T?
Exercise 7.13 If A is a matrix with 5 rows and 6 columns, what is the meaning of: A.
reshape(-1), A.reshape(3, -1), A.reshape(-1, 3), A.reshape(-1, -1), A.shape = (3,
10), and A.shape = (-1, 3)?
Exercise 7.14 List some methods to add a new row and add a new column to an existing matrix.
7 MULTIDIMENSIONAL NUMERIC DATA AT A GLANCE 145
A = np.array([
[0.2, 0.6, 0.4, 0.4],
[0.0, 0.2, 0.4, 0.7],
[0.8, 0.8, 0.2, 0.1]
]) # example matrix that we will be using below
For example:
np.square(A)
## array([[0.04, 0.36, 0.16, 0.16],
## [0. , 0.04, 0.16, 0.49],
## [0.64, 0.64, 0.04, 0.01]])
np.mean(A)
## 0.39999999999999997
np.mean(A, axis=0)
## array([0.33333333, 0.53333333, 0.33333333, 0.4 ])
np.mean(A, axis=1)
## array([0.4 , 0.325, 0.475])
Important Let us repeat, axis=1 does not mean that we get the column means (even
though columns constitute the 2nd axis, and we count starting at 0). It denotes the
axis along which the matrix is sliced. Sadly, even yours truly sometimes does not get it
right on the first attempt.
Exercise 8.1 Given the nhanes_adult_female_bmx_20201 dataset, compute the mean, stand-
ard deviation, minimum, and maximum of each body measurement.
We will get back to the topic of the aggregation of multidimensional data in Sec-
tion 8.4.
1 https://github.com/gagolews/teaching-data/raw/master/marek/nhanes_adult_female_bmx_2020.
csv
8 PROCESSING MULTIDIMENSIONAL DATA 149
More generally, a set of rules referred to in the numpy manual as broadcasting2 describes
how this package handles arrays of different shapes.
Important Generally, for two matrices, their column/row numbers much match or
be equal to 1. Also, if one operand is a one-dimensional array, it will be promoted to a
row vector.
Matrix vs Scalar
If one operand is a scalar, then it is going to be propagated over all matrix elements,
for example:
(-1)*A
## array([[-0.2, -0.6, -0.4, -0.4],
## [-0. , -0.2, -0.4, -0.7],
## [-0.8, -0.8, -0.2, -0.1]])
changes the sign of every element, which is, mathematically, an instance of multiply-
ing a matrix 𝐗 by a scalar 𝑐:
𝑐𝑥 𝑐𝑥1,2 ⋯ 𝑐𝑥1,𝑚
⎡ 1,1 ⎤
𝑐𝑥 𝑐𝑥2,2 ⋯ 𝑐𝑥2,𝑚
𝑐𝐗 = ⎢
⎢ ⋮
2,1 ⎥.
⎥
⎢ ⋮ ⋱ ⋮ ⎥
⎣ 𝑐𝑥𝑛,1 𝑐𝑥𝑛,2 ⋯ 𝑐𝑥𝑛,𝑚 ⎦
Furthermore:
A**2
## array([[0.04, 0.36, 0.16, 0.16],
## [0. , 0.04, 0.16, 0.49],
## [0.64, 0.64, 0.04, 0.01]])
A >= 0.25
## array([[False, True, True, True],
## [False, False, True, True],
## [ True, True, False, False]])
Matrix vs Matrix
For two matrices of identical sizes, we act on the corresponding elements:
2 https://numpy.org/devdocs/user/basics.broadcasting.html
3 This is not the same as matrix-multiply by itself which we cover in Section 8.3.
150 III MULTIDIMENSIONAL DATA
And now:
A * B
## array([[0.2, 0. , 0. , 0. ],
## [0. , 0.2, 0. , 0. ],
## [0.8, 0.8, 0.2, 0. ]])
Example 8.2 (*) Figure 8.1 depicts a (filled) contour plot of the Himmelblau’s function,
𝑓 (𝑥, 𝑦) = (𝑥2 + 𝑦 − 11)2 + (𝑥 + 𝑦2 − 7)2 , for 𝑥 ∈ [−5, 5] and 𝑦 ∈ [−4, 4]. To draw it, we
probed 250 points from the two said ranges and called numpy.meshgrid to generate two matrices,
both of shape 250 by 250, giving the x- and y-coordinates of all the points on the corresponding
two-dimensional grid. Thanks to this, we were able to use vectorised mathematical operations to
compute the values of 𝑓 thereon.
x = np.linspace(-5, 5, 250)
y = np.linspace(-4, 4, 250)
xg, yg = np.meshgrid(x, y)
z = (xg**2 + yg - 11)**2 + (xg + yg**2 - 7)**2
plt.contourf(x, y, z, levels=20)
CS = plt.contour(x, y, z, levels=[1, 5, 10, 20, 50, 100, 150, 200, 250])
plt.clabel(CS, colors="black")
plt.show()
To understand the result generated by numpy.meshgrid, here is its output for a smaller number
of probe points:
8 PROCESSING MULTIDIMENSIONAL DATA 151
Figure 8.1: An example filled contour plot with additional labelled contour lines
x = np.linspace(-5, 5, 3)
y = np.linspace(-4, 4, 5)
xg, yg = np.meshgrid(x, y)
xg
## array([[-5., 0., 5.],
## [-5., 0., 5.],
## [-5., 0., 5.],
## [-5., 0., 5.],
## [-5., 0., 5.]])
yg
## array([[-4., -4., -4.],
## [-2., -2., -2.],
## [ 0., 0., 0.],
## [ 2., 2., 2.],
## [ 4., 4., 4.]])
gives a matrix 𝐙 such that 𝑧𝑖,𝑗 is generated by considering the 𝑖-th element in y and the 𝑗-th item
in x, which is exactly what we desired.
The above propagated the column vector over all columns (left to right).
Similarly, combining with a 1×m row vector:
𝑥 + 𝑡1 𝑥1,2 + 𝑡2 … 𝑥1,𝑚 + 𝑡𝑚
⎡ 1,1 ⎤
𝑥 2,1 + 𝑡1 𝑥2,2 + 𝑡2 … 𝑥2,𝑚 + 𝑡𝑚
𝐗 + 𝐭 = 𝐗 + [𝑡1 𝑡2 ⋯ 𝑡𝑚 ] = ⎢
⎢
⎥.
⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 𝑛,1 + 𝑡1
𝑥 𝑥𝑛,2 + 𝑡2 … 𝑥𝑛,𝑚 + 𝑡𝑚 ⎦
Exercise 8.4 Check out that numpy.nonzero relies on similar shape broadcasting rules as the
binary operators we discussed here, but not with respect to all three arguments.
Example 8.5 (*) Himmelblau’s function in Figure 8.1 is only defined by means of arithmetic
operators, which all accept the kind of shape broadcasting that we discuss in this section. Con-
sequently, calling numpy.meshgrid in that example to evaluate 𝑓 on a grid of points was not really
necessary:
x = np.linspace(-5, 5, 3)
y = np.linspace(-4, 4, 5)
xg = x.reshape(1, -1)
yg = y.reshape(-1, 1)
(xg**2 + yg - 11)**2 + (xg + yg**2 - 7)**2
## array([[116., 306., 296.],
## [208., 178., 148.],
## [340., 170., 200.],
## [320., 90., 260.],
## [340., 130., 520.]])
See also the sparse parameter in numpy.meshgrid and Figure 12.9 where this function turns out
useful after all.
np.sort(A, axis=1)
## array([[0.2, 0.4, 0.4, 0.6],
(continues on next page)
4 https://github.com/gagolews/teaching-data/raw/master/marek/nhanes_adult_female_bmx_2020.
csv
154 III MULTIDIMENSIONAL DATA
scipy.stats.rankdata(A, axis=0)
## array([[2. , 2. , 2.5, 2. ],
## [1. , 1. , 2.5, 3. ],
## [3. , 3. , 1. , 1. ]])
Still, the aforementioned numpy.mean is amongst the many exceptions to this rule.
Compare the above with:
np.diff(A, axis=0)
## array([[-0.2, -0.4, 0. , 0.3],
## [ 0.8, 0.6, -0.2, -0.6]])
which gives the iterated differences for each column separately (along the rows).
If a function (built-in or custom) in not equipped with the axis argument and – in-
stead – it was designed to work with individual vectors, we can propagate it over all
the rows or columns by calling numpy.apply_along_axis.
For instance, here is another (did you solve the suggested exercise?) way to compute
the column z-scores:
def standardise(x):
return (x-np.mean(x))/np.std(x)
np.round(np.apply_along_axis(standardise, 0, A), 2)
## array([[-0.39, 0.27, 0.71, -0. ],
## [-0.98, -1.34, 0.71, 1.22],
## [ 1.37, 1.07, -1.41, -1.22]])
Note (*) Matrices are iterable (in the sense of Section 3.4), but in an interesting way.
Namely, an iterator traverses through each row in a matrix. Writing:
creates three variables, each representing a separate row in A, the second of which is:
r2
## array([0. , 0.2, 0.4, 0.7])
Important Generally:
• each scalar index reduces the dimensionality of the subsetted object by 1;
• slice-slice and slice-scalar indexing returns a view on the existing array, so we need
to be careful when modifying the resulting object;
• usually, indexing returns a submatrix (subblock), which is a combination of ele-
ments at given rows and columns;
• indexing with two integer or logical vectors at the same time should be avoided.
A = np.array([
[0.2, 0.6, 0.4, 0.4],
[0.0, 0.2, 0.4, 0.7],
[0.8, 0.8, 0.2, 0.1]
])
A[::2, 3:] # every second row, skip the first three columns
## array([[0.4],
## [0.1]])
A[:, 3]
## array([0.4, 0.7, 0.1])
selects the 4th column and gives a flat vector (we can always use the reshape method
to convert the resulting object back to a matrix).
Furthermore:
A[0, -1]
## 0.4
yields the element (scalar) in the first row and the last column.
selects the first, the last, and the first row again and reverses the order of columns.
selects the rows from A where the values in the first column of A are greater than 0.1.
A[np.argsort(A[:, 0]), : ]
## array([[0. , 0.2, 0.4, 0.7],
## [0.2, 0.6, 0.4, 0.4],
## [0.8, 0.8, 0.2, 0.1]])
orders the matrix with respect to the values in the first column (all rows permuted in
the same way, together).
Exercise 8.6 In the nhanes_adult_female_bmx_20205 dataset, select all the participants
whose heights are within their mean ± 2 standard deviations.
5 https://github.com/gagolews/teaching-data/raw/master/marek/nhanes_adult_female_bmx_2020.
csv
6 https://numpy.org/doc/stable/user/basics.indexing.html
158 III MULTIDIMENSIONAL DATA
yields A[0, 1], A[-1, 2], A[0, 0], A[2, 2], and A[0, 1].
To select a submatrix using integer indexes, it is best to make sure that the first indexer
is a column vector, and the second one is a row vector (or some objects like these, e.g.,
compatible lists of lists).
A[ [[0], [-1]], [[1, 3]] ] # column vector-like list, row vector-like list
## array([[0.6, 0.4],
## [0.8, 0.1]])
Further, if indexing involves logical vectors, it is best to convert them to integer ones
first (e.g., by calling numpy.flatnonzero).
The necessary reshaping can be done automatically with the numpy.ix_ function:
This is only a mild inconvenience. We will be forced to apply such double indexing
anyway in pandas whenever selecting rows by position and columns by name is required;
see Section 10.5.
B = A[:, ::2]
B
## array([[0.2, 0.4],
## [0. , 0.4],
## [0.8, 0.2]])
B *= -1
A
## array([[-0.2, 0.6, -0.4, 0.4],
## [-0. , 0.2, -0.4, 0.7],
## [-0.8, 0.8, -0.2, 0.1]])
This is time and memory efficient, but might lead to some unexpected results if we are
being rather absent-minded. We have been warned.
With numpy arrays, however, brand new rows or columns cannot be added via the index
operator. Instead, the whole array needs to be created from scratch using, e.g., one of
the functions discussed in Section 7.1.4. For example:
for 𝑖 = 1, … , 𝑛 and 𝑗 = 1, … , 𝑚.
For example:
160 III MULTIDIMENSIONAL DATA
A = np.array([
[1, 0, 1],
[2, 2, 1],
[3, 2, 0],
[1, 2, 3],
[0, 0, 1],
])
B = np.array([
[1, 0, 0, 0],
[0, 4, 1, 3],
[2, 0, 3, 1],
])
And now:
C = A @ B # or: A.dot(B)
C
## array([[ 3, 0, 3, 1],
## [ 4, 8, 5, 7],
## [ 3, 8, 2, 6],
## [ 7, 8, 11, 9],
## [ 2, 0, 3, 1]])
1 0 1 3 0 3 1
⎡ ⎤ ⎡ ⎤
⎢ 2 2 1 ⎥ 1 0 𝟎 0 4 8 5 7 ⎥
⎢ ⎥ ⎡ ⎤ ⎢
⎢
3 2 0 ⎢
⎢ 0 4 𝟏 3 ⎥
⎥=⎢ 3 8 2 6 ⎥.
⎢ ⎥ ⎥
⎢ 𝟏 𝟐 𝟑 ⎥ ⎣ 2 0 𝟑 1 ⎦ ⎢ 7 8 𝟏𝟏 9 ⎥
⎣ 0 0 1 ⎦ ⎣ 2 0 3 1 ⎦
For example, the element in the 4th row and 3rd column, 𝑐4,3 takes the 4th row in the
left matrix 𝐚4,⋅ = [1 2 3] and the 3rd column in the right matrix 𝐛⋅,3 = [0 1 3]𝑇 (they
are marked in bold), multiplies the corresponding elements and computes their sum,
i.e., 𝑐4,3 = 1 ⋅ 0 + 2 ⋅ 1 + 3 ⋅ 3 = 11.
Another example:
A = np.array([
[1, 2],
[3, 4]
(continues on next page)
8 PROCESSING MULTIDIMENSIONAL DATA 161
Important In most textbooks, just like in this one, 𝐀𝐁 always denotes the matrix mul-
tiplication. This is a very different operation from the elementwise multiplication.
A * I # elementwise multiplication
## array([[1, 0],
## [0, 4]])
Exercise 8.7 (*) Show that (𝐀𝐁)𝑇 = 𝐁𝑇 𝐀𝑇 . Also notice that, typically, matrix multiplica-
tion is not commutative.
In matrix multiplication terms, if 𝐱 is a row vector and 𝐲𝑇 is a column vector, then the
above can be written as 𝐱𝐲𝑇 . The result is a single number.
is the square of the Euclidean norm of 𝒙, which – as we said in Section 5.3.2 – is used
162 III MULTIDIMENSIONAL DATA
Exercise 8.8 Show that 𝐀𝑇 𝐀 gives the matrix that consists of the dot products of all the pairs
of columns in 𝐀 and 𝐀𝐀𝑇 stores the dot products of all the pairs of rows.
In Section 9.3.2, we will see that matrix multiplication can be used as a way to express
certain geometrical transformations of points in a dataset, e.g., scaling and rotating.
Also, in Section 9.3.3, we briefly discuss the concept of the inverse of a matrix and in
Section 9.3.4, we introduce its singular value decomposition.
Important Given two vectors of equal lengths 𝒙, 𝒚 ∈ ℝ𝑚 , the dot product of their
difference:
𝑚
(𝒙 − 𝒚) ⋅ (𝒙 − 𝒚) = (𝐱 − 𝐲)(𝐱 − 𝐲)𝑇 = ∑(𝑥𝑖 − 𝑦𝑖 )2 ,
𝑖=1
8 There are many possible distances, allowing to measure the similarity of points not only in ℝ𝑚 , but
also character strings (e.g., the Levenshtein metric), ratings (e.g., cosine dissimilarity), etc.; there is even
an encyclopedia of distances [23].
8 PROCESSING MULTIDIMENSIONAL DATA 163
is nothing else than the square of the Euclidean distance between them.
In particular, for unidimensional data (𝑚 = 1), we have ‖𝒖 − 𝒗‖ = |𝑢1 − 𝑣1 |, i.e., the
absolute value of the difference.
0 0
⎡ ⎤
⎢ 1 0 ⎥
𝐗=⎢ 3 ⎥.
⎢ −2 1 ⎥
⎣ 1 1 ⎦
Calculate (by hand): ‖𝐱1,⋅ − 𝐱2,⋅ ‖, ‖𝐱1,⋅ − 𝐱3,⋅ ‖, ‖𝐱1,⋅ − 𝐱4,⋅ ‖, ‖𝐱2,⋅ − 𝐱4,⋅ ‖, ‖𝐱2,⋅ − 𝐱3,⋅ ‖,
‖𝐱1,⋅ − 𝐱1,⋅ ‖, and ‖𝐱2,⋅ − 𝐱1,⋅ ‖.
The distances between all the possible pairs of rows in two matrices 𝐗 ∈ ℝ𝑛×𝑚 and
𝐘 ∈ ℝ𝑘×𝑚 can be computed by calling scipy.spatial.distance.cdist. We need to
be careful, though, because they result in a distance matrix of size 𝑛 × 𝑘, which can
become quite large (e.g., for 𝑛 = 𝑘 = 100,000 we would need ca. 80 GB of RAM to
store it).
Here are the distances between all the pairs of points in the same dataset.
X = np.array([
[0, 0],
[1, 0],
[-1.5, 1],
[1, 1]
])
import scipy.spatial.distance
D = scipy.spatial.distance.cdist(X, X)
D
## array([[0. , 1. , 1.80277564, 1.41421356],
## [1. , 0. , 2.6925824 , 1. ],
## [1.80277564, 2.6925824 , 0. , 2.5 ],
## [1.41421356, 1. , 2.5 , 0. ]])
Hence, 𝑑𝑖,𝑗 = ‖𝐱𝑖,⋅ − 𝐱𝑗,⋅ ‖. That we have zeros on the diagonal is due to the fact that
‖𝒖 − 𝒗‖ = 0 if and only if 𝒖 = 𝒗. Furthermore, ‖𝒖 − 𝒗‖ = ‖𝒗 − 𝒖‖, which implies the
symmetry of 𝐃, i.e., it holds 𝐃𝑇 = 𝐃
Figure 8.2 illustrates all six non-trivial pairwise distances. Let us emphasise that our
perception of distance is disturbed because the aspect ratio (the ratio between the
range of the x-axis to the range of the y-axis) is not 1:1. This is why it is very import-
ant, when judging spatial relationships between the points, to call matplotlib.pyplot.
axis("equal") or set the axis limits manually (which is left as an exercise).
164 III MULTIDIMENSIONAL DATA
1.0 2.5
0.8
0.6
1.8 2.69 1.41 1.0
0.4
0.2
0.0 1.0
1.5 1.0 0.5 0.0 0.5 1.0
Figure 8.2: Distances between four example points; their perception is disturbed be-
cause the aspect ratio is not 1:1
Important Some popular techniques in data science rely on computing pairwise dis-
tances, including:
• multidimensional data aggregation (see below),
• k-means clustering (Section 12.4),
• k-nearest neighbour regression (Section 9.2.1) and classification (Section 12.3.1),
• missing value imputation (Section 15.1),
• density estimation (which we can use outlier detection, see Section 15.4).
In the sequel, whenever we apply them, we will be assuming that data have been ap-
propriately preprocessed: in particular, that columns are on the same scale (e.g., are
8 PROCESSING MULTIDIMENSIONAL DATA 165
8.4.2 Centroids
So far we have been only discussing ways to aggregate unidimensional data (for in-
stance, each matrix column separately). It turns out that some summaries can be gen-
eralised to the multidimensional case.
For instance, it can be shown that the arithmetic mean of a vector (𝑥1 , … , 𝑥𝑛 ) is a point
𝑐 that minimises the sum of the squared unidimensional distances between itself and
𝑛 𝑛
all the 𝑥𝑖 s, i.e., ∑𝑖=1 ‖𝑥𝑖 − 𝑐‖2 = ∑𝑖=1 (𝑥𝑖 − 𝑐)2 .
We can define the centroid of a dataset 𝐗 ∈ ℝ𝑛×𝑚 as the point 𝒄 ∈ ℝ𝑚 to which the
overall squared distance is the smallest:
𝑛
minimise ∑ ‖𝐱𝑖,⋅ − 𝒄‖2 w.r.t. 𝒄.
𝑖=1
which is the componentwise arithmetic mean, i.e., its j-th component is:
1 𝑛
𝑐𝑗 = ∑𝑥 .
𝑛 𝑖=1 𝑖,𝑗
For instance, the centroid of the dataset depicted in Figure 8.2 is:
c = np.mean(X, axis=0)
c
## array([0.125, 0.5 ])
Centroids are, amongst others, a basis for the k-means clustering method that we dis-
cuss in Section 12.4.
Note (**) Generalising other aggregation functions is not a trivial task, because,
amongst others, there is no natural linear ordering relation in the multidimensional
space (see, e.g., [74]). For instance, any point on the convex hull of a dataset could serve
as an analogue of the minimal and maximal observation.
Furthermore, the componentwise median does not behave nicely (it may, for example,
fall outside the convex hull). Instead, we usually consider a different generalisation
of the median: the point 𝒎 which minimises the sum of distances (not squared),
𝑛
∑𝑖=1 ‖𝐱𝑖,⋅ − 𝒎‖. Sadly, it does not have an analytic solution, but it can be determined
algorithmically.
Note (**) A bag plot [79] is one of the possible multidimensional generalisations of the
box-and-whisker plot. Unfortunately, its use is quite limited due to its low popularity
amongst practitioners.
𝐵𝑟 (𝒙′ ) = {𝑖 ∶ ‖𝐱𝑖,⋅ − 𝒙′ ‖ ≤ 𝑟} ;
• few nearest neighbour search: for some (usually small) integer 𝑘 ≥ 1, we seek the
indexes of the 𝑘 points in 𝐗 which are the closest to 𝒙′ :
𝑁𝑘 (𝒙′ ) = {𝑖1 , 𝑖2 , … , 𝑖𝑘 },
Here is an example dataset, consisting of some randomly generated points (see Fig-
ure 8.3).
np.random.seed(777)
X = np.random.randn(25, 2)
x_test = np.array([0, 0])
import scipy.spatial.distance
D = scipy.spatial.distance.cdist(X, x_test.reshape(1, -1))
For instance, here are the indexes of the points in 𝐵0.75 (𝒙′ ):
r = 0.75
B = np.flatnonzero(D <= r)
B
## array([ 1, 11, 14, 16, 24])
k = 11
N = np.argsort(D.reshape(-1))[:k]
N
## array([14, 24, 16, 11, 1, 22, 7, 19, 0, 9, 15])
See Figure 8.3 for an illustration (observe that the aspect ratio is set to 1:1 as otherwise
the circle would look like an ellipse).
fig, ax = plt.subplots()
ax.add_patch(plt.Circle(x_test, r, color="red", alpha=0.1))
for i in range(k):
plt.plot(
[x_test[0], X[N[i], 0]],
[x_test[1], X[N[i], 1]],
"r:", alpha=0.4
)
plt.plot(X[:, 0], X[:, 1], "bo", alpha=0.1)
for i in range(X.shape[0]):
plt.text(X[i, 0], X[i, 1], str(i), va="center", ha="center")
plt.plot(x_test[0], x_test[1], "rX")
plt.text(x_test[0], x_test[1], "$\\mathbf{x}'$", va="center", ha="center")
plt.axis("equal")
plt.show()
168 III MULTIDIMENSIONAL DATA
2.0 5 20
18
1.5
1.0 4
10 2
0.5 19 13
16 9
0.0 2414 x0 15
21
0.5 8 7 1 11
0 22
1.0
23 6 3
1.5 12 17
3 2 1 0 1 2 3
Note (*) In K-d trees, the data space is partitioned into hyperrectangles along the axes
of the Cartesian coordinate system (standard basis). Thanks to such a representation,
all subareas which are too far from the point of interest can be pruned to speed up the
search.
Let us create the data structure for searching within the above 𝐗 matrix.
import scipy.spatial
T = scipy.spatial.KDTree(X)
Assume we would like to make queries with regard to the 3 following pivot points.
X_test = np.array([
[0, 0],
[2, 2],
(continues on next page)
9 In our context, we should prefer referring to them as m-d trees, but let us stick with the traditional
name.
8 PROCESSING MULTIDIMENSIONAL DATA 169
Here are the results for the fixed radius searches (𝑟 = 0.75):
T.query_ball_point(X_test, 0.75)
## array([list([1, 11, 14, 16, 24]), list([20]), list([])], dtype=object)
We see that the search was nicely vectorised: we made a query about three points at
the same time. As a result, we received a list-like object storing three lists representing
the indexes of interest. Note that in the case of the 3rd point, there are no elements in
𝐗 within the range (ball) of interest, hence the empty index list.
And here are the 5 nearest neighbours:
distances
## array([[0.31457701, 0.44600012, 0.54848109, 0.64875661, 0.71635172],
## [0.20356263, 1.45896222, 1.61587605, 1.64870864, 2.04640408],
## [1.2494805 , 1.35482619, 1.93984334, 1.95938464, 2.08926502]])
indexes
## array([[14, 24, 16, 11, 1],
## [20, 5, 13, 2, 9],
## [17, 3, 21, 12, 22]])
Each of them is a matrix with 3 rows (corresponding to the number of pivot points)
and 5 columns (the number of neighbours sought).
Note (*) We expect the K-d trees to be much faster than the brute-force approach
(where we compute all pairwise distances) in low-dimensional spaces. Nonetheless,
due to the phenomenon called the curse of dimensionality, sometimes already for 𝑚 ≥ 5
the speed gains might be very small; see, e.g., [9].
8.5 Exercises
Exercise 8.10 Does numpy.mean(A, axis=0) compute rowwise or columnwise means?
170 III MULTIDIMENSIONAL DATA
Exercise 8.11 How does shape broadcasting work? List the most common pairs of shape cases
when performing arithmetic operations like addition or multiplication.
Exercise 8.12 What are the possible matrix indexing schemes and how do they behave?
Exercise 8.13 Which kinds of matrix indexers return a view on an existing array?
Exercise 8.14 (*) How can we select a submatrix comprised of the first and the last row and the
first and the last column?
Exercise 8.15 Why appropriate data preprocessing is required when computing the Euclidean
distance between points?
Exercise 8.16 What is the relationship between the dot product, the Euclidean norm, and the
Euclidean distance?
Exercise 8.17 What is a centroid? How is it defined by means of the Euclidean distance between
the points in a dataset?
Exercise 8.18 What is the difference between the fixed-radius and few nearest-neighbours
search?
Exercise 8.19 (*) When K-d trees or other spatial search data structures might be better than a
brute-force search with scipy.spatial.distance.cdist?
9
Exploring Relationships Between Variables
Let us go back to National Health and Nutrition Examination Survey (NHANES study)
excerpt that we were playing with in Section 7.4:
body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_bmx_2020.csv",
comment="#")
body = np.array(body) # convert to matrix
body[:6, :] # preview: 6 first rows, all columns
## array([[ 97.1, 160.2, 34.7, 40.8, 35.8, 126.1, 117.9],
## [ 91.1, 152.7, 33.5, 33. , 38.5, 125.5, 103.1],
## [ 73. , 161.2, 37.4, 38. , 31.8, 106.2, 92. ],
## [ 61.7, 157.4, 38. , 34.7, 29. , 101. , 90.5],
## [ 55.4, 154.6, 34.6, 34. , 28.3, 92.5, 73.2],
## [ 62. , 144.7, 32.5, 34.2, 29.8, 106.7, 84.8]])
body.shape
## (4221, 7)
We thus have 𝑛 = 4221 participants and 7 different features describing them, in this
order:
body_columns = np.array([
"weight",
"height",
"arm len",
"leg len",
"arm circ",
"hip circ",
"waist circ"
])
We expect the data in different columns to be related to each other (e.g., a taller person
usually tends to weight more). This is why we will now be interested in quantifying the
degree of association between the variables, modelling the possible functional rela-
tionships, and finding new interesting combinations of columns.
172 III MULTIDIMENSIONAL DATA
1 𝑛 𝑥𝑖 − 𝑥 ̄ 𝑦𝑖 − 𝑦 ̄
𝑟(𝒙, 𝒚) = ∑ ,
𝑛 𝑖=1 𝑠𝑥 𝑠𝑦
with 𝑠𝑥 , 𝑠𝑦 denoting the standard deviations and 𝑥,̄ 𝑦 ̄ being the means of 𝒙 =
(𝑥1 , … , 𝑥𝑛 ) and 𝒚 = (𝑦1 , … , 𝑦𝑛 ), respectively.
Note Look carefully: we are computing the mean of the pairwise products of stand-
ardised versions of the two vectors. It is a normalised measure of how they vary together
(co-variance).
(*) Furthermore, in Section 9.3.1, we mention that 𝑟 is the cosine of the angle between
centred and normalised versions of the vectors.
And here is a built-in function (for the lazy, in a good sense) that implements the same
formula:
scipy.stats.pearsonr(x, y)[0]
## 0.8680627457873241
Note the [0] part: the function returns more than we actually need.
To get more insight, below we shall illustrate some interesting correlations using the
following function that draws a scatter plot and prints out Pearson’s 𝑟 (and Spearman’s
𝜌 which we discuss in Section 9.1.4; let us ignore it by then):
x = np.random.rand(100)
plt.subplot(1, 2, 1); plot_corr(x, -0.5*x+3, axes_eq=True) # negative slope
plt.subplot(1, 2, 2); plot_corr(x, 3*x+10, axes_eq=True) # positive slope
plt.show()
A negative correlation means that when one variable increases, the other one de-
creases (like: a car’s braking distance vs velocity).
Notice again that the arm and hip circumferences enjoy quite high positive degree of
linear correlation. Their scatterplot (Figure 7.4) looks somewhat similar to one of the
cases presented here.
Exercise 9.1 Draw a series of similar plots but for the case of negatively correlated point pairs,
e.g., 𝑦 = −2𝑥 + 5.
Important As a rule of thumb, linear correlation degree of 0.9 or greater (or -0.9 or
smaller) is quite decent. Between -0.8 and 0.8 we probably should not be talking about
two variables being linearly correlated at all. Some textbooks are more lenient, but we
have higher standards. In particular, it is not uncommon in social sciences to consider
0.6 a decent degree of correlation, but this is like building on sand. If a dataset at hand
does not provide us with strong evidence, it is our ethical duty to refrain ourselves
from making unjustified statements.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 175
0.3 0.3
0.2
0.2
0.1
0.1
0.0
0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
r = 0.765, = 0.789 r = 0.398, = 0.368
0.6 1.00
0.75
0.4
0.50
0.2 0.25
0.00
0.0
0.25
0.2 0.50
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Figure 9.2: Linear correlation coefficients for data with different amounts of noise
plt.subplot(2, 2, 1)
plot_corr(x, np.random.rand(100)) # independent (not correlated)
plt.subplot(2, 2, 2)
(continues on next page)
1 Note that in Section 6.2.3, we were also testing one concrete hypothesis: whether a distribution was
normal or whether it was anything else. We only know that if the data really follow that distribution, the
null hypothesis will not be rejected in 0.1% of the cases. The rest is silence.
176 III MULTIDIMENSIONAL DATA
0.8 0.2
0.6 0.4
0.4 0.6
0.2 0.8
0.0 1.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
r = 0.0194, = 0.0147 r = 0.00917, = 0.0231
1.0 1.0
0.8
0.5
0.6
0.0
0.4
0.5
0.2
0.0 1.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
plt.subplot(2, 2, 1)
plot_corr(x, np.sin(0.6*np.pi*x)) # sine
plt.subplot(2, 2, 2)
plot_corr(x, np.log(x+1)) # logarithm
plt.subplot(2, 2, 3);
plot_corr(x, np.exp(x**2)) # exponential of square
plt.subplot(2, 2, 4)
plot_corr(x, 1/(x/2+0.2)) # reciprocal
plt.show()
0.4 0.3
0.2
0.2
0.1
0.0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
r = 0.926, = 1.0 r = 0.949, = 1.0
2.75
2.50 4.5
2.25 4.0
2.00 3.5
1.75 3.0
1.50 2.5
1.25 2.0
1.00 1.5
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Figure 9.4: Example non-linear relationships that look like linear, at least to Pearson’s r
No single measure is perfect – we are trying to compress 2𝑛 data points into a single
number — it is obvious that there will be many different datasets, sometimes remark-
ably diverse, that will yield the same correlation value.
178 III MULTIDIMENSIONAL DATA
Notice that we ordered2 the columns to reveal some naturally occurring variable
clusters: for instance, arm, hip, waist circumference and weight are all quite strongly
correlated.
Of course, we have 1.0s on the main diagonal because a variable is trivially correl-
2 (**) This can be done automatically via some hierarchical clustering algorithm applied onto the trans-
1.0
arm circ 1.00 0.87 0.85 0.91 0.45 0.15 0.08
hip circ 0.87 1.00 0.90 0.95 0.46 0.20 0.10 0.8
waist circ 0.85 0.90 1.00 0.90 0.43 0.13 -0.03 0.6
weight 0.91 0.95 0.90 1.00 0.55 0.35 0.19
0.4
arm len 0.45 0.46 0.43 0.55 1.00 0.67 0.48
height 0.15 0.20 0.13 0.35 0.67 1.00 0.66 0.2
leg len 0.08 0.10 -0.03 0.19 0.48 0.66 1.00 0.0
arm circ hip circ waist circ weight arm len height leg len
ated with itself. Interestingly, this heatmap is symmetric which is due to the property
𝑟(𝒙, 𝒚) = 𝑟(𝒚, 𝒙).
Example 9.3 (*) To fetch the row and column index of the most correlated pair of variables
(either positively or negatively), we should first take the upper (or lower) triangle of the correl-
ation matrix (see numpy.triu or numpy.tril) to ignore the irrelevant and repeating items:
Ru = np.triu(np.abs(R), 1)
np.round(Ru, 2)
## array([[0. , 0.35, 0.55, 0.19, 0.91, 0.95, 0.9 ],
## [0. , 0. , 0.67, 0.66, 0.15, 0.2 , 0.13],
## [0. , 0. , 0. , 0.48, 0.45, 0.46, 0.43],
## [0. , 0. , 0. , 0. , 0.08, 0.1 , 0.03],
## [0. , 0. , 0. , 0. , 0. , 0.87, 0.85],
## [0. , 0. , 0. , 0. , 0. , 0. , 0.9 ],
## [0. , 0. , 0. , 0. , 0. , 0. , 0. ]])
Note that numpy.argmax returns an index in the flattened (unidimensional) array. We had to use
numpy.unravel_index to convert it to a two-dimensional one.
world = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/world_factbook_2020_subset1.csv",
comment="#")
world = np.array(world) # convert to matrix
world[:6, :] # preview
## array([[ 2000. , 52.8],
## [12500. , 79. ],
## [15200. , 77.5],
## [11200. , 74.8],
## [49900. , 83. ],
## [ 6800. , 61.3]])
plt.subplot(1, 2, 1)
plot_corr(world[:, 0], world[:, 1])
plt.xlabel("per capita GDP PPP")
plt.ylabel("life expectancy (years)")
plt.subplot(1, 2, 2)
plot_corr(np.log(world[:, 0]), world[:, 1])
plt.xlabel("log(per capita GDP PPP)")
plt.yticks()
plt.show()
If we compute Pearson’s 𝑟 between these two, we will note a quite weak linear correl-
ation:
Anyhow, already the logarithm of GDP is quite strongly linearly correlated with life
expectancy:
3 https://www.cia.gov/the-world-factbook/
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 181
80 80
life expectancy (years)
70 70
60 60
0 50000 100000 8 10 12
per capita GDP PPP log(per capita GDP PPP)
Figure 9.6: Scatterplots for life expectancy vs gross domestic product (purchasing
power parity) on linear (left-) and log-scale (righthand side)
which means that modelling our data via 𝒚 = 𝑎 log 𝒙 + 𝑏 could be an idea worth con-
sidering.
which is4 the Pearson coefficient computed over vectors of the corresponding ranks of
all the elements in 𝒙 and 𝒚 (denoted with 𝑅(𝒙) and 𝑅(𝒚), respectively).
Hence, the two following calls are equivalent:
4 If a method Y is nothing else than X on transformed data, we should not consider it a totally new
method.
182 III MULTIDIMENSIONAL DATA
Let us point out that this measure is invariant with respect to monotone transforma-
tions of the input variables (up to the sign):
This is because such transformations do not change the observations’ ranks (or only
reverse them).
Exercise 9.4 We included the 𝜌s in all the outputs generated by our plot_corr function. Re-
view all the above figures.
Exercise 9.5 Apply numpy.corrcoef and scipy.stats.rankdata (with the appropriate axis
argument) to compute the Spearman correlation matrix for all the variable pairs in body. Draw
it on a heatmap.
Exercise 9.6 (*) Draw the scatterplots of the ranks of each column in the world and body data-
sets.
1. Find the indices 𝑁𝑘 (𝒙′ ) = {𝑖1 , … , 𝑖𝑘 } of the 𝑘 points from 𝐗 closest to 𝒙′ , i.e., ones
that fulfil for all 𝑗 ∉ {𝑖1 , … , 𝑖𝑘 }:
For example, let us try expressing weight (the 1st column) as a function of hip circum-
ference (the 6th column) in the body dataset:
We can also model the life expectancy at birth in different countries (world dataset) as
a function of their GDP per capita (PPP):
Both are instances of the simple regression problem, i.e., where there is only one inde-
pendent variable (𝑚 = 1). We can easily create an appealing visualisation thereof by
means of the following function:
Figure 9.7 depicts the fitted functions for a few different 𝑘s.
184 III MULTIDIMENSIONAL DATA
plt.subplot(1, 2, 1)
knn_regress_plot(body[:, 5], body[:, 0], [5, 25, 100])
plt.xlabel("hip circumference")
plt.ylabel("weight")
plt.subplot(1, 2, 2)
knn_regress_plot(world[:, 0], world[:, 1], [5, 25, 100])
plt.xlabel("per capita GDP PPP")
plt.ylabel("life expectancy (years)")
plt.show()
Figure 9.7: K-nearest neighbour regression curves for example datasets; the greater
the k, the more coarse-grained the approximation
We obtained a smoothened version of the original dataset. The fact that we do not re-
produce the reference data points in an exact manner is reflected by the (figurative)
error term in the above equations. Its role is to emphasise the existence of some nat-
ural data variability; after all, one’s weight is not purely determined by their hip size
and life is not all about money.
For small 𝑘 we adapt to the data points better. This can be a good thing unless data
are very noisy. The greater the 𝑘, the smoother the approximation at the cost of losing
fine detail and restricted usability at the domain boundaries (here: in the left and right
part of the plots).
Usually, the number of neighbours is chosen by trial and error (just like the number of
bins in a histogram; compare Section 4.3.3).
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 185
Note (**) Some methods use weighted arithmetic means for aggregating the 𝑘 refer-
ence outputs, with weights inversely proportional to the distances to the neighbours
(closer inputs are considered more important).
Also, instead of few nearest neighbours, we can easily implement some form of
fixed-radius search regression, by simply replacing 𝑁𝑘 (𝒙′ ) with 𝐵𝑟 (𝒙′ ); compare Sec-
tion 8.4.4. Yet, note that this way we might make the function undefined in sparsely
populated regions of the domain.
𝑦 = 𝑓 (𝑥1 , 𝑥2 , … , 𝑥𝑚 ) = 𝑐1 𝑥1 + 𝑐2 𝑥2 + ⋯ + 𝑐𝑚 𝑥𝑚 + 𝑐𝑚+1 ,
𝑦 = 𝐜𝐱𝑇 + 𝑐𝑚+1 ,
𝑦 = 𝑎𝑥 + 𝑏,
Important A separate intercept “+𝑐𝑚+1 ” term in the defining equation can be quite
inconvenient, notationwisely. We will thus restrict ourselves to linear maps like:
𝑦 = 𝐜𝐱𝑇 ,
186 III MULTIDIMENSIONAL DATA
but where we can possibly have an explicit constant-1 component somewhere inside 𝐱,
for instance:
𝐱 = [𝑥1 𝑥2 ⋯ 𝑥𝑚 1].
Together with 𝐜 = [𝑐1 𝑐2 ⋯ 𝑐𝑚 𝑐𝑚+1 ], as trivially 𝑐𝑚+1 ⋅ 1 = 𝑐𝑚+1 , this new setting
is equivalent to the original one.
Without loss in generality, from now on we assume that 𝐱 is 𝑚-dimensional, regard-
less of its having a constant-1 inside or not.
because 𝐲̂ = 𝐜𝐗𝑇 gives the predicted values as a row vector (the kind reader is encour-
aged to check that on a piece of paper now), 𝐫 = 𝐲 − 𝐲̂ computes all the 𝑛 residuals,
and 𝐫𝐫 𝑇 gives their sum of squares.
The method of least squares is one of the simplest and most natural approaches to
regression analysis (curve fitting). Its theoretical foundations (calculus…) were de-
veloped more than 200 years ago by Gauss and then were polished by Legendre.
Note (*) Had the points lain on a hyperplane exactly (the interpolation problem),
5 To memorise the model for further reference, we only need to serialise its m coefficients, e.g., in a JSON
or CSV file.
6 Due to computability and mathematical analysability, which we usually explore in more advanced
𝐲 = 𝐜𝐗𝑇 would have an exact solution, equivalent to solving the linear system of equa-
tions 𝐲−𝐜𝐗𝑇 = 𝟎. However, in our setting we assume that there might be some meas-
urement errors or other discrepancies between the reality and the theoretical model.
To account for this, we are trying to solve a more general problem of finding a hyper-
plane for which ‖𝐲 − 𝐜𝐗𝑇 ‖2 is as small as possible.
This optimisation task can be solved analytically (compute the partial derivatives of
SSR with respect to each 𝑐1 , … , 𝑐𝑚 , equate them to 0, and solve a simple system of
linear equations). This results in 𝐜 = 𝐲𝐗(𝐗𝑇 𝐗)−1 , where 𝐀−1 is the inverse of a mat-
rix 𝐀, i.e., the matrix such that 𝐀𝐀−1 = 𝐀−1 𝐀 = 𝐈; compare numpy.linalg.inv. As
inverting larger matrices directly is not too robust, numerically speaking, we prefer
relying upon some more specialised algorithms to determine the solution.
The scipy.linalg.lstsq function that we use below provides a quite numerically stable
(yet, see Section 9.2.9) procedure that is based on the singular value decomposition of
the model matrix.
Let us go back to the NHANES study excerpt and express weight (the 1st column) as
function of hip circumference (the 6th column) again, but this time using an affine
map of the form7 :
We used the vectorised exponentiation operator to convert each 𝑥𝑖 (the i-th hip cir-
cumference) to a pair 𝐱𝑖,⋅ = (𝑥𝑖1 , 𝑥𝑖0 ) = (𝑥𝑖 , 1), which is a nice trick to append a
column of 1s to a matrix. This way, we included the intercept term in the model (as
discussed in Section 9.2.2). Here is a preview:
7 We sometimes explicitly list the error term that corresponds to the residuals. This is to assure the
reader that we are not naïve and that we know what we are doing. We see from the scatterplot of the
involved variables that the data do not lie on a straight line perfectly. Each model is merely an idealisa-
tion/simplification of the described reality. It is wise to remind ourselves about that every so often.
188 III MULTIDIMENSIONAL DATA
import scipy.linalg
res = scipy.linalg.lstsq(X_train, y_train)
That’s it. The optimal coefficients vector (the one that minimises the SSR) is:
c = res[0]
c
## array([ 1.3052463 , -65.10087248])
Let us contemplate the fact that the model is nicely interpretable. For instance, as hip
circumference increases, we expect the weights to be greater and greater. As we said
before, it does not mean that there is some causal relationship between the two (for
instance, there can be some latent variables that affect both of them). Instead, there
is some general tendency regarding how the data align in the sample space. For in-
stance, that the “best guess” (according to the current model – there can be many; see
below) weight for a person with hip circumference of 100 cm is 65.4 kg. Thanks to such
models, we might understand certain phenomena better or find some proxies for dif-
ferent variables (especially if measuring them directly is tedious, costly, dangerous,
etc.).
Let us determine the predicted weights for all of the participants:
y_pred = c @ X_train.T
np.round(y_pred[preview_indices], 2) # preview
## array([55.63, 74.17, 60.59, 68.03, 58.64, 62.16])
The scatterplot and the fitted regression line in Figure 9.8 indicates a quite good fit,
but of course there is some natural variability.
Figure 9.8: The least squares line for weight vs hip circumference
Exercise 9.7 The Anscombe quartet8 is a famous example dataset, where we have four pairs of
variables that have almost identical means, variances, and linear correlation coefficients. Even
though they can be approximated by the same straight line, their scatter plots are vastly different.
Reflect upon this toy example.
We wanted the squared residuals (on average – across all the points) to be as small
as possible. The least squares method assures that this is the case relative to the chosen
model, i.e., a linear one. Nonetheless, it still does not mean that what we obtained con-
stitutes a good fit to the training data. Thus, we need to perform the analysis of residuals.
Interestingly, the average of residuals is always zero:
1 𝑛
∑(𝑦 − 𝑦𝑖̂ ) = 0.
𝑛 𝑖=1 𝑖
Therefore, if we want to summarise the residuals into a single number, we should in-
8 https://github.com/gagolews/teaching-data/raw/master/r/anscombe.csv
190 III MULTIDIMENSIONAL DATA
80 fitted line
predicted output
75 residual
observed value (reference output)
70
weight
65
60
55
50
90 95 100 105 110
hip circumference
np.sqrt(np.mean(r**2))
## 6.948470091176111
Hopefully we can see that RMSE is a function of SSR that we sought to minimise above.
Alternatively, we can compute the mean absolute error:
1 𝑛
MAE(𝐲, 𝐲)̂ = ∑ |𝑦 − 𝑦𝑖̂ |.
𝑛 𝑖=1 𝑖
np.mean(np.abs(r))
## 5.207073583769202
Note Generally, fitting simple (involving one independent variable) linear models can
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 191
only make sense for highly linearly correlated variables. Interestingly, if 𝒚 and 𝒙 are
both standardised, and 𝑟 is their Pearson’s coefficient, then the least squares solution
is given by 𝑦 = 𝑟𝑥.
To verify whether a fitted model is not extremely wrong (e.g., when we fit a linear
model to data that clearly follows a different functional relationship), a plot of resid-
uals against the fitted values can be of help; see Figure 9.10. Ideally, the points should
be aligned totally at random therein, without any dependence structure (homosce-
dasticity).
Figure 9.10: Residuals vs fitted values for the linear model explaining weight as a func-
tion of hip circumference; the variance of residuals slightly increases as 𝑦𝑖̂ increases,
which is not ideal, but it could be much worse than this
Exercise 9.9 Compare9 the RMSE and MAE for the k-nearest neighbour regression curves de-
picted in the lefthand side of Figure 9.7. Also, draw the residuals vs fitted plot.
9 In k-nearest neighbour regression, we are not aiming to minimise anything in particular. If the model
is good with respect to some metrics such as RMSE or MAE, we can consider ourselves lucky. Nevertheless,
some asymptotic results guarantee the optimality of the outcomes generated for large sample sizes (e.g.,
consistency); see, e.g., [22].
192 III MULTIDIMENSIONAL DATA
For linear models fitted using the least squares method, it can be shown that it holds:
1 𝑛 2 1 𝑛 2 1 𝑛 2
∑ (𝑦𝑖 − 𝑦)̄ = ∑ (𝑦𝑖̂ − 𝑦)̄̂ + ∑ (𝑦𝑖 − 𝑦𝑖̂ ) .
𝑛 𝑖=1 𝑛 𝑖=1 𝑛 𝑖=1
In other words, the variance of the dependent variable (left) can be decomposed into
the sum of the variance of the predictions and the averaged squared residuals. Mul-
tiplying the above by 𝑛, we have that the total sum of squares is equal to the explained
sum of squares plus the residual sum of squares:
1 - np.var(y_train-y_pred)/np.var(y_train)
## 0.8959634726270759
The coefficient of determination in the current context10 is thus the proportion of vari-
ance of the dependent variable explained by the independent variables in the model.
The closer it is to 1, the better. A dummy model that always returns the mean of 𝒚 gives
R-squared of 0.
In our case, 𝑅2 ≃ 0.9 is quite high, which indicates a rather good fit.
Note (*) There are certain statistical results that can be relied upon provided that
the residuals are independent random variables with expectation zero and the same
variance (e.g., the Gauss–Markov theorem). Further, if they are normally distributed,
then we have several hypothesis tests available (e.g., for the significance of coeffi-
cients). This is why in various textbooks such assumptions are additionally verified.
But we do not go that far in this introductory course.
particularly when the fit is extremely bad. Also, note that this measure is dataset-dependent. Therefore, it
should not be used for comparing models explaining different dependent variables.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 193
We skip the visualisation part, because we do not expect it to result in a readable plot:
these are multidimensional data. The coefficient of determination is:
y_pred = c @ X_train.T
r = y_train - y_pred
1-np.var(r)/np.var(y_train)
## 0.9243996585518783
np.sqrt(np.mean(r**2))
## 5.923223870044694
np.mean(np.abs(r))
## 4.431548244333893
It is a slightly better model than the previous one. We can predict the participants’
weights with better precision, at the cost of an increased model’s complexity.
𝑓 (𝑥1 , 𝑥2 , … , 𝑥6 ) = 𝑐1 𝑥1 + 𝑐2 𝑥2 + ⋯ + 𝑥6 𝑥6 .
The design matrix is made of rubber, it can handle almost anything. If we have a linear
194 III MULTIDIMENSIONAL DATA
model, but with respect to transformed data, the algorithm does not care. This is the
beauty of the underlying mathematics; see also [10].
A creative modeller can also turn models such as 𝑢 = 𝑐𝑒𝑎𝑣 into 𝑦 = 𝑎𝑥 + 𝑏 by replacing
𝑦 = log 𝑢, 𝑥 = 𝑣, and 𝑏 = log 𝑐. There are numerous possibilities based on the proper-
ties of the log and exp functions that we listed in Section 5.2. We call them linearisable
models.
As an example, let us model the life expectancy at birth in different countries as a func-
tion of their GDP per capita (PPP).
We will consider four different models:
1. 𝑦 = 𝑐1 + 𝑐2 𝑥 (linear),
2. 𝑦 = 𝑐1 + 𝑐2 𝑥 + 𝑐3 𝑥2 (quadratic),
3. 𝑦 = 𝑐1 + 𝑐2 𝑥 + 𝑐3 𝑥2 + 𝑐4 𝑥3 (cubic),
4. 𝑦 = 𝑐1 + 𝑐2 log 𝑥 (logarithmic).
Here are the helper functions that create the model matrices:
def make_model_matrix1(x):
return x.reshape(-1, 1)**[0, 1]
def make_model_matrix2(x):
return x.reshape(-1, 1)**[0, 1, 2]
def make_model_matrix3(x):
return x.reshape(-1, 1)**[0, 1, 2, 3]
def make_model_matrix4(x):
return (np.log(x)).reshape(-1, 1)**[0, 1]
model_matrix_makers = [
make_model_matrix1,
make_model_matrix2,
make_model_matrix3,
make_model_matrix4
]
x_original = world[:, 0]
Xs_train = [ make_model_matrix(x_original)
for make_model_matrix in model_matrix_makers ]
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 195
y_train = world[:, 1]
cs = [ scipy.linalg.lstsq(X_train, y_train)[0]
for X_train in Xs_train ]
for i in range(len(Xs_train)):
R2 = 1 - np.var(y_train - cs[i] @ Xs_train[i].T)/np.var(y_train)
print(f"{model_matrix_makers[i].__name__:20} R2={R2:.3f}")
## linear model R2=0.431
## quadratic model R2=0.567
## cubic model R2=0.607
## logarithmic model R2=0.651
The logarithmic model is thus the best (out of the models we considered). The four
models are depicted in Figure 9.11.
Exercise 9.10 Draw box plots and histograms of residuals for each model as well as the scatter-
plots of residuals vs fitted values.
90
life expectancy (years)
80
70
linear model
60 quadratic model
cubic model
logarithmic model
0 20000 40000 60000 80000 100000 120000 140000
per capita GDP PPP
simplest explanation should be chosen (do not multiply entities [here: introduce inde-
pendent variables] without necessity).
In particular, the more independent variables we have in the model, the greater the
𝑅2 coefficient will be. We can try correcting for this phenomenon by considering the
adjusted 𝑅2 :
𝑛−1
𝑅̄ 2 (𝐲, 𝐲)̂ = 1 − (1 − 𝑅2 (𝐲, 𝐲))
̂ ,
𝑛−𝑚−1
which, to some extent, penalises more complex models.
Note (**) Model quality measures adjusted for the number of model parameters, 𝑚,
can also be useful in automated variable selection. For example, the Akaike Informa-
tion Criterion is a popular measure given by:
We should also be interested in a model’s predictive power – how well does it generalise
to data points that we do not have now (or pretend we do not have), but might face
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 197
in the future. As we observe the modelled reality only at a few different points, the
question is how the model performs when filling the gaps between the dots it connects.
In particular, we should definitely be careful when extrapolating the data, i.e., making
predictions outside of its usual domain. For example, the linear model predicts the
following life expectancy for an imaginary country with $500,000 per capita GDP:
cs[0] @ model_matrix_makers[0](np.array([500000])).T
## array([164.3593753])
cs[1] @ model_matrix_makers[1](np.array([500000])).T
## array([-364.10630779])
Nonsense.
Example 9.11 Let us consider the following theoretical illustration. Assume that a true model
of some reality is 𝑦 = 5 + 3𝑥3 .
def true_model(x):
return 5 + 3*(x**3)
Still, for some reason we are only able to gather a small (𝑛 = 25) sample from this model. What
is even worse, it is subject to some measurement error:
np.random.seed(42)
x = np.random.rand(25) # random xs on [0, 1]
y = true_model(x) + 0.2*np.random.randn(len(x)) # true_model(x) + noise
which is not too far, but still somewhat11 distant from the true coefficients, 5 and 3.
We can also fit a more flexible cubic polynomial, 𝑦 = 𝑐1 + 𝑐2 𝑥 + 𝑐3 𝑥2 + 𝑐4 𝑥3 :
11 For large 𝑛, we expect the pinpoint the true coefficients exactly. This is because, in our scenario (inde-
pendent, normally distributed errors with the expectation of 0), the least squares method is the maximum
likelihood estimator of the model parameters. As a consequence, it is consistent.
198 III MULTIDIMENSIONAL DATA
In terms of the SSR, this more complex model of course explains the training data better:
ssr03, ssr0123
## (1.0612111154029558, 0.9619488226837544)
Yet, it is farther away from the truth (which, whilst performing the fitting task based only on
given 𝒙 and 𝒚, is unknown). We may thus say that the first model generalises better on yet-to-
be-observed data; see Figure 9.12 for an illustration.
_x = np.linspace(0, 1, 101)
plt.plot(x, y, "o")
plt.plot(_x, true_model(_x), "--", label="true model")
plt.plot(_x, c0123 @ (_x.reshape(-1, 1)**[0, 1, 2, 3]).T,
label="fitted model y=x**[0, 1, 2, 3]")
plt.plot(_x, c03 @ (_x.reshape(-1, 1)**[0, 3]).T,
label="fitted model y=x**[0, 3]")
plt.legend()
plt.show()
7.0
6.5
6.0
5.5
5.0
Figure 9.12: The true (theoretical) model vs some guesstimates (fitted based on noisy
data); more degrees of freedom is not always better
Example 9.12 (**) We defined the sum of squared residuals (and its function, the root mean
squared error) by means of the averaged deviation from the reference values, which in fact are
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 199
themselves subject to error. Even though they are our best-shot approximation of the truth, they
should be taken with a degree of scepticism.
In the above example, given the true (reference) model 𝑓 defined over the domain 𝐷 (in our case,
𝑓 (𝑥) = 5 + 3𝑥3 and 𝐷 = [0, 1]) and an empirically fitted model 𝑓 ,̂ we can compute the square
root of the integrated squared error over the whole 𝐷:
For polynomials and other simple functions, RMSE can be computed analytically. More gener-
ally, we can approximate it numerically by sampling the above at sufficiently many points and
applying the trapezoidal rule (e.g., [73]). As this can be an educative programming exercise, be-
low we consider a range of polynomial models of different degrees.
In Figure 9.13, we see that a model’s ability to make correct generalisations onto unseen data,
with the increased complexity initially improves, but then becomes worse. It is quite a typical
behaviour. In fact, the model with the smallest RMSE on the training set, overfits to the input
sample, see Figure 9.14.
plt.plot(x, y, "o")
plt.plot(_x, true_model(_x), "--", label="true model")
for i in [0, 1, 8]:
plt.plot(_x, cs[i] @ (_x.reshape(-1, 1)**np.arange(ps[i]+1)).T,
(continues on next page)
200 III MULTIDIMENSIONAL DATA
2 × 10 1
10 1
1 2 3 4 5 6 7 8 9
model complexity (polynomial degree)
Figure 9.13: Small RMSE on training data does not necessarily imply good generalisa-
tion abilities
6.5
6.0
5.5
5.0
4.5
Overall, models should never be blindly trusted – common sense must always be ap-
plied. The fact that we fitted something using a sophisticated procedure on a dataset
that was hard to obtain does not justify its use. Mediocre models must be discarded,
and we should move on, regardless of how much time/resources we have invested
whilst developing them. Too many bad models go into production and make our daily
lives harder. We should end this madness.
Important sklearn is very convenient but allows for fitting models even if we do
not understand the mathematics behind them. This is dangerous – it is like driving
a sports car without the necessary skills and, at the same time, wearing a blindfold.
Advanced students and practitioners will appreciate it, but if used by beginners, it
needs to be handled with care; we should not mistake something’s being easily access-
ible with its being safe to use. Remember that if we are given a function implement-
ing some procedure for which we are not able to provide its definition/mathematical
properties/explain its idealised version using pseudocode, we should refrain from us-
ing it (see Rule#7).
12 https://scikit-learn.org/stable/index.html
202 III MULTIDIMENSIONAL DATA
Because of the above, we shall only present a quick demo of scikit-learn’s API. Let us
do that by fitting a multiple linear regression model for, again, weight as a function of
the arm and the hip circumference:
import sklearn.linear_model
lm = sklearn.linear_model.LinearRegression(fit_intercept=True)
lm.fit(X_train, y_train)
lm.intercept_, lm.coef_
## (-63.383425410947524, array([1.30457807, 0.8986582 ]))
y_pred = lm.predict(X_train)
import sklearn.metrics
sklearn.metrics.r2_score(y_train, y_pred)
## 0.9243996585518783
The above function is convenient, but can we really recall the formula for the score and
what it measures? We should always be able to do that.
Let us fit a degree-4 polynomial to the life expectancy vs per capita GDP dataset.
x_original = world[:, 0]
X_train = (x_original.reshape(-1, 1))**[0, 1, 2, 3, 4]
y_train = world[:, 1]
cs = dict()
13 There are methods in statistical learning where there might be multiple local minima – this is even
We store the estimated model coefficients in a dictionary, because many methods will
follow next. First, scipy:
If we drew the fitted polynomial now (see Figure 9.15), we would see that the fit is
unbelievably bad. The result returned by scipy.linalg.lstsq is now not at all optimal.
All coefficients are approximately equal to 0.
It turns out that the fitting problem is extremely ill-conditioned (and it is not the al-
gorithm’s fault): GDPs range from very small to very large ones. Furthermore, taking
the powers of 4 thereof results in numbers of ever greater range. Finding the least
squares solution involves some form of matrix inverse (not necessarily directly) and
our model matrix may be close to singular (one that is not invertible).
As a measure of the model matrix’s ill-conditioning, we often use the so-called con-
dition number, denoted 𝜅(𝐗𝑇 ), being the ratio of the largest to the smallest so-called
singular values14 of 𝐗𝑇 . They are in fact returned by the scipy.linalg.lstsq method
itself:
Note that they are already sorted nonincreasingly. The condition number 𝜅(𝐗𝑇 ) is
equal to:
As a rule of thumb, if the condition number is 10𝑘 , we are losing 𝑘 digits of numerical
precision when performing the underlying computations. We are thus currently faced
with a very ill-conditioned problem, because the above number is exceptionally large.
We expect that if the values in 𝐗 or 𝐲 are perturbed even slightly, it can result in very
large changes in the computed regression coefficients.
Note (**) The least squares regression problem can be solved by means of the singular
value decomposition of the model matrix, see Section 9.3.4. Let 𝐔𝐒𝐐 be the SVD of
14 (**) Being themselves the square roots of eigenvalues of 𝐗𝑇 𝐗. Equivalently, 𝜅(𝐗𝑇 ) = ‖(𝐗𝑇 )−1 ‖ ‖𝐗𝑇 ‖
with respect to the spectral norm. Seriously, we really need linear algebra when we even remotely think
about practising data science. Let us add it to our life skills bucket list.
204 III MULTIDIMENSIONAL DATA
Let us verify the method used by scikit-learn. As it fits the intercept separately, we
expect it to be slightly better-behaving. Nevertheless, let us keep in mind that it is
merely a wrapper around scipy.linalg.lstsq with a different API.
import sklearn.linear_model
lm = sklearn.linear_model.LinearRegression(fit_intercept=True)
lm.fit(X_train[:, 1:], y_train)
cs["sklearn"] = np.r_[lm.intercept_, lm.coef_]
cs["sklearn"]
## array([ 6.92257708e+01, 5.05752755e-13, 1.38835643e-08,
## -2.18869346e-13, 9.09347772e-19])
lm.singular_[0] / lm.singular_[-1]
## 1.4026032298428496e+16
The condition number is also enormous. Still, scikit-learn did not warn us about this
being the case (insert frowning face emoji here). Had we trusted the solution returned
by it, we would end up with conclusions from our data analysis built on sand. As we
said in Section 9.2.8, the package design assumes that its users know what they are
doing. This is okay, we are all adults here, although some of us are still learning.
Overall, if the model matrix is close to singular, the computation of its inverse is prone
to enormous numerical errors. One way of dealing with this is to remove highly cor-
related variables (the multicollinearity problem). Interestingly, standardisation can
sometimes make the fitting more numerically stable.
Let 𝐙 be a standardised version of the model matrix 𝐗 with the intercept part (the
column of 1s) not included, i.e., with 𝐳⋅,𝑗 = (𝐱⋅,𝑗 − 𝑥𝑗̄ )/𝑠𝑗 where 𝑥𝑗̄ and 𝑠𝑗 denotes
the arithmetic mean and standard deviation of the j-th column in 𝐗. If (𝑑1 , … , 𝑑𝑚−1 )
is the least squares solution for 𝐙, then the least squares solution to the underlying
original regression problem is:
𝑚−1 𝑑
⎛ 𝑗 𝑑1 𝑑2 𝑑𝑚−1 ⎞
𝒄=⎜
⎜𝑦 ̄ − ∑ 𝑥𝑗̄ , , , … , ⎟
⎟,
𝑠𝑗 𝑠1 𝑠 2 𝑠𝑚−1
⎝ 𝑗=1 ⎠
s = resZ[3]
s[0] / s[-1]
## 139.42792257372338
This is still far from perfect (we would prefer a value close to 1) but nevertheless way
better.
Figure 9.15 depicts the three fitted models, each claiming to be the solution to the ori-
ginal regression problem. Note that, luckily, we know that in our case the logarithmic
model is better than the polynomial one.
Exercise 9.13 Check the condition numbers of all the models fitted so far in this chapter via the
least squares method.
To be strict, if we read a paper in, say, social or medical sciences (amongst others)
where the researchers fit a regression model but do not provide the model matrix’s
condition number, we should doubt the conclusions they make.
On a final note, we might wonder why the standardisation is not done automatically
206 III MULTIDIMENSIONAL DATA
120
100
life expectancy (years)
80
60
40 scipy_X SSR=562307.49
sklearn SSR=6018.16
scipy_Z SSR=4334.68
20
0 20000 40000 60000 80000 100000 120000 140000
per capita GDP PPP
Figure 9.15: Ill-conditioned model matrix can result in the resulting models’ being very
wrong
by the least squares solver. As usual with most numerical methods, there is no one-
fits-all solution: e.g., when there are columns of extremely small variance or there are
outliers in data. This is why we need to study all the topics deeply: to be able to respond
flexibly to many different scenarios ourselves.
where 𝛼 is the angle between two given vectors 𝒙, 𝒚 ∈ ℝ𝑛 . In plain English, it is the
product of the magnitudes of the two vectors and the cosine of the angle between
them.
We can obtain the cosine part by computing the dot product of the normalised vectors,
i.e., such that their magnitudes are equal to 1:
𝒙 𝒚
cos 𝛼 = ⋅ .
‖𝒙‖ ‖𝒚‖
For example, consider two vectors in ℝ2 , 𝒖 = (1/2, 0) and 𝒗 = (√2/2, √2/2), which
are depicted in Figure 9.16.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 207
u = np.array([0.5, 0])
v = np.array([np.sqrt(2)/2, np.sqrt(2)/2])
np.sum(u*v)
## 0.3535533905932738
The dot product of their normalised versions, i.e., the cosine of the angle between
them is:
u_norm = u/np.sqrt(np.sum(u*u))
v_norm = v/np.sqrt(np.sum(v*v)) # BTW: this vector is already normalised
np.sum(u_norm*v_norm)
## 0.7071067811865476
The angle itself can be determined by referring to the inverse of the cosine function,
i.e., arccosine.
np.arccos(np.sum(u_norm*v_norm)) * 180/np.pi
## 45.0
0.7
[0.707, 0.707]
0.6
0.5
0.4
0.3
0.2
0.1
0.0 [0.500, 0.000]
0.2 0.0 0.2 0.4 0.6 0.8 1.0
Important If two vectors are collinear (codirectional, one is a scaled version of another,
angle 0), then cos 0 = 1. If they point in opposite directions (±𝜋 = ±180∘ angle), then
208 III MULTIDIMENSIONAL DATA
cos ±𝜋 = −1. For vectors that are orthogonal (perpendicular, ± 𝜋2 = ±90𝑐 𝑖𝑟𝑐 angle),
we get cos ± 𝜋2 = 0.
Note (**) The standard deviation 𝑠 of a vector 𝒙 ∈ ℝ𝑛 that has already been centred
(whose components’ mean is 0) is a scaled version of its magnitude, i.e., 𝑠 = ‖𝒙‖/√𝑛.
Looking at the definition of the Pearson linear correlation coefficient in Section 9.1.1,
we see that it is the dot product of the standardised versions of two vectors 𝒙 and 𝒚
divided by the number of elements therein. If the vectors are centred, we can rewrite
𝒙 𝒚
the formula equivalently as 𝑟(𝒙, 𝒚) = ‖𝒙‖ ⋅ ‖𝒚‖ and thus 𝑟(𝒙, 𝒚) = cos 𝛼. It is not easy to
imagine vectors in high-dimensional spaces, but from this observation we can at least
imply the fact that 𝑟 is bounded between -1 and 1. In this context, being not linearly
correlated corresponds to the vectors’ orthogonality.
𝑠 0 … 0
⎡ 1 ⎤
⎢ 0 𝑠2 … 0 ⎥,
𝐒 = diag(𝑠1 , 𝑠2 , … , 𝑠𝑚 ) = ⎢ ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 0 0 … 𝑠𝑚 ⎦
then 𝐗𝐒 represents scaling (stretching) with respect to the individual axes of the co-
ordinate system, because:
𝑠 𝑥 𝑠2 𝑥1,2 … 𝑠𝑚 𝑥1,𝑚
⎡ 1 1,1 ⎤
⎢ 𝑠1 𝑥2,1 𝑠2 𝑥2,2 … 𝑠𝑚 𝑥2,𝑚 ⎥
𝐗𝐒 = ⎢ ⋮ ⋮ ⋱ ⋮ ⎥.
⎢ ⎥
⎢ 𝑠1 𝑥𝑛−1,1 𝑠2 𝑥𝑛−1,2 … 𝑠𝑚 𝑥𝑛−1,𝑚 ⎥
⎣ 𝑠1 𝑥𝑛,1 𝑠2 𝑥𝑛,2 … 𝑠𝑚 𝑥𝑛,𝑚 ⎦
In particular, for any angle 𝛼, the matrix representing the corresponding rotation in
ℝ2 :
cos 𝛼 sin 𝛼
𝐑(𝛼) = [ ],
− sin 𝛼 cos 𝛼
is orthonormal (which can be easily verified using the basic trigonometric equalities).
Furthermore:
1 0 −1 0
[ ] and [ ],
0 −1 0 1
represent the two reflections, one against the x- and the other against the y-axis, re-
spectively. Both are orthonormal matrices as well.
Consider a dataset 𝐗′ in ℝ2 :
np.random.seed(12345)
Xp = np.random.randn(10000, 2) * 0.25
2 0 cos 𝜋6 sin 𝜋6
𝐗 = 𝐗′ [ ][ ] + [ 3 2 ].
0 0.5 − sin 𝜋6 cos 𝜋6
t = np.array([3, 2])
S = np.diag([2, 0.5])
S
## array([[2. , 0. ],
## [0. , 0.5]])
alpha = np.pi/6
Q = np.array([
[ np.cos(alpha), np.sin(alpha)],
[-np.sin(alpha), np.cos(alpha)]
])
(continues on next page)
210 III MULTIDIMENSIONAL DATA
Figure 9.17: A dataset and its scaled, rotated, and shifted version
The computing of such linear combinations of columns is not rare during a dataset’s
preprocessing step, especially if they are on the same scale or are unitless. As a matter
of fact, the standardisation itself is a form of scaling and translation.
Exercise 9.14 Assume that we have a dataset with two columns, giving the number of apples
and the number of oranges in clients’ baskets. What orthonormal and scaling transforms should
be applied to obtain a matrix bearing the total number of fruits and surplus apples (e.g., a row
(4, 7) should be converted to (11, −3))?
𝐀−1 𝐀 = 𝐀𝐀−1 = 𝐈.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 211
Noting that the identity matrix 𝐈 is the neutral element of the matrix multiplication,
the above is thus the analogue of the inverse of a scalar: something like 3⋅3−1 = 3⋅ 13 =
1
3
⋅ 3 = 1.
Important For any invertible matrices of admissible shapes, it might be shown that
the following noteworthy properties hold:
• (𝐀−1 )𝑇 = (𝐀𝑇 )−1 ,
• (𝐀𝐁)−1 = 𝐁−1 𝐀−1 ,
• a matrix equality 𝐀 = 𝐁𝐂 holds if and only if 𝐀𝐂−1 = 𝐁𝐂𝐂−1 = 𝐁; this is also
equivalent to 𝐁−1 𝐀 = 𝐁−1 𝐁𝐂 = 𝐂.
𝐗′ = (𝐗 − 𝐭)𝐐𝑇 (1/𝐒).
Let us verify this numerically (testing equality up to some inherent round-off error):
𝐗 = 𝐔𝐒𝐐,
where:
• 𝐔 is an n-by-m semi-orthonormal matrix (its columns are orthonormal vectors; it
holds 𝐔𝑇 𝐔 = 𝐈),
212 III MULTIDIMENSIONAL DATA
Important In data analysis, we usually apply the SVD on matrices that have already
been centred (so that their column means are all 0).
For example:
import scipy.linalg
n = X.shape[0]
X_centred = X - np.mean(X, axis=0)
U, s, Q = scipy.linalg.svd(X_centred, full_matrices=False)
And now:
The norms of all the columns in 𝐔 are all equal to 1 (and hence standard deviations are
1/√𝑛). Consequently, they are on the same scale:
What is more, they are orthogonal: their dot products are all equal to 0. Regard-
ing what we said about Pearson’s linear correlation coefficient and its relation to dot
products of normalised vectors, we imply that the columns in 𝐔 are not linearly cor-
related. In some sense, they form independent dimensions.
Now, it holds 𝐒 = diag(𝑠1 , … , 𝑠𝑚 ), with the elements on the diagonal being:
s
## array([49.72180455, 12.5126241 ])
The elements on the main diagonal of 𝐒 are used to scale the corresponding columns
in 𝐔. The fact that they are ordered decreasingly means that the first column in 𝐔𝐒 has
the greatest standard deviation, the second column has the second greatest variability,
and so forth.
S = np.diag(s)
US = U @ S
(continues on next page)
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 213
Multiplying 𝐔𝐒 by 𝐐 simply rotates and/or reflects the dataset. This brings 𝐔𝐒 to a new
coordinate system where, by construction, the dataset projected onto the direction
determined by the first row in 𝐐, i.e., 𝐪1,⋅ has the largest variance, projection onto
𝐪2,⋅ has the second largest variance, and so on.
Q
## array([[ 0.86781968, 0.49687926],
## [-0.49687926, 0.86781968]])
This is why we refer to the rows in 𝐐 as principal directions (or components). Their scaled
versions (proportional to the standard deviations along them) are depicted in Fig-
ure 9.18. Note that we have more or less recreated the steps needed to construct 𝐗
from 𝐗′ above (by the way we generated 𝐗′ , we expect it to have linearly uncorrelated
columns; yet, 𝐗′ and 𝐔 have different column variances).
chainlink = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/clustering/fcps_chainlink.csv")
As we said in Section 7.4, the plotting is always done on a two-dimensional surface (be
it the computer screen or book page). We can look at the dataset only from one angle
at a time.
In particular, a scatterplot matrix only depicts the dataset from the perspective of the
axes of the Cartesian coordinate system (standard basis); see Figure 9.19.
sns.pairplot(data=pd.DataFrame(chainlink))
# plt.show() # not needed :/
These viewpoints by no means must reveal the true geometric structure of the dataset.
However, we know that we can rotate the virtual camera and find some more interesting
214 III MULTIDIMENSIONAL DATA
Figure 9.18: Principal directions of an example dataset (scaled so that they are propor-
tional to the standard deviations along them)
angle. It turns out that our dataset represents two nonintersecting rings, hopefully
visible Figure 9.20.
fig = plt.figure()
ax = fig.add_subplot(1, 3, 1, projection="3d", facecolor="#ffffff00")
ax.scatter(chainlink[:, 0], chainlink[:, 1], chainlink[:, 2])
ax.view_init(elev=45, azim=45, vertical_axis="z")
ax = fig.add_subplot(1, 3, 2, projection="3d", facecolor="#ffffff00")
ax.scatter(chainlink[:, 0], chainlink[:, 1], chainlink[:, 2])
ax.view_init(elev=37, azim=0, vertical_axis="z")
ax = fig.add_subplot(1, 3, 3, projection="3d", facecolor="#ffffff00")
ax.scatter(chainlink[:, 0], chainlink[:, 1], chainlink[:, 2])
ax.view_init(elev=10, azim=150, vertical_axis="z")
plt.show()
It turns out that we may find a noteworthy viewpoint using the SVD. Namely, we can
perform the decomposition of a centred dataset which we denote with 𝐗:
𝐗 = 𝐔𝐒𝐐.
import scipy.linalg
X_centered = chainlink-np.mean(chainlink, axis=0)
U, s, Q = scipy.linalg.svd(X_centered, full_matrices=False)
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 215
1.0
0.5
0.0
0
0.5
1.0
2.0
1.5
1.0
0.5
1
0.0
0.5
1.0
1.0
0.5
0.0
2
0.5
1.0
1 0 1 1 0 1 2 1 0 1
0 1 2
𝐏 = 𝐗𝐐−1 = 𝐔𝐒,
we know that its first column has the highest variance, the second column has the
second highest variability, and so on. It might indeed be worth looking at that dataset
from that most informative perspective.
Figure 9.21 gives the scatter plot for 𝐩⋅,1 and 𝐩⋅,2 . Maybe this does not reveal the true
geometric structure of the dataset (no single two-dimensional projection can do that),
but at least it is better than the initial ones (from the pairplot).
0.75
0.50
0.25
0.00
0.25
0.50
0.75
What we just did is a kind of dimensionality reduction. We found a viewpoint (in the form
of an orthonormal matrix, being a mixture of rotations and reflections) on 𝐗 such that
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 217
its orthonormal projection onto the first two axes of the Cartesian coordinate system
is the most informative16 (in terms of having the highest variance along these axes).
ssi = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/ssi_2016_indicators.csv",
comment="#")
X = np.array(ssi.iloc[:, [3, 5, 13, 15, 19] ]) # select columns, make matrix
n = X.shape[0]
X[:6, :] # preview
## array([[ 9.32 , 8.13333333, 8.386 , 8.5757 , 5.46249573],
## [ 8.74 , 7.71666667, 7.346 , 6.8426 , 6.2929302 ],
## [ 5.11 , 4.31666667, 8.788 , 9.2035 , 3.91062849],
## [ 9.61 , 7.93333333, 5.97 , 5.5232 , 7.75361284],
## [ 8.95 , 7.81666667, 8.032 , 8.2639 , 4.42350654],
## [10. , 8.65 , 1. , 1. , 9.66401848]])
Each index is on the scale from 0 to 10. These are, in this order:
1. Safe Sanitation,
2. Healthy Life,
3. Energy Use,
4. Greenhouse Gases,
5. Gross Domestic Product.
Above we displayed the data corresponding to the 6 following countries:
countries = list(ssi.iloc[:, 0]) # select the 1st column from the data frame
countries[:6] # preview
## ['Albania', 'Algeria', 'Angola', 'Argentina', 'Armenia', 'Australia']
This is a five-dimensional dataset. We cannot easily visualise it. That the pairplot does
not reveal much is left as an exercise. Let us thus perform the SVD decomposition of
16 (**) The Eckart–Young–Mirsky theorem states that 𝐔
⋅,∶𝑘 𝐒∶𝑘,∶𝑘 𝐐∶𝑘,⋅ (where “:k” denotes “first k rows or
columns”) is the best rank-k approximation of 𝐗 with respect to both the Frobenius and spectral norms.
17 https://ssi.wi.th-koeln.de/
218 III MULTIDIMENSIONAL DATA
a standardised version of this dataset, 𝐙 (recall that the centring is necessary, at the
very least).
Z = (X - np.mean(X, axis=0))/np.std(X, axis=0)
U, s, Q = scipy.linalg.svd(Z, full_matrices=False)
The standard deviations of the data projected onto the consecutive principal compon-
ents (columns in 𝐔𝐒) are:
s/np.sqrt(n)
## array([2.02953531, 0.7529221 , 0.3943008 , 0.31897889, 0.23848286])
It is customary to check the ratios of the cumulative variances explained by the con-
secutive principal components, which is a normalised measure of their importances.
We can compute them by calling:
np.cumsum(s**2)/np.sum(s**2)
## array([0.82380272, 0.93718105, 0.96827568, 0.98862519, 1. ])
As in some sense the variability within the first two components covers ca. 94% of the
variability of the whole dataset, we can restrict ourselves only to a two-dimensional
projection of this dataset (actually, we are quite lucky here – or someone has selected
these countrywise indices for us in a very clever fashion).
The rows in 𝐐 feature the so-called loadings. They give the coefficients defining the
linear combinations of the rows in 𝐙 that correspond to the principal components.
Let us try to interpret them.
np.round(Q[0, :], 2) # loadings – the 1st principal axis
## array([-0.43, -0.43, 0.44, 0.45, -0.47])
The first row in 𝐐 consists of similar values, but with different signs. We can consider
them a scaled version of the average Energy Use (column 3), Greenhouse Gases (4), and
MINUS Safe Sanitation (1), MINUS Healthy Life (2), MINUS Gross Domestic Product
(5). We could call this a measure of a country’s overall eco-unfriendliness(?), because
countries with low Healthy Life and high Greenhouse Gasses will score highly on this
scale.
np.round(Q[1, :], 2) # loadings – the 2nd principal axis
## array([ 0.52, 0.5 , 0.52, 0.45, -0.02])
The second row in 𝐐 defines a scaled version of the average of Safe Sanitation (1),
Healthy Life (2), Energy Use (3), and Greenhouse Gases (4), almost completely ignor-
ing the GDP (5). Should we call it a measure of industrialisation? Something like this.
But this naming is just for fun18 .
18 Although someone might take these results seriously and write, for example, a research thesis about
it. Mathematics – unlike the brains of ordinary mortals – does not need our imperfect interpretations/fairy
tales to function properly. We need more maths in our lives.
9 EXPLORING RELATIONSHIPS BETWEEN VARIABLES 219
Figure 9.22 is a scatter plot of the countries projected onto the said two principal dir-
ections. For readability, we only display a few chosen labels. This is merely a projec-
tion/approximation, but it might be an interesting one for some practitioners.
P2 = U[:, :2] @ np.diag(s[:2]) # == Y @ Q[:2, :].T
plt.plot(P2[:, 0], P2[:, 1], "o", alpha=0.1)
which = [ # hand-crafted/artisan
141, 117, 69, 123, 35, 80, 93, 45, 15, 2, 60, 56, 14,
104, 122, 8, 134, 128, 0, 94, 114, 50, 34, 41, 33, 77,
64, 67, 152, 135, 148, 99, 149, 126, 111, 57, 20, 63
]
for i in which:
plt.text(P2[i, 0], P2[i, 1], countries[i], ha="center")
plt.axis("equal")
plt.xlabel("1st principal component (eco-unfriendliness?)")
plt.ylabel("2nd principal component (industrialisation?)")
plt.show()
Cuba Tajikistan
Honduras
1.0 Montenegro Peru Bangladesh
Portugal
Greece Uzbekistan
0.5 CyprusHungary Indonesia Nepal
Israel Venezuela
Lebanon Cambodia
0.0 Japan Ireland
Singapore Botswana
Bosnia-Herzegovina Zambia
0.5 Czech Republic Libya Tanzania
Mongolia Angola
1.0
Estonia Nigeria
Kazakhstan Gabon Sierra Leone
1.5
Russia South Africa
2.0 Turkmenistan
3 2 1 0 1 2 3
1st principal component (eco-unfriendliness?)
There are of course many other approaches to dimensionality reduction, also nonlin-
ear ones, including kernel PCA, feature agglomeration via hierarchical clustering, au-
toencoders, t-SNE, etc.
A popular introductory text in statistical learning is [47]. We recommend [2, 8, 9, 22,
24] for more advanced students. Computing-oriented students should check out [64].
9.5 Exercises
Exercise 9.15 Why correlation is not causation?
Exercise 9.16 What does the linear correlation of 0.9 mean? How about the rank correlation of
0.9? And the linear correlation of 0.0?
Exercise 9.17 How is Spearman’s coefficient related to Pearson’s one?
Exercise 9.18 State the optimisation problem behind the least squares fitting of linear models.
Exercise 9.19 What are the different ways of the numerical summarising of residuals?
Exercise 9.20 Why is it important for the residuals to be homoscedastic?
Exercise 9.21 Is a more complex model always better?
Exercise 9.22 Why should extrapolation be handled with care?
Exercise 9.23 Why did we say that novice users should refrain from using scikit-learn?
Exercise 9.24 What is the condition number of a model matrix and why should we always
check it?
Exercise 9.25 What is the geometrical interpretation of the dot product of two normalised vec-
tors?
Exercise 9.26 How can we verify if two vectors are orthonormal? What is an orthonormal pro-
jection? What is the inverse of an orthonormal matrix?
Exercise 9.27 What is the inverse of a diagonal matrix?
Exercise 9.28 Characterise the general properties of the three matrices obtained by performing
the singular value decomposition of a given matrix of shape n-by-m.
Exercise 9.29 How can we obtain the first principal component of a given centred matrix?
Exercise 9.30 How can we compute the ratios of the variances explained by the consecutive
principal components?
Part IV
Heterogeneous Data
10
Introducing Data Frames
numpy arrays are an extremely versatile tool for performing data analysis exercises and
other numerical computations of various kinds. Although theoretically possible oth-
erwise, in practice we only store elements of the same type therein, most often num-
bers.
pandas1 [62] is amongst over one hundred thousand2 open-source packages and re-
positories that use numpy to provide additional data wrangling functionality. It was
originally written by Wes McKinney but was heavily inspired by the data.frame3 ob-
jects in S and R as well as tables in relational (think: SQL) databases and spreadsheets.
Before we delve into the world of pandas, let us point out that it is customary to load
this package under the following alias:
import pandas as pd
Important Let us repeat: pandas is built on top of numpy and most objects therein
can be processed by numpy functions as well. Many other functions (e.g., in sklearn)
accept both DataFrame and ndarray objects, but often convert the former to the latter
internally to enable data processing using fast C/C++/Fortran routines.
What we have learnt so far4 still applies. But there is of course more, hence this part.
np.random.seed(123)
pd.DataFrame(
np.random.rand(4, 3),
columns=["a", "b", "c"]
)
## a b c
## 0 0.696469 0.286139 0.226851
## 1 0.551315 0.719469 0.423106
## 2 0.980764 0.684830 0.480932
## 3 0.392118 0.343178 0.729050
Notice that rows and columns are labelled (and how readable that is).
A dictionary of vector-like objects of equal lengths is another common option:
np.random.seed(123)
df = pd.DataFrame(dict(
a = np.round(np.random.rand(5), 2),
b = [1, 2.5, np.nan, 4, np.nan],
c = [True, True, False, False, True],
d = ["A", "B", "C", None, "E"],
e = ["spam", "spam", "bacon", "spam", "eggs"],
f = np.array([
(continues on next page)
4 If by any chance the kind reader frivolously decided to start their journey with this superb book at this
chapter, it is now the time to go back to the Preface and learn everything in the right order. See you later.
10 INTRODUCING DATA FRAMES 225
body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_bmx_2020.csv",
comment="#")
body.head() # display first few rows (5 by default)
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 0 97.1 160.2 34.7 40.8 35.8 126.1 117.9
## 1 91.1 152.7 33.5 33.0 38.5 125.5 103.1
## 2 73.0 161.2 37.4 38.0 31.8 106.2 92.0
## 3 61.7 157.4 38.0 34.7 29.0 101.0 90.5
## 4 55.4 154.6 34.6 34.0 28.3 92.5 73.2
Reading from URLs and local files is of course supported; compare Section 13.6.1.
Exercise 10.2 Check out the other pandas.read_* functions in the pandas documentation. We
will be discussing some of them later.
df.shape
## (5, 7)
Recall that numpy arrays are equipped with the dtype slot.
10.1.2 Series
There is a separate class for storing individual data frame columns: it is called Series.
Data frames with one column are printed out slightly differently. We get the column
name at the top, but do not have the dtype information at the bottom.
Important It is crucial to know when we are dealing with a Series and when with a
DataFrame object, because each of them defines a slightly different set of methods.
10 INTRODUCING DATA FRAMES 227
We will now be relying upon object-oriented syntax (compare Section 2.2.3) much
more frequently than before.
s.mean()
## 0.49800000000000005
df.mean(numeric_only=True)
## a 0.498
## b 2.500
## c 0.600
## dtype: float64
s.shape
## (5,)
s.dtype
## dtype('float64')
s.values
## array([0.7 , 0.29, 0.23, 0.55, 0.72])
np.mean(s)
## 0.49800000000000005
As a consequence, what we covered in the part of this book that dealt with vector pro-
cessing still holds for data frame columns (but there will be more).
Series can also be named.
s.name
## 'a'
This is convenient, especially when we convert them to a data frame, because the name
sets the label of the newly created column:
228 IV HETEROGENEOUS DATA
s.rename("spam").to_frame()
## spam
## 0 0.70
## 1 0.29
## 2 0.23
## 3 0.55
## 4 0.72
10.1.3 Index
Another important class is called Index6 . We use it for storing element or axes labels.
The index (lowercase) slot of a data frame stores an object of class Index (or one of its
derivatives) that gives the row names:
s.index
## RangeIndex(start=0, stop=5, step=1)
The set_index method can be applied to make a data frame column act as a sequence
of row labels:
df2 = df.set_index("e")
df2
## a b c d f g
## e
## spam 0.70 1.0 True A 2021-01-01 [spam]
## spam 0.29 2.5 True B 2022-02-02 [bacon, spam]
## bacon 0.23 NaN False C 2023-03-03 None
## spam 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## eggs 0.72 NaN True E 2025-05-05 [ham]
also the concept of an index in relational databases. In pandas, we can have non-unique row names.
10 INTRODUCING DATA FRAMES 229
df2.index.name
## 'e'
df2.rename_axis(index="ROWS", columns="COLS")
## COLS a b c d f g
## ROWS
## spam 0.70 1.0 True A 2021-01-01 [spam]
## spam 0.29 2.5 True B 2022-02-02 [bacon, spam]
## bacon 0.23 NaN False C 2023-03-03 None
## spam 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## eggs 0.72 NaN True E 2025-05-05 [ham]
Having a named index slot is handy when we decide that we want to convert the vector
of row labels back to a standalone column:
df2.rename_axis(index="NEW_COLUMN").reset_index()
## NEW_COLUMN a b c d f g
## 0 spam 0.70 1.0 True A 2021-01-01 [spam]
## 1 spam 0.29 2.5 True B 2022-02-02 [bacon, spam]
## 2 bacon 0.23 NaN False C 2023-03-03 None
## 3 spam 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## 4 eggs 0.72 NaN True E 2025-05-05 [ham]
There is also an option to get rid of the current index and to replace it with the default
label sequence, i.e., 0, 1, 2, …:
df2.reset_index(drop=True)
## a b c d f g
## 0 0.70 1.0 True A 2021-01-01 [spam]
## 1 0.29 2.5 True B 2022-02-02 [bacon, spam]
## 2 0.23 NaN False C 2023-03-03 None
## 3 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## 4 0.72 NaN True E 2025-05-05 [ham]
Take note of the fact that reset_index, and many other methods that we have used so
far, do not modify the data frame in place.
Exercise 10.4 Use the pandas.DataFrame.rename method to change the name of the a column
in df to spam.
Also, a hierarchical index – one that is comprised of more than one level – is possible.
230 IV HETEROGENEOUS DATA
For example, here is a sorted (see Section 10.6.1) version of df with a new index based
on two columns at the same time:
df.sort_values("e", ascending=False).set_index(["e", "c"])
## a b d f g
## e c
## spam True 0.70 1.0 A 2021-01-01 [spam]
## True 0.29 2.5 B 2022-02-02 [bacon, spam]
## False 0.55 4.0 None 2024-04-04 [eggs, bacon, spam]
## eggs True 0.72 NaN E 2025-05-05 [ham]
## bacon False 0.23 NaN C 2023-03-03 None
For the sake of readability, the consecutive repeated spams were not printed.
Example 10.5 Hierarchical indexes might arise after aggregating data in groups. For example:
nhanes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_p_demo_bmx_2020.csv",
comment="#").rename({
"BMXBMI": "bmival",
"RIAGENDR": "gender",
"DMDBORN4": "usborn"
}, axis=1)
This returned a Series object with a hierarchical index. Let us fret not, though: reset_index
always comes to our rescue:
res.reset_index()
## gender usborn bmival
## 0 1 1 25.734110
## 1 1 2 27.405251
## 2 2 1 27.120261
## 3 2 2 27.579448
## 4 2 77 28.725000
## 5 2 99 32.600000
10 INTRODUCING DATA FRAMES 231
np.random.seed(123)
df = pd.DataFrame(dict(
u = np.round(np.random.rand(5), 2),
v = np.round(np.random.randn(5), 2),
w = ["spam", "bacon", "spam", "eggs", "sausage"]
), index=["a", "b", "c", "d", "e"])
df
## u v w
## a 0.70 0.32 spam
## b 0.29 -0.05 bacon
## c 0.23 -0.20 spam
## d 0.55 1.98 eggs
## e 0.72 -1.62 sausage
All numpy functions can be applied directly on individual columns, i.e., objects of type
Series, because they are vector-like.
u = df.loc[:, "u"] # extract the `u` column (gives a Series; see below)
np.quantile(u, [0, 0.5, 1])
## array([0.23, 0.55, 0.72])
Most numpy functions also work if they are fed with data frames, but we will need to
extract the numeric columns manually.
Sometimes the results will automatically be coerced to a Series object with the index
slot set appropriately:
np.mean(uv, axis=0)
## u 0.498
## v 0.086
## dtype: float64
Many operations, for convenience, were also implemented as methods for the Series
and DataFrame classes, e.g., mean, median, min, max, quantile, var, std, and skew.
232 IV HETEROGENEOUS DATA
df.mean(numeric_only=True)
## u 0.498
## v 0.086
## dtype: float64
df.quantile([0, 0.5, 1], numeric_only=True)
## u v
## 0.0 0.23 -1.62
## 0.5 0.55 -0.05
## 1.0 0.72 1.98
Also note the describe method, which returns a few statistics at the same time.
df.describe()
## u v
## count 5.000000 5.000000
## mean 0.498000 0.086000
## std 0.227969 1.289643
## min 0.230000 -1.620000
## 25% 0.290000 -0.200000
## 50% 0.550000 -0.050000
## 75% 0.700000 0.320000
## max 0.720000 1.980000
Exercise 10.6 Check out the pandas.DataFrame.agg method that can apply all aggregates
given by a list of functions. Write a call equivalent to df.describe().
Note (*) Let us stress that above we see the corrected for bias (but still only asymptotic-
𝑛
ally unbiased) version of standard deviation, given by √ 𝑛−1
1
∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2 ; compare
Section 5.1. In pandas, std methods assume ddof=1 by default, whereas we recall that
numpy uses ddof=0.
This is an unfortunate inconsistency between the two packages, but please do not
blame the messenger.
10 INTRODUCING DATA FRAMES 233
np.exp(df.loc[:, "u"])
## a 2.013753
## b 1.336427
## c 1.258600
## d 1.733253
## e 2.054433
## Name: u, dtype: float64
np.exp(df.loc[:, ["u", "v"]])
## u v
## a 2.013753 1.377128
## b 1.336427 0.951229
## c 1.258600 0.818731
## d 1.733253 7.242743
## e 2.054433 0.197899
When applying the binary arithmetic, relational, and logical operators on an object of
class Series and a scalar or a numpy vector, the operations are performed elementwisely
– a style with which we are already familiar.
For instance, here is a standardised version of the u column:
u = df.loc[:, "u"]
(u - np.mean(u)) / np.std(u)
## a 0.990672
## b -1.020098
## c -1.314357
## d 0.255025
## e 1.088759
## Name: u, dtype: float64
Binary operators act on the elements with corresponding labels. For two objects hav-
ing identical index slots (this is the most common scenario), this is the same as ele-
mentwise vectorisation. For instance:
For transforming many numerical columns at once, it is a good idea either to convert
them to a numeric matrix explicitly and then use the basic numpy functions:
uv = np.array(df.loc[:, ["u", "v"]])
uv2 = (uv-np.mean(uv, axis=0))/np.std(uv, axis=0)
uv2
## array([[ 0.99067229, 0.20286225],
## [-1.0200982 , -0.11790285],
## [-1.3143573 , -0.24794275],
## [ 0.25502455, 1.64197052],
## [ 1.08875866, -1.47898717]])
Anticipating what we cover in the next section, in both cases, we can write df.loc[:,
["u", "v"]] = uv2 to replace the old content. Also, new columns can be added based
on the transformed versions of the existing ones, for instance:
df.loc[:, "uv_squared"] = (df.loc[:, "u"] * df.loc[:, "v"])**2
df
## u v w uv_squared
## a 0.70 0.32 spam 0.050176
## b 0.29 -0.05 bacon 0.000210
## c 0.23 -0.20 spam 0.002116
## d 0.55 1.98 eggs 1.185921
## e 0.72 -1.62 sausage 1.360489
Example 10.7 (*) Binary operations on objects with different index slots are in fact vectorised
labelwisely:
x = pd.Series([1, 10, 1000, 10000, 100000], index=["a", "b", "a", "a", "c"])
x
(continues on next page)
10 INTRODUCING DATA FRAMES 235
And now:
x * y
## a 3.0
## a 3000.0
## a 30000.0
## b 10.0
## b 20.0
## c 500000.0
## d NaN
## dtype: float64
Here, each element in the first Series named a was multiplied by each (there was only one)
element labelled a in the second Series. For d, there were no matches, hence the result’s being
marked as missing; compare Chapter 15. Thus, this behaves like a full outer join-type operation;
see Section 10.6.4.
The above is different from elementwise vectorisation in numpy:
np.array(x) * np.array(y)
## array([ 1, 20, 3000, 40000, 500000])
Labelwise vectorisation can be useful in certain contexts, but we should be aware of this (yet an-
other) incompatibility between the two packages.
236 IV HETEROGENEOUS DATA
and:
c = b.copy()
c.index = list("abcdefghij")
c
## a 0.70
## b 0.29
## c 0.23
## d 0.55
## e 0.72
## f 0.42
## g 0.98
## h 0.68
## i 0.48
## j 0.39
## dtype: float64
They consist of the same values, in the same order, but have different labels (index
10 INTRODUCING DATA FRAMES 237
slots). In particular, b’s labels are integers that do not match the physical element pos-
itions (where 0 would denote the first element, etc.).
Important For numpy vectors, we had four different indexing schemes: via a scalar
(extracts an element at a given position), a slice, an integer vector, and a logical vector.
Series objects are additionally labelled. Therefore, they can also be accessed through
the contents of the index slot.
both do not select the first item, but the item labelled 0.
However:
and
haviour is going to change in a backward-incompatible manner. This means that in future versions of the
package, the same code will generate the result corresponding to b.loc[:1]. Hence, we will get a different
number of items. Compare:
b.iloc[:1]
## 2 0.7
## dtype: float64
b.loc[:1]
## 2 0.70
## 1 0.29
## dtype: float64
Just never apply [...] directly on Series nor DataFrame objects and you will not have to worry about re-
membering all the exceptions.
238 IV HETEROGENEOUS DATA
Important We should never apply [...] directly on Series nor DataFrame objects.
To avoid ambiguity, we should be referring to the loc[...] and iloc[...] accessors
for the label- and position-based filtering, respectively.
10.4.2 loc[...]
Series.loc[...] implements label-based indexing.
b.loc[0]
## 0.39
This returned the element labelled 0. On the other hand, c.loc[0] will raise a KeyError,
because c consists of string labels only. But in this case, we can write:
c.loc["j"]
## 0.39
b.loc[ [0, 1, 0] ]
## 0 0.39
## 1 0.29
## 0 0.39
## dtype: float64
c.loc[ ["j", "b", "j"] ]
## j 0.39
## b 0.29
## j 0.39
## dtype: float64
b.loc[1:7]
## 1 0.29
## 8 0.23
## 7 0.55
## dtype: float64
b.loc[0:4:-1]
## 0 0.39
(continues on next page)
The above calls return all elements between the two indicated labels.
Note Be careful that if there are repeated labels, then we will be returning all (sic!9 )
the matching items:
10.4.3 iloc[...]
Here are some examples of position-based indexing with the iloc[...] accessor. It is
worth stressing that, fortunately, its behaviour is consistent with its numpy counter-
part, i.e., the ordinary square brackets applied on objects of class ndarray.
For example:
returns the 2nd, 3rd, …, 7th element (not including b.iloc[7], i.e., the 8th one).
For iloc[...], the indexer must be unlabelled, e.g., be an ordinary numpy vector.
`
And now:
df.loc[ df.loc[:, "u"] > 0.5, "u":"w" ]
## u v w
(continues on next page)
10 INTRODUCING DATA FRAMES 241
selects the rows where the values in the u column are greater than 0.5 and then returns
all columns between u and w (inclusive!).
Furthermore,
fetches the first three rows (by position; iloc is necessary) and then selects two indic-
ated columns.
Compare this to:
df.loc[:3, ["u", "w"]] # df[:3, ["u", "w"]] does not even work - don't
## u w
## 0 0.70 spam
## 1 0.29 bacon
## 2 0.23 spam
## 3 0.55 eggs
Important Getting a scrambled numeric index that does not match the physical pos-
itions is quite easy, for instance in the context of data frame sorting which we discuss
in Section 10.6.1:
df2 = df.sort_values("v")
df2
## u v w x
## 4 0.72 -1.62 sausage True
## 2 0.23 -0.20 spam True
## 1 0.29 -0.05 bacon False
## 0 0.70 0.32 spam True
## 3 0.55 1.98 eggs False
Important We can frequently write df.u as a shorter version of df.loc[:, "u"]. This
improves the readability in contexts such as:
This accessor is, sadly, not universal. We can verify this by considering a data frame
featuring a column named, e.g., mean (which clashes with a built-in method).
Exercise 10.10 In the tips10 dataset, select data on male customers where the total bills were in
the [10, 20] interval. Also, select Saturday and Sunday records where the tips were greater than
$5.
10 https://github.com/gagolews/teaching-data/raw/master/other/tips.csv
10 INTRODUCING DATA FRAMES 243
Important Notation like “df.new_column = ...” does not work. As we said, only loc
and iloc are universal. For other accessors, this is not necessarily the case.
Exercise 10.11 Use pandas.DataFrame.insert to add a new column not necessarily at the end
of df.
Exercise 10.12 Use pandas.DataFrame.append to add a few more rows to df.
In order to remedy this, it is best to create a copy of a column, modify it, and then to
replace the old contents with the new ones.
u = df.loc[:, "u"].copy()
u.iloc[0] = 42 # or a whole for loop to process them all, or whatever
df.loc[:, "u"] = u
df.loc[:, "u"].iloc[0] # testing
## 42.0
body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_bmx_2020.csv",
comment="#")
body.sample(5, random_state=123) # 5 rows without replacement
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 4214 58.4 156.2 35.2 34.7 27.2 99.5 77.5
## 3361 73.7 161.0 36.5 34.5 29.0 107.6 98.2
## 3759 61.4 164.6 37.5 40.4 26.9 93.5 84.4
## 3733 120.4 158.8 33.5 34.6 40.5 147.2 129.3
## 1121 123.5 157.5 35.5 29.0 50.5 143.0 136.4
Notice the random_state argument which controls the seed of the pseudorandom
number generator so that we get reproducible results. Alternatively, we could call
numpy.random.seed.
Exercise 10.13 Show how the three aforementioned scenarios can be implemented manually
using iloc[...] and numpy.random.permutation or numpy.random.choice.
In machine learning practice, we are used to training and evaluating machine learning
models on different (mutually disjoint) subsets of the whole data frame.
For instance, in Section 12.3.3, we mention that we may be interested in performing
the so-called training/test split (partitioning), where 80% (or 60% or 70%) of the ran-
domly selected rows would constitute the first new data frame and the remaining 20%
(or 40% or 30%, respectively) would go to the second one.
Given a data frame like:
And then to pick the first 80% of them to construct the data frame number one:
k = int(df.shape[0]*0.8)
df.iloc[idx[:k], :]
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 4 55.4 154.6 34.6 34.0 28.3 92.5 73.2
## 0 97.1 160.2 34.7 40.8 35.8 126.1 117.9
## 7 75.9 154.5 35.4 37.6 32.7 107.7 98.7
## 5 62.0 144.7 32.5 34.2 29.8 106.7 84.8
## 8 77.2 159.2 38.5 40.5 35.7 102.0 97.5
## 3 61.7 157.4 38.0 34.7 29.0 101.0 90.5
## 1 91.1 152.7 33.5 33.0 38.5 125.5 103.1
## 6 66.2 166.5 37.5 37.6 32.0 96.3 95.7
df.iloc[idx[k:], :]
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 9 91.6 174.5 36.1 45.9 35.2 121.3 100.3
## 2 73.0 161.2 37.4 38.0 31.8 106.2 92.0
Exercise 10.14 In the wine_quality_all11 dataset, leave out all but the white wines. Parti-
tion the resulting data frame randomly into three data frames: wines_train (60% of the rows),
wines_validate (another 20% of the rows), and wines_test (the remaining 20%).
Exercise 10.15 Write a function kfold which takes a data frame df and an integer 𝑘 > 1 as
arguments. Return a list of data frames resulting in stemming from randomly partitioning df
into 𝑘 disjoint chunks of equal (or almost equal if that is not possible) sizes.
np.random.seed(123)
df = pd.DataFrame(dict(
year = np.repeat([2023, 2024, 2025], 4),
quarter = np.tile(["Q1", "Q2", "Q3", "Q4"], 3),
data = np.round(np.random.rand(12), 2)
)).set_index(["year", "quarter"])
df
## data
(continues on next page)
11 https://github.com/gagolews/teaching-data/raw/master/other/wine_quality_all.csv
246 IV HETEROGENEOUS DATA
The index has both levels named, but this is purely for aesthetic reasons.
Indexing using loc[...] by default relates to the first level of the hierarchy:
df.loc[2023, :]
## data
## quarter
## Q1 0.70
## Q2 0.29
## Q3 0.23
## Q4 0.55
Note that we selected all rows corresponding to a given label and dropped (!) this level
of the hierarchy.
Another example:
Let us stress again that the `:` operator can only be used directly within the square
brackets. Nonetheless, we can always use the slice constructor to create a slice in any
context:
10.6.1 Sorting
Let us consider another example dataset. Here are the yearly (for 2018) average air
quality data12 in the Australian state of Victoria.
air = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
air = (
air.
loc[air.param_id.isin(["BPM2.5", "NO2"]), :].
reset_index(drop=True)
)
air
## sp_name param_id value
## 0 Alphington BPM2.5 7.848758
## 1 Alphington NO2 9.558120
## 2 Altona North NO2 9.467912
## 3 Churchill BPM2.5 6.391230
## 4 Dandenong NO2 9.800705
## 5 Footscray BPM2.5 7.640948
## 6 Footscray NO2 10.274531
## 7 Geelong South BPM2.5 6.502762
## 8 Geelong South NO2 5.681722
## 9 Melbourne CBD BPM2.5 8.072998
## 10 Moe BPM2.5 6.427079
## 11 Morwell East BPM2.5 6.784596
## 12 Morwell South BPM2.5 6.512849
## 13 Morwell South NO2 5.124430
## 14 Traralgon BPM2.5 8.024735
## 15 Traralgon NO2 5.776333
sort_values is a convenient means to order the rows with respect to one criterion, be
it numeric or categorical.
12 https://discover.data.vic.gov.au/dataset/epa-air-watch-all-sites-air-quality-hourly-averages-yearly
10 INTRODUCING DATA FRAMES 249
air.sort_values("value", ascending=False)
## sp_name param_id value
## 6 Footscray NO2 10.274531
## 4 Dandenong NO2 9.800705
## 1 Alphington NO2 9.558120
## 2 Altona North NO2 9.467912
## 9 Melbourne CBD BPM2.5 8.072998
## 14 Traralgon BPM2.5 8.024735
## 0 Alphington BPM2.5 7.848758
## 5 Footscray BPM2.5 7.640948
## 11 Morwell East BPM2.5 6.784596
## 12 Morwell South BPM2.5 6.512849
## 7 Geelong South BPM2.5 6.502762
## 10 Moe BPM2.5 6.427079
## 3 Churchill BPM2.5 6.391230
## 15 Traralgon NO2 5.776333
## 8 Geelong South NO2 5.681722
## 13 Morwell South NO2 5.124430
Here, in each group of identical parameters, we get a decreasing order with respect to
the value.
Exercise 10.16 Compare the ordering with respect to param_id and value vs value and then
param_id.
250 IV HETEROGENEOUS DATA
(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
.sort_values("sp_name")
.sort_values("param_id")
.set_index("param_id")
.loc[["BPM2.5", "NO2"], :]
.reset_index())
## param_id sp_name value
## 0 BPM2.5 Melbourne CBD 8.072998
## 1 BPM2.5 Moe 6.427079
## 2 BPM2.5 Footscray 7.640948
## 3 BPM2.5 Morwell East 6.784596
## 4 BPM2.5 Churchill 6.391230
## 5 BPM2.5 Morwell South 6.512849
## 6 BPM2.5 Traralgon 8.024735
## 7 BPM2.5 Alphington 7.848758
## 8 BPM2.5 Geelong South 6.502762
## 9 NO2 Morwell South 5.124430
## 10 NO2 Traralgon 5.776333
## 11 NO2 Geelong South 5.681722
## 12 NO2 Altona North 9.467912
## 13 NO2 Alphington 9.558120
## 14 NO2 Dandenong 9.800705
## 15 NO2 Footscray 10.274531
We lost the ordering based on station names in the two subgroups. To switch to a
mergesort-like method (timsort), we should pass kind="stable".
(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
.sort_values("sp_name")
.sort_values("param_id", kind="stable") # !
.set_index("param_id")
.loc[["BPM2.5", "NO2"], :]
.reset_index())
## param_id sp_name value
## 0 BPM2.5 Alphington 7.848758
## 1 BPM2.5 Churchill 6.391230
(continues on next page)
10 INTRODUCING DATA FRAMES 251
Exercise 10.17 (*) Perform identical reorderings but using only loc[...], iloc[...], and
numpy.argsort.
The missing values are denoted with NaNs (not-a-number); see Section 15.1 for more
details. Interestingly, we got a hierarchical index in the columns (sic!) slot, hence the
loc[...] part to drop the last level of the hierarchy. Also notice that the index and
columns slots are named.
air_wide.T.rename_axis(index="location", columns="param").\
stack().rename("value").reset_index()
## location param value
## 0 BPM2.5 Alphington 7.848758
## 1 BPM2.5 Churchill 6.391230
## 2 BPM2.5 Footscray 7.640948
## 3 BPM2.5 Geelong South 6.502762
## 4 BPM2.5 Melbourne CBD 8.072998
## 5 BPM2.5 Moe 6.427079
## 6 BPM2.5 Morwell East 6.784596
## 7 BPM2.5 Morwell South 6.512849
## 8 BPM2.5 Traralgon 8.024735
## 9 NO2 Alphington 9.558120
## 10 NO2 Altona North 9.467912
## 11 NO2 Dandenong 9.800705
## 12 NO2 Footscray 10.274531
## 13 NO2 Geelong South 5.681722
## 14 NO2 Morwell South 5.124430
## 15 NO2 Traralgon 5.776333
We used the data frame transpose (T) to get a location-major order (less boring an out-
come in this context). Missing values are gone now. We do not need them anymore.
Nevertheless, passing dropna=False would help us identify the combinations of loca-
tion and param for which the readings are not provided.
A = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/some_birth_dates1.csv",
comment="#")
(continues on next page)
10 INTRODUCING DATA FRAMES 253
and:
B = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/some_birth_dates2.csv",
comment="#")
B
## Name BirthDate
## 0 Hushang Naigamwala 25.08.1991
## 1 Zhen Wei 16.11.1975
## 2 Micha Kitchen 17.09.1930
## 3 Jodoc Alwin 16.11.1969
## 4 Igor Mazał 14.05.2004
## 5 Katarzyna Lasko 20.10.1971
## 6 Duchanee Panomyaong 19.06.1952
## 7 Mefodiy Shachar 01.10.1914
## 8 Paul Meckler 29.09.1968
## 9 Noe Tae-Woong 11.07.1970
## 10 Åge Trelstad 07.03.1935
In both datasets, there is a single categorical column whose elements uniquely identify
each record (i.e., Name). In the language of relational databases, we would call it the
primary key. In such a case, implementing the set-theoretic operations is relatively
easy, as we can refer to the pandas.Series.isin method.
First, 𝐴 ∩ 𝐵 (intersection), includes only the rows that are both in A and in B:
A.loc[A.Name.isin(B.Name), :]
## Name BirthDate
## 4 Micha Kitchen 17.09.1930
## 5 Mefodiy Shachar 01.10.1914
## 6 Paul Meckler 29.09.1968
## 7 Katarzyna Lasko 20.10.1971
(continues on next page)
254 IV HETEROGENEOUS DATA
Second, 𝐴 ∖ 𝐵 (difference), gives all the rows that are in A but not in B:
A.loc[~A.Name.isin(B.Name), :]
## Name BirthDate
## 0 Paitoon Ornwimol 26.06.1958
## 1 Antónia Lata 20.05.1935
## 2 Bertoldo Mallozzi 17.08.1972
## 3 Nedeljko Bukv 19.12.1921
We could have stored them alongside the air data frame, but that would be a waste of space.
Also, if we wanted to modify some datum (note, e.g., the annoying double space in param_name
for BPM2.5), we would have to update all the relevant records.
Instead, we can always match the records in both data frames that have the same param_ids, and
join (merge) these datasets only when we really need this.
Let us discuss the possible join operations by studying the two following toy data sets:
A = pd.DataFrame({
"x": ["a0", "a1", "a2", "a3"],
"y": ["b0", "b1", "b2", "b3"]
})
A
## x y
## 0 a0 b0
## 1 a1 b1
## 2 a2 b2
## 3 a3 b3
and:
B = pd.DataFrame({
"x": ["a0", "a2", "a2", "a4"],
"z": ["c0", "c1", "c2", "c3"]
(continues on next page)
256 IV HETEROGENEOUS DATA
pd.merge(A, B, on="x")
## x y z
## 0 a0 b0 c0
## 1 a2 b2 c1
## 2 a2 b2 c2
The left join of A with B guarantees to return all the records from A, even those which
are not matched by anything in B.
The right join of A with B is the same as the left join of B with A:
Finally, the full outer join is the set-theoretic union of the left and the right join:
Nevertheless, the methods are probably too plentiful to our taste. Their developers
were overgenerous. They wanted to include a list of all the possible verbs related to
data analysis, even if they can be trivially expressed by a composition of 2-3 simpler
operations from numpy or scipy instead.
As strong advocates of minimalism, we would rather save ourselves from being over-
loaded with too much new information. This is why our focus in this book is on devel-
oping the most transferable24 skills. Our approach is also slightly more hygienic. We do
not want the reader to develop a hopeless mindset, the habit of looking everything up
15 https://github.com/gagolews/teaching-data/raw/master/marek/air_quality_2018_value.csv.gz
16 https://github.com/gagolews/teaching-data/raw/master/marek/air_quality_2018_point.csv
17 https://github.com/gagolews/teaching-data/raw/master/marek/air_quality_2018_param.csv
18 https://github.com/gagolews/teaching-data/raw/master/marek/air_quality_2018.csv.gz
19 https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_DEMO.htm
20 https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_BMX.htm
21 https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/AUX_J.htm
22 https://wwwn.cdc.gov/Nchs/Nhanes/continuousnhanes/default.aspx?BeginYear=2017
23 https://pandas.pydata.org/pandas-docs/stable/reference/index.html
24 This is also in line with the observation that Python with pandas is not the only environment where we
can work with data frames (e.g., base R or Julia with DataFrame.jl allows that too).
258 IV HETEROGENEOUS DATA
on the internet when faced with even the simplest kinds of problems. We have brains
for a reason.
10.7 Exercises
Exercise 10.24 How are data frames different from matrices?
Exercise 10.25 What are the use cases of the name slot in Series and Index objects?
Exercise 10.26 What is the purpose of set_index and reset_index?
Exercise 10.27 Why learning numpy is crucial when someone wants to become a proficient user
of pandas?
Exercise 10.28 What is the difference between iloc[...] and loc[...]?
Exercise 10.29 Why applying the index operator [...] directly on a Series or DataFrame ob-
ject is not necessarily a good idea?
Exercise 10.30 What is the difference between index, Index, and columns?
Exercise 10.31 How can we compute the arithmetic mean and median of all the numeric
columns in a data frame, using a single line of code?
Exercise 10.32 What is a training/test split and how to perform it using numpy and pandas?
Exercise 10.33 What is the difference between stacking and unstacking? Which one yields a
wide (as opposed to long) format?
Exercise 10.34 Name different data frame join (merge) operations and explain how they work.
Exercise 10.35 How does sorting with respect to more than one criterion work?
Exercise 10.36 Name the basic set-theoretic operations on data frames.
11
Handling Categorical Data
So far, we have been mostly dealing with quantitative (numeric) data, on which we
were able to apply various mathematical operations, such as computing the arithmetic
mean or taking the square thereof. Naturally, not every transformation must always
make sense in every context (e.g., multiplying temperatures – what does it mean when
we say that it is twice as hot today as compared to yesterday?), but still, the possibilities
were plenty.
Qualitative data (also known as categorical data, factors, or enumerated types) such
as eye colour, blood type, or a flag whether a patient is ill, on the other hand, take a
small number of unique values. They support an extremely limited set of admissible
operations. Namely, we can only determine whether two entities are equal or not.
In datasets involving many features, which we shall cover in Chapter 12, categorical
variables are often used for observation grouping (e.g., so that we can compute the
best and average time for marathoners in each age category or draw box plots for fin-
ish times of men and women separately). Also, they may serve as target variables in
statistical classification tasks (e.g., so that we can determine if an email is “spam” or
“not spam”).
would use integers between 1 and 𝑙 (inclusive). Nevertheless, a dataset creator is free to encode the labels
however they want. For example, DMDBORN4 in NHANES has: 1 (born in 50 US states or Washington, DC), 2
(others), 77 (refused to answer), and 99 (do not know).
260 IV HETEROGENEOUS DATA
Let us consider the data on the original whereabouts of the top 16 marathoners (the
37th PZU native Marathon dataset):
marathon = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/37_pzu_warsaw_marathon_simplified.csv",
comment="#")
cntrs = np.array(marathon.country, dtype="str")
cntrs16 = cntrs[:16]
cntrs16
## array(['KE', 'KE', 'KE', 'ET', 'KE', 'KE', 'ET', 'MA', 'PL', 'PL', 'IL',
## 'PL', 'KE', 'KE', 'PL', 'PL'], dtype='<U2')
These are two-letter ISO 3166 country codes, encoded of course as strings (notice the
dtype="str" argument).
cat_cntrs16 = pd.unique(cntrs16)
cat_cntrs16
## array(['KE', 'ET', 'MA', 'PL', 'IL'], dtype='<U2')
Note We could have also used numpy.unique discussed in Section 5.5.3, but it would
sort the distinct values lexicographically. In other words, they would not be listed in
the order of appearance.
The code sequence 0, 0, 0, 1, … corresponds to the 1st, 1st, 1st, 2nd, … level in
cat_cntrs16, i.e., Kenya, Kenya, Kenya, Ethiopia, ….
Even though we can represent categorical variables using a set of integers, it does
not mean that they become instances of a quantitative type. Arithmetic operations
thereon do not really make sense.
The values between 0 (inclusive) and 5 (exclusive) can be used to index a given array of
length 𝑙 = 5. As a consequence, to decode our factor, we can simply write:
cat_cntrs16[codes_cntrs16]
## array(['KE', 'KE', 'KE', 'ET', 'KE', 'KE', 'ET', 'MA', 'PL', 'PL', 'IL',
## 'PL', 'KE', 'KE', 'PL', 'PL'], dtype='<U2')
Then we make use of the fact that numpy.argsort applied on a vector representing a permutation,
determines its very inverse:
new_codes_cntrs16 = np.argsort(new_codes)[codes_cntrs16]
new_codes_cntrs16
## array([1, 1, 1, 4, 1, 1, 4, 2, 0, 0, 3, 0, 1, 1, 0, 0])
Verification:
np.all(cntrs16 == new_cat_cntrs16[new_codes_cntrs16])
## True
Exercise 11.2 (**) Determine the set of unique values in cntrs16 in the order of appearance
(and not sorted lexicographically), but without using pandas.unique nor pandas.factorize.
Then, encode cntrs16 using this level set.
Hint: check out the return_index argument to numpy.unique and numpy.searchsorted.
262 IV HETEROGENEOUS DATA
Furthermore, pandas includes2 a special dtype for storing categorical data. Namely, we
can write:
or, equivalently:
cntrs16_series = pd.Series(cntrs16).astype("category")
These two yield a Series object displayed as if it was represented using string labels:
cntrs16_series.head() # preview
## 0 KE
## 1 KE
## 2 KE
## 3 ET
## 4 KE
## dtype: category
## Categories (5, object): ['ET', 'IL', 'KE', 'MA', 'PL']
np.array(cntrs16_series.cat.codes)
## array([2, 2, 2, 0, 2, 2, 0, 3, 4, 4, 1, 4, 2, 2, 4, 4], dtype=int8)
cntrs16_series.cat.categories
## Index(['ET', 'IL', 'KE', 'MA', 'PL'], dtype='object')
(marathon.iloc[:16, :].country.astype("category")
.cat.reorder_categories(
["KE", "IL", "MA", "ET", "PL"]
)
.cat.rename_categories(
["Kenya", "Israel", "Morocco", "Ethiopia", "Poland"]
).astype("str")
).head()
## 0 Kenya
## 1 Kenya
## 2 Kenya
## 3 Ethiopia
## 4 Kenya
## Name: country, dtype: object
2 https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
11 HANDLING CATEGORICAL DATA 263
Important When converting logical to numeric, False becomes 0 and True becomes
1. Conversely, 0 is converted to False and anything else (including -0.326) to True.
Hence, instead of working with vectors of 0s and 1s, we might equivalently be playing
with logical arrays. For example:
or, equivalently:
Important It is not rare to work with vectors of probabilities, where the i-th element
therein, say p[i], denotes the likelihood of an observation’s belonging to class 1. Con-
sequently, the probability of being a member of class 0 is 1-p[i]. In the case where we
would rather work with crisp classes, we can simply apply the conversion (p>=0.5) to
get a logical vector.
Exercise 11.3 Given a numeric vector x, create a vector of the same length as x whose i-th element
is equal to "yes" if x[i] is in the unit interval and to "no" otherwise. Use numpy.where, which
can act as a vectorised version of the if statement.
1 0 0 0
⎡ ⎤
0 1 0 0
𝐑=⎢
⎢ 0 0
⎥.
⎥
⎢ 1 0 ⎥
⎣ 0 1 0 0 ⎦
One can easily verify that each row consists of one and only one 1 (the number of 1s
per one row is 1). Such a representation is adequate when solving a multiclass classi-
fication problem by means of l binary classifiers. For example, if spam, bacon, and hot
dogs are on the menu, then spam is encoded as (1, 0, 0), i.e., yeah-spam, nah-bacon,
and nah-hot dog. We can build three binary classifiers, each narrowly specialising in
sniffing one particular type of food.
Example 11.4 Write a function to one-hot encode a given categorical vector represented using
character strings.
Example 11.5 Write a function to decode a one-hot encoded matrix.
Example 11.6 (*) We can also work with matrices like 𝐏 ∈ [0, 1]𝑛×𝑙 , where 𝑝𝑖,𝑗 denotes the
probability of the 𝑖-th object’s belonging to the 𝑗-th class. Given an example matrix of this kind,
verify that in each row the probabilities sum to 1 (up to a small numeric error). Then, decode such
a matrix by choosing the greatest element in each row.
Nonetheless, rounding can easily introduce tied observations, which are problematic
on their own; see Section 5.5.3.
numpy.searchsorted can be used to determine the interval where each value in mins
falls.
bins = [130, 140, 150]
codes_mins16 = np.searchsorted(bins, mins16)
codes_mins16
## array([0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3])
By default, the intervals are of the form (𝑎, 𝑏] (not including 𝑎, including 𝑏). Code 0
corresponds to values less than the first bin bound, whereas code 3 – greater than or
equal to the last bound:
pandas.cut gives us another interface to the same binning method. It returns a vector-
like object with dtype="category", with very readable labels generated automatically
(and ordered; see Section 11.4.7):
cut_mins16 = pd.Series(pd.cut(mins16, [-np.inf, 130, 140, 150, np.inf]))
cut_mins16.iloc[ [0, 1, 6, 7, 13, 14, 15] ].astype("str") # preview
## 0 (-inf, 130.0]
## 1 (130.0, 140.0]
## 6 (130.0, 140.0]
## 7 (140.0, 150.0]
## 13 (140.0, 150.0]
## 14 (150.0, inf]
## 15 (150.0, inf]
## dtype: object
cut_mins16.cat.categories.astype("str")
## Index(['(-inf, 130.0]', '(130.0, 140.0]', '(140.0, 150.0]',
## '(150.0, inf]'],
## dtype='object')
Example 11.7 (*) We can create a set of the corresponding categories manually, for example, as
follows:
bins2 = np.r_[-np.inf, bins, np.inf]
np.array(
(continues on next page)
266 IV HETEROGENEOUS DATA
Exercise 11.8 (*) Check out the numpy.histogram_bin_edges function which tries to determ-
ine some informative interval bounds based on a few simple heuristics. Recall that numpy.
linspace and numpy.geomspace can be used for generating equidistant bounds on linear and
logarithmic scales, respectively.
np.random.seed(123)
np.random.choice(
a=["spam", "bacon", "eggs", "tempeh"],
p=[ 0.7, 0.1, 0.15, 0.05],
replace=True,
size=16
)
## array(['spam', 'spam', 'spam', 'spam', 'bacon', 'spam', 'tempeh', 'spam',
## 'spam', 'spam', 'spam', 'bacon', 'spam', 'spam', 'spam', 'bacon'],
## dtype='<U6')
If we generate a sufficiently long vector, we will expect "spam" to occur ca. 70% times,
and "tempeh" to be drawn in 5% of the cases, etc.
counts_cntrs16 = np.bincount(codes_cntrs16)
counts_cntrs16
## array([7, 2, 1, 5, 1])
A vector of counts can easily be turned into a vector of proportions (fractions) or per-
centages (if we multiply them by 100):
Almost 31.25% of the top runners were from Poland (this marathon is held in Warsaw
after all…).
Exercise 11.9 Using numpy.argsort, sort counts_cntrs16 increasingly together with the cor-
responding items in cat_cntrs16.
The three columns are: sex, age (in 10-year brackets), and country. We can of course
analyse the data distribution in each column individually, but this we leave as an exer-
cise. Instead, we note that some interesting patterns might also arise when we study
the combinations of levels of different variables.
Here are the levels of the sex and age variables:
pd.unique(marathon.sex)
## array(['M', 'F'], dtype=object)
pd.unique(marathon.age)
## array(['20', '30', '50', '40', '60+'], dtype=object)
268 IV HETEROGENEOUS DATA
These can be converted to a two-way contingency table, which is a matrix that gives the
number of occurrences of each pair of values from the two factors:
V = counts2.unstack(fill_value=0)
V
## age 20 30 40 50 60+
## sex
## F 240 449 262 43 19
## M 879 2200 1708 541 170
For example, there were 19 women aged at least 60 amongst the marathoners. Jolly
good.
The marginal (one-dimensional) frequency distributions can be recreated by comput-
ing the rowwise and columnwise sums of V:
np.sum(V, axis=1)
## sex
## F 1013
## M 5498
## dtype: int64
np.sum(V, axis=0)
## age
## 20 1119
## 30 2649
## 40 1970
## 50 584
## 60+ 189
## dtype: int64
11 HANDLING CATEGORICAL DATA 269
The display is in the long format (compare Section 10.6.2), because we cannot nicely
print a three-dimensional array. Yet, we can always partially unstack the dataset, for
aesthetic reasons:
counts3.set_index(["country", "sex", "age"]).unstack()
## count
## age 20 30 40 50 60+
## country sex
## PL F 222.0 422.0 248.0 26.0 8.0
## M 824.0 2081.0 1593.0 475.0 134.0
## SK F NaN NaN NaN 1.0 NaN
## M NaN NaN 1.0 NaN 1.0
(continues on next page)
270 IV HETEROGENEOUS DATA
Let us again appreciate how versatile is the concept of data frames. Not only can we
represent data to be investigated (one row per observation, variables possibly of mixed
types), but also we can store the results of such analyses (neatly formatted tables).
x = (marathon.age.astype("category")
.cat.reorder_categories(["20", "30", "40", "50", "60+"])
.value_counts(sort=False)
)
x
## 20 1119
## 30 2649
## 40 1970
## 50 584
## 60+ 189
## Name: age, dtype: int64
Bar plots are self-explanatory and hence will do the trick most of the time; see Fig-
ure 11.1.
ind = np.arange(len(x)) # 0, 1, 2, 3, 4
plt.bar(ind, height=x, color="lightgray", edgecolor="black", alpha=0.8)
plt.xticks(ind, x.index)
plt.show()
The ind vector gives the x-coordinates of the bars; here: consecutive integers. By calling
matplotlib.pyplot.xticks we assign them readable labels.
Exercise 11.10 Draw a bar plot for the five most prevalent foreign (i.e., excluding Polish) mara-
thoners’ original whereabouts which features an additional bar that represents “all other” coun-
11 HANDLING CATEGORICAL DATA 271
2500
2000
1500
1000
500
0
20 30 40 50 60+
tries. Depict percentages instead of counts, so that the total bar height is 100%. Assign a different
colour to each bar.
A bar plot is a versatile tool for visualising the counts also in the two-variable case;
see Figure 11.2. Let us use seaborn.barplot now, which is a pleasant wrapper around
matplotlib.pyplot.bar (but, as usual, gives less control over the details):
sex
2000 F
M
1500
count
1000
500
0
20 30 40 50 60+
age
20 30 40 50 60+
Figure 11.3: An example stacked bar plot: Age distribution for different sexes amongst
all the runners
11 HANDLING CATEGORICAL DATA 273
51.00
50.75
50.50
50.25
50.00
%
49.75
49.50
49.25
49.00
Duda Trzaskowski
Figure 11.4: Such a great victory! Wait… Just look at the y-axis tick marks.
Important We should always read the axis tick marks. And when drawing our own
bar plots, we must never trick the reader; this is unethical; compare Rule#9.
274 IV HETEROGENEOUS DATA
%
100
0
Duda Trzaskowski
cat_med = np.array([
"Unauthorised drug", "Wrong IV rate", "Wrong patient", "Dose missed",
"Underdose", "Wrong calculation","Wrong route", "Wrong drug",
"Wrong time", "Technique error", "Duplicated drugs", "Overdose"
])
(continues on next page)
3 https://www.cec.health.nsw.gov.au/CEC-Academy/quality-improvement-tools/pareto-charts
11 HANDLING CATEGORICAL DATA 275
Let us display the dataset ordered with respect to the counts, decreasingly:
Pareto charts may aid in visualising the datasets where the Pareto principle is likely to
hold, at least approximately. They include bar plots with some extras:
• bars are listed in decreasing order,
• the cumulative percentage curve is added.
The plotting of the Pareto chart is a little tricky, because it involves using two different
Y axes (as usual, fine-tuning the figure and studying the manual of the matplotlib
package is left as an exercise.)
x = np.arange(len(med)) # 0, 1, 2, ...
p = 100.0*med/np.sum(med) # percentages
fig.tight_layout()
plt.show()
276 IV HETEROGENEOUS DATA
100
20
80
15
cumulative %
60
%
10
5 40
0 20
ime
Wro ent
oute
e
Dup ulation
te
rror
Wro ed
rug
Und ugs
se
rug
rdos
V ra
erdo
iss
ati
ng d
ed d
dr
ue e
ng t
ng r
em
Ove
ng I
ng p
ted
alc
oris
Wro
hniq
Dos
lica
Wro
ng c
Wro
uth
Tec
Wro
Una
Figure 11.6: An example Pareto chart: the most frequent causes for medication errors
In Figure 11.6, we can read that the first five causes (less than 40%) correspond to
ca. 85% of the medication errors. More precisely, the cumulative probabilities are:
med.cumsum()/np.sum(med)
## Dose missed 0.213953
## Wrong time 0.406977
## Wrong drug 0.583721
## Overdose 0.720930
## Wrong patient 0.844186
## Wrong route 0.906977
## Wrong calculation 0.944186
## Duplicated drugs 0.965116
## Underdose 0.981395
## Wrong IV rate 0.990698
## Technique error 0.997674
## Unauthorised drug 1.000000
## dtype: float64
Note that there is an explicit assumption here that a single error is only due to a single
cause. Also, we presume that each medication error has a similar degree of severity.
Policymakers and quality controllers often rely on similar simplifications. They most
probably are going to be addressing only the top causes. If we ever wondered why some
processes (mal)function the way they do, there is a hint above. Inventing something
more effective yet so simple at the same time requires much more effort.
It would be also nice to report the number of cases where no mistakes are made and
11 HANDLING CATEGORICAL DATA 277
the cases where errors are insignificant. Healthcare workers are doing a wonderful job
for our communities, especially in the public system. Why add to their stress?
2000
1750
240 449 262 43 19
F
1500
1250
sex
1000
750
879 2200 1708 541 170
M
500
250
20 30 40 50 60+
age
Figure 11.7: A heatmap for the marathoners’ sex and age category
counts = marathon.country.value_counts()
counts.head()
## PL 6033
## GB 71
(continues on next page)
278 IV HETEROGENEOUS DATA
Therefore, as far as qualitative data aggregation is concerned, what we are left with is
the mode, i.e., the most frequently occurring value.
It turns out that amongst the fastest 22 runners (a nicely round number), there is a tie
between Kenya and Poland – both meet our definition of a mode:
counts = marathon.country.iloc[:22].value_counts()
counts
## KE 7
## PL 7
## ET 3
## IL 3
## MA 1
## MD 1
## Name: country, dtype: int64
To avoid any bias, it is always best to report all the potential mode candidates:
counts.loc[counts == counts.max()].index
## Index(['KE', 'PL'], dtype='object')
If one value is required, though, we can pick one at random (calling numpy.random.
choice).
np.sum(marathon.country == "PL")
## 6033
11 HANDLING CATEGORICAL DATA 279
This gave the number of elements that are equal to "PL" (because the sum of 0s and 1s
is equal to the number of 1s in the sequence). Note that (country == "PL") is a logical
vector that represents a binary categorical variable with levels: not-Poland (False) and
Poland (True).
If we divide the above result by the length of the vector, we will get the proportion:
np.mean(marathon.country == "PL")
## 0.9265857779142989
About 93% of the runners were from Poland. As this is greater than 0.5, "PL” is defin-
itely the mode.
Exercise 11.14 What is the meaning of numpy.all, numpy.any, numpy.min, numpy.max,
numpy.cumsum, and numpy.cumprod applied on logical vectors?
Note (**) Having the 0/1 (or zero/nonzero) vs False/True correspondence allows us
to perform some logical operations using integer arithmetic. In mathematics, 0 is
the annihilator of multiplication and the neutral element of addition, whereas 1 is the
neutral element of multiplication. In particular, assuming that p and q are logical val-
ues and a and b are numeric ones, we have, what follows:
• p+q != 0 means that at least one value is True and p+q == 0 if and only if both are
False;
• more generally, p+q == 2 if both elements are True, p+q == 1 if only one is True (we
call it exclusive-or, XOR), and p+q == 0 if both are False;
• p*q != 0 means that both values are True and p*q == 0 holds whenever at least one
is False;
• 1-p corresponds to the negation of p (changes 1 to 0 and 0 to 1);
• p*a + (1-p)*b is equal to a if p is True and equal to b otherwise.
between the observed proportions 𝑝1̂ , … , 𝑝𝑙̂ and the theoretical ones 𝑝1 , … , 𝑝𝑙 are sig-
nificantly large or not:
Having such a test is beneficial, e.g., when the data we have at hand are based on small
surveys that are supposed to serve as estimates of what might be happening in a larger
population.
Going back to our political example from Section 11.3.2, it turns out that one of the
pre-election polls indicated that 𝑐 = 516 out of 𝑛 = 1017 people would vote for the
first candidate. We have 𝑝1̂ = 50.74% (Duda) and 𝑝2̂ = 49.26% (Trzaskowski). If we
would like to test whether the observed proportions are significantly different from
each other, we could test them against the theoretical distribution 𝑝1 = 50% and
𝑝2 = 50%, which states that there is a tie between the competitors (up to a sampling
error).
A natural test statistic is based on the relative squared differences:
𝑙 2
(𝑝𝑖̂ − 𝑝𝑖 )
𝑇̂ = 𝑛 ∑ .
𝑖=1
𝑝𝑖
c, n = 516, 1017
p_observed = np.array([c, n-c]) / n
p_expected = np.array([0.5, 0.5])
T = n * np.sum( (p_observed-p_expected)**2 / p_expected )
T
## 0.2212389380530986
Similarly to the continuous case in Section 6.2.3, we should reject the null hypothesis,
if:
𝑇̂ ≥ 𝐾.
The critical value 𝐾 is computed based on the fact that, if the null hypothesis is true, 𝑇̂
follows the 𝜒 2 (chi-squared, hence the name of the test) distribution with 𝑙−1 degrees
of freedom, see scipy.stats.chi2.
We thus need to query the theoretical quantile function to determine the test statistic
that is not exceeded in 99.9% of the trials (under the null hypothesis):
As 𝑇̂ < 𝐾 (because 0.22 < 10.83), we cannot deem the two proportions significantly
different. In other words, this poll did not indicate (at the significance level 0.1%) any
of the candidates as a clear winner.
11 HANDLING CATEGORICAL DATA 281
Exercise 11.15 Assuming 𝑛 = 1017, determine the smallest 𝑐, i.e., the number of respondents
claiming they would vote for Duda, that leads to the rejection of the null hypothesis.
There are 𝑙 = 5 age categories. First, denote the total number of observations in both
groups with 𝑛′ and 𝑛″ .
n1 = c1.sum()
n2 = c2.sum()
n1, n2
## (1013, 5498)
The observed proportions in the first group (females), denoted as 𝑝′1̂ , … , 𝑝′𝑙̂ , are, re-
spectively:
p1 = c1/n1
p1
## array([0.23692004, 0.44323791, 0.25863771, 0.04244817, 0.01875617])
Here are the proportions in the second group (males), 𝑝″1̂ , … , 𝑝″𝑙̂ :
p2 = c2/n2
p2
## array([0.15987632, 0.40014551, 0.31065842, 0.09839942, 0.03092033])
We would like to verify whether the corresponding proportions are equal (up to some
sampling error):
In other words, we are interested whether the categorical data in the two groups come
from the same discrete probability distribution.
Taking the estimated expected proportions:
pp = (n1*p1+n2*p2)/(n1+n2)
T = n1 * np.sum( (p1-pp)**2 / pp ) + n2 * np.sum( (p2-pp)**2 / pp )
T
## 75.31373854741857
It can be shown that, if the null hypothesis is true, the test statistic approximately fol-
lows the 𝜒 2 distribution with 𝑙 − 1 degrees of freedom5 . The critical value 𝐾 is equal
to:
As 𝑇̂ ≥ 𝐾 (because 75.31 ≥ 18.47), we reject the null hypothesis. And so, the runners’
age distribution differs across sexes (at significance level 0.1%).
l = [
["Arthritis", "Asthma", "Back problems", "Cancer (malignant neoplasms)",
"Chronic obstructive pulmonary disease", "Diabetes mellitus",
"Heart, stroke and vascular disease", "Kidney disease",
"Mental and behavioural conditions", "Osteoporosis"],
["15-44", "45-64", "65+"]
]
C = 1000*np.array([
[ 360.2, 1489.0, 1772.2],
[1069.7, 741.9, 433.7],
[1469.6, 1513.3, 955.3],
[ 28.1, 162.7, 237.5],
[ 103.8, 207.0, 251.9],
(continues on next page)
5 Notice that [73] in Section 14.3 recommends 𝑙 degrees of freedom, but we do not agree with this rather
informal reasoning. Also, simple Monte Carlo simulations suggest that 𝑙 − 1 is a better candidate.
6 https://www.abs.gov.au/statistics/health/health-conditions-and-risks/
national-health-survey-first-results/2017-18
11 HANDLING CATEGORICAL DATA 283
Cramér’s 𝑉 is one of a few ways to measure the degree of association between two
categorical variables. It is equal to 0 (the lowest possible value) if the two variables are
independent (there is no association between them) and 1 (the highest possible value)
if they are tied.
Given a two-way contingency table 𝐶 with 𝑛 rows and 𝑚 columns and assuming that:
𝑛 𝑚 (𝑐𝑖,𝑗 − 𝑒𝑖,𝑗 )2
𝑇 = ∑∑ ,
𝑖=1 𝑗=1
𝑒𝑖,𝑗
where:
𝑚 𝑛
(∑𝑘=1 𝑐𝑖,𝑘 ) (∑𝑘=1 𝑐𝑘,𝑗 )
𝑒𝑖,𝑗 = 𝑛 𝑚 ,
∑𝑖=1 ∑𝑗=1 𝑐𝑖,𝑗
scipy.stats.contingency.association(C)
## 0.316237999724298
284 IV HETEROGENEOUS DATA
The above means that there might be a small association between age and the preval-
ence of certain conditions. In other words, it might be the case that some conditions
are more prevalent in different age groups than others.
Exercise 11.16 Compute the Cramér 𝑉 using only numpy functions.
Example 11.17 (**) We can easily verify the hypothesis whether 𝑉 does not differ significantly
from 0, i.e., whether the variables are independent. Looking at 𝑇, we see that this is essentially
the test statistic in Pearson’s chi-squared goodness-of-fit test.
If the data are really independent, 𝑇 follows the chi-squared distribution 𝑛 + 𝑚 − 1. As a con-
sequence, the critical value 𝐾 is equal to:
As 𝑇 is much greater than 𝐾, we conclude (at significance level 0.1%) that the health conditions
are not independent of age.
Exercise 11.18 (*) Take a look at Table 19: Comorbidity of selected chronic conditions in
the National Health Survey 20187 , where we clearly see that many disorders co-occur. Visualise
them on some heatmaps and bar plots (including data grouped by sex and age).
national-health-survey-first-results/2017-18
8 https://github.com/gagolews/teaching-data/raw/master/marek/uk_income_simulated_2020.txt
11 HANDLING CATEGORICAL DATA 285
grades = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/grades_results.txt", dtype="str")
grades = pd.Series(pd.Categorical(grades,
categories=["F", "P", "C", "D", "HD"], ordered=True))
grades.value_counts() # note the order of labels
## F 30
## P 29
## C 23
## HD 22
## D 19
## dtype: int64
How would you determine the average grade represented as a number between 0 and 100, taking
into account that for a P you need at least 50%, C is given for ≥ 60%, D for ≥ 70%, and HD for only
(!) 80% of the points. Come up with a pessimistic, optimistic, and best-shot estimate, and then
compare your result to the true corresponding scores listed in the grades_scores10 dataset.
9 https://github.com/gagolews/teaching-data/raw/master/marek/grades_results.txt
10 https://github.com/gagolews/teaching-data/raw/master/marek/grades_scores.txt
286 IV HETEROGENEOUS DATA
11.5 Exercises
Exercise 11.21 Does it make sense to compute the arithmetic mean of a categorical variable?
Exercise 11.22 Name the basic use cases for categorical data.
Exercise 11.23 (*) What is a Pareto chart?
Exercise 11.24 How can we deal with the case of the mode being nonunique?
Exercise 11.25 What is the meaning of the sum and mean for binary data (logical vectors)?
Exercise 11.26 What is the meaning of numpy.mean((x > 0) & (x < 1)), where x is a numeric
vector?
Exercise 11.27 List some ways to visualise multidimensional categorical data (combinations of
two or more factors).
Exercise 11.28 (*) State the null hypotheses verified by the one- and two-sample chi-squared
goodness-of-fit tests.
Exercise 11.29 (*) How is Cramér’s V defined and what values does it take?
12
Processing Data in Groups
Let us consider another subset of the US Centres for Disease Control and Prevention
National Health and Nutrition Examination Survey, this time carrying some body
measures (P_BMX1 ) together with demographics (P_DEMO2 ).
nhanes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_p_demo_bmx_2020.csv",
comment="#")
nhanes = (
nhanes
.loc[
(nhanes.DMDBORN4 <= 2) & (nhanes.RIDAGEYR >= 18),
["RIDAGEYR", "BMXWT", "BMXHT", "BMXBMI", "RIAGENDR", "DMDBORN4"]
] # age >= 18 and only US and non-US born
.rename({
"RIDAGEYR": "age",
"BMXWT": "weight",
"BMXHT": "height",
"BMXBMI": "bmival",
"RIAGENDR": "gender",
"DMDBORN4": "usborn"
}, axis=1) # rename columns
.dropna() # remove missing values
.reset_index(drop=True)
)
We consider only the adult (at least 18 years old) participants, whose country of birth
(the US or not) is well-defined. Let us recode the usborn and gender variables (for read-
ability) and introduce the BMI categories:
nhanes.loc[:, "usborn"] = (
nhanes.usborn.astype("category")
.cat.rename_categories(["yes", "no"]).astype("str") # recode usborn
)
nhanes.loc[:, "gender"] = (
nhanes.gender.astype("category")
(continues on next page)
1 https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_BMX.htm
2 https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_DEMO.htm
288 IV HETEROGENEOUS DATA
nhanes.head()
## age weight height bmival gender usborn bmicat
## 0 29 97.1 160.2 37.8 female no obese
## 1 49 98.8 182.3 29.7 male yes overweight
## 2 36 74.3 184.2 21.9 male yes normal
## 3 68 103.7 185.3 30.2 male yes obese
## 4 76 83.3 177.1 26.6 male yes overweight
type(nhanes.groupby("gender"))
## <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
type(nhanes.groupby("gender").height) # or (...)["height"]
## <class 'pandas.core.groupby.generic.SeriesGroupBy'>
12 PROCESSING DATA IN GROUPS 289
Important When we wish to browse the list of available attributes in the pandas
manual, it is worth knowing that DataFrameGroupBy and SeriesGroupBy are separate
types. Still, they have many methods and slots in common, because they both inherit
from (extend) the GroupBy class.
nhanes.groupby("gender").size()
## gender
## female 4514
## male 4271
## dtype: int64
It returns an object of type Series. We can also perform the grouping with respect to
a combination of levels in two qualitative columns:
nhanes.groupby(["gender", "bmicat"]).size()
## gender bmicat
## female underweight 93
## normal 1161
## overweight 1245
## obese 2015
## male underweight 65
## normal 1074
## overweight 1513
## obese 1619
## dtype: int64
This yields a Series with a hierarchical index (as discussed in Section 10.1.3). Never-
theless, we can always call reset_index to convert it to standalone columns:
nhanes.groupby(["gender", "bmicat"]).size().rename("counts").reset_index()
## gender bmicat counts
## 0 female underweight 93
## 1 female normal 1161
## 2 female overweight 1245
## 3 female obese 2015
## 4 male underweight 65
## 5 male normal 1074
## 6 male overweight 1513
## 7 male obese 1619
3 https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html
290 IV HETEROGENEOUS DATA
Take note of the rename part. It gave us some readable column names.
Furthermore, it is possible to group rows in a data frame using a list of any Series
objects, i.e., not just column names in a given data frame; see Section 16.2.3 for an
example.
Exercise 12.2 (*) Note the difference between pandas.GroupBy.count and pandas.GroupBy.
size methods (by reading their documentation).
nhanes.groupby("gender").mean(numeric_only=True).reset_index()
## gender age weight height bmival
## 0 female 48.956580 78.351839 160.089189 30.489189
## 1 male 49.653477 88.589932 173.759541 29.243620
The arithmetic mean was computed only on numeric columns4 . Further, a few com-
mon aggregates are generated by describe:
nhanes.groupby("gender").height.describe().reset_index()
## gender count mean std ... 25% 50% 75% max
## 0 female 4514.0 160.089189 7.035483 ... 155.3 160.0 164.8 189.3
## 1 male 4271.0 173.759541 7.702224 ... 168.5 173.8 178.9 199.6
##
## [2 rows x 9 columns]
(nhanes.
loc[:, ["gender", "height", "weight"]].
groupby("gender").
aggregate([np.mean, np.median, len, lambda x: (np.max(x)+np.min(x))/2]).
reset_index()
)
## gender height ... weight
## mean median len ... mean median len <lambda_0>
## 0 female 160.089189 160.0 4514 ... 78.351839 74.1 4514 143.45
## 1 male 173.759541 173.8 4271 ... 88.589932 85.0 4271 139.70
##
## [2 rows x 9 columns]
4 (*) In this example, we called pandas.GroupBy.mean. Note that it has slightly different functionality from
Note The column names in the output object are generated by reading the applied
functions’ __name__ slots, see, e.g., print(np.mean.__name__).
mr = lambda x: (np.max(x)+np.min(x))/2
mr.__name__ = "midrange"
(nhanes.
loc[:, ["gender", "height", "weight"]].
groupby("gender").
aggregate([np.mean, mr]).
reset_index()
)
## gender height weight
## mean midrange mean midrange
## 0 female 160.089189 160.2 78.351839 143.45
## 1 male 173.759541 172.1 88.589932 139.70
def standardise(x):
return (x-np.mean(x, axis=0))/std0(x, axis=0)
nhanes["height_std"] = (
nhanes.
loc[:, ["height", "gender"]].
groupby("gender").
transform(standardise)
)
nhanes.head()
## age weight height bmival gender usborn bmicat height_std
## 0 29 97.1 160.2 37.8 female no obese 0.015752
## 1 49 98.8 182.3 29.7 male yes overweight 1.108960
## 2 36 74.3 184.2 21.9 male yes normal 1.355671
## 3 68 103.7 185.3 30.2 male yes obese 1.498504
## 4 76 83.3 177.1 26.6 male yes overweight 0.433751
292 IV HETEROGENEOUS DATA
The new column gives the relative z-scores: a woman with a relative z-score of 0 has
height of 160.1 cm, whereas a man with the same z-score has height of 173.8 cm.
We can check that the means and standard deviations in both groups are equal to 0
and 1:
(nhanes.
loc[:, ["gender", "height", "height_std"]].
groupby("gender").
aggregate([np.mean, std0])
)
## height height_std
## mean std0 mean std0
## gender
## female 160.089189 7.034703 -1.351747e-15 1.0
## male 173.759541 7.701323 3.145329e-16 1.0
grouped = (nhanes.head()
.loc[:, ["gender", "weight", "height"]].groupby("gender")
)
list(grouped)
## [('female', gender weight height
## 0 female 97.1 160.2), ('male', gender weight height
## 1 male 98.8 182.3
## 2 male 74.3 184.2
## 3 male 103.7 185.3
## 4 male 83.3 177.1)]
The way Python formatted the above output is imperfect, so we need to contemplate it
for a tick. We see that when iterating through a GroupBy object, we get access to pairs
giving all the levels of the grouping variable and the subsets of the input data frame
corresponding to these categories.
Here is a simple example where we make use of the above fact:
12 PROCESSING DATA IN GROUPS 293
We see that splitting followed by manual processing of the chunks in a loop is quite te-
dious in the case where we would merely like to compute some basic aggregates. These
scenarios are extremely common. No wonder why the pandas developers introduced a
convenient interface in the form of the pandas.DataFrame.groupby and pandas.Series.
groupby methods and the DataFrameGroupBy and SeriesGroupby classes. Still, for more
ambitious tasks, the low-level way to perform the splitting will come in handy.
Exercise 12.4 (**) Using the manual splitting and matplotlib.pyplot.boxplot, draw a
box-and-whisker plot of heights grouped by BMI category (four boxes side by side).
Exercise 12.5 (**) Using the manual splitting, compute the relative z-scores of the height
column separately for each BMI category.
Example 12.6 Let us also demonstrate that the splitting can be done manually without the use
of pandas. Namely, calling numpy.split(a, ind) returns a list with a (being an array-like ob-
ject, e.g., a vector, a matrix, or a data frame) partitioned rowwisely into len(ind)+1 chunks at
indexes given by ind. For example:
To split a data frame into groups defined by a categorical column, we can first sort it with respect
to the criterion of interest, for instance, the gender data:
Then, we can use numpy.unique to fetch the indexes of first occurrences of each series of identical
labels:
This can now be used for dividing the sorted data frame into chunks:
We obtained a list of data frames split at rows specified by where[1:]. Here is a preview of the
first and the last row in each chunk:
for i in range(len(levels)):
# process (levels[i], nhanes_grp[i])
print(f"level='{levels[i]}'; preview:")
print(nhanes_grp[i].iloc[ [0, -1], : ], end="\n\n")
## level='female'; preview:
## age weight height bmival gender usborn bmicat height_std
## 0 29 97.1 160.2 37.8 female no obese 0.015752
## 8781 67 82.8 147.8 37.9 female no obese -1.746938
##
## level='male'; preview:
## age weight height bmival gender usborn bmicat height_std
## 1 49 98.8 182.3 29.7 male yes overweight 1.108960
## 8784 74 59.7 167.5 21.3 male no normal -0.812788
Within each subgroup, we can apply any operation we have learnt so far: our imagination is the
only major limiting factor. For instance, we can aggregate some columns:
nhanes_agg = [
dict(
level=t.gender.iloc[0], # they are all the same here – take first
height_mean=np.round(np.mean(t.height), 2),
weight_mean=np.round(np.mean(t.weight), 2)
)
for t in nhanes_grp
]
print(nhanes_agg[0])
## {'level': 'female', 'height_mean': 160.09, 'weight_mean': 78.35}
print(nhanes_agg[1])
## {'level': 'male', 'height_mean': 173.76, 'weight_mean': 88.59}
pd.DataFrame(nhanes_agg)
## level height_mean weight_mean
## 0 female 160.09 78.35
## 1 male 173.76 88.59
Furthermore, a simple trick to allow grouping with respect to more than one column is to apply
numpy.unique on a string vector that combines the levels of the grouping variables, e.g., by con-
catenating them like nhanes_srt.gender + "___" + nhanes_srt.bmicat (assuming that
nhanes_srt is ordered with respect to these two criteria).
12 PROCESSING DATA IN GROUPS 295
usborn
no
female yes
gender
male
20 30 40 50 60 70 80 90
bmival
Figure 12.1: The distribution of BMIs for different genders and countries of birth
Let us contemplate for a while how easy it is now to compare the BMI distribution in
different groups. Here, we have two grouping variables, as specified by the y and hue
arguments.
Exercise 12.7 Create a similar series of violin plots.
Exercise 12.8 (*) Add the average BMIs in each group to the above box plot using matplotlib.
pyplot.plot. Check ylim to determine the range on the y-axis.
sns.barplot(
y="counts", x="gender", hue="bmicat", palette="Paired",
data=(
nhanes.
groupby(["gender", "bmicat"]).
size().
rename("counts").
reset_index()
)
)
plt.show()
2000 bmicat
1750 underweight
normal
1500 overweight
1250 obese
counts
1000
750
500
250
0
female male
gender
Figure 12.2: Number of persons for each gender and BMI category
Exercise 12.9 Draw a similar bar plot where the bar heights sum to 100% for each gender.
Exercise 12.10 Using the two-sample chi-squared test, verify whether the BMI category distri-
butions for men and women differ significantly from each other.
usborn
0.025 no
yes
0.020
Density
0.015
0.010
0.005
0.000
50 100 150 200 250
weight
Figure 12.3: The weight distribution of the US-born participants has a higher mean
and variance
Important Grid plots can bear any kind of data visualisation we have discussed so far
(e.g., histograms, bar plots, scatterplots).
298 IV HETEROGENEOUS DATA
Exercise 12.12 Draw a trellis plot with scatterplots of weight vs height grouped by BMI cat-
egory and gender.
We have used manual splitting of the weight column into subgroups and then
plotted the two ECDFs separately, because a call to seaborn.ecdfplot(data=nhanes,
x="weight", hue="usborn") does not honour our wish to use alternating lines styles
(most likely due to a bug).
A two-sample Kolmogorov–Smirnov test can be used to check whether two ECDFs 𝐹𝑛̂′
(e.g., the weight of the US-born participants) and 𝐹𝑚
̂″ (e.g., the weight of non-US-born
persons) are significantly different from each other:
The test statistic will be a variation of the one-sample setting discussed in Sec-
12 PROCESSING DATA IN GROUPS 299
0.015
0.010
0.005
0.000
usborn = yes | gender = female usborn = yes | gender = male
0.030
0.025
0.020
Density
0.015
0.010
0.005
0.000
50 100 150 200 250 50 100 150 200 250
weight weight
Figure 12.5: Distribution of weights for different genders and countries of birth
Computing the above is slightly trickier than in the previous case5 , but luckily an ap-
propriate procedure is already implemented in scipy.stats:
x12 = nhanes.set_index("usborn").weight
x1 = x12.loc["yes"] # first sample
x2 = x12.loc["no"] # second sample
Dnm = scipy.stats.ks_2samp(x1, x2)[0]
(continues on next page)
5 Remember that this is an introductory course, and we are still being very generous here. We encourage
the readers to upskill themselves (later, of course) not only in mathematics, but also in programming (e.g.,
algorithms and data structures).
300 IV HETEROGENEOUS DATA
1.0
usborn
no
0.8 yes
0.6
Proportion
0.4
0.2
0.0
50 100 150 200 250
weight
Assuming significance level 𝛼 = 0.001, the critical value is approximately (for larger
𝑛 and 𝑚) equal to:
log(𝛼/2)(𝑛 + 𝑚)
𝐾𝑛,𝑚 = √− .
2𝑛𝑚
alpha = 0.001
np.sqrt(-np.log(alpha/2) * (len(x1)+len(x2)) / (2*len(x1)*len(x2)))
## 0.04607410479813944
As usual, we reject the null hypothesis when 𝐷̂ 𝑛,𝑚 ≥ 𝐾𝑛,𝑚 , which is exactly the case
here (at significance level 0.1%). In other words, weights of US- and non-US-born
participants differ significantly.
Important Frequentist hypothesis testing only takes into account the deviation
between distributions that is explainable due to sampling effects (the assumed ran-
domness of the data generation process). For large sample sizes, even very small de-
viations6 will be deemed statistically significant, but it does not mean that we should
consider them as practically significant. For instance, if a very costly, environmentally
6 Including those that are merely due to round-off errors.
12 PROCESSING DATA IN GROUPS 301
unfriendly, and generally inconvenient for everyone upgrade leads to a process’ im-
provement such that we reject the null hypothesis stating that two distributions are
equal, but it turns out that the gains are ca. 0.5%, the good old common sense should
be applied.
Exercise 12.13 Compare between the ECDFs of weights of men and women who are between 18
and 25 years old. Determine whether they are significantly different.
Important Some statistical textbooks and many research papers in the social sci-
ences (amongst many others) employ the significance level of 𝛼 = 5%, which is of-
ten criticised as too high7 . Many stakeholders aggressively push towards constant im-
provements in terms of inventing bigger, better, faster, more efficient things. In this
context, larger 𝛼 allows for generating more sensational discoveries. This is because it
considers smaller differences as already significant. This all adds to what we call the
reproducibility crisis in the empirical sciences.
We, on the other hand, claim that it is better to err on the side of being cautious. This,
in the long run, is more sustainable.
x = nhanes.weight.loc[nhanes.usborn == "yes"]
y = nhanes.weight.loc[nhanes.usborn == "no"]
xd = np.sort(x)
yd = np.sort(y)
if len(xd) > len(yd): # interpolate between quantiles in a longer sample
xd = np.quantile(xd, np.arange(1, len(yd)+1)/(len(yd)+1))
else:
yd = np.quantile(yd, np.arange(1, len(xd)+1)/(len(xd)+1))
plt.plot(xd, yd, "o")
plt.axline((xd[len(xd)//2], xd[len(xd)//2]), slope=1,
linestyle=":", color="gray") # identity line
plt.xlabel(f"Sample quantiles (weight; usborn=yes)")
plt.ylabel(f"Sample quantiles (weight; usborn=no)")
plt.show()
7 For similar reasons, we do not introduce the notion of p-values. Most practitioners tend to misunder-
180
Sample quantiles (weight; usborn=no)
160
140
120
100
80
60
40
Notice that we interpolated between the quantiles in a larger sample to match the
length of the shorter vector.
wine_train = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/other/sweetwhitewine_train2.csv",
comment="#")
wine_train.head()
## alcohol sugar bad
## 0 10.625271 10.340159 0
## 1 9.066111 18.593274 1
## 2 10.806395 6.206685 0
## 3 13.432876 2.739529 0
## 4 9.578162 3.053025 0
We are given each wine’s alcohol and residual sugar content, as well as a binary cat-
egorical variable stating whether a group of sommeliers deem a given beverage quite
8 http://archive.ics.uci.edu/ml/datasets/Wine+Quality
12 PROCESSING DATA IN GROUPS 303
bad (1) or not (0). Figure 12.8 reveals that subpar wines are rather low in… alcohol and,
to some extent, sugar.
Figure 12.8: Scatterplot for sugar vs alcohol content for white, rather sweet wines, and
whether they are considered bad (1) or drinkable (0) by some experts
Someone answer the door! We have a delivery of a few new wine bottles. Interestingly,
their alcohol and sugar contents have been given on their respective labels.
wine_test = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/other/sweetwhitewine_test2.csv",
comment="#").iloc[:, :-1]
wine_test.head()
## alcohol sugar
## 0 9.315523 10.041971
## 1 12.909232 6.814249
## 2 9.051020 12.818683
## 3 9.567601 11.091827
## 4 9.494031 12.053790
We would like to determine which of the wines from the test set might be not-bad
304 IV HETEROGENEOUS DATA
without asking an expert for their opinion. In other words, we would like to exercise
a classification task (see, e.g., [8, 47]). More formally:
Important Assume we are given a set of training points 𝐗 ∈ ℝ𝑛×𝑚 and the cor-
responding reference outputs 𝒚 ∈ {𝐿1 , 𝐿2 , … , 𝐿𝑙 }𝑛 in the form of a categorical vari-
able with l distinct levels. The aim of a classification algorithm is to predict what the
outputs for each point from a possibly different dataset 𝐗′ ∈ ℝ𝑛 ×𝑚 , i.e., 𝒚′̂ ∈
′
In other words, we are asked to fill the gaps in a categorical variable. Recall that in a
regression problem (Section 9.2), the reference outputs were numerical.
Exercise 12.14 Which of the following are instances of classification problems and which are
regression tasks?
• Detect email spam.
• Predict a market stock price (good luck with that).
• Assess credit risk.
• Detect tumour tissues in medical images.
• Predict time-to-recovery of cancer patients.
• Recognise smiling faces on photographs (kind of creepy).
• Detect unattended luggage in airport security camera footage.
What kind of data should you gather to tackle them?
2. Classify 𝒙′ as 𝑦′̂ = mode(𝑦𝑖1 , … , 𝑦𝑖𝑘 ), i.e., assign it the label that most frequently
occurs amongst its 𝑘 nearest neighbours. If a mode is nonunique, resolve the ties,
for example, at random.
It is thus a similar algorithm to k-nearest neighbour regression (Section 9.2.1). We
only replaced the quantitative mean with the qualitative mode.
This is a variation on the theme: if you don’t know what to do in a given situation, try to
12 PROCESSING DATA IN GROUPS 305
mimic what the majority of people around you are doing. Or, if you don’t know what
to think about a particular wine, but amongst the five most similar ones (in terms of
alcohol and sugar content) three were said to be awful, say that you don’t like it because
it’s not sweet enough. Thanks to this, others will take you for a very refined wine taster.
Let us apply a 5-nearest neighbour classifier on the standardised version of the data-
set: as we are about to use a technique based on pairwise distances, it would be best
if the variables were on the same scale. Thus, we first compute the z-scores for the
training set:
X_train = np.array(wine_train.loc[:, ["alcohol", "sugar"]])
means = np.mean(X_train, axis=0)
sds = np.std(X_train, axis=0)
Z_train = (X_train-means)/sds
Let us stress that we referred to the aggregates computed for the training set. This
is a good example of a situation where we cannot simply use a built-in method from
pandas. Instead, we apply what we have learned about numpy.
First, we fetched the indexes of each test point’s nearest neighbours (amongst the
points in the training set). Then, we read their corresponding labels; they are stored
in a matrix with 𝑘 columns. Finally, we computed the modes in each row. As a con-
sequence, we have each point in the test set classified.
And now:
k = 5
y_train = np.array(wine_train.bad)
y_pred = knn_class(Z_test, Z_train, y_train, k)
y_pred[:10] # preview
## array([1, 0, 0, 1, 1, 0, 1, 0, 0, 1])
Note Unfortunately, scipy.stats.mode does not resolve possible ties at random: e.g.,
the mode of (1, 1, 1, 2, 2, 2) is always 1. Nevertheless, in our case, 𝑘 is odd and the num-
ber of possible classes is 𝑙 = 2, so the mode is always unique.
Figure 12.9 shows how nearest neighbour classification categorises different regions
306 IV HETEROGENEOUS DATA
of a section of the two-dimensional plane. The greater the 𝑘, the smoother the de-
cision boundaries. Naturally, in regions corresponding to few training points, we do
not expect the classification accuracy to be good enough9 .
x1 = np.linspace(Z_train[:, 0].min(), Z_train[:, 0].max(), 100)
x2 = np.linspace(Z_train[:, 1].min(), Z_train[:, 1].max(), 100)
xg1, xg2 = np.meshgrid(x1, x2)
Xg12 = np.column_stack((xg1.reshape(-1), xg2.reshape(-1)))
ks = [5, 25]
for i in range(len(ks)):
plt.subplot(1, len(ks), i+1)
yg12 = knn_class(Xg12, Z_train, y_train, ks[i])
plt.scatter(Z_train[y_train == 0, 0], Z_train[y_train == 0, 1],
c="black", marker="o", alpha=0.5)
plt.scatter(Z_train[y_train == 1, 0], Z_train[y_train == 1, 1],
c="#DF536B", marker="v", alpha=0.5)
plt.contourf(x1, x2, yg12.reshape(len(x2), len(x1)),
cmap="gist_heat", alpha=0.5)
plt.title(f"$k={ks[i]}$", fontdict=dict(fontsize=10))
plt.xlabel("alcohol")
if i == 0: plt.ylabel("sugar")
plt.show()
import sklearn.neighbors
knn = sklearn.neighbors.KNeighborsClassifier(k)
knn.fit(Z_train, y_train)
y_pred2 = knn.predict(Z_test)
We can verify that the results are identical to the ones above by calling:
np.all(y_pred2 == y_pred)
## True
y_test = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/other/sweetwhitewine_test2.csv",
comment="#")
y_test = np.array(y_test.bad)
y_test[:10] # preview
## array([1, 0, 0, 1, 0, 0, 1, 0, 1, 1])
The accuracy score is the most straightforward measure of the similarity between these
true labels (denoted 𝒚′ = (𝑦1′ , … , 𝑦𝑛′ ′ )) and the ones predicted by the classifier (de-
noted 𝒚′̂ = (𝑦1′̂ , … , 𝑦𝑛′̂ ′ )). It is defined as a ratio between the correctly classified in-
stances and all the instances:
𝑛′
∑ 𝟏(𝑦′ = 𝑦𝑖′̂ )
Accuracy(𝒚′ , 𝒚′̂ ) = 𝑖=1 ′ 𝑖 ,
𝑛
where the indicator function 𝟏(𝑦𝑖′ = 𝑦𝑖′̂ ) = 1 if and only if 𝑦𝑖′ = 𝑦𝑖′̂ and 0 otherwise.
Computing the above for our test sample gives:
np.mean(y_test == y_pred)
## 0.706
Thus, 71% of the wines were correctly classified with regard to their true quality. Before
we get too enthusiastic, let us note that our dataset is slightly imbalanced in terms of
the distribution of label counts:
It turns out that the majority of the wines (330 out of 500) in our sample are truly good.
Notice that a dummy classifier which labels all the wines as great would have accur-
acy of 66%. Our k-nearest neighbour approach to wine quality assessment is not that
usable after all.
C = pd.DataFrame(
dict(y_pred=y_pred, y_test=y_test)
).value_counts().unstack(fill_value=0)
C
## y_test 0 1
## y_pred
## 0 272 89
## 1 58 81
In the binary classification case (𝑙 = 2) such as this one, its entries are usually referred
to as (see also the table below):
• TN – the number of cases where the true 𝑦𝑖′ = 0 and the predicted 𝑦𝑖′̂ = 0 (true
negative),
• TP – the number of instances such that the true 𝑦𝑖′ = 1 and the predicted 𝑦𝑖′̂ = 1
(true positive),
• FN – how many times the true 𝑦𝑖′ = 1 but the predicted 𝑦𝑖′̂ = 0 (false negative),
• FN – how many times the true 𝑦𝑖′ = 0 but the predicted 𝑦𝑖′̂ = 1 (false positive).
The terms positive and negative refer to the output predicted by a classifier, i.e., they
indicate whether some 𝑦𝑖′̂ is equal to 1 and 0, respectively.
Table 12.1: The different cases of true vs predicted labels in a binary classification task
(𝑙 = 2)
𝑦𝑖′ = 0 𝑦𝑖′ = 1
Ideally, the number of false positives and false negatives should be as low as possible.
The accuracy score only takes the raw number of true negatives (TN) and true positives
(TP) into account:
TN + TP
Accuracy(𝒚′ , 𝒚′̂ ) = .
TN + TP + FN + FP
Consequently, it might not be a good metric in imbalanced classification problems.
12 PROCESSING DATA IN GROUPS 309
There are, fortunately, some more meaningful measures in the case where class 1 is
less prevalent and where mispredicting it is considered more hazardous than making
an inaccurate prediction with respect to class 0. This is because most will agree that
it is better to be surprised by a vino mislabelled as bad, than be disappointed with a
highly recommended product where we have already built some expectations around
it. Further, not getting diagnosed as having COVID-19 where we are genuinely sick
can be more dangerous for the people around us than being asked to stay at home
with nothing but a headache.
Precision answers the question: If the classifier outputs 1, what is the probability that
this is indeed true?
𝑛′
TP ∑ 𝑦𝑖′ 𝑦𝑖′̂
Precision(𝒚′ , 𝒚′̂ ) = = 𝑖=1 .
TP + FP 𝑛′
∑𝑖=1 𝑦𝑖′̂
Recall (sensitivity, hit rate, or true positive rate) addresses the question: If the true class
is 1, what is the probability that the classifier will detect it?
𝑛′
TP ∑ 𝑦𝑖′ 𝑦𝑖′̂
Recall(𝒚′ , 𝒚′̂ ) = = 𝑖=1 .
TP + FN 𝑛′
∑𝑖=1 𝑦𝑖′
C[1,1]/(C[1,1]+C[0,1]) # recall
## 0.4764705882352941
np.sum(y_test*y_pred)/np.sum(y_test) # equivalently
## 0.4764705882352941
Only 48% of the really bad wines will be filtered out by the classifier.
F-measure (or 𝐹1 -measure), is the harmonic10 mean of precision and recall in the case
where we would rather have them aggregated into a single number:
−1
1 1 −1 −1 TP
F(𝒚′ , 𝒚′̂ ) = = ( (Precision + Recall )) = FP+FN
.
TP +
1
+ 1
Precision Recall 2
2
2
10 (*) For any vector of nonnegative values, its minimum ≤ its harmonic mean ≤ its arithmetic mean.
310 IV HETEROGENEOUS DATA
C[1,1]/(C[1,1]+0.5*C[0,1]+0.5*C[1,0]) # F
## 0.5242718446601942
Exercise 12.17 Determine the best parameter setting for the k-nearest neighbour classification
of the color variable based on standardised versions of some physicochemical features (chosen
columns) of wines in the wine_quality_all11 dataset. Create a 60/20/20% dataset split. For
each 𝑘 = 1, 3, 5, 7, 9, compute the corresponding F-measure on the validation test. Evaluate the
quality of the best classifier on the test set.
Exercise 12.18 (**) Redo the above exercise (assessing the wine colour classifiers), but this time
maximise the F-measure obtained by a 5-fold cross-validation.
means, l-means, or ü-means method too. Nevertheless, some mainstream practitioners consider k-means
as a kind of a brand name, let us thus refrain from adding to their confusion. Interestingly, another widely
known algorithm is called fuzzy (weighted) c-means [6].
12 PROCESSING DATA IN GROUPS 313
And this is why we refer to the above objective function as the (total) within-cluster sum
of squares (WCSS). This problem looks easier, but let us not be tricked; 𝑙𝑖 s depend on
𝒄𝑗 s. They vary together. We have just made it less explicit.
It can be shown that given a fixed label vector 𝒍 representing a partition, 𝒄𝑗 must be
the centroid (Section 8.4.2) of the points assigned thereto:
1
𝒄𝑗 = ∑ 𝐱 ,
𝑛𝑗 𝑖∶𝑙 =𝑗 𝑖,⋅
𝑖
where 𝑛𝑗 = |{𝑖 ∶ 𝑙𝑖 = 𝑗}| gives the number of 𝑖s such that 𝑙𝑖 = 𝑗, i.e., the size of the 𝑗-th
cluster.
X = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/blobs1.txt", delimiter=",")
import scipy.cluster.vq
C, l = scipy.cluster.vq.kmeans2(X, 2)
The discovered cluster centres are stored in a matrix with 𝑘 rows and 𝑚 columns, i.e.,
the 𝑗-th row gives 𝐜𝑗 .
C
## array([[ 0.99622971, 1.052801 ],
## [-0.90041365, -1.08411794]])
314 IV HETEROGENEOUS DATA
l
## array([1, 1, 1, ..., 0, 0, 0], dtype=int32)
As usual in Python, indexing starts at 0. So for 𝑘 = 2 we only obtain the labels 0 and 1.
Figure 12.10 depicts the two clusters together with the cluster centroids. We use l as
a colour selector in my_colours[l] (this is a clever instance of the integer vector-based
indexing). It seems that we correctly discovered the very natural partitioning of this
dataset into two clusters.
4
3
2
1
0
1
2
3
6 4 2 0 2 4 6
Figure 12.10: Two clusters discovered by the k-means method; cluster centroids are
marked separately
The label vector l can be added as a new column in the dataset. Here is a preview:
We can now enjoy all the techniques for processing data in groups that we have dis-
cussed so far. In particular, computing the columnwise means gives nothing else than
the above cluster centroids:
Xl.groupby("l").mean()
## X1 X2
## l
## 0 0.996230 1.052801
## 1 -0.900414 -1.084118
The label vector l can be recreated by computing the distances between all the points
and the centroids and then picking the indexes of the closest pivots:
Important By construction13 , the k-means method can only detect clusters of convex
shapes (such as Gaussian blobs).
Exercise 12.19 Perform the clustering of the wut_isolation14 dataset and notice how non-
sensical, geometrically speaking, the returned clusters are.
Exercise 12.20 Determine a clustering of the wut_twosplashes15 dataset and display the res-
ults on a scatterplot. Compare them with those obtained on the standardised version of the data-
set. Recall what we said about the Euclidean distance and its perception being disturbed when a
plot’s aspect ratio is not 1:1.
Note (*) An even simpler classifier than the k-nearest neighbours one described
above builds upon the concept of the nearest centroids. Namely, it first determines the
centroids (componentwise arithmetic means) of the points in each class. Then, a new
point (from the test set) is assigned to the class whose centroid is the closest thereto.
The implementation of such a classifier is left as a rather straightforward exercise to
13 (*) And its relation to Voronoi diagrams.
14 https://github.com/gagolews/teaching-data/raw/master/clustering/wut_isolation.csv
15 https://github.com/gagolews/teaching-data/raw/master/clustering/wut_twosplashes.csv
316 IV HETEROGENEOUS DATA
3. Compute the centroids of the clusters defined by the label vector 𝒍, i.e., for every
𝑗 = 1, 2, … , 𝑘:
1
𝒄𝑗 = ∑ 𝐱 ,
𝑛𝑗 𝑖∶𝑙 =𝑗 𝑖,⋅
𝑖
12 PROCESSING DATA IN GROUPS 317
0.0030
0.0025
0.0020
0.0015
0.0010
0.0005
0.0000
0.0005
0.0010
0.0 0.2 0.4 0.6 0.8 1.0
Figure 12.11: An example function (of only one variable; our problem is much higher-
dimensional) with many local minima; how can we be sure there is no better minimum
outside of the depicted interval?
initial cluster centres are picked at random. Let us test its behaviour by analysing three
chosen country-wise categories from the 2016 Sustainable Society Indices16 dataset.
ssi = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/ssi_2016_categories.csv",
comment="#")
X = ssi.set_index("Country").loc[:,
["PersonalDevelopmentAndHealth", "WellBalancedSociety", "Economy"]
].rename({
"PersonalDevelopmentAndHealth": "Health",
"WellBalancedSociety": "Balance",
"Economy": "Economy"
}, axis=1) # rename columns
n = X.shape[0]
X.loc[["Australia", "Germany", "Poland", "United States"], :] # preview
## Health Balance Economy
## Country
## Australia 8.590927 6.105539 7.593052
## Germany 8.629024 8.036620 5.575906
## Poland 8.265950 7.331700 5.989513
## United States 8.357395 5.069076 3.756943
k = 3
np.random.seed(123) # reproducibility matters
C1, l1 = scipy.cluster.vq.kmeans2(X, k)
C1
## array([[7.99945084, 6.50033648, 4.36537659],
## [7.6370645 , 4.54396676, 6.89893746],
## [6.24317074, 3.17968018, 3.60779268]])
The objective function (total within-cluster sum of squares) at the returned cluster
centres is equal to:
import scipy.spatial.distance
def get_wcss(X, C):
D = scipy.spatial.distance.cdist(X, C)**2
return np.sum(np.min(D, axis=1))
get_wcss(X, C1)
## 446.5221283436733
Is it good or not necessarily? We are unable to tell. What we can do, however, is to run
the algorithm again, this time from a different starting point:
16 https://ssi.wi.th-koeln.de/
12 PROCESSING DATA IN GROUPS 319
It is a better solution (we are lucky; it might as well have been worse). But is it the best
possible? Again, we cannot tell, alone in the dark.
Does a potential suboptimality affect the way the data points are grouped? It is indeed
the case here. Let us look at the contingency table for the two label vectors:
pd.DataFrame(dict(l1=l1, l2=l2)).value_counts().unstack(fill_value=0)
## l2 0 1 2
## l1
## 0 8 0 43
## 1 39 6 0
## 2 0 57 1
Important Clusters are essentially unordered. The label vector (1, 1, 2, 2, 1, 3) repres-
ents the same clustering as the label vectors (3, 3, 2, 2, 3, 1) and (2, 2, 3, 3, 2, 1).
Much better. It turns out that 8+6+1 countries are categorised differently. We would
definitely not want to initiate any diplomatic crisis because of our not knowing that
the above algorithm might return suboptimal solutions.
Exercise 12.22 (*) Determine which countries are affected.
320 IV HETEROGENEOUS DATA
wcss, Cs = [], []
for i in range(1000):
C, l = scipy.cluster.vq.kmeans2(X, k, seed=i)
Cs.append(C)
wcss.append(get_wcss(X, C))
The best of the local minima (no guarantee that it is the global one, again) is:
np.min(wcss)
## 437.51120966832775
Cs[np.argmin(wcss)]
## array([[7.80779013, 5.19409177, 6.97790733],
## [7.92606993, 6.35691349, 3.91202972],
## [6.31794579, 3.12048584, 3.84519706]])
They are the same as C2 above (up to a permutation of labels). We were lucky18 , after
all.
It is very educational to look at the distribution of the objective function at the iden-
tified local minima to see that, proportionally, in the case of this dataset it is not rare
to end up in a quite bad solution; see Figure 12.12.
plt.hist(wcss, bins=100)
plt.show()
Also, Figure 12.13 depicts all the cluster centres to which the algorithm converged. We
see that we should not be trusting the results generated by a single run of a heuristic
solver to the k-means problem.
Example 12.23 (*) The scikit-learn package implements an algorithm that is similar to the
Lloyd’s one. The method is equipped with the n_init parameter (which defaults to 10) which
automatically applies the aforementioned restarting.
17 If we have many different heuristics, each aiming to approximate a solution to the k-means problem,
from the practical point of view it does not really matter which one returns the best solution – they are
merely our tools to achieve a higher goal. Ideally, we should run all of them many times and get the result
that corresponds to the smallest WCSS. It is crucial to do our best to find the optimal set of cluster centres –
the more approaches we test, the better the chance of success.
18 Mind who is the benevolent dictator of the pseudorandom number generator’s seed.
12 PROCESSING DATA IN GROUPS 321
500
400
300
200
100
0
450 500 550 600 650
Figure 12.12: Within-cluster sum of squares at the results returned by different runs
of the k-means algorithm; sometimes we might be very unlucky
import sklearn.cluster
np.random.seed(123)
km = sklearn.cluster.KMeans(k) # KMeans(k, n_init=10)
km.fit(X)
## KMeans(n_clusters=3)
km.inertia_ # WCSS – not optimal!
## 437.5467188958928
Still, there are no guarantees: the solution is suboptimal too. As an exercise, pass n_init=100,
n_init=1000, and n_init=10000 and determine the returned WCSS.
Note It is theoretically possible that a developer from the scikit-learn team, when
they see the above result, will make a tweak in the algorithm so that after an update
to the package, the returned minimum will be better. This cannot be deemed a bug
fix, though, as there are no bugs here. Improving the behaviour of the method in this
example will lead to its degradation in others. There is no free lunch in optimisation.
Note Some datasets are more well-behaving than others. The k-means method is over-
all quite usable, but we must always be cautious.
We recommend always performing at least 100 random restarts. Also, if a report from
data analysis does not say anything about the number of tries performed, we should
322 IV HETEROGENEOUS DATA
Figure 12.13: Traces of different cluster centres our k-means algorithm converged to;
some are definitely not optimal, and therefore the method must be restarted a few
times to increase the likelihood of pinpointing the true solution
assume that the results are gibberish19 . People will complain about our being a pain,
but we know better; compare Rule#9.
Exercise 12.24 Run the k-means method, 𝑘 = 8, on the sipu_unbalance20 dataset from many
random sets of cluster centres. Note the value of the total within-cluster sum of squares. Also, plot
the cluster centres discovered. Do they make sense? Compare these to the case where you start the
19 For instance, R’s stats::kmeans automatically uses nstart=1. It is not rare, unfortunately, that data
method from the following cluster centres which are close to the global minimum.
−15 5
⎡ ⎤
⎢ −12 10 ⎥
⎢ −10 5 ⎥
⎢ ⎥
⎢ 15 0 ⎥
𝐂=⎢ .
⎢ 15 10 ⎥⎥
⎢ 20 5 ⎥
⎢ ⎥
⎢ 25 0 ⎥
⎣ 25 10 ⎦
12.6 Exercises
Exercise 12.25 Name the data type of the objects that the DataFrame.groupby method returns.
Exercise 12.26 What is the relationship between the GroupBy, DataFrameGroupBy, and
SeriesGroupBy classes?
Exercise 12.27 What are relative z-scores and how can we compute them?
Exercise 12.28 Why and when the accuracy score might not be the best way to quantify a clas-
sifier’s performance?
Exercise 12.29 What is the difference between recall and precision, both in terms of how they
are defined and where they are the most useful?
Exercise 12.30 Explain how the k-nearest neighbour classification and regression algorithms
work. Why do we say that they are model-free?
Exercise 12.31 In the context of k-nearest neighbour classification, why it might be important
to resolve the potential ties at random when computing the mode of the neighbours’ labels?
324 IV HETEROGENEOUS DATA
Exercise 12.32 What is the purpose of a training/test and a training/validation/test set split?
Exercise 12.33 Give the formula for the total within-cluster sum of squares.
Exercise 12.34 Are there any cluster shapes that cannot be detected by the k-means method?
Exercise 12.35 Why do we say that solving the k-means problem is hard?
Exercise 12.36 Why restarting Lloyd’s algorithm many times is necessary? Why are reports
from data analysis that do not mention the number of restarts not trustworthy?
13
Accessing Databases
pandas is convenient for working with data that fit into memory and which can be
stored in individual CSV files. Still, larger information banks in a shared environment
will often be made available to us via relational (structured) databases such as Postgr-
eSQL or MariaDB, or a wide range of commercial products.
Most commonly, we use SQL (Structured Query Language) to define the data chunks1
we wish to analyse. Then, we fetch them from the database driver in the form of a
pandas data frame. This enables us to perform the operations we are already familiar
with, e.g., various transformations or visualisations.
Below we make a quick introduction to the basics of SQL using SQLite2 , which is
a lightweight, flat-file, and server-less open-source database management system.
Overall, SQLite is a sensible choice for data of even hundreds or thousands of giga-
bytes in size that fit on a single computer’s disk. This is more than enough for playing
with our data science projects or prototyping more complex solutions.
Important In this chapter, we will learn that the syntax of SQL is very readable: it is
modelled after the natural (English) language. The purpose of this introduction is not
to write own queries nor to design own databanks: this should be covered by a separate
course on database systems; see, e.g., [15, 19].
a more versatile choice. If we have too much data, we can always fetch random samples (this is what stat-
istics is for) thereof or pre-aggregate the information on the server size. This should be sufficient for most
intermediate-level users.
2 https://sqlite.org
3 https://travel.stackexchange.com
4 https://archive.org/details/stackexchange
326 IV HETEROGENEOUS DATA
First, Tags gives, amongst others, topic categories (TagName) and how many questions
mention them (Count):
Tags = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/travel_stackexchange_com_2017/Tags.csv.gz",
comment="#")
Tags.head(3)
## Count ExcerptPostId Id TagName WikiPostId
## 0 104 2138.0 1 cruising 2137.0
## 1 43 357.0 2 caribbean 356.0
## 2 43 319.0 4 vacations 318.0
Third, Badges recalls all rewards handed to the users (UserId) for their engaging in vari-
ous praiseworthy activities:
Badges = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/travel_stackexchange_com_2017/Badges.csv.gz",
comment="#")
Badges.head(3)
## Class Date Id Name TagBased UserId
## 0 3 2011-06-21T20:16:48.910 1 Autobiographer False 2
## 1 3 2011-06-21T20:16:48.910 2 Autobiographer False 3
## 2 3 2011-06-21T20:16:48.910 3 Autobiographer False 4
Fourth, Posts lists all the questions and answers (the latter do not have ParentId set to
NaN).
Posts = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/travel_stackexchange_com_2017/Posts.csv.gz",
comment="#")
Posts.head(3)
## AcceptedAnswerId ... ViewCount
## 0 393.0 ... 419.0
## 1 NaN ... 1399.0
(continues on next page)
13 ACCESSING DATABASES 327
Fifth, Votes list all the up-votes (VoteTypeId equal to 2) and down-votes (VoteTypeId of
3) to all the posts.
Votes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/travel_stackexchange_com_2017/Votes.csv.gz",
comment="#")
Votes.head(3)
## BountyAmount CreationDate Id PostId UserId VoteTypeId
## 0 NaN 2011-06-21T00:00:00.000 1 1 NaN 2
## 1 NaN 2011-06-21T00:00:00.000 2 1 NaN 2
## 2 NaN 2011-06-21T00:00:00.000 3 2 NaN 2
Exercise 13.1 See the README5 file for a detailed description of each column. Note that are rows
are uniquely defined by their respective Ids. They are relations between the data frames, e.g.,
Users.Id vs Badges.UserId, Posts.Id vs Votes.PostId, etc. Moreover, for privacy reasons,
some UserIds might be missing. In such a case, they are encoded with a not-a-number; compare
Chapter 15.
The above defines the file path (compare Section 13.6.1) where the database is going to
be stored. We use a randomly generated filename inside the local file system’s (we are
on Linux) temporary directory, /tmp. This is just a pleasant exercise, and we will not be
using this database afterwards. The reader might prefer setting a filename relative to
the current working directory (as given by os.getcwd), e.g., dbfile = "travel.db".
We can now connect to the database:
5 https://github.com/gagolews/teaching-data/raw/master/travel_stackexchange_com_2017/
README.md
328 IV HETEROGENEOUS DATA
import sqlite3
conn = sqlite3.connect(dbfile)
The database might now be queried: we can add new tables, insert new rows, and re-
trieve records.
Our data are already in the form of pandas data frames. Therefore, exporting them to
the database is straightforward. We only need to make a series of calls to the pandas.
DataFrame.to_sql method.
Note (*) It is possible to export data that do not fit into memory by reading them
in chunks of considerable, but not too large, sizes. In particular pandas.read_csv has
the nrows argument that lets us read several rows from a file connection; see Sec-
tion 13.6.4. Then, pandas.DataFrame.to_sql(..., if_exists="append") can be used to
append new rows to an existing table.
Exporting data can of course be done without pandas as well, e.g., when they are to be
fetched from XML or JSON files (compare Section 13.5) and processed manually, row
by row. Intermediate-level SQL users can call conn.execute("CREATE TABLE t..."),
followed by conn.executemany("INSERT INTO t VALUES(?, ?, ?)", l), and then conn.
commit(). This will create a new table (here: named t) populated by a list of records
(e.g., in the form of tuples or numpy vectors). For more details, see the manual6 of the
sqlite3 package.
pd.read_sql_query("""
SELECT * FROM Tags LIMIT 3
(continues on next page)
6 https://docs.python.org/3/library/sqlite3.html
13 ACCESSING DATABASES 329
The above query selected all columns (SELECT *) and the first three rows (LIMIT 3) from
the Tags table.
Exercise 13.2 For the above and all the following SQL queries, write the equivalent Python code
that generates the same result using pandas functions and methods. In each case, there might be
more than one equally fine solution. In case of any doubt about the meaning of the queries, please
refer to the SQLite documentation7 . Example solutions are provided at the end of this section.
Example 13.3 For a reference query:
res1a = pd.read_sql_query("""
SELECT * FROM Tags LIMIT 3
""", conn)
res1b = Tags.head(3)
No error message means that the test is passed. The cordial thing about the assert_frame_equal
function is that it ignores small round-off errors introduced by arithmetic operations.
Nonetheless, the results generated by pandas might be the same up to the reordering of
rows. In such a case, before calling pandas.testing.assert_frame_equal, we can invoke
DataFrame.sort_values on both data frames to sort them with respect to 1 or 2 chosen columns.
13.3.1 Filtering
Exercise 13.4 From Tags, select two columns TagName and Count and rows for which TagName
is equal to one of the three choices provided.
res2a = pd.read_sql_query("""
SELECT TagName, Count
FROM Tags
WHERE TagName IN ('poland', 'australia', 'china')
""", conn)
(continues on next page)
7 https://sqlite.org/lang.html
330 IV HETEROGENEOUS DATA
13.3.2 Ordering
Exercise 13.6 Select the Title and Score columns from Posts where ParentId is missing (i.e.,
the post is in fact a question) and Title is well-defined. Then, sort the results by the Score column,
decreasingly (descending order). Finally, return only the first five rows (e.g., top five scoring ques-
tions).
res4a = pd.read_sql_query("""
SELECT Title, Score
FROM Posts
WHERE ParentId IS NULL AND Title IS NOT NULL
ORDER BY Score DESC
LIMIT 5
""", conn)
(continues on next page)
13 ACCESSING DATABASES 331
res5a = pd.read_sql_query("""
SELECT DISTINCT Name
FROM Badges
WHERE UserId=23
""", conn)
res5a
## Name
## 0 Supporter
## 1 Student
## 2 Teacher
## 3 Scholar
## 4 Beta
## 5 Nice Question
## 6 Editor
## 7 Nice Answer
## 8 Yearling
## 9 Popular Question
## 10 Taxonomist
## 11 Notable Question
res6a = pd.read_sql_query("""
SELECT DISTINCT
Name,
CAST(strftime('%Y', Date) AS FLOAT) AS Year
FROM Badges
WHERE UserId=23
(continues on next page)
332 IV HETEROGENEOUS DATA
Exercise 13.10 Count how many unique combinations of pairs (Name, Year) for the badges
won by the user with Id=23 are there. Then, return only the rows having Count greater than 1
and order the results by Count decreasingly. In other words, list the badges received more than
once in any given year.
res8a = pd.read_sql_query("""
SELECT
Name,
CAST(strftime('%Y', Date) AS FLOAT) AS Year,
COUNT(*) AS Count
FROM Badges
WHERE UserId=23
GROUP BY Name, Year
HAVING Count > 1
ORDER BY Count DESC
""", conn)
res8a
## Name Year Count
## 0 Popular Question 2014.0 3
## 1 Notable Question 2015.0 2
Note that WHERE is performed before GROUP BY, and HAVING is applied thereafter.
13.3.5 Joining
Exercise 13.11 Join (merge) Tags, Posts, and Users for all posts with OwnerUserId not equal
to -1 (i.e., the tags which were created by “alive” users). Return the top six records with respect to
Tags.Count.
res9a = pd.read_sql_query("""
SELECT Tags.TagName, Tags.Count, Posts.OwnerUserId,
Users.Age, Users.Location, Users.DisplayName
FROM Tags
JOIN Posts ON Posts.Id=Tags.WikiPostId
JOIN Users ON Users.AccountId=Posts.OwnerUserId
WHERE OwnerUserId != -1
ORDER BY Tags.Count DESC, Tags.TagName ASC
LIMIT 6
""", conn)
res9a
(continues on next page)
334 IV HETEROGENEOUS DATA
Exercise 13.12 First, create an auxiliary (temporary) table named UpVotesTab, where we store
the information about the number of up-votes (VoteTypeId=2) that each post has received. Then,
join (merge) this table with Posts and fetch some details about the five questions (PostTypeId=1)
with the most up-votes.
res10a = pd.read_sql_query("""
SELECT UpVotesTab.*, Posts.Title FROM
(
SELECT PostId, COUNT(*) AS UpVotes
FROM Votes
WHERE VoteTypeId=2
GROUP BY PostId
) AS UpVotesTab
JOIN Posts ON UpVotesTab.PostId=Posts.Id
WHERE Posts.PostTypeId=1
ORDER BY UpVotesTab.UpVotes DESC LIMIT 5
""", conn)
res10a
## PostId UpVotes Title
## 0 3080 307 OK we're all adults here, so really, how on ea...
## 1 38177 254 How do you know if Americans genuinely/literal...
## 2 24540 221 How to intentionally get denied entry to the U...
## 3 20207 211 Why are airline passengers asked to lift up wi...
## 4 96447 178 Why prohibit engine braking?
Example 13.14 To generate res3a with pandas only, we need some more complex filtering with
loc[...]:
res3b = (
Posts.
loc[
(Posts.PostTypeId == 1) & (Posts.ViewCount >= 10000) &
(Posts.FavoriteCount >= 35) & (Posts.FavoriteCount <= 100),
["Title", "Score", "ViewCount", "FavoriteCount"]
].
reset_index(drop=True)
)
pd.testing.assert_frame_equal(res3a, res3b) # no error == OK
Example 13.15 For res4a, some filtering and sorting is all we need:
res4b = (
Posts.
loc[
Posts.ParentId.isna() & (~Posts.Title.isna()),
["Title", "Score"]
].
sort_values("Score", ascending=False).
head(5).
reset_index(drop=True)
)
pd.testing.assert_frame_equal(res4a, res4b) # no error == OK
res5b = (
Badges.
loc[Badges.UserId == 23, ["Name"]].
drop_duplicates().
reset_index(drop=True)
)
pd.testing.assert_frame_equal(res5a, res5b) # no error == OK
336 IV HETEROGENEOUS DATA
Example 13.17 For res6a, we first need to add a new column to the copy of Badges:
Then, we apply some basic filtering and the removal of duplicated rows:
res6b = (
Badges2.
loc[Badges2.UserId == 23, ["Name", "Year"]].
drop_duplicates().
reset_index(drop=True)
)
pd.testing.assert_frame_equal(res6a, res6b) # no error == OK
Badges2 = Badges.copy()
Badges2.loc[:, "Year"] = (
Badges2.Date.astype("datetime64").dt.strftime("%Y").astype("float")
)
res7b = (
Badges2.
loc[Badges2.UserId == 23, ["Name", "Year"]].
groupby("Name")["Year"].
aggregate([len, np.min, np.mean, np.max]).
sort_values("len", ascending=False).
head(4).
reset_index()
)
res7b.columns = ["Name", "Count", "MinYear", "MeanYear", "MaxYear"]
Had we not converted Year to float, we would obtain a meaningless average year, without any
warning.
Unfortunately, the rows in res7a and res7b are ordered differently. For testing, we need to reorder
them in the same way:
pd.testing.assert_frame_equal(
res7a.sort_values(["Name", "Count"]).reset_index(drop=True),
res7b.sort_values(["Name", "Count"]).reset_index(drop=True)
) # no error == OK
Example 13.19 For res8a, we first count the number of values in each group:
13 ACCESSING DATABASES 337
Badges2 = Badges.copy()
Badges2.loc[:, "Year"] = (
Badges2.Date.astype("datetime64").dt.strftime("%Y").astype("float")
)
res8b = (
Badges2.
loc[ Badges2.UserId == 23, ["Name", "Year"] ].
groupby(["Name", "Year"]).
size().
rename("Count").
reset_index()
)
res8b = (
res8b.
loc[ res8b.Count > 1, : ].
sort_values("Count", ascending=False).
reset_index(drop=True)
)
pd.testing.assert_frame_equal(res8a, res8b) # no error == OK
Example 13.20 To obtain a result equivalent to res9a, we need to merge Posts with Tags, and
then merge the result with Users:
res9b = (
res9b.
loc[
(res9b.OwnerUserId != -1) & (~res9b.OwnerUserId.isna()),
["TagName", "Count", "OwnerUserId", "Age", "Location", "DisplayName"]
].
sort_values(["Count", "TagName"], ascending=[False, True]).
head(6).
reset_index(drop=True)
)
Example 13.21 To obtain a result equivalent to res10a, we first need to create an auxiliary data
frame that corresponds to the subquery.
UpVotesTab = (
Votes.
loc[Votes.VoteTypeId==2, :].
groupby("PostId").
size().
rename("UpVotes").
reset_index()
)
And now:
conn.close()
fetch data in these formats. Sadly, often this will require some quite tedious labour,
neither art nor science; see also [89] and [18].
Exercise 13.22 Consider the Web API for accessing8 the on-street parking bay sensor data in
Melbourne, VIC, Australia. Using, for example, the json package, convert the data9 in the JSON
format to a data frame.
Exercise 13.23 Australian Radiation Protection and Nuclear Safety Agency publishes10 UV
data for different Aussie cities. Using, for example, the xml package, convert this XML dataset11
to a data frame.
Exercise 13.24 (*) Check out the English Wikipedia article featuring a list of 20th-century clas-
sical composers12 . Using pandas.read_html, convert the Climate Data table included therein
to a data frame.
Exercise 13.25 (*) Using, for example, the lxml package, write a function that converts each
bullet list featured in a given Wikipedia article (e.g., this one13 ), to a list of strings.
Exercise 13.26 (**) Import an archived version of a Stack Exchange14 site that you find inter-
esting and store it in an SQLite database. You can find the relevant data dumps here15 .
Exercise 13.27 (**) Download16 and then import an archived version of one of the wikis hosted
by the Wikimedia Foundation17 (e.g., the whole English Wikipedia) so that it can be stored in an
SQLite database.
ultraviolet-radation-data-information
11 https://uvdata.arpansa.gov.au/xml/uvvalues.xml
12 https://en.wikipedia.org/wiki/List_of_20th-century_classical_composers
13 https://en.wikipedia.org/wiki/Category:Fr%C3%A9d%C3%A9ric_Chopin
14 https://stackexchange.com/
15 https://archive.org/details/stackexchange
16 https://meta.wikimedia.org/wiki/Data_dumps
17 https://wikimediafoundation.org/
340 IV HETEROGENEOUS DATA
import os.path
os.path.join("~", "Desktop", "file.csv") # we are on GNU/Linux
## '~/Desktop/file.csv'
Important We will frequently be referring to file paths relative to the working direct-
ory of the currently executed Python session (e.g., from which IPython/Jupyter note-
book server was started); see os.getcwd.
All non-absolute file names (ones that do not start with `~`, `/`, `c:\\`, and the like),
for example, "filename.csv" or os.path.join("subdir", "filename.csv") are always
relative to the current working directory.
For instance, if the working directory is "/home/marek/projects/python", then
"filename.csv" refers to "/home/marek/projects/python/filename.csv".
Also, `..` denotes the current working directory’s parent directory. Thus, "../
filename2.csv" resolves to "/home/marek/projects/filename2.csv".
Exercise 13.28 Print the current working directory by calling os.getcwd. Next, download the
file air_quality_2018_param18 and save it in the current Python session’s working directory
(e.g., in your web browser, right-click on the web page’s canvas and select Save Page As…). Load
with pandas.read_csv by passing "air_quality_2018_param.csv" as the input path.
Exercise 13.29 (*) Download the aforementioned file programmatically (if you have not done
so yet) using the requests module.
18 https://github.com/gagolews/teaching-data/raw/master/marek/air_quality_2018_param.csv
13 ACCESSING DATABASES 341
try:
# statements to execute
x = pd.read_csv("file_not_found.csv")
print(x.head()) # this will not be executed if the above raises an error
except OSError:
# if an exception occurs, we can handle it here
print("File has not been found")
## File has not been found
13.7 Exercises
Exercise 13.31 Find an example of an XML and JSON file. Which one is more human-
readable? Do they differ in terms of capabilities?
19 https://docs.python.org/3/tutorial/errors.html
20 https://docs.python.org/3/tutorial/inputoutput.html
342 IV HETEROGENEOUS DATA
Exercise 13.32 What is wrong with constructing file paths like "~" + "\\" + "filename.
csv"?
Exercise 13.33 What are the benefits of using a SQL database management system in data sci-
ence activities?
Exercise 13.34 (*) How can we populate a database with gigabytes of data read from many CSV
files?
Part V
There are a few binary operators overloaded for strings, e.g., `+` stands for string con-
catenation:
x + " and eggs"
## 'spam and eggs'
Strings are immutable, but parts thereof can always be reused in conjunction with the
concatenation operator:
x[:2] + "ecial"
## 'special'
Note Despite the wide support for Unicode, sometimes our own or other readers’ dis-
play (e.g., web browsers when viewing an HTML version of the output report) might
not be able to render all code points properly, e.g., due to missing fonts. Still, we should
rest assured that they are processed correctly if string functions are applied thereon.
Note (*) More advanced string transliteration2 can be performed by means of the ICU3
(International Components for Unicode) library, which the PyICU package provides
wrappers for.
For instance, converting all code points to ASCII (English) might be necessary when
1 (*) More precisely, Python strings are UTF-8-encoded. Most web pages and APIs are nowadays served
in UTF-8. But we can still occasionally encounter files encoded in ISO-8859-1 (Western Europe), Windows-
1250 (Eastern Europe), Windows-1251 (Cyrillic), GB18030 and Big5 (Chinese), EUC-KR (Korean), Shift-JIS
and EUC-JP (Japanese), amongst others. They can be converted using the str.decode method.
2 https://unicode-org.github.io/icu/userguide/transforms/general/
3 https://icu.unicode.org/
14 TEXT DATA 347
identifiers are expected to miss some diacritics that would normally be included (as
in "Gągolewski" vs "Gagolewski"):
icu.Transliterator.createInstance("NFKD; NFC").transliterate("¼ąr²")
## '¼ąr2'
food.count("spam")
## 3
food.index("spam")
## 7
food.replace("spam", "veggies")
## 'bacon, veggies, veggies, srapatapam, eggs, and veggies'
4 https://www.unicode.org/faq/normalization.html
348 V OTHER DATA TYPES
Exercise 14.1 Read the manual of the following methods: str.startswith, str.endswith,
str.find, str.rfind, str.rindex, str.removeprefix, and str.removesuffix.
The splitting of long strings at specific fixed delimiter strings can be done via:
food.split(", ")
## ['bacon', 'spam', 'spam', 'srapatapam', 'eggs', 'and spam']
See also str.partition. The str.join method implements the inverse operation:
Moreover, in Section 14.4, we will discuss pattern matching with regular expressions,
which can be useful in, amongst others, extracting more abstract data chunks (num-
bers, URLs, email addresses, IDs) from within strings.
We have: "a" < "aa" < "aaaaaaaaaaaaa" < "ab" < "aba" < "abb" < "b" < "ba" < "baaaaaaa"
< "bb" < "Spanish Inquisition".
The lexicographic ordering (character-by-character, from left to right) is not necessar-
ily appropriate for strings featuring numerals:
Additionally, it only takes into account the numeric codes (see Section 14.4.3) corres-
ponding to each Unicode character. Consequently, it does not work well with non-
English alphabets:
In Polish, A with ogonek (Ą) should sort after A and before B, let alone I. However, their
corresponding numeric codes in the Unicode table are: 260 (Ą), 65 (A), 66 (B), and 73
(I). The resulting ordering is thus incorrect, as far as natural language processing is
concerned.
It is best to perform string collation using the services provided by ICU. Here is an
example of German phone book-like collation where "ö" is treated the same as "oe":
14 TEXT DATA 349
c = icu.Collator.createInstance(icu.Locale("de_DE@collation=phonebook"))
c.setStrength(0) # ignore case and some diacritics
c.compare("Löwe", "loewe")
## 0
In some languages, contractions occur, e.g., in Slovak and Czech, two code points "ch"
are treated as a single entity and are sorted after "h":
icu.Collator.createInstance(icu.Locale("sk_SK")).compare("chladný", "hladný")
## 1
This means that we have "chladný" > "hladný" (the 1st argument is greater than the
2nd one). Compare the above to something similar in Polish:
icu.Collator.createInstance(icu.Locale("pl_PL")).compare("chłodny", "hardy")
## -1
That is, "chłodny" < "hardy" (the first argument is less than the 2nd one).
c = icu.Collator.createInstance()
c.setAttribute(
icu.UCollAttribute.NUMERIC_COLLATION,
icu.UCollAttributeValue.ON
)
c.compare("a9", "a123")
## -1
Which is the correct result: "a9" is less than "a123" (compare the above to the example
where we used the ordinary `<`).
5 https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
350 V OTHER DATA TYPES
This allows for the encoding of missing values by means of the None object (which is of
type None, not str); compare Section 15.1.
Vectorised versions of base string operations are available via the pandas.Series.str
accessor. We thus have pandas.Series.str.strip, pandas.Series.str.split, pandas.
Series.str.find, and so forth. For instance:
But there is more. For example, a function to compute the length of each string:
x.str.len()
## 0 4.0
## 1 5.0
## 2 NaN
## 3 9.0
## 4 4.0
## dtype: float64
Vectorised concatenation of strings can be performed using the overloaded `+` oper-
ator:
x.str.cat(sep="; ")
## 'spam; bacon; buckwheat; spam'
Conversion to numeric:
Select substrings:
Replace substrings:
Exercise 14.2 Consider the nasaweather_glaciers6 data frame. All glaciers are assigned
11/12-character unique identifiers as defined by the WGMS convention that forms the glacier ID
number by combining the following five elements.
1. 2-character political unit (first two letters of the ID),
2. 1-digit continent code (the third letter),
3. 4-character drainage code (next four),
4. 2-digit free position code (next two),
5. 2- or 3-digit local glacier code (the remaining ones).
Extract the five chunks and store them as independent columns in the data frame.
Here, the data type “<U5” means that we deal with Unicode strings of length no greater
than 5. Unfortunately, replacing elements with too long a content will result in trun-
cated strings:
x[2] = "buckwheat"
x
## array(['spam', 'bacon', 'buckw'], dtype='<U5')
x = x.astype("<U10")
x[2] = "buckwheat"
x
## array(['spam', 'bacon', 'buckwheat'], dtype='<U10')
The numpy.char7 module includes several vectorised versions of string routines, most
of which we discussed above. For example:
x = np.array([
"spam", "spam, bacon, and spam",
"spam, eggs, bacon, spam, spam, and spam"
])
np.char.split(x, ", ")
## array([list(['spam']), list(['spam', 'bacon', 'and spam']),
## list(['spam', 'eggs', 'bacon', 'spam', 'spam', 'and spam'])],
## dtype=object)
np.char.count(x, "spam")
## array([1, 2, 4])
Vectorised operations that we would normally perform through the binary operators
(i.e., `+`, `*`, `<`, etc.) are available through standalone functions:
7 https://numpy.org/doc/stable/reference/routines.char.html
14 TEXT DATA 353
The function that returns the length of each string is also noteworthy:
np.char.str_len(x)
## array([ 4, 21, 39])
x = pd.Series([
"spam",
"spam, bacon, spam",
"potatoes",
None,
"spam, eggs, bacon, spam, spam"
])
xs = x.str.split(", ", regex=False)
xs
## 0 [spam]
## 1 [spam, bacon, spam]
## 2 [potatoes]
## 3 None
## 4 [spam, eggs, bacon, spam, spam]
## dtype: object
xs.str.get(0)
## 0 spam
## 1 spam
## 2 potatoes
## 3 None
## 4 spam
## dtype: object
or slicing:
Exercise 14.3 (*) Using pandas.merge, join the countries8 , world_factbook_20209 , and
ssi_2016_dimensions10 datasets based on the country names. Note that some manual data
cleansing will be necessary beforehand.
Exercise 14.4 (**) Given a Series object xs featuring lists of strings, convert it to a 0/1 repres-
entation.
1. Determine the list of all unique strings; let us call it xu.
2. Create a data frame x with xs.shape[0] rows and len(xu) columns such that x.iloc[i,
j] is equal to 1 if xu[j] is amongst xs.loc[i] and equal to 0 otherwise. Set the column
names to xs.
3. Given x (and only x: neither xs nor xu), perform the inverse operation.
For example, for the above xs object, x should look like:
8
https://github.com/gagolews/teaching-data/raw/master/other/countries.csv
9
https://github.com/gagolews/teaching-data/raw/master/marek/world_factbook_2020.csv
10 https://github.com/gagolews/teaching-data/raw/master/marek/ssi_2016_dimensions.csv
14 TEXT DATA 355
pi = 3.14159265358979323846
f"π = {pi:.2f}"
## 'π = 3.14'
creates a string showing the value of the variable pi formatted as a float rounded to
two places after the decimal separator.
Note (**) Similar functionality can be achieved using the str.format method:
"π = {:.2f}".format(pi)
## 'π = 3.14'
as well as the `%` operator overloaded for strings, which uses sprintf-like value place-
holders known to some readers from other programming languages (such as C):
"π = %.2f" % pi
## 'π = 3.14'
x = np.array([1, 2, 3])
str(x)
## '[1 2 3]'
repr(x)
## 'array([1, 2, 3])'
The former is more human-readable, and the latter is slightly more technical. Note
that repr often returns an output that can be interpreted as executable Python code
with no or few adjustments. Nonetheless, pandas objects are amongst the many ex-
ceptions to this rule.
import IPython.display
x = 2+2
out = f"*Result*: $2^2=2\\cdot 2={x}$." # LaTeX math
IPython.display.Markdown(out)
Result: 22 = 2 ⋅ 2 = 4.
Recall from Section 1.2.5 that Markdown is a very flexible markup11 language that al-
lows us to define itemised and numbered lists, mathematical formulae, tables, im-
ages, etc.
On a side note, data frames can be nicely prepared for display in a report using pandas.
DataFrame.to_markdown.
11 (*) Markdown is amongst many markup languages. Other learn-worthy ones include HTML (for the
Web) and LaTeX (especially for the beautiful typesetting of maths, print-ready articles, and books, e.g.,
PDF; see [68] for a good introduction).
14 TEXT DATA 357
We can convert it to other formats, including HTML, PDF, EPUB, ODT, and even
presentations by running12 the pandoc13 tool. We may also embed it directly inside an
IPython/Jupyter notebook:
IPython.display.Markdown(out)
• bacon
• eggs
• spam
And now for something completely different:
Rank Food
1 maps
2 nocab
3 sgge
4 maps
Note Figures created in matplotlib can be exported to PNG, SVG, or PDF files using
the matplotlib.pyplot.savefig function. We can include them manually in a Mark-
down document using the  syntax.
Note (*) IPython/Jupyter Notebooks can be converted to different formats using the
jupyter-nbconvert14 command line tool. jupytext15 can create notebooks from ordin-
ary text files. Literate programming with mixed R and Python is possible with the R
packages knitr16 and reticulate17 . See [72] for an overview of many more options.
import re
x = "We're the knights who say ni! niiiii! ni! niiiiiiiii!"
re.findall(r"\bni+\b", x)
## ['ni', 'niiiii', 'ni', 'niiiiiiiii']
The order of arguments is (look for what, where), not the other way around.
Important We used the r"..." prefix to input a string so that “\b” is not treated as an
escape sequence which denotes the backspace character. Otherwise, the above would
have to be written as “\\bni+\\b”.
If we had not insisted on matching at the word boundaries (i.e., if we used the simple
"ni+" regex instead), we would also match the "ni" in "knights".
The re.search function returns an object of class re.Match that enables us to get some
more information about the first match:
r = re.search(r"\bni+\b", x)
r.start(), r.end(), r.group()
## (26, 28, 'ni')
The above includes the start and end position (index) and the match itself. If the regex
contains capture groups (see below for more details), we can also pinpoint the matches
thereto.
Moreover, re.finditer returns an iterable object that includes the same details, but
now about all the matches:
18 https://kate-editor.org/
19 https://www.eclipse.org/ide/
20 https://vscodium.com/
360 V OTHER DATA TYPES
rs = re.finditer(r"\bni+\b", x)
for r in rs:
print((r.start(), r.end(), r.group()))
## (26, 28, 'ni')
## (30, 36, 'niiiii')
## (38, 40, 'ni')
## (42, 52, 'niiiiiiiii')
re.split(r"!\s+", x)
## ["We're the knights who say ni", 'niiiii', 'ni', 'niiiiiiiii!']
The “!\s*” regex matches the exclamation mark followed by one or more whitespace
characters.
re.sub(r"\bni+\b", "nu", x)
## "We're the knights who say nu! nu! nu! nu!"
Note (**) More flexible replacement strings can be generated by passing a custom
function as the second argument:
x.str.contains(r"\bni+\b")
## 0 True
## 1 True
## 2 None
## 3 False
## 4 True
## dtype: object
x.str.count(r"\bni+\b")
## 0 1.0
## 1 3.0
## 2 NaN
## 3 0.0
## 4 2.0
## dtype: float64
x.str.replace(r"\bni+\b", "nu", regex=True)
## 0 nu!
## 1 nu, nu, nu!
## 2 None
## 3 spam, bacon
## 4 nu, nu!
## dtype: object
x.str.findall(r"\bni+\b")
## 0 [ni]
## 1 [niiii, ni, nii]
## 2 None
## 3 []
## 4 [nii, ni]
## dtype: object
x.str.split(r",\s+") # a comma, one or more whitespaces
## 0 [ni!]
## 1 [niiii, ni, nii!]
## 2 None
## 3 [spam, bacon]
## 4 [nii, ni!]
## dtype: object
Note (*) If we intend to seek matches to the same pattern in many different strings
without the use of pandas, it might be faster to pre-compile a regex first, and then use
the re.Pattern.findall method instead or re.findall:
362 V OTHER DATA TYPES
Important The following characters have special meaning to the regex engine: “.”,
“\”, “|”, “(“, “)”, “[“, “]”, “{“, “}”, “^”, “$”, “*”, “+”, and “?”.
Any regular expression that contains none of the above behaves like a fixed pattern:
There are three occurrences of a pattern that is comprised of four code points, “s” fol-
lowed by “p”, then by “a”, and ending with “m”.
re.findall(r"\.", "spam...")
## ['.', '.', '.']
The above extracts non-overlapping length-4 substrings that end with “am”, case-
insensitively.
The dot’s insensitivity to the newline character is motivated by the need to maintain
21 https://docs.python.org/3/library/re.html
14 TEXT DATA 363
compatibility with tools such as grep (when searching within text files in a line-by-line
manner). This behaviour can be altered by setting the DOTALL flag.
re.findall("[hj]am", x)
## ['ham', 'jam']
the “[hj]am” regex matches: “h” or “j”, followed by “a”, followed by “m”. In other words,
"ham" and "jam" are the only two strings that are matched by this pattern (unless
matching is done case-insensitively).
Important The following characters, if used within square brackets, may be treated
non-literally: “\”, “[“, “]”, “^”, “-“, “&”, “~”, and “|”.
To include them as-is in a character set, the backslash-escape must be used. For ex-
ample, “[\[\]\\]” matches a backslash or a square bracket.
Complementing Sets
Including “^” (the caret) after the opening square bracket denotes a set’s complement.
Hence, “[^abc]” matches any code point except “a”, “b”, and “c”. Here is an example
where we seek any substring that consists of four non-spaces:
22 https://www.unicode.org/charts/
364 V OTHER DATA TYPES
re.findall("[0-9A-Za-z]", "Gągolewski")
## ['G', 'g', 'o', 'l', 'e', 'w', 's', 'k', 'i']
The above pattern denotes the union of three code ranges: ASCII upper- and lower-
case letters and digits. Nowadays, in the processing of text in natural languages, this
notation should be avoided. Note the missing “ą” (Polish “a” with ogonek) in the result.
x = "aąb߯AĄB�12��,.;'! \t-+=\n[]©��”„"
Some glyphs are not available in the PDF version of this book (because we did not in-
stall the required fonts, e.g., the Arabic digit 4 or left and right arrows; this is an edu-
cational example), but they are well-defined at the program level.
Noteworthy Unicode-aware code point classes include the “word” characters:
re.findall(r"\w", x)
## ['a', 'ą', 'b', 'ß', 'Æ', 'A', 'Ą', 'B', '�', '1', '2', '�', '�']
decimal digits:
re.findall(r"\d", x)
## ['1', '2', '�', '�']
and whitespaces:
re.findall(r"\s", x)
## [' ', '\t', '\n']
Moreover, e.g., “\W” is equivalent to “[^\w]” , i.e., denotes the set’s complement.
x = "spam, egg, ham, jam, algae, and an amalgam of spam, all al dente"
re.findall("spam|ham", x)
## ['spam', 'ham', 'spam']
Grouping Subexpressions
The “|” operator has very low precedence (otherwise, we would match "spamam" or
"spaham" above instead). If we wish to introduce an alternative of subexpressions, we
14 TEXT DATA 365
need to group them using the “(?:...)” syntax. For instance, “(?:sp|h)am” matches
either "spam" or "ham".
Notice that the bare use of the round brackets, “(...)” (i.e., without the “?:”) part, has
the side-effect of creating new capturing groups; see below for more details.
Also, matching is always done left-to-right, on a first-come, first-served (greedy)
basis. Consequently, if the left branch is a subset of the right one, the latter will never
be matched. In particular, “(?:al|alga|algae)” can only match "al". To fix this, we
can write “(?:algae|alga|al)”.
Non-grouping Parentheses
Some parenthesised subexpressions – those in which the opening bracket is followed
by the question mark – have a distinct meaning. In particular, “(?#...)” denotes a
free-format comment that is ignored by the regex parser:
re.findall(
"(?# match 'sp' or 'h')(?:sp|h)(?# and 'am')am|(?# or match 'egg')egg",
x
)
## ['spam', 'egg', 'ham', 'spam']
re.findall(
"(?:sp|h)" + # match either 'sp' or 'h'
"am" + # followed by 'am'
"|" + # ... or ...
"egg", # just match 'egg'
x
)
## ['spam', 'egg', 'ham', 'spam']
14.4.5 Quantifiers
More often than not, a variable number of instances of the same subexpression needs
to be captured or its presence should be made optional. This can be achieved by means
of the following quantifiers:
• “?” matches 0 or 1 time;
• “*” matches 0 or more times;
366 V OTHER DATA TYPES
Greedy:
x = "sp(AM)(maps)(SP)am"
re.findall(r"\(.+\)", x)
## ['(AM)(maps)(SP)']
Lazy:
re.findall(r"\(.+?\)", x)
## ['(AM)', '(maps)', '(SP)']
re.findall(r"\([^)]+\)", x)
## ['(AM)', '(maps)', '(SP)']
The first regex is greedy: it matches an opening bracket, then as many characters as
possible (including “)”) that are followed by a closing bracket. The two other patterns
terminate as soon as the first closing bracket is found.
More examples:
x = "spamamamnomnomnomammmmmmmmm"
re.findall("sp(?:am|nom)+", x)
## ['spamamamnomnomnomam']
re.findall("sp(?:am|nom)+?", x)
## ['spam']
And:
re.findall("sp(?:am|nom)+?m*", x)
## ['spam']
re.findall("sp(?:am|nom)+?m+", x)
## ['spamamamnomnomnomammmmmmmmm']
14 TEXT DATA 367
Let us stress that the quantifier is applied to the subexpression that stands directly
before it. Grouping parentheses can be used in case they are needed.
re.findall(r"\d+(?:\.\d+)?", x)
## ['12', '34.5', '678.901234', '37', '629']
finds digits which are possibly (but not necessarily) followed by a dot and a digit se-
quence.
Exercise 14.5 Write a regex that extracts all #hashtags from a string #omg #SoEasy.
This returned the matches to the individual capture groups, not the whole matching
substrings.
re.find and re.finditer can pinpoint each component:
r = re.search(r"(\w+)='(.+?)'", x)
print("whole (0):", (r.start(), r.end(), r.group()))
print(" 1 :", (r.start(1), r.end(1), r.group(1)))
print(" 2 :", (r.start(2), r.end(2), r.group(2)))
## whole (0): (0, 20, "name='Sir Launcelot'")
## 1 : (0, 4, 'name')
## 2 : (6, 19, 'Sir Launcelot')
Here is a vectorised version of the above from pandas, returning the first match:
y = pd.Series([
"name='Sir Launcelot'",
(continues on next page)
368 V OTHER DATA TYPES
We see that the findings are conveniently presented in the data frame form. The first
column gives the matches to the first capture group. All matches can be extracted too:
y.str.extractall(r"(\w+)='(.+?)'")
## 0 1
## match
## 0 0 name Sir Launcelot
## 1 0 quest Seek Grail
## 2 0 favcolour blue
## 1 favcolour yel.. Aaargh!
Recall that if we just need the grouping part of “(...)”, i.e., without the capturing
feature, “(?:...)” can be applied.
y.str.extract("(?:\\w+)='(?P<value>.+?)'")
## value
## 0 Sir Launcelot
## 1 Seek Grail
## 2 blue
re.sub(r"(?P<key>\w+)='(?P<value>.+?)'",
r"\g<value> is a \g<key>", x)
## 'Sir Launcelot is a name, Seek Grail is a quest, blue is a favcolour'
Back-Referencing
Matches to capture groups can also be part of the regexes themselves. In such a con-
text, e.g., “\1” denotes whatever has been consumed by the first capture group.
In general, parsing HTML code with regexes is not recommended, unless it is well-
structured (which might be the case if it is generated programmatically; but we can
always use the lxml package). Despite this, let us consider the following examples:
x = "<p><em>spam</em></p><code>eggs</code>"
re.findall(r"<[a-z]+>.*?</[a-z]+>", x)
## ['<p><em>spam</em>', '<code>eggs</code>']
This did not match the correct closing HTML tag. But we can make this happen by
writing:
re.findall(r"(<([a-z]+)>.*?</\2>)", x)
## [('<p><em>spam</em></p>', 'p'), ('<code>eggs</code>', 'code')]
This regex guarantees that the match will include all characters between the opening
"<tag>" and the corresponding (not: any) closing "</tag>".
14.4.7 Anchoring
Lastly, let us mention the ways to match a pattern at a given abstract position within
a string.
The five regular expressions match "spam", respectively, anywhere within the string,
at the beginning, at the end, at the beginning or end, and in strings that are equal to
the pattern itself. We can check this by calling:
370 V OTHER DATA TYPES
Exercise 14.6 Write a regex that does the same job as str.strip.
re.findall(r"\b\w+\b(?![,.])", x)
## ['I', 'like', 'and']
This time, we matched the words that end with neither a comma nor a dot.
14.5 Exercises
Exercise 14.7 List some ways to normalise character strings.
14 TEXT DATA 371
Exercise 14.8 (**) What are the challenges of processing non-English text?
Exercise 14.9 What are the problems with the "[A-Za-z]" and "[A-z]" character sets?
Exercise 14.10 Name the two ways to turn on case-insensitive regex matching.
Exercise 14.11 What is a word boundary?
Exercise 14.12 What is the difference between the "^" and "$" anchors?
Exercise 14.13 When would we prefer using "[0-9]" instead of "\d"?
Exercise 14.14 What is the difference between the "?", "??", "*", "*?", "+", and "+?" quanti-
fiers?
Exercise 14.15 Does "." match all the characters?
Exercise 14.16 What are named capture groups and how can we refer to the matches thereto in
re.sub?
Exercise 14.17 Write a regex that extracts all standalone numbers accepted by Python, includ-
ing 12.123, -53, +1e-9, -1.2423e10, 4. and .2.
Exercise 14.18 Write a regex that matches all email addresses.
Exercise 14.19 Write a regex that matches all URLs starting with http:// or https://.
Exercise 14.20 Cleanse the warsaw_weather23 dataset so that it contains analysable numeric
data.
23 https://github.com/gagolews/teaching-data/raw/master/marek/warsaw_weather.csv
15
Missing, Censored, and Questionable Data
Up to now, we have been mostly assuming that observations are of decent quality, i.e.,
trustworthy. It would be nice if that was always the case, but it is not.
In this chapter, we briefly address the most basic methods for dealing with suspicious
observations: outliers, missing, censored, imprecise, and incorrect data.
nhanes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_p_demo_bmx_2020.csv",
comment="#")
nhanes.loc[:, ["BMXWT", "BMXHT", "RIDAGEYR", "BMIHEAD", "BMXHEAD"]].head()
## BMXWT BMXHT RIDAGEYR BMIHEAD BMXHEAD
## 0 NaN NaN 2 NaN NaN
## 1 42.2 154.7 13 NaN NaN
## 2 12.0 89.3 2 NaN NaN
## 3 97.1 160.2 29 NaN NaN
## 4 13.6 NaN 2 NaN NaN
Some of the columns feature NaN (not-a-number) values. They are used here to encode
missing (not available) data. Previously, we decided not to be bothered by them: a shy
call to dropna resulted in their removal. But we are curious now.
The reasons behind why some items are missing might be numerous, for instance:
• a participant did not know the answer to a given question;
• someone refused to answer a given question;
• a person did not take part in the study anymore (attrition, death, etc.);
• an item was not applicable (e.g., number of minutes spent cycling weekly when
someone answered they did not learn to ride a bike yet);
374 V OTHER DATA TYPES
• a piece of information was not collected, e.g., due to the lack of funding or a failure
of a piece of equipment.
Looking at the column descriptions on the data provider’s website1 , for example,
BMIHEAD stands for “Head Circumference Comment”, whereas BMXHEAD is “Head Cir-
cumference (cm)”, but these were only collected for infants.
Exercise 15.1 Read the column descriptions (refer to the comments in the CSV file for the relev-
ant URLs) to identify the possible reasons for some of the records in nhanes being missing.
Exercise 15.2 Learn about the difference between the pandas.DataFrameGroupBy.size and
pandas.DataFrameGroupBy.count methods.
There are versions of certain aggregation functions that ignore missing values what-
soever: numpy.nanmean, numpy.nanmin, numpy.nanmax, numpy.nanpercentile, numpy.
nanstd, etc.
This is quite unfortunate behaviour as this way we might miss (sic!) the presence of
missing values. Therefore, it is crucial to have the dataset carefully pre-inspected.
x # preview
## 0 NaN
## 1 154.7
## 2 89.3
## 3 160.2
## 4 NaN
## Name: BMXHT, dtype: float64
y = (x > 100)
y
## 0 False
## 1 True
## 2 False
## 3 True
## 4 False
## Name: BMXHT, dtype: bool
Unfortunately, comparisons against missing values yield False, instead of the more
semantically valid missing value. Hence, if we want to retain the missingness inform-
ation (we do not know if a missing value is greater than 100), we need to do it manually:
376 V OTHER DATA TYPES
Exercise 15.3 Read the pandas documentation3 about missing value handling.
Important In all kinds of reports from data analysis, we need to be explicit about
the way we handle the missing values. This is because sometimes they might strongly
affect the results.
Let us consider an example vector with missing values, comprised of heights of the
adult participants of the NHANES study.
x = nhanes.loc[nhanes.loc[:, "RIDAGEYR"] >= 18, "BMXHT"]
The simplest approach is to replace each missing value with the corresponding
column’s mean. This does not change the overall average but decreases the variance.
xi = x.copy()
xi[np.isnan(xi)] = np.nanmean(xi)
Similarly, we could consider replacing missing values with the median, or – in the case
of categorical data – the mode.
data distribution and introduce some kind of bias; see Figure 15.1 for the histograms of
x, xi, and xg. These effects can be obscured if we increase the histogram bins’ widths,
but they will still be present in the data. No surprise here: we added to the sample
many identical values.
0 0 0
150 200 150 200 150 200
Note (**) Rubin (e.g., in [60]) suggests the use of a procedure called multiple imputation
6 https://github.com/gagolews/teaching-data/raw/master/marek/nhanes_p_demo_bmx_2020.csv
15 MISSING, CENSORED, AND QUESTIONABLE DATA 379
(see also [88]), where copies of the original datasets are created, missing values are
imputed by sampling from some estimated distributions, the inference is made, and
then the results are aggregated. An example implementation of such an algorithm is
available in sklearn.impute.IterativeImputer.
• IDs of entities that simply do not exist (e.g., unregistered or deleted clients’ ac-
counts);
and so forth.
To be able to identify and handle incorrect data, we need specific knowledge of a par-
ticular domain. Optimally, basic data validation techniques should already be em-
ployed on the data collection stage, for instance when a user submits an online form.
There can be many tools that can assist us with identifying erroneous observations,
e.g., spell checkers such as hunspell7 .
For smaller datasets, observations can also be inspected manually. In other cases, we
might have to develop our own algorithms for detecting such bugs in data.
Exercise 15.6 Given some data frame with numeric columns only, perform what follows.
1. Check if all numeric values in each column are between 0 and 1,000.
2. Check if all values in each column are unique.
3. Verify that all the rowwise sums add up to 1.0 (up to a small numeric error).
4. Check if the data frame consists of 0s and 1s only. Provided that this is the case, verify that
for each row, if there is a 1 in some column, then all the columns to the right are filled with 1s
too.
Many data validation methods can be reduced to operations on strings; see Chapter
14. They may be as simple as writing a single regular expression or checking if a label
is in a dictionary of possible values but also as difficult as writing your own parser for
a custom context-sensitive grammar.
Exercise 15.7 Once we import the data fetched from dirty sources, relevant information will
have to be extracted from raw text, e.g., strings like "1" should be converted to floating-point
numbers. Below we suggest several tasks that can aid in developing data validation skills in-
volving some operations on text.
Given an example data frame with text columns (manually invented, please be creative), perform
what follows.
1. Remove trailing and leading whitespaces from each string.
2. Check if all strings can be interpreted as numbers, e.g., "23.43".
3. Verify if a date string in the YYYY-MM-DD format is correct.
4. Determine if a date-time string in the YYYY-MM-DD hh:mm:ss format is correct.
5. Check if all strings are of the form (+NN) NNN-NNN-NNN or (+NN) NNNN-NNN-NNN, where N
denotes any digit (valid telephone numbers).
6. Inspect whether all strings are valid country names.
7 https://hunspell.github.io/
15 MISSING, CENSORED, AND QUESTIONABLE DATA 381
7. (*) Given a person’s date of birth, sex, and Polish ID number PESEL8 , check if that ID is
correct.
8. (*) Determine if a string represents a correct International Bank Account Number (IBAN9 )
(note that IBANs feature two check digits).
9. (*) Transliterate text to ASCII, e.g., "żółty ©" to "zolty (C)".
10. (**) Using an external spell checker, determine if every string is a valid English word.
11. (**) Using an external spell checker, ascertain that every string is a valid English noun in the
singular form.
12. (**) Resolve all abbreviations by means of a custom dictionary, e.g., "Kat." → "Katherine",
"Gr." → "Grzegorz".
15.4 Outliers
Another group of inspectionworthy observations consists of outliers. We can define
them as the samples that reside in the areas of substantially lower density than their
neighbours.
Outliers might be present due to an error, or their being otherwise anomalous, but
they may also simply be interesting, original, or novel. After all, statistics does not
give any meaning to data items; humans do.
What we do with outliers is a separate decision. We can get rid of them, correct them,
replace them with a missing value (and then possibly impute), or analyse them separ-
ately. In particular, there is a separate subfield in statistics called extreme value the-
ory that is interested in predicting the distribution of very large observations (e.g., for
modelling floods, extreme rainfall, or temperatures); see, e.g., [5]. But this is a topic
for a more advanced course; see, e.g., [50]. By then, let us stick with some simpler
settings.
Note (*) We can of course choose a different threshold. For instance, for the normal
distribution N(10, 1), even though the probability of observing a value greater than 15 is
theoretically non-zero, it is smaller 0.000029%, so it is sensible to treat this observation
as suspicious. On the other hand, we do not want to mark too many observations as
outliers because inspecting them manually will be too labour-intense.
Exercise 15.8 For each column in nhanes_p_demo_bmx_202010 , inspect a few smallest and
largest observations and see if they make sense.
Exercise 15.9 Perform the above separately for data in each group as defined by the RIAGENDR
column.
x = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/blobs2.txt")
plt.subplot(1, 2, 1)
sns.boxplot(data=x, orient="h", color="lightgray")
plt.yticks([])
plt.subplot(1, 2, 2)
sns.histplot(x, binwidth=1, color="lightgray")
plt.show()
Fixed-radius search techniques, which we discussed in Section 8.4, can be used for
estimating the underlying probability density function. Given a data sample 𝒙 =
10 https://github.com/gagolews/teaching-data/raw/master/marek/nhanes_p_demo_bmx_2020.csv
15 MISSING, CENSORED, AND QUESTIONABLE DATA 383
350
300
250
200
Count
150
100
50
0
0 20 40 0 20 40
Figure 15.2: With box plots, we may fail to detect some outliers
1 𝑛
𝑓𝑟̂ (𝑧) = ∑ |𝐵 (𝑧)|,
2𝑟𝑛 𝑖=1 𝑟
where |𝐵𝑟 (𝑧)| denotes the number of observations from 𝒙 whose distance to 𝑧 is not
greater than 𝑟, i.e., fall into the interval [𝑧 − 𝑟, 𝑧 + 𝑟].
n = len(x)
r = 1 # radius – feel free to play with different values
import scipy.spatial
t = scipy.spatial.KDTree(x.reshape(-1, 1))
dx = pd.Series(t.query_ball_point(x.reshape(-1, 1), r)).str.len() / (2*r*n)
dx[:6] # preview
## 0 0.000250
## 1 0.116267
## 2 0.116766
## 3 0.166667
## 4 0.076098
## 5 0.156188
## dtype: float64
Then, points in the sample lying in low-density regions (i.e., all 𝑥𝑖 such that 𝑓𝑟̂ (𝑥𝑖 ) is
small) can be flagged for further inspection:
11 This is an instance of a kernel density estimator, with the simplest kernel – a rectangular one.
384 V OTHER DATA TYPES
See Figure 15.3 for an illustration of 𝑓𝑟̂ . Of course, 𝑟 should be chosen with care – just
like the number of bins in a histogram.
sns.histplot(x, binwidth=1, stat="density", color="lightgray")
z = np.linspace(np.min(x)-5, np.max(x)+5, 1001)
dz = pd.Series(t.query_ball_point(z.reshape(-1, 1), r)).str.len() / (2*r*n)
plt.plot(z, dz, label=f"density estimator ($r={r}$)")
plt.show()
0.175
0.150
0.125
Density
0.100
0.075
0.050
0.025
0.000
0 10 20 30 40 50
X[:, 0]
200
150
Count
X[:, 0] 100
50
0
4 2 0 2 4 2 0 2
X[:, 1]
200
150
Count
X[:, 1]
100
50
0
2 0 2 4 2 0 2 4
The scatterplot in Figure 15.5 reveals that the data consist of two quite well-separable
blobs:
4
3
2
1
0
1
2
3
6 4 2 0 2 4 6
There are a few observations that we might mark as outliers. The truth is that yours
truly injected eight junk points at the very end of the dataset. Ha.
X[-8:, :]
## array([[-3. , 3. ],
## [ 3. , 3. ],
## [ 3. , -3. ],
## [-3. , -3. ],
## [-3.5, 3.5],
## [-2.5, 2.5],
## [-2. , 2. ],
## [-1.5, 1.5]])
t = scipy.spatial.KDTree(X)
n = t.query_ball_point(X, 0.2) # r=0.2 (radius) – play with it yourself
c = np.array(pd.Series(n).str.len())
c[[0, 1, -2, -1]] # preview
## array([42, 30, 1, 1])
c[i] gives the number of points within X[i, :]’s 𝑟-radius (with respect to the Euclidean dis-
tance), including the point itself. Consequently, c[i]==1 denotes a potential outlier; see Fig-
ure 15.6 for an illustration.
4 normal point
outlier
3
2
1
0
1
2
3
6 4 2 0 2 4 6
Figure 15.6: Outlier detection based on a fixed-radius search for the blobs1 dataset
12 (**) We can easily normalise the outputs to get a true 2D kernel density estimator, but multivariate
statistics is beyond the scope of this course. In particular, that data might have fixed marginal distributions
(projections onto 1D) but their multidimensional images might be very different is beautifully described by
the copula theory; see [66].
388 V OTHER DATA TYPES
15.5 Exercises
Exercise 15.12 How can missing values be represented in numpy and pandas?
Exercise 15.13 Explain some basic strategies for dealing with missing values in numeric vec-
tors.
Exercise 15.14 Why we should be very explicit about the way we handle missing and other sus-
picious data? Is it a good idea to mark as missing (or remove completely) the observations that
we dislike or otherwise deem inappropriate, controversial, dangerous, incompatible with
our political views, etc.?
Exercise 15.15 Is replacing missing values with the sample arithmetic mean for income data
(as in, e.g., the uk_income_simulated_202013 dataset) a sensible strategy?
Exercise 15.16 What are the differences between data missing completely at random, missing
at random, and missing not at random?
Exercise 15.17 List some basic strategies for dealing with data that might contain outliers.
13 https://github.com/gagolews/teaching-data/raw/master/marek/uk_income_simulated_2020.txt
16
Time Series
So far, we have been using numpy and pandas mostly for storing:
• independent measurements, where each row gives, e.g., weight, height, … records
of a different subject; we often consider these a sample of a representative subset
of one or more populations, each recorded at a particular point in time;
• data summaries to be reported in the form of tables or figures, e.g., frequency
distributions giving counts for the corresponding categories or labels.
In this chapter, we will explore the most basic concepts related to the wrangling of
time series, i.e., signals indexed by discrete time. Usually, a time series is a sequence of
measurements sampled at equally spaced moments, e.g., a patient’s heart rate probed
every second, daily average currency exchange rates, or highest yearly temperatures
recorded in some location.
temps = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/spokane_temperature.txt")
Here are some data aggregates for the whole sample. First, the popular quantiles:
1 Note that midrange, being the mean of the lowest and the highest observed temperature on a given day,
is not a particularly good estimate of the average daily reading. This dataset is considered for illustrational
purposes only.
390 V OTHER DATA TYPES
Figure 16.1: Distribution of the midrange daily temperatures in Spokane in the period
1889-2021; observations are treated as a bag of unrelated items (temperature on a “ran-
domly chosen day” in a version of planet Earth where there is no climate change)
When computing data aggregates or plotting histograms, the order of elements does
not matter. Contrary to the case of the independent measurements, vectors represent-
ing time series do not have to be treated simply as mixed bags of unrelated items.
Important In time series, for any given item 𝑥𝑖 , its neighbouring elements 𝑥𝑖−1 and
𝑥𝑖+1 denote the recordings occurring directly before and after it. We can use this
temporal ordering to model how consecutive measurements depend on each other, de-
scribe how they change over time, forecast future values, detect seasonal and long-
time trends, and so forth.
16 TIME SERIES 391
In Figure 16.2, we depict the data for 2021, plotted as a function of time. What we
see is often referred to as a line chart (line graph): data points are connected by straight
line segments. There are some visible seasonal variations, such as, well, obviously, that
winter is colder than summer. There is also some natural variability on top of seasonal
patterns typical for the Northern Hemisphere.
plt.plot(temps[-365:])
plt.xticks([0, 181, 364], ["2021-01-01", "2021-07-01", "2021-12-31"])
plt.show()
30
20
10
10
Figure 16.2: Line chart of midrange daily temperatures in Spokane for 2021
d = np.array([
"1889-08-01", "1970-01-01", "1970-01-02", "2021-12-31", "today"
], dtype="datetime64")
d
(continues on next page)
2 https://numpy.org/doc/stable/reference/arrays.datetime.html
392 V OTHER DATA TYPES
Important Internally, the above are represented as the number of days or seconds
since the so-called Unix Epoch, 1970-01-01T00:00:00 in the UTC time zone.
d.astype(float)
## array([-2.9372e+04, 0.0000e+00, 1.0000e+00, 1.8992e+04, 1.9387e+04])
dt.astype(float)
## array([7.26500000e+03, 1.67505928e+09])
spokane = pd.DataFrame(dict(
date=np.arange("1889-08-01", "2022-01-01", dtype="datetime64[D]"),
temp=temps
))
spokane.head()
## date temp
## 0 1889-08-01 21.1
## 1 1889-08-02 20.8
## 2 1889-08-03 22.2
## 3 1889-08-04 21.7
## 4 1889-08-05 18.3
Interestingly, if we ask the date column to become the data frame’s index (i.e., row
labels), we will be able select date ranges quite easily with loc[...] and string slices
(refer to the manual of pandas.DateTimeIndex for more details).
spokane.set_index("date").loc["2021-12-25":, :].reset_index()
## date temp
## 0 2021-12-25 -1.4
## 1 2021-12-26 -5.0
## 2 2021-12-27 -9.4
## 3 2021-12-28 -12.8
## 4 2021-12-29 -12.2
## 5 2021-12-30 -11.4
## 6 2021-12-31 -11.4
Example 16.2 Based on the above, we can plot the data for the last five years quite easily; see
Figure 16.3. Note that the x-axis labels are generated automatically.
x = spokane.set_index("date").loc["2017-01-01":, "temp"]
plt.plot(x)
plt.show()
The pandas.to_datetime function can also convert arbitrarily formatted date strings,
e.g., "MM/DD/YYYY" or "DD.MM.YYYY" to Series of datetime64s.
30
20
10
10
Figure 16.3: Line chart of midrange daily temperatures in Spokane for 2017–2021
Exercise 16.3 From the birth_dates3 dataset, select all people less than 18 years old (as of the
current day).
Several datetime functions and related properties can be referred to via the pandas.
Series.dt accessor (similarly to pandas.Series.str discussed in Chapter 14). In par-
ticular, they deliver a convenient means for extracting different date or time fields,
such as date, time, year, month, day, dayofyear, hour, minute, second, etc. For instance:
dates_ymd = pd.DataFrame(dict(
year = dates.dt.year,
month = dates.dt.month,
day = dates.dt.day
))
dates_ymd
## year month day
## 0 1991 4 5
## 1 2022 7 14
## 2 2042 12 21
dates.dt.strftime("%d.%m.%Y")
## 0 05.04.1991
## 1 14.07.2022
(continues on next page)
3 https://github.com/gagolews/teaching-data/raw/master/marek/birth_dates.csv
16 TIME SERIES 395
Interestingly, pandas.to_datetime can also convert data frames with columns named
year, month, day, etc., back to datetime objects directly:
pd.to_datetime(dates_ymd)
## 0 1991-04-05
## 1 2022-07-14
## 2 2042-12-21
## dtype: datetime64[ns]
Example 16.4 Let us extract the month and year parts of dates to compute the average monthly
temperatures it the last 50-ish years:
x = spokane.set_index("date").loc["1970":, ].reset_index()
mean_monthly_temps = x.groupby([
x.date.dt.year.rename("year"),
x.date.dt.month.rename("month")
]).temp.mean().unstack()
mean_monthly_temps.head().round(1) # preview
## month 1 2 3 4 5 6 7 8 9 10 11 12
## year
## 1970 -3.4 2.3 2.8 5.3 12.7 19.0 22.5 21.2 12.3 7.2 2.2 -2.4
## 1971 -0.1 0.8 1.7 7.4 13.5 14.6 21.0 23.4 12.9 6.8 1.9 -3.5
## 1972 -5.2 -0.7 5.2 5.6 13.8 16.6 20.0 21.7 13.0 8.4 3.5 -3.7
## 1973 -2.8 1.6 5.0 7.8 13.6 16.7 21.8 20.6 15.4 8.4 0.9 0.7
## 1974 -4.4 1.8 3.6 8.0 10.1 18.9 19.9 20.1 15.8 8.9 2.4 -0.8
Figure 16.4 depicts these data on a heatmap. We rediscover the ultimate truth that winters are
cold, whereas in the summertime the living is easy, what a wonderful world.
sns.heatmap(mean_monthly_temps)
plt.show()
d = np.diff(x)
d
## array([-3.6, -4.4, -3.4, 0.6, 0.8, 0. ])
For instance, between the second and the first day of the last week, the midrange tem-
perature dropped by -3.6°C.
The other way around, here the cumulative sums of the deltas:
np.cumsum(d)
## array([ -3.6, -8. , -11.4, -10.8, -10. , -10. ])
This turned deltas back to a shifted version of the original series. But we will need the
first (root) observation therefrom to restore the dataset in full:
An exponential family is identified by the scale parameter 𝑠 > 0, being at the same time its
expected value. The probability density function of Exp(s) is given for 𝑥 ≥ 0 by:
𝑓 (𝑥) = 1𝑠 𝑒−𝑥/𝑠 ,
and 𝑓 (𝑥) = 0 otherwise. We should be careful: some textbooks choose the parametrisation by
𝜆 = 1/𝑠 instead of 𝑠. The scipy package also uses this convention.
Here is a pseudorandom sample where there are five events per minute on average:
np.random.seed(123)
λ = 60/5 # 5 events per 60 seconds on average
d = scipy.stats.expon.rvs(size=1200, scale=λ)
np.round(d[:8], 3) # preview
## array([14.307, 4.045, 3.087, 9.617, 15.253, 6.601, 47.412, 13.856])
np.mean(d)
## 11.839894504211724
The result is close to what we expected, i.e., 𝑠 = 12 seconds between the events.
We can convert the above to datetime (starting at a fixed calendar date) as follows. Note that we
will measure the deltas in milliseconds so that we do not loose precision; datetime64 is based on
integers, not floating-point numbers.
t0 = np.array("2022-01-01T00:00:00", dtype="datetime64[ms]")
d_ms = np.round(d*1000).astype(int) # in milliseconds
t = t0 + np.array(np.cumsum(d_ms), dtype="timedelta64[ms]")
(continues on next page)
4 https://github.com/gagolews/teaching-data/raw/master/marek/euraud-20200101-20200630-no-na.
txt
398 V OTHER DATA TYPES
As an exercise, let us apply binning and count how many events occur in each hour:
We expect 5 events per second, i.e., 300 of them per hour. On a side note, from a course in statistics
we know that for exponential inter-event times, the number of events per unit of time follows a
Poisson distribution.
Exercise 16.7 (*) Consider the wait_times5 dataset that gives the times between some consec-
utive events, in seconds. Estimate the event rate per hour. Draw a histogram representing the
number of events per hour.
Exercise 16.8 (*) Consider the btcusd_ohlcv_2021_dates6 dataset which gives the daily
BTC/USD exchange rates in 2021:
btc = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/btcusd_ohlcv_2021_dates.csv",
comment="#").loc[:, ["Date", "Close"]]
btc["Date"] = btc["Date"].astype("datetime64[D]")
btc.head(12)
## Date Close
## 0 2021-01-01 29374.152
## 1 2021-01-02 32127.268
## 2 2021-01-03 32782.023
## 3 2021-01-04 31971.914
## 4 2021-01-05 33992.430
## 5 2021-01-06 36824.363
(continues on next page)
5 https://github.com/gagolews/teaching-data/raw/master/marek/wait_times.txt
6 https://github.com/gagolews/teaching-data/raw/master/marek/btcusd_ohlcv_2021_dates.csv
16 TIME SERIES 399
Write a function that converts it to a lagged representation, being a convenient form for some
machine learning algorithms.
1. Add the Change column that gives by how much the price changed since the previous day.
2. Add the Dir column indicating if the change was positive or negative.
3. Add the Lag1, …, Lag5 columns which give the Changes in the five preceding days.
The first few rows of the resulting data frame should look like this (assuming we do not want any
missing values):
In the 6th row (representing 2021-01-12), Lag1 corresponds to Change on 2021-01-11, Lag2 gives
the Change on 2021-01-10, and so forth.
To spice things up, make sure your code can generate any number (as defined by another para-
meter to the function) of lagged variables.
1 1 𝑘
𝑦𝑖 = (𝑥𝑖 + 𝑥𝑖+1 + ⋯ + 𝑥𝑖+𝑘−1 ) = ∑ 𝑥𝑖+𝑗−1 ,
𝑘 𝑘 𝑗=1
For example, here are the temperatures in the last 7 days of December 2011:
400 V OTHER DATA TYPES
x = spokane.set_index("date").iloc[-7:, :]
x
## temp
## date
## 2021-12-25 -1.4
## 2021-12-26 -5.0
## 2021-12-27 -9.4
## 2021-12-28 -12.8
## 2021-12-29 -12.2
## 2021-12-30 -11.4
## 2021-12-31 -11.4
x.rolling(3, center=True).mean().round(2)
## temp
## date
## 2021-12-25 NaN
## 2021-12-26 -5.27
## 2021-12-27 -9.07
## 2021-12-28 -11.47
## 2021-12-29 -12.13
## 2021-12-30 -11.67
## 2021-12-31 NaN
We get, in this order: the mean of the first three observations; the mean of the 2nd,
3rd, and 4th items; then the mean of the 3rd, 4th, and 5th; and so forth. Notice that
the observations were centred in such a way that we have the same number of miss-
ing values at the start and end of the series. This way, we treat the first 3-day moving
average (the average of the temperatures on the first three days) as representative of
the 2nd day.
And now for something completely different; the 5-moving average:
x.rolling(5, center=True).mean().round(2)
## temp
## date
## 2021-12-25 NaN
## 2021-12-26 NaN
## 2021-12-27 -8.16
## 2021-12-28 -10.16
## 2021-12-29 -11.44
## 2021-12-30 NaN
## 2021-12-31 NaN
Applying the moving average has the nice effect of smoothing out all kinds of broadly
16 TIME SERIES 401
conceived noise. To illustrate this, compare the temperature data for the last five years
in Figure 16.3 to their averaged versions in Figure 16.5.
x = spokane.set_index("date").loc["2017-01-01":, "temp"]
x30 = x.rolling(30, center=True).mean()
x100 = x.rolling(100, center=True).mean()
plt.plot(x30, label="30-day moving average")
plt.plot(x100, "r--", label="100-day moving average")
plt.legend()
plt.show()
25
20
15
10
0
30-day moving average
5 100-day moving average
2017 2018 2019 2020 2021 2022
Figure 16.5: Line chart of 30- and 100-moving averages of the midrange daily temper-
atures in Spokane for 2017-2021
Exercise 16.9 (*) Other aggregation functions can be applied in rolling windows as well. Draw,
in the same figure, the plots of the 1-year moving minimums, medians, and maximums.
x = spokane.set_index("date").loc["1970-01-01":, "temp"]
x10y = x.rolling(3653, center=True).mean()
xd = x - x10y
Seasonal patterns can be revealed by smoothening out the detrended version of the
data, e.g., using a 1-year moving average:
plt.plot(x10y, label="trend")
plt.plot(xd1y, "r--", label="seasonal pattern")
plt.legend()
plt.show()
Figure 16.6: Trend and seasonal pattern for the Spokane temperatures in recent years
Also, if we know the length of the seasonal pattern (in our case, 365-ish days), we can
draw a seasonal plot, where we have a separate curve for each season (here: year) and
where all the series share the same x-axis (here: the day of the year); see Figure 16.7.
30
20
Temperature
10
1970
0 1980
1990
2000
10
2010
2020
20 Average
0 50 100 150 200 250 300 350
Day of year
Figure 16.7: Seasonal plot: Temperatures in Spokane vs the day of the year; years
between 1970 and 2021
Exercise 16.10 Draw a similar plot for the whole data range, i.e., 1889–2021.
Exercise 16.11 Try using pd.Series.dt.strftime with a custom formatter instead of pd.
Series.dt.dayofyear.
Example 16.12 The classic air_quality_19737 dataset gives some daily air quality measure-
ments in New York, between May and September 1973. Let us impute the first few observations
in the solar radiation column:
air = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/r/air_quality_1973.csv",
comment="#")
x = air.loc[:, "Solar.R"].iloc[:12]
pd.DataFrame(dict(
original=x,
ffilled=x.fillna(method="ffill"),
bfilled=x.fillna(method="bfill"),
interpolated=x.interpolate(method="linear")
))
## original ffilled bfilled interpolated
## 0 190.0 190.0 190.0 190.000000
## 1 118.0 118.0 118.0 118.000000
## 2 149.0 149.0 149.0 149.000000
## 3 313.0 313.0 313.0 313.000000
## 4 NaN 313.0 299.0 308.333333
## 5 NaN 313.0 299.0 303.666667
## 6 299.0 299.0 299.0 299.000000
## 7 99.0 99.0 99.0 99.000000
## 8 19.0 19.0 19.0 19.000000
## 9 194.0 194.0 194.0 194.000000
## 10 NaN 194.0 256.0 225.000000
## 11 256.0 256.0 256.0 256.000000
eurxxx = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/eurxxx-20200101-20200630-no-na.csv",
delimiter=",")
eurxxx[:6, :] # preview
## array([[1.6006 , 7.7946 , 0.84828, 4.2544 ],
## [1.6031 , 7.7712 , 0.85115, 4.2493 ],
## [1.6119 , 7.8049 , 0.85215, 4.2415 ],
## [1.6251 , 7.7562 , 0.85183, 4.2457 ],
## [1.6195 , 7.7184 , 0.84868, 4.2429 ],
## [1.6193 , 7.7011 , 0.85285, 4.2422 ]])
This gives EUR/AUD (how many Australian Dollars we pay for 1 Euro), EUR/CNY
(Chinese Yuans), EUR/GBP (British Pounds), and EUR/PLN (Polish Złotys), in this or-
der. Let us draw the four time series; see Figure 16.8.
dates = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/euraud-20200101-20200630-dates.txt",
dtype="datetime64")
labels = ["AUD", "CNY", "GBP", "PLN"]
styles = ["solid", "dotted", "dashed", "dashdot"]
for i in range(eurxxx.shape[1]):
plt.plot(dates, eurxxx[:, i], ls=styles[i], label=labels[i])
plt.legend(loc="upper right", bbox_to_anchor=(1, 0.9)) # a bit lower
plt.show()
Unfortunately, they are all on different scales. This is why the plot is not necessarily
readable. It would be better to draw these time series on four separate plots (compare
the trellis plots in Section 12.2.5).
Another idea is to depict the currency exchange rates relative to the prices on some day,
say, the first one; see Figure 16.9.
for i in range(eurxxx.shape[1]):
plt.plot(dates, eurxxx[:, i]/eurxxx[0, i],
ls=styles[i], label=labels[i])
plt.legend()
plt.show()
406 V OTHER DATA TYPES
8
7 AUD
CNY
6 GBP
PLN
5
4
3
2
1
2020-01 2020-02 2020-03 2020-04 2020-05 2020-06 2020-07
Figure 16.8: EUR/AUD, EUR/CNY, EUR/GBP, and EUR/PLN exchange rates in the first
half of 2020
AUD
1.150 CNY
1.125 GBP
PLN
1.100
1.075
1.050
1.025
1.000
0.975
2020-01 2020-02 2020-03 2020-04 2020-05 2020-06 2020-07
Figure 16.9: EUR/AUD, EUR/CNY, EUR/GBP, and EUR/PLN exchange rates relative to
the prices on the first day
16 TIME SERIES 407
This way, e.g., a relative EUR/AUD rate of ca. 1.15 in mid-March means that if an Aussie
bought some Euros on the first day, and then sold them three-ish months later, they
would have 15% more wealth (the Euro become 15% stronger relative to AUD).
Exercise 16.14 Based on the EUR/AUD and EUR/PLN records, compute and plot the AUD/PLN
as well as PLN/AUD rates.
Exercise 16.15 (*) Draw the EUR/AUD and EUR/GBP rates on a single plot, but where each
series has its own9 y-axis.
Exercise 16.16 (*) Draw the EUR/xxx rates for your favourite currencies over a larger period.
Use data10 downloaded from the European Central Bank. Add a few moving averages. For each
year, identify the lowest and the highest rate.
btcusd = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/btcusd_ohlcv_2021.csv",
delimiter=",")
btcusd[:6, :4] # preview (we skip the Volume column for readability)
## array([[28994.01 , 29600.627, 28803.586, 29374.152],
## [29376.455, 33155.117, 29091.182, 32127.268],
## [32129.408, 34608.559, 32052.316, 32782.023],
## [32810.949, 33440.219, 28722.756, 31971.914],
## [31977.041, 34437.59 , 30221.188, 33992.43 ],
## [34013.613, 36879.699, 33514.035, 36824.363]])
This gives the open, high, low, and close (OHLC) prices on the 365 consecutive days,
which is a common way to summarise daily rates.
The mplfinance11 package (matplotlib-finance) features a few functions related to the
plotting of financial data. Here, let us briefly describe the well-known candlestick plot.
Figure 16.10 depicts the January 2021 data. Let us stress that this is not a box and
9 https://matplotlib.org/stable/gallery/subplots_axes_and_figures/secondary_axis.html
10 https://www.ecb.europa.eu/stats/policy_and_exchange_rates/euro_reference_exchange_rates/
html/index.en.html
11 https://github.com/matplotlib/mplfinance
408 V OTHER DATA TYPES
42000
40000
38000
36000
Price
34000
32000
30000
1
1
n1
n0
n0
n1
n2
n2
n3
Ja
Ja
Ja
Ja
Ja
Ja
Ja
Figure 16.10: Candlestick plot for the BTC/USD exchange rates in January 2021
whisker plot. The candlestick body denotes the difference in the market opening and
the closing price. The wicks (shadows) give the range (high to low). White candle-
sticks represent bullish days – where the closing rate is greater than the opening one
(uptrend). Black candles are bearish (decline).
Exercise 16.17 Draw the BTC/USD rates for the entire year and add the 10-day moving aver-
ages.
Exercise 16.18 (*) Draw a candlestick plot manually, without using the mplfinance package.
Hint: matplotlib.pyplot.fill might be helpful.
Exercise 16.19 (*) Using matplotlib.pyplot.fill_between add a semi-transparent poly-
gon that fills the area bounded between the Low and High prices on all the days.
pare at some level of generality (think: health data on patients being subject to two
treatment plans that we wish to evaluate).
From this perspective, time series are already quite distinct, because there is some
dependence observed in the time domain: a price of a stock that we observe today is
influenced by what was happening yesterday. There might also be some seasonal pat-
terns or trends under the hood. For a good introduction to forecasting; see [52, 70].
Also, for data of this kind, employing statistical modelling techniques (stochastic pro-
cesses) can make a lot of sense; see, e.g., [84].
Signals such as audio, images, and video are different, because structured randomness
does not play a dominant role there (unless it is a noise that we would like to filter out).
Instead, what is happening in the frequency (think: perceiving pitches when listening
to music) or spatial (seeing green grass and sky in a photo) domain will play a key role
there.
Signal processing thus requires a distinct set of tools, e.g., Fourier analysis and finite
impulse response (discrete convolution) filters. This course obviously cannot be about
everything (also because it requires some more advanced calculus skills that we did
not assume the reader to have at this time); but see, e.g., [82, 83].
Nevertheless, we should keep in mind that these are not completely independent do-
mains. For example, we can extract various features of audio signals (e.g., overall
loudness, timbre, and danceability of each recording in a large song database) and
then treat them as tabular data to be analysed using the techniques described in this
course. Moreover, machine learning (e.g., convolutional neural networks) algorithms
may also be used for tasks such as object detection on images or optical character re-
cognition; see, e.g., [42].
16.5 Exercises
Exercise 16.20 Assume we have a time series with n observations. What is a 1- and an n-
moving average? Which one is smoother, a (0.01n)- or a (0.1n)- one?
Exercise 16.21 What is the Unix Epoch?
Exercise 16.22 How can we recreate the original series when we are given its numpy.diff-
transformed version?
Exercise 16.23 (*) In your own words, describe the key elements of a candlestick plot.
Changelog
Note that the most up-to-date version of this book can be found at https://
datawranglingpy.gagolewski.com/.
Below is the list of the most noteworthy changes.
• 2023-02-06 (v1.0.3):
– Numeric reference style; updated bibliography.
– Reduce the file size of the screen-optimised PDF at the cost of a slight de-
crease of the quality of some figures.
– The print-optimised PDF now uses selective rasterisation of parts of figures,
not whole pages containing them. This should result in a much better quality
of the printed version.
– Bug fixes.
– Minor extensions, including: pandas.Series.dt.strftime, more details how
to avoid pitfalls in data frame indexing, etc.
• 2022-08-24 (v1.0.2):
– First printed (paperback) version can be ordered from Amazon13 .
– Fixed page margin and header sizes.
– Minor typesetting and other fixes.
• 2022-08-12 (v1.0.1):
– Cover.
– ISBN 978-0-6455719-1-2 assigned.
• 2022-07-16 (v1.0.0):
– Preface complete.
– Handling tied observations.
12 https://github.com/gagolews/datawranglingpy/issues
13 https://www.amazon.com/dp/0645571911
412 CHANGELOG
[1] Abramowitz, M., Stegun, I.A., editors. (1972). Handbook of Mathematical Functions
with Formulas, Graphs, and Mathematical Tables. Dover Publications. URL: https://
personal.math.ubc.ca/~cbm/aands/intro.htm.
[2] Aggarwal, C.C. (2015). Data Mining: The Textbook. Springer.
[3] Arnold, T.B., Emerson, J.W. (2011). Nonparametric goodness-of-fit tests for dis-
crete null distributions. The R Journal, 3(2):34–39. DOI: 10.32614/RJ-2011-016.
[4] Bartoszyński, R., Niewiadomska-Bugaj, M. (2007). Probability and Statistical In-
ference. Wiley.
[5] Beirlant, J., Goegebeur, Y., Teugels, J., Segers, J. (2004). Statistics of Extremes: The-
ory and Applications. Wiley. DOI: 10.1002/0470012382.
[6] Bezdek, J.C., Ehrlich, R., Full, W. (1984). FCM: The fuzzy c-means cluster-
ing algorithm. Computer and Geosciences, 10(2–3):191–203. DOI: 10.1016/0098-
3004(84)90020-7.
[7] Billingsley, P. (1995). Probability and Measure. John Wiley & Sons.
[8] Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer-Verlag. URL:
https://www.microsoft.com/en-us/research/people/cmbishop/.
[9] Blum, A., Hopcroft, J., Kannan, R. (2020). Foundations of Data Science. Cambridge
University Press. URL: https://www.cs.cornell.edu/jeh/book.pdf.
[10] Box, G.E.P., Cox, D.R. (1964). An analysis of transformations. Journal of the Royal
Statistical Society. Series B (Methodological), 26(2):211–252.
[11] Bullen, P.S. (2003). Handbook of Means and Their Inequalities. Springer Sci-
ence+Business Media, Dordrecht.
[12] Campello, R.J.G.B., Moulavi, D., Zimek, A., Sander, J. (2015). Hierarchical dens-
ity estimates for data clustering, visualization, and outlier detection. ACM Trans-
actions on Knowledge Discovery from Data, 10(1):5:1–5:51. DOI: 10.1145/2733381.
[13] Chambers, J.M., Hastie, T. (1991). Statistical Models in S. Wadsworth &
Brooks/Cole.
[14] Clauset, A., Shalizi, C.R., Newman, M.E.J. (2009). Power-law distributions in
empirical data. SIAM Review, 51(4):661–703. DOI: 10.1137/070710111.
[15] Connolly, T., Begg, C. (2015). Database Systems: A Practical Approach to Design, Im-
plementation, and Management. Pearson.
416 REFERENCES
[16] Conover, W.J. (1972). A Kolmogorov goodness-of-fit test for discontinuous dis-
tributions. Journal of the American Statistical Association, 67(339):591–596. DOI:
10.1080/01621459.1972.10481254.
[17] Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press.
URL: https://archive.org/details/in.ernet.dli.2015.223699.
[18] Dasu, T., Johnson, T. (2003). Exploratory Data Mining and Data Cleaning. John
Wiley & Sons.
[19] Date, C.J. (2003). An Introduction to Database Systems. Pearson.
[20] Deisenroth, M.P., Faisal, A.A., Ong, C.S. (2020). Mathematics for Machine Learn-
ing. Cambridge University Press. URL: https://mml-book.github.io/.
[21] Dekking, F.M., Kraaikamp, C., Lopuhaä, H.P., Meester, L.E. (2005). A Modern
Introduction to Probability and Statistics: Understanding Why and How. Springer.
[22] Devroye, L., Györfi, L., Lugosi, G. (1996). A Probabilistic Theory of Pattern Recogni-
tion. Springer. DOI: 10.1007/978-1-4612-0711-5.
[23] Deza, M.M., Deza, E. (2014). Encyclopedia of Distances. Springer.
[24] Efron, B., Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence,
and Data Science. Cambridge University Press.
[25] Ester, M., Kriegel, H.P., Sander, J., Xu, X. (1996). A density-based algorithm for
discovering clusters in large spatial databases with noise. In: Proc. KDD'96, pp.
226–231.
[26] Feller, W. (1950). An Introduction to Probability Theory and Its Applications: Volume I.
Wiley.
[27] Forbes, C., Evans, M., Hastings, N., Peacock, B. (2010). Statistical Distributions.
Wiley.
[28] Freedman, D., Diaconis, P. (1981). On the histogram as a density estimator: L₂
theory. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57:453–476.
[29] Friedl, J.E.F. (2006). Mastering Regular Expressions. O'Reilly.
[30] Gagolewski, M. (2015). Data Fusion: Theory, Methods, and Applications. Institute of
Computer Science, Polish Academy of Sciences. DOI: 10.5281/zenodo.6960306.
[31] Gagolewski, M. (2015). Spread measures and their relation to aggrega-
tion functions. European Journal of Operational Research, 241(2):469–477. DOI:
10.1016/j.ejor.2014.08.034.
[32] Gagolewski, M. (2021). genieclust: Fast and robust hierarchical cluster-
ing. SoftwareX, 15:100722. URL: https://genieclust.gagolewski.com, DOI:
10.1016/j.softx.2021.100722.
[33] Gagolewski, M. (2022). stringi: Fast and portable character string processing in
R. Journal of Statistical Software, 103(2):1–59. URL: https://stringi.gagolewski.com,
DOI: 10.18637/jss.v103.i02.
REFERENCES 417
[53] Hyndman, R.J., Fan, Y. (1996). Sample quantiles in statistical packages. American
Statistician, 50(4):361–365. DOI: 10.2307/2684934.
[54] Kleene, S.C. (1951). Representation of events in nerve nets and finite automata.
Technical Report RM-704, The RAND Corporation, Santa Monica, CA. URL:
https://www.rand.org/content/dam/rand/pubs/research_memoranda/2008/
RM704.pdf.
[55] Knuth, D.E. (1992). Literate Programming. CSLI.
[56] Knuth, D.E. (1997). The Art of Computer Programming II: Seminumerical Algorithms.
Addison-Wesley.
[57] Kuchling, A.M. (2023). Regular Expression HOWTO. URL: https://docs.python.
org/3/howto/regex.html.
[58] Lee, J. (2011). A First Course in Combinatorial Optimisation. Cambridge University
Press.
[59] Ling, R.F. (1973). A probability theory of cluster analysis. Journal of the American
Statistical Association, 68(341):159–164. DOI: 10.1080/01621459.1973.10481356.
[60] Little, R.J.A., Rubin, D.B. (2002). Statistical Analysis with Missing Data. John Wiley
& Sons.
[61] Lloyd, S.P. (1957 (1982)). Least squares quantization in PCM. IEEE Transactions on
Information Theory, 28:128–137. Originally a 1957 Bell Telephone Laboratories Re-
search Report; republished in 1982. DOI: 10.1109/TIT.1982.1056489.
[62] McKinney, W. (2022). Python for Data Analysis. O'Reilly. URL: https:
//wesmckinney.com/book/.
[63] Modarres, M., Kaminskiy, M.P., Krivtsov, V. (2016). Reliability Engineering and Risk
Analysis: A Practical Guide. CRC Press.
[64] Monahan, J.F. (2011). Numerical Methods of Statistics. Cambridge University Press.
[65] Müllner, D. (2011). Modern hierarchical, agglomerative clustering algorithms.
arXiv:1109.2378 [stat.ML]. URL: https://arxiv.org/abs/1109.2378v1.
[66] Nelsen, R.B. (1999). An Introduction to Copulas. Springer-Verlag.
[67] Newman, M.E.J. (2005). Power laws, Pareto distributions and Zipf's law. Contem-
porary Physics, pages 323–351. DOI: 10.1080/00107510500052444.
[68] Oetiker, T., et al. (2021). The Not So Short Introduction to LaTeX 2ε. URL: https://tobi.
oetiker.ch/lshort/lshort.pdf.
[69] Olver, F.W.J., et al. (2023). NIST Digital Library of Mathematical Functions. URL:
https://dlmf.nist.gov/.
[70] Ord, J.K., Fildes, R., Kourentzes, N. (2017). Principles of Business Forecasting.
Wessex Press.
REFERENCES 419
[71] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E. (2011). Scikit-learn: Ma-
chine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
[72] Poore, G.M. (2019). Codebraid: Live code in pandoc Markdown. In: Proc. 18th Py-
thon in Science Conf., pp. 54–61. URL: https://conference.scipy.org/proceedings/
scipy2019/pdfs/geoffrey_poore.pdf.
[73] Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P. (2007). Numerical
Recipes. The Art of Scientific Computing. Cambridge University Press.
[74] Pérez-Fernández, R., Baets, B. De, Gagolewski, M. (2019). A taxonomy of mono-
tonicity properties for the aggregation of multidimensional data. Information Fu-
sion, 52:322–334. DOI: 10.1016/j.inffus.2019.05.006.
[75] Rabin, M., Scott, D. (1959). Finite automata and their decision problems. IBM
Journal of Research and Development, 3:114–125.
[76] Ritchie, D.M., Thompson, K.L. (1970). QED text editor. Technical Report 70107-
002, Bell Telephone Laboratories, Inc. URL: https://wayback.archive-it.org/all/
20150203071645/http://cm.bell-labs.com/cm/cs/who/dmr/qedman.pdf.
[77] Robert, C.P., Casella, G. (2004). Monte Carlo Statistical Methods. Springer-Verlag.
[78] Ross, S.M. (2020). Introduction to Probability and Statistics for Engineers and Scientists.
Academic Press.
[79] Rousseeuw, P.J., Ruts, I., Tukey, J.W. (1999). The bagplot: A bivariate boxplot. The
American Statistician, 53(4):382–387. DOI: 10.2307/2686061.
[80] Rubin, D.B. (1976). Inference and missing data. Biometrika, 63(3):581–590.
[81] Sandve, G.K., Nekrutenko, A., Taylor, J., Hovig, E. (2013). Ten simple rules for re-
producible computational research. PLOS Computational Biology, 9(10):1–4. DOI:
10.1371/journal.pcbi.1003285.
[82] Smith, S.W. (2002). The Scientist and Engineer's Guide to Digital Signal Processing.
Newnes. URL: https://www.dspguide.com/.
[83] Steiglitz, K. (1996). A Digital Signal Processing Primer: With Applications to Digital Au-
dio and Computer Music. Pearson.
[84] Tijms, H.C. (2003). A First Course in Stochastic Models. Wiley.
[85] Tufte, E.R. (2001). The Visual Display of Quantitative Information. Graphics Press.
[86] Tukey, J.W. (1962). The future of data analysis. Annals of Mathematical Stat-
istics, 33(1):1–67. URL: https://projecteuclid.org/journalArticle/Download?urlId=
10.1214%2Faoms%2F1177704711, DOI: 10.1214/aoms/1177704711.
[87] Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley.
[88] van Buuren, S. (2018). Flexible Imputation of Missing Data. CRC Press. URL: https:
//stefvanbuuren.name/fimd/.
420 REFERENCES
[89] van der Loo, M., de Jonge, E. (2018). Statistical Data Cleaning with Applications in R.
John Wiley & Sons.
[90] Virtanen, P., et al. (2020). SciPy 1.0: Fundamental algorithms for scientific com-
puting in Python. Nature Methods, 17:261–272. DOI: 10.1038/s41592-019-0686-2.
[91] Waskom, M.L. (2021). seaborn: Statistical data visualization. Journal of Open
Source Software, 6(60):3021. DOI: 10.21105/joss.03021.
[92] Wickham, H. (2011). The split-apply-combine strategy for data analysis. Journal
of Statistical Software, 40(1):1–29. DOI: 10.18637/jss.v040.i01.
[93] Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10):1–23. DOI:
10.18637/jss.v059.i10.
[94] Wierzchoń, S.T., Kłopotek, M.A. (2018). Modern Algorithms for Cluster Analysis.
Springer. DOI: 10.1007/978-3-319-69308-8.
[95] Wilson, G., et al. (2014). Best practices for scientific computing. PLOS Biology,
12(1):1–7. DOI: 10.1371/journal.pbio.1001745.
[96] Wilson, G., et al. (2017). Good enough practices in scientific computing. PLOS
Computational Biology, 13(6):1–20. DOI: 10.1371/journal.pcbi.1005510.
[97] Xie, Y. (2015). Dynamic Documents with R and knitr. Chapman and Hall/CRC.