Untitled
Untitled
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page iii
❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page iv
❦
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic,
mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is
available at http://www.wiley.com/go/permissions.
The right of Philippe J.S. De Brouwer to be identified as the author of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this
book may not be available in other formats.
10 9 8 7 6 5 4 3 2 1
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page v
❦
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page vii
❦
Short Overview
Foreword xxv
Acknowledgements xxix
Preface xxxi
I Introduction 1
❦ ❦
1 The Big Picture with Kondratiev and Kardashev 3
3 Conventions 11
6 The Implementation of OO 87
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page viii
❦
13 RDBMS 219
14 SQL 223
V Modelling 373
21 Regression Models 375
❦ ❦
22 Classification Models 387
26 Labs 495
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page ix
❦
Short Overview ix
32 R Markdown 699
IX Appendices 819
A Create your own R package 821
Bibliography 859
Nomenclature 869
Index 881
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page xi
❦
Contents
Foreword xxv
Acknowledgements xxix
Preface xxxi
❦ I Introduction 1 ❦
1 The Big Picture with Kondratiev and Kardashev 3
3 Conventions 11
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page xii
❦
xii Contents
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page xiii
❦
Contents xiii
6 The Implementation of OO 87
6.1 Base Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 S3 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2.1 Creating S3 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2.2 Creating Generic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.3 Method Dispatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.4 Group Generic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 S4 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.1 Creating S4 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.2 Using S4 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.3 Validation of Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3.4 Constructor functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.5 The .Data slot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3.6 Recognising Objects, Generic Functions, and Methods . . . . . . . . . . . . 108
6.3.7 Creating S4 Generics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3.8 Method Dispatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.4 The Reference Class, refclass, RC or R5 Model . . . . . . . . . . . . . . . . . . . . . 113
6.4.1 Creating RC Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4.2 Important Methods and Attributes . . . . . . . . . . . . . . . . . . . . . . . 117
6.5 Conclusions about the OO Implementation . . . . . . . . . . . . . . . . . . . . . . 119
❦ ❦
7 Tidy R with the Tidyverse 121
7.1 The Philosophy of the Tidyverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2 Packages in the Tidyverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.2.1 The Core Tidyverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.2.2 The Non-core Tidyverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3 Working with the Tidyverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.3.1 Tibbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.3.2 Piping with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.3.3 Attention Points When Using the Pipe . . . . . . . . . . . . . . . . . . . . . 133
7.3.4 Advanced Piping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3.4.1 The Dollar Pipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3.4.2 The T-Pipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3.4.3 The Assignment Pipe . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page xiv
❦
xiv Contents
13 RDBMS 219
14 SQL 223
14.1 Designing the Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
14.2 Building the Database Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
14.2.1 Installing a RDBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page xv
❦
Contents xv
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page xvi
❦
xvi Contents
V Modelling 373
21 Regression Models 375
21.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
21.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
21.2.1 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
21.2.2 Non-linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
21.3 Performance of Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
21.3.1 Mean Square Error (MSE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
21.3.2 R-Squared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
21.3.3 Mean Average Deviation (MAD) . . . . . . . . . . . . . . . . . . . . . . . . 386
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page xvii
❦
Contents xvii
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page xviii
❦
xviii Contents
26 Labs 495
26.1 Financial Analysis with quantmod . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
26.1.1 The Basics of quantmod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
26.1.2 Types of Data Available in quantmod . . . . . . . . . . . . . . . . . . . . . . 496
26.1.3 Plotting with quantmod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
26.1.4 The quantmod Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 500
26.1.4.1 Sub-setting by Time and Date . . . . . . . . . . . . . . . . . . . . 500
26.1.4.2 Switching Time Scales . . . . . . . . . . . . . . . . . . . . . . . . 501
26.1.4.3 Apply by Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
26.1.5 Support Functions Supplied by quantmod . . . . . . . . . . . . . . . . . . . 502
26.1.6 Financial Modelling in quantmod . . . . . . . . . . . . . . . . . . . . . . . 504
26.1.6.1 Financial Models in quantmod . . . . . . . . . . . . . . . . . . . . 504
26.1.6.2 A Simple Model with quantmod . . . . . . . . . . . . . . . . . . . 504
26.1.6.3 Testing the Model Robustness . . . . . . . . . . . . . . . . . . . . 507
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page xix
❦
Contents xix
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page xx
❦
xx Contents
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page xxi
❦
Contents xxi
32 R Markdown 699
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page xxii
❦
xxii Contents
IX Appendices 819
A Create your own R Package 821
A.1 Creating the Package in the R Console . . . . . . . . . . . . . . . . . . . . . . . . . 823
A.2 Update the Package Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825
A.3 Documenting the Functionsxs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826
A.4 Loading the Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827
A.5 Further Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page xxiii
❦
Contents xxiii
Bibliography 859
Nomenclature 869
Index 881
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:14pm Page xxv
❦
Foreword
This book brings together skills and knowledge that can help to boost your career. It is an excellent
tool for people working as database manager, data scientist, quant, modeller, statistician, analyst
and more, who are knowledgeable about certain topics, but want to widen their horizon and
understand what the others in this list do. A wider understanding means that we can do our job
better and eventually open doors to new or enhanced careers.
The student who graduated from a science, technology, engineering or mathematics or similar
program will find that this book helps to make a successful step from the academic world into a
any private or governmental company.
This book uses the popular (and free) software R as leitmotif to build up essential program-
ming proficiency, understand databases, collect data, wrangle data, build models and select mod-
els from a suit of possibilities such linear regression, logistic regression, neural networks, decision
❦ trees, multi criteria decision models, etc. and ultimately evaluate a model and report on it. ❦
We will go the extra mile by explaining some essentials of accounting in order to build up to
pricing of assets such as bonds, equities and options. This helps to deepen the understanding how
a company functions, is useful to be more result oriented in a private company, helps for one’s own
investments, and provides a good example of the theories mentioned before. We also spend time
on the presentation of results and we use R to generate slides, text documents and even interactive
websites! Finally we explore big data and provide handy tips on speeding up code.
I hope that this book helps you to learn faster than me, and build a great and interesting career.
Enjoy reading!
Philippe De Brouwer
2020
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:14pm Page xxvii
❦
Dr. Philippe J.S. De Brouwer leads expert teams in the service centre of HSBC in Krakow, is Hon-
orary Consul for Belgium in Krakow, and is also guest professor at the University of Warsaw,
Jagiellonian University, and AGH University of Science and Technology. He teaches both at exec-
utive MBA programs and mathematics faculties.
He studied theoretical physics, and later acquired his second Master degree while working.
Finishing this Master, he solved the “fallacy of large numbers puzzle” that was formulated by P.A.
Samuelson 38 years before and remained unsolved since then. In his Ph.D., he successfully chal-
lenged the assumptions of the noble price winning “Modern portfolio Theory” of H. Markovitz,
by creating “Maslowian Portfolio Theory.”
His career brought him into insurance, banking, investment management, and back to bank-
ing, while his specialization shifted from IT, data science to people management.
❦ For Fortis (now BNP), he created one of the first capital guaranteed funds and got promoted ❦
to director in 2000. In 2002, he joined KBC, where he merged four companies into one and sub-
sequently became CEO of the merged entity in 2005. Under his direction, the company climbed
from number 11 to number 5 on the market, while the number of competitors increased by 50%.
In the aftermath of the 2008 crisis, he helped creating a new asset manager for KBC in Ireland that
soon accommodated the management of ca. 1000 investment funds and had about = C120 billion
under management. In 2012, he widened his scope by joining the risk management of the bank
and specialized in statistics and numerical methods. Later, Philippe worked for the Royal Bank
of Scotland (RBS) in London and specialized in Big Data, analytics and people management. In
2016, he joined HSBC and is passionate about building up a Centre of Excellence in risk man-
agement in the service centre in Krakow. One of his teams, the independent model review team,
validates the most important models used in the banking group worldwide.
Married and father of two, he invests his private time in the future of the education by volun-
teering as board member of the International School of Krakow. This way, he contributes modestly
to the cosmopolitan ambitions of Krakow. He gives back to society by assuming the responsibility
of Honorary Consul for Belgium in Krakow, and mainly helps travellers in need.
In his free time, he teaches at the mathematics departments of AGH University of Science
and Technology and Jagiellonian University in Krakow and at the executive MBA programs the
Krakow Business School of the University of Economics in Krakow and the Warsaw Univer-
sity. He teaches subjects like finance, behavioural economics, decision making, Big Data, bank
management, structured finance, corporate banking, financial markets, financial instruments,
team-building, and leadership. What stands out is his data and analytics course: with this course
he manages to provide similar content with passion for undergraduate mathematics students and
experienced professionals of an MBA program. This variety of experience and teaching experience
in both business and mathematics is what lays the foundations of this book: the passion to bridge
the gap between theory and practice.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:14pm Page xxix
❦
Acknowledgements
Writing a book that is so eclectic and holds so many information would not have been possible
without tremendous support from so many people: mentors, family, colleagues, and ex-colleagues
at work or at universities. This book is in the first place a condensation of a few decades of inter-
esting work in asset management and banking and mixes things that I have learned in C-level
jobs and more technical assignments.
I thank the colleagues of the faculties of applied mathematics at the AGH University of Science
and Technology, the faculty of mathematics of the Jagiellonian University of Krakow, and the
colleagues of HSBC for the many stimulating discussions and shared insights in mathematical
modelling and machine learning.
To the MBA program of the Cracovian Business School, the University of Warsaw, and to
the many leaders that marked my journey, I am indebted for the business insight, stakeholder
❦ management and commercial wit that make this book complete. ❦
A special thanks goes to Piotr Kowalczyk, FRM and Dr. Grzegorz Goryl, PRM, for reading
large chunks of this book and providing detailed suggestions. I am also grateful for the gen-
eral remarks and suggestions from Dr. Jerzy Dzieża, faculty of applied mathematics at the AGH
University of Science and Technology of Krakow and the fruitful discussions with Dr. Tadeusz
Czernik, from the University of Economics of Katowice and also Senior Manager at HSBC, Inde-
pendent Model Review, Krakow.
This book would not be what it is now without the many years of experience, the stimulating
discussions with so many friends, and in particular my wife, Joanna De Brouwer who encouraged
me to move from London in order to work for HSBC in Krakow, Poland. Somehow, I feel that I
should thank the city council and all the people for the wonderful and dynamic environment that
attracts so many new service centres and that makes the ones that already had selected for Krakow
grow their successful investments. This dynamic environment has certainly been an important
stimulating factor in writing this book.
However, nothing would have been possible without the devotion and support of my family:
my wife Joanna, both children, Amelia and Maximilian, were wonderful and are a constant source
of inspiration and support.
Finally, I would like to thank the thousands of people who contribute to free and open source
software, people that spend thousands of hours to create and improve software that others can use
for free. I profoundly believe that these selfless acts make this world a better and more inclusive
place, because they make computers, software, and studying more accessible for the less fortunate.
A special honorary mentioning should go to the people that have built Linux, LATEX, R, and
the ecosystems around each of them as well as the companies that contribute to those projects,
such as Microsoft that has embraced R and RStudio that enhances R and never fails to share the
fruits of their efforts with the larger community.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:14pm Page xxxi
❦
Preface
The author has written this book based on his experience that spans roughly three decades in
insurance, banking, and asset management. During his career, the author worked in IT, struc-
tured and managed highly technical investment portfolios (at some point oversaw = C24 billion
in thousand investment funds), fulfilled many C-level roles (e.g. was CEO of KBCTFI SA [an
asset manager in Poland], was CIO and COO for Eperon SA [a fund manager in Ireland] and
sat on boards of investment funds, and was involved in big-data projects in London), and did
quantitative analysis in risk departments of banks. This gave the author a unique and in-depth
view of many areas ranging form analytics, big-data, databases, business requirements, financial
modelling, etc.
In this book, the author presents a structured overview of his knowledge and experience for
anyone who works with data and invites the reader to understand the bigger picture, and discover
❦ new aspects. This book also demystifies hype around machine learning and AI, by helping the ❦
reader to understand the models and program them in R without spending too much time on the
theory.
This book aims to be a starting point for quants, data scientists, modellers, etc. It aims to
be the book that bridges different disciplines so that a specialist in one domain can grab this
book, understand how his/her discipline fits in the bigger picture, and get enough material to
understand the person who is specialized in a related discipline. Therefore, it could be the ideal
book that helps you to make career move to another discipline so that in a few years you are that
person who understands the whole data-chain. In short, the author wants to give you a short-cut
to the knowledge that he spent 30 years to accumulate.
Another important point is that this book is written by and for practitioners: people that work
with data, programming and mathematics for a living in a corporate environment. So, this book
would be most interesting for anyone interested in data-science, machine learning, statistical
learning and mathematical modelling and whomever wants to convey technical matters in a clear
and concise way to non-specialists.
This also means that this book is not necessarily the best book in any of the disciplines that it
spans. In every specialisation there are already good contenders.
• More formal introductions to statistics are for example in: Cyganowski, Kloeden, and
Ombach (2001) and Andersen et al. (1987). There are also many books about specific
stochastic processes and their applications in financial markets: see e.g. Wolfgang and
Baschnagel (1999), Malliaris and Brock (1982), and Mikosch (1998). While knowledge of
stochastic processes and their importance in asset pricing are important, this covers only
a very narrow spot of applications and theory. This book is more general, more gently on
theoretical foundations and focusses more on the use of data to answer real-life problems
in everyday business environment.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:14pm Page xxxii
❦
xxxii Preface
• This is not simply a book about programming and/or any related techniques. If you just
want to learn programming in R, then Grolemund (2014) will be get you started faster. Our
Part II will also get you started in programming, though it assumes a certain familiarity
with programming and mainly zooms in on aspects that will be important in the rest of the
book.
• This book is not a comprehensive books about financial modelling. Other books do a better
job in listing all types of possible models. No book does a better job here than Bernard Marr’s
publication: Marr (2016): “Key Business Analytics, the 60+ business analysis tool every
manager needs to know.” This book will list you all words that some managers might use
and what it means, without any of the mathematics nor any or the programming behind. I
warmly recommend keeping this book next to ours. Whenever someone comes up with a
term like “customer churn analytics” for example, you can use Bernard’s book to find out
what it actually means and then turn to ours to “get your hands dirty” and actually do it.
• If you are only interested in statistical learning and modelling, you will find the following
books more focused: Hastie, Tibshirani, and Friedman (2009) or also James, Witten, Hastie,
and Tibshirani (2013) who also uses R.
• Data science is more elaborately treated in Baesens (2014) and the recent book by Wickham
❦ and Grolemund (2016) that provides an excellent introduction to R and data science in ❦
general. This last book is a great add-on to this book as it focusses more on the data-aspects
(but less on the statistical learning part). We also focus more on the practical aspects and
real data problems in corporate environment.
A book that comes close to ours in purpose is the book that my friend professor Bart Baetens
has compiled “Analytics in a Big Data World, the Essential guide to data science and its appli-
cations”: Baesens (2014). If the mathematics, programming, and R itself scare you in this book,
then Bart’s book is for you. Bart’s book covers different methods, but above all, for the reader, it is
sufficient to be able to use a spreadsheet to do some basic calculations. Therefore, it will not help
you to tackle big data nor programming a neural network yourself, but you will understand very
well what it means and how things work.
Another book that might work well if the maths in this one are prohibitive to you is Provost
and Fawcett (2013), it will give you some insight in what the statistical learning is and how it
works, but will not prepare you to use it on real data.
Summarizing, I suggest you buy next to this book also Marr (2016) and Baesens (2014).
This will provide you a complete chain from business and buzzwords (Bernard’s book) over
understanding what modelling is and what practical issues one will encounter (Bart’s book) to
implementing this in a corporate setting and solve the practical problems of a data scientist and
modeller on sizeable data (this book).
In a nutshell, this book does it all, is gentle on theoretical foundations and aims to be a one-
stop shop to show the big picture, learn all those things and actually apply it. It aims to serve as
a basis when later picking up more advanced books in certain narrow areas. This book will take
you on a journey of working with data in a real company, and hence, it will discuss also practical
problems such as people filling in forms or extracting data from a SQL database.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:14pm Page xxxiii
❦
Preface xxxiii
It should be readable for any person that finished (or is finishing) university level education in
a quantitative field such as physics, civil engineering, mathematics, econometrics, etc. It should
also be readable by the senior manager with a technical background, who tries to understand
what his army of quants, data scientists, and developers are up to, while having fun learning
R. After reading this book you will be able to talk to all, challenge their work, and make most
analysis yourself or be part of a bigger entity and specialize in one of the steps of modelling or
data-manipulation.
In some way, this book can also be seen as a celebration of FOSS (Free and Open Source Soft-
ware). We proudly mention that for this book no commercial software was used at all. The operat-
ing system is Linux, the windows manager Fluxbox (sometimes LXDE or KDE), Kile and vi helped
the editing process, Okular displayed the PDF-file, even the database servers and Hadoop/Spark
are FOSS . . . and of course R and LATEX provided the icing on the cake. FOSS makes this world a FOSS
more inclusive place as it makes technology more attainable in poorer places on this world.
Hence, we extend a warm thanks to all people that spend so much time to contributing to free
software.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 4:08pm Page xxxv
❦
❦ ❦
www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:11pm Page 1
❦
PART I
Introduction
♥
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:14pm Page 3
❦
♣1♣
You have certainly heard the words: “data is the new oil,” and you probably wondered “are we
indeed on the verge of a new era of innovation and wealth creation or . . . is this just hype and will
it blow over soon enough?”
Since our ancestors left the trees about 6 million years ago, we roamed the African steppes and
we evolved a more upright position and limbs better suited for walking than climbing. However,
for about 4 million years physiological changes did not include a larger brain. It is only in the last
million years that we gradually evolved a more potent frontal lobe capable of abstract and logical
thinking.
The first good evidence of abstract thinking is the Makapansgat pebble, a jasperite cobble –
❦ roughly 260 g and 5 by 8 cm – that by geological tear and wear shows a few holes and lines that ❦
vaguely resemble (to us) a human face. About 2.5 million years ago one of our australopithecine
ancestors not only realized this resemblance but also deemed it interesting enough to pick up the
pebble, keep it, and finally leave it in a cave miles from the river where it was found.
This development of abstract thinking that goes beyond vague resemblance was a major mile-
stone. As history unfolded, it became clear that this was only the first of many steps that would
lead us to the era of data and knowledge that we live in today. Many more steps towards more abstract thinking
complex and abstract thinking, gene mutations and innovation would be needed.
Soon we developed language. With language we were able to transform learning from an indi-
vidual level to a collective level. Now, experiences could be passed on to the next generation or
peers much more efficiently, it became possible to prepare someone for something that he or she
did not yet encounter and to accumulate more knowledge with every generation.
More than ever before this abstract thinking and accumulation of collective experiences lead
to a “knowledge advantage” and smartness became an attractive trait in a mate. This allowed our
brain to develop further and great innovations such as the wheel, scripture, bronze, agriculture,
iron, specialisation of labour soon started to transform not only our societal coherence but also
the world around us.
Without those innovations, we would not be where we are now. While it is discussable to
classify these inventions as the fruit of scientific work, it is equally hard to deny that some kind
of scientific approach was necessary. For example, realizing the patterns in the movements of the
sun, we could predict seasons and weather changes to come and this allowed us to put the grains
on the right moment in the ground. This was based on observations and experience.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:14pm Page 4
❦
Science and progress flourished, but the fall of the Western European empire made Europe
sink in the dark medieval period where thinking was dominated by religious fear and superstition
and hence scientific progress came to grinding halt, and it is wake improvements in medical care,
food production and technology.
The Arab world continued the legacy of Aristotle (384–322 BCE, Greece) and Alhazen (Ibn
al-Haytham, 965–1039 Iraq), who by many is considered as the father of the modern scientific
scientific method method.1 It was this modern scientific method that became a catalyst for scientific and techno-
logical development.
A class of people that accumulated wealth through smart choices emerged. This was made
possible by private enterprise and an efficient way of sharing risks and investments. In 1602, the
East Indies Company became the first common stock company and in 1601 the Amsterdam Stock
Exchange created a platform where innovative, exploratory and trade ideas could find the neces-
sary capital to flourish.
In 1775, James Watt’s improvement of the steam engine allowed to leverage on the progress
made around the joint stock company and the stock exchange. This combination powered the
capitalism raise of a new societal organization, capitalism and fueled the first industrial wave based on
automation (mainly in the textile industry).
While this first industrial wave brought much misery and social injustice, as a species we were
preparing for the next stage. It created wealth as never before on a scale never seen before. From
England, the industrialization, spread fast over Europe and the young state in North America. It
all ended in “the Panic of 1873,” that brought the “Long Depression” to Europe and the United
States of America. This depression was so deep that it would indirectly give rise to a the invention
of an new economic order: communism.
The same steam engine, however, had another trick up its sleeve: after industrialisation it
❦ was able to provide mass transport by the railway. This fuelled a new wave of wealth creation ❦
and lasted till the 1900s, where that wave of wealth creation ended in the “Panic of 1901” and
the “Panic of 1907” – the first stock market crashes to start in the United States of America. The
internal combustion engine, electricity and magnetism became the cornerstones of an new wave
of exponential growth based on innovation. The “Wall Street Crash of 1929” ended this wave and
started the “Great Depression.”
It was about 1935 when Kondratiev noticed these long term waves of exponential growth
and devastating market crashes and published his findings in Kondratieff and Stolper (1935) –
Kondratiev republished in Kondratieff (1979). The work became prophetic as the automobile industry and
chemistry fuelled a new wave of development that gave us individual mobility and lasted till 1973–
1974 stock market crash.
IT The scenario repeated itself as clockwork when it was the turn of the electronic computer and
information technology information technology (IT) to fuel exponential growth till the crashes of 2002–2008.
Now, momentum is gathering pace with a few strong contenders to pull a new wave of eco-
nomic development and wealth creation, a new phase of exponential growth. These contenders
include in our opinion:
• quantum computing (if we will manage to get them work, that is),
ticism and scrutiny from peers, reproducibility of results. The idea is to formulate a hypothesis, based on logical
induction based on observations, then allowing peers to review and publish the results, so that other can falsify of
conform the results.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:14pm Page 5
❦
This book is about the last group: machine learning (statistical learning) and data, and if you
are reading it then you are certainly playing a significant role in designing the newest wave of
exponential growth. If regulators will listen to scientists2 and stop using incoherent risk measures
in legislation (such as the Basel agreement and UCITS IV), then it might even be the first phase
of exponential growth that does not end in a dramatic crash of financial markets.
Working with data is helping a new Kondratiev wave of wealth creation to take off. The Kon-
dratiev waves of development seem to bring us step by step closer to a Type I civilisation, as
described by Nikolai Kardashev – (see Kardashev, 1964). Nikolai Kardashev describes a com- Kardashev
pelling model of how intelligent societies could develop. He recognizes three stages of develop-
ment. The stages are:
1. Type 0 society is the tribal organization form that we know today and has lasted since the
dawn of the Homo genus.
2. A Type I society unites all people of a planet in one community and wields technology to
influence weather and harvest energy of that planet.
3. A Type II society has colonized more than one planet of one solar system, is able to capture
the energy of a star, and in some sense rules its solar system.
4. A Type III society has a galaxy in its control, is able to harvest energy from multiple stars,
maybe even feeds stars into black holes to generate energy, etc.
5. There is no type Type IV society described in Kondratiev, though logically one can expect
that the next step would be to spread over the local cluster of galaxies.3 However, we would
argue that more probably we will by then have found that our own species, survival is a
❦ ❦
legacy problem and that in order to satisfy the deepest self-actualization there is no need
for enormous amounts of energy – where Kardashev seems to focus on. Therefore, we argue
that the fourth stage would be very different.
While both Kondratiev’s and Kardashev’s theories are stylised idioms of a much more complex
underlying reality, they paint a picture that is recognizable and allows us to realize how important
scientific progress is. Indeed, if we do not make sure that we have alternatives to this planet, then
our species is doomed to extinction rather soon. A Yellowstone explosion, an asteroid the size of
an small country, a rogue planet, a travelling black hole, a nearby supernova explosion and so
many more things can make our earth disappear or at least render it unsuitable to sustain life as
we have known it for many thousands of years or – in the very best case – trigger a severe mass
extinction.4 Only science will be able to help us to transcend the limitation of the thin crust of
this earth.
Only science has brought welfare and wealth, only science and innovation will help us to leave
a lasting legacy in this universe.
But what is that “science”?
2 See elaboration on coherent risk measures for example in Artzner et al. (1997), Artzner et al. (1999), and De
Brouwer (2012).
3 Dominating the universe has multiple problems as there is a large part of the universe that will be forever
invisible, even with the speed of light, due to the Dark Energy and the expansion of the universe.
4 With this statement we do not want to to take a stance in the debate whether the emergence of the Homo
sapiens coincided with the start of an extinction event – that is till ongoing – nor do we want to ignore that for the
last 300 000 years species are disappearing at an alarming rate, nor do we want to take a stance in the debate that
the Homo sapiens is designing its own extinction. We only want to point out that dramatic events might occur on
our lovely planet and make it unsuitable to sustain intelligent life.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:14pm Page 7
❦
♣2♣
The world around us is constantly changing, making the wrong decisions can be disastrous for
any company or person. At the same time it is more than ever important to innovate. Innovating
is providing a diversity of ideas that just as in biological evolution might hold the best suited
mutation for the changing environment.
There are many ways to come to a view on what to do next. Some of the more popular
methods include instinct and prejudice, juiced up with psychological biases both in perception
and decision making. Other popular methods include decision by authority (“let the boss
decide”), deciding by decibels (“the loudest employee is heard”) and dogmatism (“we did
this in the past” or “we have a procedure that says this”). While these methods of creating
an opinion and deciding might coincidently work out, in general they are sub-optimal by
design. Indeed, the best solution might not even be considered or be pre-emptively be ruled out
❦ based on flawed arguments. ❦
Looking at scientific development throughout times as well as human history, one is com-
pelled to conclude that the only workable construct so far is also known as the scientific method. scientific method
No other methods haves brought the world so many innovations and progress, no other methods
have stood up in the face of scrutiny.
Aristotle (384–322 BCE, Greece) can be seen as the father of the scientific method, because of
his rigorous logical method which was much more than natural logic. But it is fair to credit Ibn
al-Haytham (aka Alhazen — 965–1039, Iraq) for preparing the scientific method for collaborative
use. His emphasis on collecting empirical data and reproducibility of results laid the foundation
for a scientific method that is much more successful. This method allows people to check each
other and confirm or reject previous results.
However, both the scientific method and the word “scientist” only came into common use scientist
in the nineteenth century and the scientific method only became the standard method in the
twentieth century. Therefore, it should not come as a surprise that this became also a period of
inventions and development as never seen before.
Indeed, while previous inventions such as fire, agriculture, the wheel, bronze and steel
might not have followed explicitly the scientific method they created a society ready to embrace
the scientific method and fuel an era of accelerated innovation and expansion. The internal
combustion engine, electricity and magnetism fuelled the economic growth as never seen before.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:14pm Page 8
❦
draw conclusions
make a model
wrangle data
❦ ❦
formulate question (hypothesis)
Figure 2.1: A view on the steps in the scientific method for the data scientist and mathematical
modeller, aka “quant.” In a commercial company the communication and convincing or putting the
model in production bear a lot of importance.
The electronic computer brought us to the twenty first century and now a new era of growth
is being prepared by big data, machine learning, nanotechnology and – maybe – quantum
computing.
Indeed, with huge powers come huge responsibility. Once an invention is made, it is impos-
sible to “un-invent” it. Once the atomic bomb exist, it cannot be forgotten, it is forever part of
our knowledge. What we can do is promote peaceful applications of quantum technology, such
as sensors to open doors, diodes, computers and quantum computers.
singularity For example, as information and data technology advances, the singularity1 . It is our respon-
sibility to foresee potential dangers and do all what is in our power to avoid that these dangers
1 The term “singularity” refers to the point in time where an intelligent system would be able to produce an even
more intelligent system that also can create another system that is a certain percentage smarter in a time that is
a certain percentage faster. This inevitably leads to exponentially increasing creating of better systems. This time
series converges to one point in time, where “intelligence” of the machine would hit its absolute limits. First, record
of the subject is by Stanislaw Ulam in a discussion with John Von Neuman in the 1950s and an early and convincing
publication is Good (1966). It is also elaborately explored in Kurzweil (2010).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:14pm Page 9
❦
become an extinction event. Many inventions had a dark side and have led to more efficient ways
of killing people, degenerating the ozone layer or polluting our ecosystem. Humanity has had
many difficult times and very dark days, however, never before humanity became extinct. That
would be the greatest disaster for there would be no recovery possible.
So the scientific method is important. This method has brought us into the information age
and we are only scratching the surface of possibilities. It is only logical that all corporates try to
stay abreast of changes and put a strong emphasis on innovation. This leads to an ever-increasing
focus on data, algorithms, mathematical models such as machine learning.
Data, statistics and the scientific method are powerful tools. The company that has the best
data and uses its data best is the company that will be the most adaptable to the changes and
hence the one to survive. This is not biological evolution, but guided evolution. We do not have
to rely on a huge number of companies with random variations, but we can use data to see trends
and react to them.
The role of the data-analyst in any company cannot be overestimated. It is the reader of the
book on whose shoulders rest not only to read those patterns from the data but also to convince
decision makers to act in this fact-based insight.
Because the role of data and analytics is so important, it is essential to follow scientific rigour.
This means in the first place following the scientific method for data analysis. An interpretation
of the scientific method for data-science is in Figure 2.2 on page 10.
Till now we discussed the role of the data scientists and actions that they would take. But how
does it look from the point of view of data itself?
Using that scientific method for data-science, the most important thing is probably to make
sure that the one understands the data very well. Data in itself is meaningless. For example, 930 is data
❦ just a number. It could be anything: from the age of Adamath in Genesis, to the price of chair or the ❦
code to unlock your bike-chain. It could be a time and 930 could mean “9:30” (assume “am” if your
time-zone habits require so). Knowing that interpretation, the numbers become information, but information
we cannot understand this information till we know what it means (it could be the time I woke up
– after a long party, the time of a plane to catch, a meeting at work, etc.). We can only understand
the data if we know that it is a bus schedule of the bus “843-my-route-to-work” for example. This
understanding, together with the insight that this bus always runs 15 minutes late and my will insight
to catch the bus can lead to action: to go out and wait for that bus and get on it. action
This simple example shows us how the data cycle in any company or within any discipline
should work. We first have a question, such as for example “to which customers can we lend
money without pushing them into a debt-spiral.” Then one will collect data (from own systems or
credit bureau). This data can then be used to create a model that allows us to reduce the complexity
of all observations to the explaining variables only: a projection in a space of lower dimensions.
That model helps us to get the insight from the data and once put in production allows us to
decide on the right action for each credit application.
This institution will end up with a better credit approval process, where less loss events occur.
That is the role of data-science: to drive companies to the creation of more sustainable wealth in
a future where all have a place and plentifulness.
This cycle – visualized in Figure 2.2 on page 10 – brings into evidence the importance of
data-science. Data science is a way to bring the scientific method into a private company, so that
decisions do not have to be based on gut-feeling alone. It is the role of the data scientist to take data,
transform that data into information, create understanding from that data that can lead to action-
able insight. It is then up to the management of the business to decide on the actions and follow
them through. The importance of being connected to the reality via contact with the business can-
not be overstated. In each and every step, mathematics will serve as tools, such as screwdrivers
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:14pm Page 10
❦
❦ ❦
Figure 2.2: The role of data-science in a company is to take data and turn it into actionable insight.
At every step – apart from technical issues that will be discussed in this book – it is of utmost impor-
tance to understand the context and limitations of data, business, regulations and customers. For
effectiveness in every step, one needs to pay attention to communication and permanent contact with
all stakeholders and environment is key.
and hammers. However, the choice about which one to use depends on a good understanding
what we are working with and what we are trying to achieve.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 11
❦
♣3♣
Conventions
This book is formatted with LATEX. The people who know this markup language will have high
expectations for the consistency and format of this book. As you can expect there is
This is a book with a programming language as leitmotif and hence you might expect to find
a lot of chunks of code. R is an interpreted language and it is usually accessed by opening the
❦ software R (simply type R on the command prompt and press enter).1 ❦
# This is code
1+pi
## [1] 4.141593
Sys.getenv(c("EDITOR","USER","SHELL", "LC_NUMERIC"))
## EDITOR USER SHELL LC_NUMERIC
## "vi" "root" "/bin/bash" "pl_PL.UTF-8"
As you can see, the code is highlighted, that means that not all things have the same colour
and it is easier to read and understand what is going on. The first line is a “comment” that means
that R will not do anything with it, it is for human use only. The next line is a simple sum. In your
R terminal, this what you will type or copy after the > prompt. It will rather look like this:
1 You, will of course, first have to install the base software R. More about this in Chapter 4 “The Basics of R” on
page 21.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 12
❦
12 3 Conventions
In this, book there is nothing in front of a command and the reply of R is preceded by two
pound signs: “##.”2 The pound sign ( # ) is also the symbol used by R to precede a comment,
hence R will ignore this line if fed into the command prompt. This allows you to copy and paste
lines or whole chunks if you are working from an electronic version of the book. If the > sign
would precede the command, then R would not understand if, and if you accidentally copy the
output that from the book, nothing will happen because the #-sign indicates to R to ignore the
rest of the line (this is a comment for humans, not for the machine).
Sys.getenv() The function Sys.getenv() returns us all environment variables if no parameter is given. If
it is supplied with a list of parameters, then it will only return those.
In the example above the function got three variables supplied, hence only report on these
three. You will also notice that the variables are wrapped in a special function c(...) . This is
because the function Sys.getenv() expects one vector as argument and the function c() will
create the vector out of a list of supplied arguments.
Note that in this paragraph above name of the function Sys.getenv() is mono-spaced. That
is our convention to use code within text. Even in the index, at the end of this book, we will follow
that convention.
You will also have noticed that in text – such as this line – we refer to code fragments and
functions, using fixed width font such as for example “the function mean() calculates the aver-
age.” When this part of the code needs emphasizing or is used as a word in the sentence, we might
want to highlight it additionally as follows: mean(1 + pi) .
Some other conventions also follow from this small piece of code. We will assume that you are
using Linux (unless mentioned otherwise). But do not worry: that is not something that will stand
in your way. In Chapter 4 “The Basics of R” on page 21, we will get you started in Windows and
all other things will be pretty much the same. Also, while most books are United States centric,
❦ United States of America we want to be as inclusive as possible and not assume that you live in the United States working ❦
from United States data.
ISO standard As a rule, we take a country-agnostic stance and follow the ISO standards3 , for dates and
dimensions of other variables. For example, we will use meters instead of feet.
Learning works best when you can do something with the knowledge that you are acquir-
ing. Therefore, we will usually even show the code of a plot that is mainly there for illustrative
purposes, so you can immediately try everything yourself.
When the code produces a plot (chart or graph), then the plot will appear generally at that
point between the code lines. For example, consider we want to show the generator function for
the normal distribution.
2 The number sign, #, is also known as the “hash sign” or “pound sign.” It probably evolved from the “libra
ponda” (a pound weight). It is currently used in any different fields as part of phone numbers, in programming
languages (e.g. in an URL it indicates a sub-address, in R it precedes a comment, etc), the command prompt for the
root user in Unix and Linux, in set theory (#S is the cardinality of the set S), in topology (A#B is the connected
sum of manifolds A and B), number theory (#n is the primorial of n), as keyword in some social media, etc. The
pronunciation hence varies widely: “hash” when used to tag keywords (#book would be the hash sign and the tag
book). Hence, reading the “#”-sign as “hashtag” is at least superfluous). Most often, it is pronounced as “pound.”
Note that the musical notation is another symbol, ♯, that is pronounced as “sharp” as in the music (e.g. C♯).
3 ISO standards refer to the standards published by the International Organization for Standardization (ISO). This
is an international standard-defining body, founded on 23 February 1947, it promotes worldwide proprietary, indus-
trial and commercial standards. At this point, 164 countries are member and decisions are made by representatives
of those countries. ISO is also recognised by the United Nations.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 13
❦
3 Conventions 13
Histogram of x
200
150
Frequency
100
50
0
Figure 3.1: An example showing the histogram of data generated from the normal distribution.
In most cases, the plot will be just after the code that it generates – even if the code continues
after the plot(...) command. Therefore, the plot will usually sit exactly where the code creates
it. However, in some rare cases, this will not be possible (it would create page layout that would
not be aesthetically appealing). The plot will then appear near the code chunk (maybe on the
next page). To help you to find and identify the plot in such case, we will usually add a numbered
caption to the plot.
The R code is so ubiquitous and integrated in the text that it will appear just where it should
be (though charts might move). They are integral part of the text and the comments that appear
there might not be repeated in the normal text later.
There is also some other code from the command prompt and/or from SQL environments.
That code appears much less, so they are numbered and appear as in Listings 3.1 and 3.2.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 14
❦
14 3 Conventions
$ R
>
Listing 3.1: This is what you would see if you start R in the command line terminal. Note that the
last sign is the R-prompt, inviting you to type commands. This code fragment is typical for how code
that is not in the R-language has been typeset in this book.
❦ ❦
$ factor 1492
1492: 2 2 373
$ calc 2*2*373
1492
$ pi 60
3.14159265358979323846264338327950288419716939937510582097494
Listing 3.2: Another example of a command line instructions: factor, calc, and pi. This example only
has CLI code and does not start R.
Note that in these environments, we do not “comment out” the output. We promise to avoid
mixing input and output, but in some cases, the output will just be there. So in general it is only
possible to copy line per line the commands to see the output on the screen. Copying the whole
block and pasting it in the command prompt leads to error messages, rather than the code being
executed. This is unlike the R code, which can be copied as a whole, pasted in the R-command
prompt and it should all work fine.
Questions or tasks look as follows:
Question #1 Histogram
Consider Figure 3.1 on page 13. Now, imagine that you did not generate the data, but
someone gave it to you, so that you do not know how it was generated. Then what could
this data represent? Or, rephrased, what could x be? Does it look familiar?
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 15
❦
3 Conventions 15
Questions or tasks can be answered by the tools and methods explained previously. Note that
it might require to do some research by your own, such as looking into the help files or other doc-
umentation (we will of course explain how to access these). If you are using this book to prepare
for an exam or test, it is probably a good preparation, but if you are in a hurry it is possible to skip
these (in this book they do not add to the material explained). However, in general thinking about
the issues presented will help you to solve your data-problems more efficiently.
Note that the answers to most questions can be found in E “Answers to Selected Questions”
on page 1061.The answer might not always be detailed but it should be enough to solve the issue
raised.
Definition: This is a definition
This is not an book about exact mathematics. This is a pragmatic book with a focus on
practical applications. Therefore, we use the word “definition” also in a practical sense.
Definitions are not always rigorous definitions as a mathematician would be used to. We
rather use practical definitions (e.g. how a function is implemented).
The use of a function is – mainly at the beginning of the book – highlighted as follows. For
example:
Function use for mean()
• x is an R-object,
❦ ❦
• na.rm is a boolean (setting this to TRUE will remove missing values),
From this example, it should be clear how the function mean() is used. Note the following:
• On the first line we repeat the function with its most important parameters.
• The parameter na.rm can be omitted. When it is not provided, the default FALSE is used.
A parameter with a default can be recognised in the first line via the equal sign.
• The three dots indicate that other parameters can be passed that will not be used by the
function mean(), but they are passed on to further methods.
• Some functions can take a lot of parameters. In some cases, we only show the most
important ones.
Later on in the book, we assume that the reader is able to use the examples and find more
about the function in its documentation. For example, ?mean will display information about the
function mean.
When a new concept or idea is built up with examples they generally appear just like that in
the text. Additional examples after the point is made are usually highlighed as follows:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 16
❦
16 3 Conventions
Example: Mean
Some example environments are split in two parts: the question and the solution as follows:
Example: Mean
What is the mean of all integer numbers from one to 100? Use the function mean().
mean(1:100)
## [1] 50.5
There are a few more special features in the layout that might be of interest.
A hint is something that adds additional practical information that is not part of the normal
flow of the text.
❦ ❦
Hint – Using the hint boxes
When first studying a section, skip the hints, and when reading it a second time pay more
attention to the hints.
When we want to draw attention to something that might or might not be clear from the
normal flow of the text, we put it in a “notice environment.” This looks as follows:
Note that hints, notes and warnings look all similar, but for your convenience, we have
differentiating colours and layout details.
There are more such environments and we let them speak for themselves.
Skip the digressions when you read the text first, and come back to them later.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 17
❦
3 Conventions 17
When reading the book, always read the comments in the code.
In general, a warning is important to read once you will start working on your own.
Note – Shadow
Note that the boxes with a shadow are “lifted off the page” and are a little independent
from the flow of the main text. Those that are no shadow are part of the main flow of the
text (definitions, examples, etc.)
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:12pm Page 19
❦
PART II
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 21
❦
♣4♣
The Basics of R
In this book we will approach data and analytics from a practitioners point of view and our tool
of choice is R. R is in some sense a re-implementation of S – a programming language written in
1976 by John Chambers at Bell Labs – with added lexical scoping semantics. Usually, code written
in S will also run in R. S
R is a modern language with a rather short history. In 1992, the R-project was started by Ross
Ihaka and Robert Gentleman at the University of Auckland, New Zealand. The first version was
available in 1995 and the first stable version was available in 2000.
Now, the R Development Core Team (of which Chambers is a member) develops R further and
maintains the code. Since a few years Microsoft has embraced the project and provides MRAN
(Microsoft R Application Network). This package is also free and open source software (FOSS) FOSS
and has some advantages over standard R such as enhanced performance (e.g. multi-thread sup-
❦ port, the checkpoint package that makes results more reproducible). ❦
Essentially, R is . . .
• integration with the procedures written in the C, C++, .Net, Python, or FORTRAN C
languages for efficiency; C++
.Net
• zero purchase cost (available under the GNU General Public License), and pre-compiled Fortran
binary versions are provided for various operating systems like Linux, Windows, and Mac; Linux
Windows
Mac
• simplicity and effectiveness;
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 22
❦
22 4 The Basics of R
• graphical facilities for data analysis and display either directly at the computer or printing;
• the ability for you to stand on the shoulders of giants (e.g. by using libraries).
R is arguably the most widely used statistics programming language and is used from universities
to business applications, while it still gains rapidly in popularity.
If at any point you are trying to solve a particular issues and you are stuck, the online
community will be very helpful. To get unstuck, do the following:
• First, look up your problems by adding the keyword “R” in the search string. Most
probably, someone else encountered the very same problem before you, and the
answer is already posted. Avoid to post a question that has been answered before.
• If you need to ask your question in a forum such as for example www.
stackexchange.com then you will need to add a minimal reproducible example.
The package reprex can help you to do just that.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 23
❦
Before we can start, we need a working installation of R on our computer. On Linux, this can be installing R
done via the command line. On Debian and its many derivatives such as Ubuntu or Mint, this
looks as follows:1
sudo apt-get install r-base
• https://www.tutorialspoint.com/execute_r_online.php
• http://www.r-fiddle.org
RStudio
❦ For the user, who is not familiar with the command line, it is highly recommendable to use ❦
an IDE, such as RStudio (see https://www.rstudio.com). Later on – for example in Chapter IDE
32 “R Markdown” on page 879 – we will see that RStudio has some unique advantages over the RStudio
R-console in store, that will convince even the most traditional command-line-users.
Whether you use standard R or MRAN, using RStudio will enhance your performance and MRAN
help you to be more productive.
IDE
Rstudio is an integrated development environment (IDE) for R and provides a console, editor integrated development
with syntax-highlighting, a window to show plots and some workspace management. environment
RStudio is also in the repositories of most Linux distributions. That means that there is no
need to go to the website or RStudio, but it is possible to download R, and install it with just one
line in the CLI of your OS. For example, in Debian and its derivatives, this would be:
# Note that the first 2 lines are just to make sure that
# you are using the latest version of the repository.
sudo apt-get update
sudo apt-get upgrade
# Install RStudio:
sudo apt-get install rstudio
1 There are different package management systems for different flavours of Linux and discussing them all is not
only beyond the scope of this book, but not really necessary. We assume that if you use Genttoo, that you will know
what to do or where to choose.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 24
❦
24 4 The Basics of R
This process is very simple: provide your admin password and the script will take care of
everything. Then use your preferred way to start the new software. R can be started by typing R
in the command prompt and RStudio will launch with the command rstudio .
An important note is that RStudio will work with the latest version of R that is installed. Also
when you install MRAN, RStudio will automatically load this new version.
Basic arithmetic
addition The basic operators work as one would expect. Simply type in the R terminal 2+3 followed
product by ENTER and R will immediately display the result.
power
#addition
2 + 3
#product
2 * 3
#power
2**3
2^3
#logic
2 < 3
x <- c(1,3,4,3)
x.mean <- mean(x)
x.mean
y <- c(2,3,5,1)
x+y
x <- scan()
will start an interface that invites you to type all values of the vector one by one.
In order to get back to the command prompt: type enter without typing a number (ie. leave
one empty to end).
This is only one of the many ways to get data into R. Most probably you will use a mix of
defining variables in the code and reading in data from files. See for example Chapter 15
“Connecting R to an SQL Database” on page 327.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 25
❦
To modify an existing variable, one can use the edit() function edit()
edit(x)
The edit function will open the editor that is defined in the options. While in RStudio, this
is a specially designed pop-up window with buttons to save and cancel, in the command
line interface (CLI) this might be vi . The heydays of this fantastic editor are over and
you might never have seen it before. It is not really possible to use vi without reading
the manual (e.g. via the man vi command on the OS CLI or an online tutorial). To get
out of vi, type: [ESC]:q![ENTER] . Note that we show the name of a key within square
brackets and that all the other strings are just one keystroke each.
Batch mode
R is an interpreted language and while the usual interaction is typing commands and getting
the reply appear on the screen, it is also possible to use R in batch mode. batch mode
In this section we will present you with a practical introduction to R, it is not a formal intro-
duction. If you would like to learn more about the foundations, then we recommend the docu-
mentation provided by the R-core team here: https://cran.r-project.org/doc/manuals/
r-release/R-lang.pdf.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 26
❦
26 4 The Basics of R
4.2 Variables
variables As any computer language, R allows to use variables to store information. The variable is referred
to by its name. Valid names in R consist of numbers and characters. Even most special characters
can be used.
In R, variables
• can contain letters as well as "_" (underscore) and "." (dot), and
• variables must start with a letter (that can be preceded with a dot).
For example, my_var.1 and my.Cvar are valid variables, but _myVar, my%var and 1.var are not
acceptable.
Assignment
assignment Assignment can be made left or right:
x.1 <- 5
x.1 + 3 -> .x
print(.x)
## [1] 8
R-programmers will use the arrow sign <- most often, however, R allows left assignment with
the = sign.
❦ x.3 = 3.14 ❦
x.3
## [1] 3.14
There are also occasions that we must use the = operator. For example, when assigning values
to named inputs for functions.
v1 <- c(1,2,3,NA)
mean(v1, na.rm = TRUE)
## [1] 2
There are more nuances and therefore, we will come back to the assignment in Chapter 4.4.4
“Assignment Operators” on page 78. These nuances are better understood with some more back-
ground, and for now it is enough to be able to assign values to variable.
Variable Management
With what we have seen so far, it is possible to make already simple calculations, define and
modify variables. There is still a lot to follow and it is important to have some basic tools to keep
things tidy. One of such tools is the possibility to see defined variables and eventually remove
unused ones.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 27
❦
4.2 Variables 27
# Remove a variable
rm(x.1) # removes the variable x.1
ls() # x.1 is not there any more
rm(list = ls()) # removes all variables
ls()
ls()
Note – What are invisible variables rm()
A variable whose name starts with a dot (e.g. .x ) is in all aspects the same as a variable
that starts with a letter. The only difference is that the first will be hidden with the standard
arguments of the function ls().
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 28
❦
28 4 The Basics of R
As most computer languages, R has some built-in data-types. While it is possible to do certain
things in R without worrying about data types, understanding and consciously using these base-
types will help you to write bug-free code that is more robust and it will certainly speed up the
debugging process. In this section we will highlight the most important ones.
complex numbers # Complex numbers use the letter i (without multiplication sign):
x <- 2.2 + 3.2i
class(x)
## [1] "complex"
While R allows to change the type of a variable, doing so is not a good practice. It makes
code difficult to read and understand.
# Avoid this:
x <- 3L # x defined as integer
x
## [1] 3
So, keep your code tidy and to not change data types.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 29
❦
Dates
Working with dates is a complex subject. We explain the essence of the issues in Section 17.6 date
“Dates with lubridate” on page 407. For now, it is sufficient to know that dates are one of the base
types of R.
Make sure to read Section 17.6 “Dates with lubridate” on page 407 for more information
about working with dates as well as the inevitable problems related to dates.
4.3.2 Vectors
4.3.2.1 Creating Vectors
Simply put, vectors are lists of objects that are all of the same type. They can be the result of a vector
calculation or be declared with the function c().
class(y)
## [1] "character"
More about lists can be found in Section 4.3.6 “Lists” on page 53.
v[c(1,5)]
## [1] 1 5
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 30
❦
30 4 The Basics of R
v["banana"]
## banana
## "yellow"
v1 <- c(1,2,3)
v2 <- c(4,5,6)
# Standard arithmetic
v1 + v2
## [1] 5 7 9
❦ v1 - v2 ❦
## [1] -3 -3 -3
v1 * v2
## [1] 4 10 18
## Warning in v1 + v2: longer object length is not a multiple of shorter object length
## [1] 2 4 4 6 6
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 31
❦
This behaviour is most probably different from what the experienced programmer will
expect. Not only we can add or multiply vectors of different nature (e.g. long and real),
but also we can do arithmetic on vectors of different size. This is usually not what you
have in mind, and does lead to programming mistakes. Do take an effort to avoid vector
recycling by explicitly building vectors of the right size.
# Example 1:
v1 <- c(1, -4, 2, 0, pi)
sort(v1)
## [1] -4.000000 0.000000 1.000000 2.000000 3.141593
The time series nottem (from the package “datasets” that is usually loaded when R starts)
contains the temperatures in Notthingham from 1920 to 1939 in Fahrenheit. Create a new
object that contains a list of all temperatures in Celsius.
Note that nottem is a time series object (see Chapter 10 “Time Series Analysis”
on page 255) and not a matrix. Its elements are addressed with nottam[n] where
n is between 1 and length(nottam). However, when printed it will look like a temperature
matrix with months in the columns and years in the rows. This is because the
print-function will use functionality specific to the time series object.a length()
Remember that T (C) = 59 (T (F ) − 32).
a This behaviour is caused by the dispatcher-function implementation of an object-oriented
programming model. To understand how this works and what it means, we refer to Section 6.2
“S3 Objects” on page 122.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 32
❦
32 4 The Basics of R
4.3.4 Matrices
matrix Matrices are a very important class of objects. They appear in all sorts of practical problems: invest-
ment portfolios, landscape rendering in games, image processing in the medical sector, fitting of
neural networks, etc.
# Create a matrix.
M = matrix( c(1:6), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
It is also possible to create a unit or zero vector with the same function. If we supply one
matrix() scalar instead a vector to the first argument of the function matrix(), it will be recycled as much
as necessary.
# Unit vector:
❦ matrix (1, 2, 1) ❦
## [,1]
## [1,] 1
## [2,] 1
# Fortunately, R expects that the vector fits exactly n times in the matrix:
matrix (1:3, 4, 4)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 33
❦
Once the matrix exists, the columns and rows can be renamed with the functions colnames()
and rownames() colnames()
rownames()
colnames(M) <- c('C1', 'C2', 'C3')
rownames(M) <- c('R1', 'R2', 'R3', 'R4')
M
## C1 C2 C3
## R1 10 11 12
## R2 13 14 15
## R3 16 17 18
## R4 19 20 21
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 34
❦
34 4 The Basics of R
M1 * M2
## [,1] [,2] [,3]
## [1,] 0 11 24
## [2,] 39 56 75
## [3,] 96 119 144
## [4,] 171 200 231
❦ M1 / M2 ❦
## [,1] [,2] [,3]
## [1,] Inf 11.000000 6.000000
## [2,] 4.333333 3.500000 3.000000
## [3,] 2.666667 2.428571 2.250000
## [4,] 2.111111 2.000000 1.909091
dot-product
Write a function for the dot-product for matrices. Add also some security checks. Finally,
compare your results with the “%*%-operator.”
The dot-product is pre-defined via the %*% opeartor. Note that the function t() creates the
transposed vector or matrix.
a %*% t(a)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 2 4 6
## [3,] 3 6 9
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 35
❦
t(a) %*% a
## [,1]
## [1,] 14
# Define A:
A <- matrix(0:8, nrow = 3, byrow = TRUE)
# Test products:
A %* % a
## [,1]
## [1,] 8
## [2,] 26
## [3,] 44
A %* % A
## [,1] [,2] [,3]
## [1,] 15 18 21
## [2,] 42 54 66
## [3,] 69 90 111
There are also other operations possible on matrices. For example the quotient works as
follows:
A %/% A
## [,1] [,2] [,3]
## [1,] NA 1 1
❦ ## [2,] 1 1 1 ❦
## [3,] 1 1 1
Note that matrices will accept both normal operators and specific matrix operators.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 36
❦
36 4 The Basics of R
A / A
## [,1] [,2] [,3]
## [1,] NaN 1 1
## [2,] 1 1 1
## [3,] 1 1 1
Note that while exp(A), for example, is well defined for a matrix as the sum of the series:
+∞
exp(A) = An /n!
n=0
# The exponential of A:
exp(A)
## [,1] [,2] [,3]
## [1,] 1.00000 2.718282 7.389056
## [2,] 20.08554 54.598150 148.413159
## [3,] 403.42879 1096.633158 2980.957987
sin(A)
## [,1] [,2] [,3]
## [1,] 0.0000000 0.8414710 0.9092974
## [2,] 0.1411200 -0.7568025 -0.9589243
## [3,] -0.2794155 0.6569866 0.9893582
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 37
❦
Note also that some operations will collapse the matrix to another (simpler) data type.
# Collapse to a vectore:
colSums(A)
## [1] 9 12 15
rowSums(A)
## [1] 3 12 21
min(A)
## [1] 0
We already saw the function t() to transpose a matrix. There are a few others available in
base R. For example, the function diag() diagonal matrix that is a subset of the matrix, det() diag()
caluclates the determinant, etc. The function solve() will solve the equation A %.% x = b , but solve()
when A is missing, it will assume the idenity vector and return the inverse of A.
M <- matrix(c(1,1,4,1,2,3,3,2,1), 3, 3)
M
## [,1] [,2] [,3]
## [1,] 1 1 3
## [2,] 1 2 2
## [3,] 4 3 1
❦ ❦
# The diagonal of M:
diag(M)
## [1] 1 2 1
# Inverse:
solve(M)
## [,1] [,2] [,3]
## [1,] 0.3333333 -0.66666667 0.33333333
## [2,] -0.5833333 0.91666667 -0.08333333
## [3,] 0.4166667 -0.08333333 -0.08333333
# Determinant:
det(M)
## [1] -12
# The QR composition:
QR_M <- qr(M)
QR_M$rank
## [1] 3
ncol(M)
## [1] 3
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 38
❦
38 4 The Basics of R
rowSums(M)
## [1] 5 5 8
rowMeans(M)
## [1] 1.666667 1.666667 2.666667
mean(M)
## [1] 2
cbind(M, M)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 1 3 1 1 3
## [2,] 1 2 2 1 2 2
## [3,] 4 3 1 4 3 1
❦ ❦
4.3.5 Arrays
Matrices are very useful, however there will be times that data has more dimensions than just two.
R has a solutions with the base-type “array.” Unlike matrices that have always two dimensions,
arrays can be of any number of dimensions. However, the requirement that all elements are of
array the same data-type is also valid for arrays.
Note that words like “array” are used as keywords in many computer languages, and that it is
important to understand exactly how it is implemented in the language that you want to use. In
this section we will introduce you to the practical aspects of working with arrays.
# Create an array:
a <- array(c('A','B'),dim = c(3,3,2))
print(a)
## , , 1
##
## [,1] [,2] [,3]
## [1,] "A" "B" "A"
## [2,] "B" "A" "B"
## [3,] "A" "B" "A"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 39
❦
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] "B" "A" "B"
## [2,] "A" "B" "A"
## [3,] "B" "A" "B"
M1 <- a[,,1]
M2 <- a[,,2]
M2
## col1 col2 col3
## R1 1 10 12
## R2 1 11 13
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 40
❦
40 4 The Basics of R
2. MARGIN: a vector giving the subscripts which the function will be applied over.
E.g., for a matrix ’1’ indicates rows, ’2’ indicates columns, ’c(1, 2)’ indicates rows
and columns. Where ’X’ has named dimnames, it can be a character vector selecting
dimension names.
3. FUN: the function to be applied: see ’Details’. In the case of functions like ’+’, ’back-
quoted or quoted
It is sufficient to provide the data, the dimension of application and the function that has to
be applied. To show how this works, we construct a simple example to calculate sums of rows and
cbind() sums of columns.
apply() The reader will notice that in the example above the variable x is actually not an array but
rather a data frame. The function apply() works however the same: instead of 2 dimensions,
there can be more.
Consider the previous example with the array a, and remember that a has three dimensions:
2 rows, 3 columns, and 2 matrices, then the following should be clear.
# Demonstrate apply:
apply(a, 1, sum)
## R1 R2
## 46 50
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 41
❦
apply(a, 2, sum)
## col1 col2 col3
## 4 42 50
apply(a, 3, sum)
## Matrix1 Matrix2
## 48 48
4.3.6 Lists
Where vectors, arrays, and matrices only can contain variables of the same sort (numeric, charac- list
ter, integer, etc.), the list object allows to mix different types into one object. The concept of a list
is similar to the concept “object” in many programming languages such as C++. Notice, however,
that there is no abstraction, only instances.
Definition: List
In R, lists are objects which are sets of elements that are not necessarily all of the same
type. Lists can mix numbers, strings, vectors, matrices, functions, boolean variables, and
even lists.
List might be reminiscent to how objects work in other languages (e.g. it looks similar to
the struct in C). Indeed, everything is an object in R. However, to understand how R
implements different styles of objects and object-oriented programming, we recommend
to read Chapter 6 on page 117.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 42
❦
42 4 The Basics of R
print(L[3])
## $approx
## [1] 3.14
print(L$approx)
## [1] 3.14
V1 <- c(1,2,3)
L2 <- list(V1, c(2:7))
L3 <- list(L2,V1)
print(L3)
## [[1]]
## [[1]][[1]]
## [1] 1 2 3
##
## [[1]][[2]]
## [1] 2 3 4 5 6 7
##
##
## [[2]]
## [1] 1 2 3
print(L3[[1]][[2]][3])
## [1] 4
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 43
❦
Note how the list L3 is a list of lists, rather than the concatenation of two lists. Instead of
adding the elements of L2 after those of V1 and having nine slots for data, it has two slots. Each
of those slots contains the object V1 and L2 respectively.
The double square brackets will inform R that we want one element returned as the most
elementary class and is limited to returning one position. The simple square brackets will
return a list and can be given a range.
class(L[2])
## [1] "list"
class(L2[[2]])
## [1] "integer"
# range
L2[1:2]
## [[1]]
❦ ## [1] 1 2 3 ❦
##
## [[2]]
## [1] 2 3 4 5 6 7
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 44
❦
44 4 The Basics of R
L <- list(1,2)
L[4] <- 4 # position 3 is NULL
L
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## NULL
##
## [[4]]
## [1] 4
L$pi_value <- pi
L
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## NULL
##
## [[4]]
## [1] 4
❦ ## ❦
## $pi_value
## [1] 3.141593
It is also possible to delete an element via the squared brackets. Note that if we address the
elements of a list by their number, we need to recalculate the numbers. If we were addressing the
elements of the list by name, nothing needs to be changed.
L <- L[-2]
L
## [[1]]
## [1] 2
##
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 45
❦
## [[2]]
## [1] 4
##
## $pi_value
## [1] 3.141593
When deleting an element in a list, the numbering will change so that it appears that the
deleted element was never there. This implies that when accessing elements of the list by
number, it is unsafe to delete elements and can lead to unwanted side effects of the code.
❦ ❦
Lists are more complex than vectors, instead of failing with a warning and requiring addi-
tional options to be set, the unlist() function will silently make some decisions for you.
Apart from performance considerations, it might also be necessary to convert parts of a list to
a vector, because some functions will expect vectors and will not work on lists.
4.3.7 Factors
Factors are the objects which hold a series of labels. They store the vector along with the distinct factors
values of the elements in the vector as label. Factors are in many ways similar to the enum data
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 46
❦
46 4 The Basics of R
type in C, C++ or Java, here they are mainly used to store named constants. The labels are always
of the character-type2 irrespective of data type of the elements in the input vector.
From the aforementioned example it is clear that the factor-object “is aware” of all the labels
for all observations as well as the different levels (or different labels) that exist. The next code
fragment makes clear that some functions – such as plot() – will recognize the factor-object
and produce results that make sense for this type of object. The following line of code is enough
to produce the output that is shown in Figure 4.1.
❦ ❦
2 The character type in R is what in most other languages would be called a “string.” In other words, that is text
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 47
❦
There are a few specific functions for the factor-object. For example, the function nlevels()
returns the number of levels in the factor object. nlevels()
In Figure 4.2 on page 63 we notice that the order is now as desired (it is the order that we have
provided via the attribute labels in the function factor().
gl(3,2,,c("bad","average","good"),TRUE)
## [1] bad bad average average good good
## Levels: bad < average < good
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 48
❦
48 4 The Basics of R
❦ ❦
Question #4
Use the dataset mtcars (from the library MASS) and explore the distribution of number
of gears. Then explore the correlation between gears and transmission.
Question #5
Then focus on the transmission and create a factor-object with the words “automatic” and
“manual” instead of the numbers 0 and 1.
mtcars Use the ?mtcars to find out the exact definition of the data.
Question #6
Use the dataset mtcars (from the library MASS) and explore the distribution of the horse-
power (hp). How would you proceed to make a factoring (e.g. Low, Medium, High) for
this attribute? Hint: Use the function cut().
cut()
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 49
❦
Figure 4.3: The standard plot for a data frame in R shows each column printed in function of each
other. This is useful to see correlations or how generally the data is structured.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 50
❦
50 4 The Basics of R
Most data is rectangular, and in almost any analysis we will encounter data that is structured in
a data frame. The following functions can be helpful to extract information from the data frame,
investigate its structure and study the content.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 51
❦
d <- data.frame(
Name = c("Piotr", "Pawel","Paula","Lisa","Laura"),
Gender = c("Male", "Male","Female", "Female","Female"),
Score = c(78,88,92,89,84),
Age = c(42,38,26,30,35),
stringsAsFactors = FALSE
)
d$Gender <- factor(d$Gender) # manually factorize gender
str(d)
## 'data.frame': 5 obs. of 4 variables:
## $ Name : chr "Piotr" "Pawel" "Paula" "Lisa" ...
## $ Gender: Factor w/ 2 levels "Female","Male": 2 2 1 1 1
## $ Score : num 78 88 92 89 84
## $ Age : num 42 38 26 30 35
a See Chapter 7 “Tidy R with the Tidyverse” on page 161 and note that tibbles (the data-frame alternative
❦ ❦
4.3.8.3 Editing Data in a Data Frame
While one usually reads in large amounts of data and uses an IDE such as RStudio that facilitates de()
the visualization and manual modification of data frames, it is useful to know how this is done data.entry()
edit()
when no graphical interface is available. Even when working on a server, all these functions will
always be available.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 52
❦
52 4 The Basics of R
rbind() Adding rows corresponds to adding observations. This is done via the function rbind().
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 53
❦
# Only records that match in name and age are in the merged table:
print(data_test.merged)
## Name Age Gender.x Score.x Gender.y Score.y
## 1 Pawel 38 Male 88 Male 88
## 2 Piotr 42 Male 78 Male 78
merge()
Short-cuts
R will allow the use of short-cuts, provided that they are unique. For example, in the data- short-cut
frame data_test there is a column Name. There are no other columns whose name start with
the letter “N”; hence. this one letter is enough to address this column.
❦ ❦
data_test$N
## [1] Piotr Pawel Paula Lisa Laura
## Levels: Laura Lisa Paula Pawel Piotr
Use “short-cuts” sparingly and only when working interactively (not in functions or code
that will be saved and re-run later). When later another column is added the short-cut
will no longer be unique and behaviour is hard to predict and it is even harder to spot the
programming error in a part of your code that previously worked fine.
rownames(data_test)
## [1] "1" "2" "3" "4" "5"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 54
❦
54 4 The Basics of R
colnames(data_test)[2]
## [1] "Gender"
rownames(data_test)[3]
## [1] "3"
Question #7
2. Convert it to a data-frame,
• a string ends when the same quotes are encountered the next time,
a <- "Hello"
b <- "world"
paste() paste(a, b, sep = ", ")
## [1] "Hello, world"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 55
❦
Note – Paste
In many cases we do not need anything between strings that are concatenated. We can of
course supply an empty string as separator ( sep = '' ), but it is also possible to use the past0()
custom function pate0():
paste0(12, '%')
## [1] "12%"
❦ ❦
• nsmall is the minimum number of digits to the right of the decimal point.
Formatting examples
a<-format(100000000,big.mark=" ",
nsmall=3,
width=20,
scientific=FALSE,
justify="r")
print(a)
## [1] " 100 000 000.000"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 56
❦
56 4 The Basics of R
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 57
❦
4.4 Operators 57
4.4 Operators
While we already encountered operators in previous sections when we introduced the data types, operators
here we give a systematic overview of operators on base types.
v1 <- c(2,4,6,8)
v2 <- c(1,2,3,5)
v1 + v2 # addition addition
## [1] 3 6 9 13
v1 - v2 # subtraction substraction
## [1] 1 2 3 3
v1 * v2 # multiplication multiplication
## [1] 2 8 18 40
v1 / v2 # division division
## [1] 2.0 2.0 2.0 1.6
v1 %% v2 # remainder of division
## [1] 0 0 0 3
While the result of the sum will not surprise anyone, the result of the multiplication might
come as a surprise for users of matrix oriented software such as Mathlab or Octave for
example. In R an operations is always element per element – unless explicitly requested.
For example, the dot-product can be obtained as follows.
v1 %*% v2
## [,1]
## [1,] 68
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 58
❦
58 4 The Basics of R
v1 <- c(8,6,3,2)
v2 <- c(1,2,3,5)
bigger than v1 > v2 # bigger than
## [1] TRUE TRUE FALSE FALSE
equal v1 == v2 # equal
## [1] FALSE FALSE TRUE FALSE
v1 | v2 # or
## [1] TRUE TRUE FALSE TRUE
!v1 # not
## [1] FALSE FALSE TRUE TRUE
v2 <- c(TRUE)
as.logical(v1) # coerce to logical (only 0 is FALSE)
## [1] TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE NA
v1 & v2
## [1] TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE NA
v1 | v2
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 59
❦
4.4 Operators 59
Note that numbers different from zero are considered as TRUE, but only zero is consid-
ered as FALSE. Further, NA is implemented in a smart way. For example, in order to assess
TRUE & NA we need to know what the second element is, hence it will yield NA. How-
ever, TRUE | NA will be true regardless what the second element is, hence R will show
the result.
FALSE | NA
## [1] NA
TRUE | NA
## [1] TRUE
FALSE & NA
## [1] FALSE
TRUE & NA
## [1] NA
❦ ❦
# right assignment
3 -> x
3 ->> x
#chained assignment
x <- y <- 4
The <<- or ->> operators change a variable in the actual environment and the environ- assignment – left
ment above the actual one. Environments will be discussed in Section 5.1 “Environments in R” assignment – right
on page 110. Till now we have always been working in the command prompt. This is the root- assignment – chained
environment. A function will create a new (subordinated) environment and might for example
use a new variable. When the function stops running that environment stops to exist and the
variable exists no longer.3
3 If your programming experience is limited, this might seem a little confusing. It is best to accept this and read
further. Then, after reading Section 5.1 “Environments in R” on page 110 have another look at this section.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 60
❦
60 4 The Basics of R
Hint – Assignment
Generally it is best to keep it simple, most people will expect to see a left assignment, and
while it might make code a little shorter, the right assignment will be a little confusing
for most readers. Best is to stick to = or <- and of course <<- whenever assignment in a
higher environment is also intended.
In some special cases, such as the definition of parameters of a function, it is not possible to
use the “arrow” and one must revert to the = sign. This makes sense, because that is not the same
as a traditional assignment.
## Error in mean.default(v1, na.rm <- TRUE): ’trim’ must be numeric of length one
While the <<- seems to do exactly the same, it changes also the value of the variable in the
environment above the actual one. The following example makes clear how <- only changes the
value of x while the function is active, but <<- also changes the value of the variable x in the
environment where the function was called from.
❦ ❦
# f
# Assigns in the current and superior environment 10 to x,
# then prints it, then makes it 0 only in the function environment
# and prints it again.
# arguments:
# x -- numeric
f <- function(x) {x <<- 10; print(x); x <- 0; print(x)}
x <- 3
x
## [1] 3
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 61
❦
4.4 Operators 61
While it is certainly cool and most probably efficient to change a variable in the envi-
ronment above the function, it will make your code harder to read and if the function is
hidden in a package it might lead to unexpected side effects. This superpower is best used
sparingly.
# +-+
# This function is a new operator
# arguments:
# x -- numeric
# y -- numeric
# returns:
# x - y
`+-+` <- function(x, y) x - y
5 +-+ 5
## [1] 0
5 +-+ 1
❦ ## [1] 4
❦
The following are some common operators that help working with data. operator – other
# create a list
x <- c(10:20)
x
## [1] 10 11 12 13 14 15 16 17 18 19 20
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 62
❦
62 4 The Basics of R
# between 2 and 4
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 63
❦
R is Turing complete and hence offers a range of tools to make choices and repeat certain parts flow control
of code. Knowing the different ways to change the flow of a code by if-statements and loops is
essential knowledge for each R-programmer.
4.5.1 Choices
4.5.1.1 The if-Statement
The workhorse to control the flow of actions is the if() function. if()
The construct is both simple and efficient.
if (logical statement) {
executed if logical statement is true
} else {
executed if the logical statement if false
}
This basic construct can also be enriched with else if statements. For example, we draw a
random number from the normal distribution and check if it is bigger than zero.
❦ ❦
set.seed(1890)
x <- rnorm(1)
if (x < 0) {
print('x is negative')
} else if (x > 0) {
print('x is positive')
} else {
print('x is zero')
}
## [1] "x is positive"
It is possible to have more than one else-if statement and/or use nested statements.
x <- 122
if (x < 10) {
print('less than ten')
} else if (x < 100) {
print('between 10 and 100')
} else if (x < 1000) {
print('between 100 and 1000')
} else {
print('bigger than 1000 (or equal to 1000)')
}
## [1] "between 10 and 1000"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 64
❦
64 4 The Basics of R
Note that the statements do not necessarily have to be encapsulated by curly brackets if the
statement only takes one line.
x <- TRUE
y <- pi
y <- if (x) 1 else 2
y # y is now 1
## [1] 1
Note that hybrid forms are possible, but it gets confusing very fast. In the following piece of
code the variable y will not get the value one, but rather six.
z <- 0
y <- if (x) {1; z <- 6} else 2
y # y is now 6
## [1] 6
z # z is also 6
## [1] 6
x <- 1:6
ifelse(x %% 2 == 0, 'even', 'odd')
❦ ## [1] "odd" "even" "odd" "even" "odd" "even" ❦
The ifelse function can also use vectors as parameters in the output.
x <- 1:6
y <- LETTERS[1:3]
ifelse(x %% 2 == 0, 'even', y)
## [1] "A" "even" "C" "even" "B" "even"
x <- 'b'
x_info <- switch(x,
'a' = "One",
'b' = "Two",
'c' = "Three",
stop("Error: invalid `x` value")
)
# x_info should be 'two' now:
x_info
## [1] "Two"
The switch statement can always be written a an else-if statement. The following code does
the same as the aforementioned code.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 65
❦
x <- 'b'
x_info <- if (x == 'a' ) {
"One"
} else if (x == 'b') {
"Two"
} else if (x == 'c') {
"Three"
} else {
stop("Error: invalid `x` value")
}
# x_info should be 'two' now:
x_info
## [1] "Two"
The switch() statement can always be written as with the if-else-if construction, which in
its turn can always be written based on with if-else statements. This is same logic also applies for
loops (that repeat parts of code). All loop constructs can always be written with a for-loop and an
if-statement, but more advanced structures help to keep code clean and readable.
4.5.2 Loops
One of the most common constructs in programming is repeating a certain block of code a few
times. This repeating of code is a “loop.” Loops can repeat code a fixed number of times or do this
only when certain conditions are fulfilled.
As in most programming languages, there is a “for-loop” that repeats a certain block of code a loop – for
given number of times. Interestingly, the counter does not have to follow a pre-defined increment,
❦ ❦
the counter will rather follow the values supplied in a vector. R’s for-loop is an important tool to
add to your toolbox.
The for-loop is useful to repeat a block of code a certain number of times. R will iterate a given
variable through elements of a vector.
Function use for for()
The for-loop will execute the statements for each value in the given vector.
x <- LETTERS[1:5]
for ( j in x) {
print(j)
}
## [1] "A"
## [1] "B"
## [1] "C"
## [1] "D"
## [1] "E"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 66
❦
66 4 The Basics of R
Unlike most computer languages, R does not need to use a “counter” to use a for-loop.
repeat {
commands
if(condition) {break}
}
❦ ❦
Example: Repeat loop
x <- c(1,2)
c <- 2
repeat {
print(x+c)
c <- c+1
if(c > 4) {break}
}
## [1] 3 4
## [1] 4 5
## [1] 5 6
Do not forget the {break} statement, it is integral part of the repeat loop.
4.5.2.3 While
while
loop – while The while-loop is similar to the repeat-loop. However, the while-loop will first check the con-
dition and then run the code to be repeated. So, this code might not be executed at all.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 67
❦
while (test_expression) {
statement
}
v <- c(1:5)
for (j in v) {
❦ ❦
if (j == 3) {
print("--break--")
break
}
print(j)
}
## [1] 1
## [1] 2
## [1] "--break--"
The next statement will skip the remainder of the current iteration of a loop and starts next
iteration of the loop.
v <- c(1:5)
for (j in v) {
if (j == 3) {
print("--skip--")
next
}
print(j)
}
## [1] 1
## [1] 2
## [1] "--skip--"
## [1] 4
## [1] 5
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 68
❦
68 4 The Basics of R
n <- 10^7
v1 <- 1:n
Note that in our simple experiment the for-loop is 47.3 times slower than the vector oper-
ation! So, use operators at the highest level (vectors, matrices, etc) and especially do an
effort to understand the apply-family functions (see Chapter 4.3.5 “Arrays” on page 50) or
their tidyverse equivalents: the map-family of functions of the package purrr. An example
for the apply-family can be found in Chapter 22.2.6 on page 513 and one for the map-
family is in Chapter 19.3 on page 459.
Further information about optimising code for speed and more elegant and robust
timing of code can be found in Chapter 40 “The Need for Speed” on page 997.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 69
❦
4.6 Functions 69
4.6 Functions
functions
More than in any other programming language, functions play a prominent role in R. Part of the
reason is the implementation of the dispatcher function based object model with the S3 objects
— see Chapter 6 “The Implementation of OO” on page 117.
The user is both able to use the built-in functions and/or define his own bespoke functions.
4.6.1 Built-in Functions
Right after starting, R some functions are available. We call these the “built-in functions.” Some
examples are:
R has many “packages” that act as a library of functions that can be loaded with one
command. For more information refer to Chapter 4.7 “Packages” on page 96.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 70
❦
70 4 The Basics of R
# c_surface
# Calculates the surface of a circle
# Arguments:
# radius -- numeric, the radius of the circle
# Returns
# the surface of the cicle
c_surface <- function(radius) {
x <- radius ^ 2 * pi
return (x)
}
c_surface(2) + 2
❦ ## [1] 14.56637
❦
Note that it is not necessary to explicitly “return” something. A function will automatically
return the last value that is send to the standard output. So, the following fragment would do
exactly the same:
# c_surface
# Calculates the surface of a circle
# Arguments:
# radius -- numeric, the radius of the circle
# Returns
# the surface of the cicle
c_surface <- function(radius) {
radius ^ 2 * pi
}
Usually, we will keep functions in a separate file that is then loaded in our code with the command
source(). Editing a function is then done by changing this file and reloading it – and hence
overwriting the existing function content.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 71
❦
4.6 Functions 71
Most probably you will work in a modern environment such as the IDE RStudio, which makes
editing a text-file with code and running that code a breeze. However, there might be cases where
one has only terminal access to R. In that case, the following functions might come in handy. edit()
fix()
# Edit the function with vi:
fix(c_surface)
# Or us edit:
c_surface <- edit()
Hint
The edit() function uses the vi editor when using the CLI on Linux. This editor is not
so popular any more and you might not immediately know how to close it. To get out of
it: press [esc] , then type :q and press [enter] .
vi
Example
The function paste() collates the arguments provided and returns one string that is a
concatenation of all strings supplied, separated by a separator. This separator is supplied
in the function via the argument sep . What is the default separator used in paste()?
c_surface()
## [1] 12.56637
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 72
❦
72 4 The Basics of R
4.7 Packages
One of the most important advantages of R is that it allows you to stand on the shoulders of
giants. It allows you to load a library of additional functionality, so that you do not waste time
writing and debugging something that has been solved before. This allows you to focus on your
research and analysis.
Unlike environments like spreadsheets, R is more like a programming language that is
extremely flexible, modular, and customizable.
❦ ❦
The number of packages availabe increases fast. At the time of writing there are about 15
thousand packages available (see the next “Further information” section). We can of course not
explain each package in just one book. Below we provide a small selection as illustration and in
the rest of the book we will use a selection of 60 packages (which contain a few hundred upstream
packages). The choice of packages is rather opinionated and personal. R is free software and there
are always many ways to achieve the same result.
More information about the packages as well as the packages themselves can be found
on the CRAN server https://cran.r-project.org/web/packages.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 73
❦
4.7 Packages 73
R provides also functionality to get a list of all packages – there is no need to use a web-
crawling or scraper interface.
We can use the function library() to get a list of all packages that are installed on our library()
machine.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 74
❦
74 4 The Basics of R
While in the rest of the book, most code is “active” in this sense that the output that
appears under a line or the plot that appears close to it are generated while the book was
compiled, the code in this book is “cold”: the code is not executed. The reason is that the
commands from this section would produce long and irrelevant output. The lists would be
long, because the author’s computer has many packages installed, but also little relevant
to you, because you have certainly a different configuration. Other commands would even
change packages as a side effect of compiling this book.
Once we know which packages can be updated, we can execute this update:
If you are very certain that you want to update all packages at once, use the ask argument:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 75
❦
Most analysis will start with reading in data. This can be done from many types of electronic
formats such as databases, spreadsheet, CSV files, fixed width text-files, etc.
Reading text from a file in a variable can be done by asking R to request the user to provide
the file name as follows:
t <- readLines(file.choose())
file.choose()
This will load the text of the file in one character string t. However, typically that is not exactly
what we need. In order to manipulate data and numbers, it will be necessary to load data in a
vector or data-frame for example.
In further sections – such as Chapter 15 “Connecting R to an SQL Database” on page 327 – we
will provide more details about data-input. Below, we provide a short overview that certainly will
come in handy.
In the aforementioned example, we have first copied the file to our local computer, but
that is not necessary. The function read.csv() is able to read a file directly from the
Internet.
eurofxref-hist.zip?c6f8f9a0a5f970e31538be5271051b3c.
5 With R it is also possible to read files directly from the Internet by supplying the URL in to the function
read.csv().
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 76
❦
76 4 The Basics of R
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 77
❦
Finding data
Once the data is loaded in R it is important to be able to make selections and further prepare
the data. We will come back to this in much more detail in Part IV “Data Wrangling” on page 335,
but present here already some essentials.
d2<- data.frame(d1$Date,d1$CAD)
d2
## d1.Date d1.CAD
## 1 2008-12-30 1.7331
## 2 2008-12-29 1.7408
## 3 2008-12-18 1.7433
## 4 1999-02-03 1.7151
## 5 1999-01-29 1.7260
## 6 1999-01-28 1.7374
## 7 1999-01-27 1.7526
## 8 1999-01-26 1.7609
❦ ## 9 1999-01-25 1.7620 ❦
## 10 1999-01-22 1.7515
## 11 1999-01-21 1.7529
## 12 1999-01-20 1.7626
## 13 1999-01-19 1.7739
## 14 1999-01-18 1.7717
## 15 1999-01-15 1.7797
## 16 1999-01-14 1.7707
## 17 1999-01-13 1.8123
## 18 1999-01-12 1.7392
## 19 1999-01-11 1.7463
## 20 1999-01-08 1.7643
## 21 1999-01-07 1.7602
## 22 1999-01-06 1.7711
## 23 1999-01-05 1.7965
## 24 1999-01-04 1.8004
It is also possible to write data back into a file. Best is to use a structured format such as a subset()
CSV-file.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 78
❦
78 4 The Basics of R
## 3 2008-12-18 1.7433
## 4 1999-02-03 1.7151
## 5 1999-01-29 1.7260
## 6 1999-01-28 1.7374
## 7 1999-01-27 1.7526
## 8 1999-01-26 1.7609
## 9 1999-01-25 1.7620
## 10 1999-01-22 1.7515
## 11 1999-01-21 1.7529
## 12 1999-01-20 1.7626
## 13 1999-01-19 1.7739
## 14 1999-01-18 1.7717
## 15 1999-01-15 1.7797
## 16 1999-01-14 1.7707
## 17 1999-01-13 1.8123
## 18 1999-01-12 1.7392
## 19 1999-01-11 1.7463
## 20 1999-01-08 1.7643
## 21 1999-01-07 1.7602
## 22 1999-01-06 1.7711
## 23 1999-01-05 1.7965
## 24 1999-01-04 1.8004
❦ ❦
Figure 4.6: The histogram of the most recent values of the CAD only.
Without the row.names = FALSE statement, the function write.csv() would add a
row that will get the name “X.”
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 79
❦
4.8.3 Databases
Spreadsheets and CSV-file are good carriers for reasonably small datasets. Any company that
holds a lot of data will use a database system – see for example Chapter 13 “RDBMS” on page
285. Importing data from a database system is somewhat different. The data is usually structured
in “tables” (logical units of rectangular data); however, they will seldom contain all information
that we need and usually have too much rows. We need also some protocol to communicate with
the database: that is the role of the structured query language (SQL) – see Chapter 14 “SQL on
page 291. database – MySQL
R can connect to many popular database systems. For example, MySQL: as usual there is a
package that will provide this functionality. MySQL
❦ if(!any(grepl("xls", installed.packages()))){ ❦
install.packages("RMySQL")}
library(RMySQL)
RMySQL
Connecting to the Database
This is explained in more detail in Chapter 15 “Connecting R to an SQL Database” on page
327, but the code segment below will already get you the essential ingredients.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 80
❦
80 4 The Basics of R
Update Queries
fetch() It is also possible to manipulate data in the database directly from R. This allows use to prepare
data in the database first and then download it and do our analysis, while we keep all the code in
one file.
The dbSendQuery() function can be used to send any query, including UPDATE, INSERT,
CREATE TABLE and DROP TABLE queries so we can push results back to the database.
sSQL = ""
sSQL[1] <- "UPDATE tbl_students
SET score = 'A' WHERE raw_score > 90;"
sSQL[2] <- "INSERT INTO tbl_students
(name, class, score, raw_score)
VALUES ('Robert', 'Grade 0', 88,NULL);"
sSQL[3] <- "DROP TABLE IF EXISTS tbl_students;"
for (k in c(1:3)){
dbSendQuery() dbSendQuery(myConnection, sSQL[k])
}
dbWriteTable(myConnection, "tbl_name",
data_frame_name[, ], overwrite = TRUE)
dbWriteTable()
Even while connections will be automatically closed when the scope of the database object
is lost, it is a good idea to close a connection explicitly. Closing a connection explicitly,
makes sure that it is closed and does not remain open if something went wrong, and hence
the resources of our RDBMS are freed. Closing a database connection can be done with
dbDisconnect()
dbDisconnect(myConnection, ...).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 81
❦
♣5♣
5.1 Environments in R
Environments can be thought of as a set of existing or declared objects (functions, variables, etc.).
When we start R, it will create an environment before the command prompt is available to the
user.
The top-level environment is the R command prompt. This is the “global environment” and global environment
known as R_GlobalEnv and can be accessed as .GlobalEnv. R_GlobalEnv
As mentioned earlier, the function ls() shows which variables and functions are defined in
❦ ❦
the current environment. The function environment() will tell us which environment is the environment()
current one.
environment() # get the environment
## <environment: R_GlobalEnv>
a <- "a"
f <- function (x) print(x)
ls() # note that x is not part of.GlobalEnv
## [1] "a" "f"
When a function starts to be executed this will create a new environment that is subordonated
to the environment that calls the function.
# f
# Multiple actions and side effects to illustrate environments
# Arguments:
# x -- single type
f <- function(x){
# define a local function g() within f()
g <- function(y){
b <- "local variable in g()"
print(" -- function g() -- ")
print(environment())
print(ls())
print(paste("b is", b))
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 82
❦
# actions in function f:
a <<- 'new value for a, also in Global_env'
x <- 'new value for x'
b <- d # d is taken from the environment higher
c <- "only defined in f(), not in g()"
g("parameter to g")
print(" -- function f() -- ")
print(environment())
print(ls())
print(paste("a is", a))
print(paste("b is", b))
print(paste("c is", c))
print(paste("x is", x))
}
f(a)
## [1] " -- function g() -- "
## <environment: 0x557538bf9e28>
## [1] "b" "y"
## [1] "b is local variable in g()"
## [1] "c is only defined in f(), not in g()"
❦ ## [1] " -- function f() -- " ❦
## <environment: 0x557536bcc808>
## [1] "b" "c" "g" "x"
## [1] "a is new value for a, also in Global_env"
## [1] "b is 3.14159265358979"
## [1] "c is only defined in f(), not in g()"
## [1] "x is new value for x"
b
## [1] 0
c
## [1] 3.141593
Each function within a function or environment will create a new environment that has its
own variables. Variable names can be the same but the local environment will always take prece-
dence. A few things stand out in the example above.
• The variable a does not appear in the scope of the functions.
• However, a function can access a variable defined the level higher if it is not re-defined
in the function itself (see what happens with the variable c: it is not defined in g(), so
automatically R will search the environment above.
• The function g() can use the variable b without defining it or without it being passed as an
argument. When it changes that variable, it can use that new value, but once we are back
in the function f(), the old value is still there.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 83
❦
Just as any programming language, R as rules for lexical scoping. R is extremely flexible and this
can be quite intimidating when starting, but it is possible to amass this flexibility.
First, a variable does not necessarily need to be declared, R will silently create it or even change
it is type.
x <- 'Philippe'
rm(x) # make sure the definition is removed
x # x is indeed not there (generates an error message)
This can, of course, lead to mistakes in our code: we do not have to declare variables, so we
cannot group those declarations so that these error become obvious. This means that if there is a
mistake, one might expect to see strange results that are hard to explain. In such case, debugging is
not easy. However, this is quite unlikely to come into your way. Follow the rule that one function
is never longer than half an A4 page and most likely this feature of R will save time instead of
increasing debugging time.
Next, one will expect that each variable has a scope.
❦ ❦
# f
# Demonstrates the scope of variables
f <- function() {
a <- pi # define local variable
print(a) # print the local variable
print(b) # b is not in the scope of the function
}
# f() did not change the value of a in the environment that called f():
print(a)
## [1] 1
This illustrates that the scoping model in R is dynamical scoping. This means that when a dynamical scoping
variable is used, that R in the first place will try to find it in the local scope, if that fails, it goes a
step up and continues to do so until R ends up at the root level or finds a definition.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 84
❦
To take this a step further we will study how the scoping within S3 objects works.1 The
example below is provided by the R Core Team.
# Citation from the R documentation:
# Copyright (C) 1997-8 The R Core Team
open.account <- function(total) {
list(
deposit = function(amount) {
if(amount <= 0)
stop("Deposits must be positive!\n")
total <<- total + amount
cat(amount,"deposited. Your balance is", total, "\n\n")
},
withdraw = function(amount) {
if(amount > total)
stop("You do not have that much money!\n")
total <<- total - amount
cat(amount,"withdrawn. Your balance is", total, "\n\n")
},
balance = function() {
cat("Your balance is", total, "\n\n")
❦ }
❦
)
}
ross$withdraw(30)
## 30 withdrawn. Your balance is 70
ross$balance()
## Your balance is 70
robert$balance()
## Your balance is 200
ross$deposit(50)
## 50 deposited. Your balance is 120
ross$balance()
## Your balance is 120
try(ross$withdraw(500)) # no way..
## Error in ross$withdraw(500) : You do not have that much money!
This is a prime example of how flexible R is. At first this is quite bizarre, until we notice the
<<- operator. This operator indicates that the definition of a level higher is to be used. Also the
1 To fully appreciate what is going on here, it is best to read the section on the object models (Chapter 6 “The
Implementation of OO” on page 87) in R first and more especially Section 6.2 “S3 Objects” on page 91.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:15pm Page 85
❦
variable passed to the function automatically becomes an attribute of the object. Or maybe it was
there because the object is actually defined as a function itself and that function got the parameter
“amount” as a variable.
This example is best understood by realizing that this can also be written with a declared value
“total” at the top level of our object.
While dynamical scoping has its advantages when working interactively, it makes code
more confusing, and harder to read and maintain. It is best to be very clear with what is
intended (if an object has an attribute, then declare it). Also never use variables of a higher
level, rather pass them to your function as a parameter, then it is clear what is intended.)
In fact, a lot is going on in this short example and especially in the line
total <<- total + amount . First, of all, open.account is defined as a function. That
function has only one line of code in some sense. This is a fortiori the last line of code, and hence,
this is what the function will return. So it will try to return a list of three functions: “deposit,”
“withdraw” and “balance.”
What might be confusing in the way this example is constructed is probably is that when the
function open.account is invoked, it will immediately execute the first function of that list. Well,
that is not what happens. What happens is that the open.account object gets a variable total,
because it is passed as an argument. So, it will exist within the scope of that function.
❦ The odd thing is that rather than behaving as a function this construct behaves like a normal ❦
object that has an attribute amount and a creator function open.account(). This function in
sets this attribute to be equal to the value passed to that function.
R will store variable definitions in its memory, and it may happen that you manage to read
in a file for example, but then keep a mistake in your file. If you do not notice the mistake,
it will go unnoticed (since your values are still there and you can work with them). The
next person who tries to reproduce your code will fail. Start each file that contains your
code with:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 87
❦
♣6♣
The Implementation of OO
R is an object oriented (OO) language but if you know how objects work in for example C++, it
might take some mental flexibility to get your mind around how it works in R. R is not a compiled
language, and it is build with versatility in mind. In some sense most objects are a to be considered OO
as an object and can readily be extended without much formalism and without recompiling.1 object oriented
Personally, I find it useful to think of R as a functional language with odd object possibilities.
This means that if you want to make some simple analysis, then you might skip this section. We
did our best to keep the knowledge of OO to a minimum for the whole book.
Programming languages that provide support for object oriented programming, allow code
to be data-centric as opposed to functional languages that are in essence logic-oriented. They
do that by introducing the concept of a “class.” A manifestation of that class then becomes the
object that we can work with. Objects represent real life things. For example, if we want to create
❦ a software that manages bank accounts, it might be possible to have one object that is the account, ❦
another that is the customer, etc.
The main idea is that in other parts of the code we can work with the object “account” and
ask that object for its balance. This has multiple advantages. First, of all, the logic of the way a
balance is found is only in one place and in every call the same. Second, it becomes easier to pass
on information and keep code lean: if you will need the balance all you have to do is import the
object account and you inherit all its functionality.
There are other ways to keep code clean, for example creating an object, that is a savings
account, will automatically inherit the functionality and data that all accounts share. So it
becomes easy to create other types of accounts that are based on one primordial object account.
For example, current accounts, savings accounts and investment accounts can all inherit from
the object “account.” One of the basic things that all accounts will share is for example the way
ownership works and how transactions are allowed. This can be programmed once and used in
all types of accounts that inherit from this one. If necessary, this can even be overwritten, if there
one type of account that uses another logic.
1 Object oriented programming refers to the programming style that provides a methodology that enables a logical
system (real life concept, such as for example “student”)) to be modelled an “objects.” In further code it will then
be possible to address this object via its methods and attributes. For example, the object student can have a method
“age” that returns the age of the student based in its attribute birth-date.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 88
❦
88 6 The Implementation of OO
Another example could be how we can keep meta-data together with data.
The following code creates, for example, an attribute data_source within the object df.
In many ways, the list object (and many other objects) act as manifestations of objects that
can freely be expanded. So already in the Section 4.3.6 “Lists” on page 41, we have used the object
oriented capabilities of R explicitely. This might be a little bewildering and leaves the reader prob-
ably wondering what is the true object model behind R. Well, the answer is that there is not one
but rather four types of classes.
Multiple OO systems ready to use in R. R’s OO systems differ in how classes and methods are
defined:
1. Base types. This is not a true OO system as it misses critical features like inheritance and
flexibility. However, this underlies all other OO systems, and hence, it is worth to under-
C stand them a little. They are typically thought of as a struct in C.
2. S3. S3 is a popular class type in R. It is very different from the OO implementation that
is in most programming languages such as C++, C#, PHP, Java, etc. In those languages
one would expect to pass a message to an object (for example ask the object my_account
its balance as via the method my_curr_acc.balance(). The object my_account is of the
type account and hence it is the logic that sits there that will determine what function
❦ balance() is used. These implementations are called message-passing systems., S3 uses a ❦
different logic: it is a generic-function implementation of the OO logic. The S3 object can
still have its own methods, but there are generic functions that will decide which method
to use. For example, the function print() will do different things for a linear model, a
dataset, or a scalar.
3. Then there is also S4, which works very similarly to S3, but there is a formal declaration
of the object (its definition) and it has special helper functions for defining generics and
methods. S4 also allows for “multiple dispatch,” which means that generic functions can
pick methods based on the class of any number of arguments, not just one.
C++ 4. Reference classes (RC) are probably closest to what you would expect based on your C++
C# or C# experience. RC implements a message-passing OO system in R. This means that a
method (function) will belong to a class, and it is not a generic method that will decide how
it behaves on different classes. The RC implementation uses the $. This means that a call to
the function balance of an object of class account will look like my_account$dbalance().
RC objects in R are “mutable.” This means that they don’t follow R’s general logic “copy-
on-modify” semantics, but are modified in place. This allows for difficult to read code but
RC – reference class is invaluable to solve problems that cannot be solved in S3 or S4.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 89
❦
The base types are build upon the “structures” from the language C that underlies R.2 Knowing struct
this inheritance, the possibilities and limitations of the base types should not be a mystery. A
struct is basically a collection of variables that are gathered under one name. In our example
(the bank account) it could hold the name of the account holder, the balance as a number, but
not the balance as a function. The following works:
# Define a string:
acc <- "Philippe"
This means that the base type holds information on how the object is stored in memory (and
hence how much bytes it occupies), what variables it has, etc. The base types are part of R’s code
and compiled, so it is only possible to create new ones by modifying R’s source code and recom-
❦ ❦
piling. When thinking about the base types, one readily recalls all the types that we studied in
the previous sections such as integers, vectors, matrices are base types. However, there are more
exotic ones such as environments, functions, calls.
Some conventions are not straightforward but deeply embedded in R and many people’s code,
some things might be somewhat surprising. Consider the following code:
# a function build in core R
typeof(mean)
## [1] "closure"
is.primitive(mean) is.primitive()
## [1] FALSE
is.function(add1) is.function()
## [1] TRUE
is.object()
is.object(add1)
## [1] FALSE
2 The reader that has knowledge of C, might want to know that this is the object-like functionality that is provided
by the struct keyword in C. R is written in C and to program base-R one used indeed those structures.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 90
❦
90 6 The Implementation of OO
❦ ❦
3 Experienced C-users might want to think of this as something like the statement switch(TYPEOF(x)) in C.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 91
❦
6.2 S3 Objects 91
6.2 S3 Objects
S3 is probably the most simple implementation of an OO system that is still useful. In its simplic-
ity, it is extremely versatile and user friendly (once you get your old C and C++ reflexes under
control).
The function is.object() returns true both for S3 and S4 objects. There is no base function
that allows directly to test if an object is S3, but there is a to test to check if an object is S4. So we
can test if something is S3 as follows.
# is.S3
# Determines if an object is S3
# Arguments:
# x -- an object
# Returns:
# boolean -- TRUE if x is S3, FALSE otherwise
is.S3 <- function(x){is.object(x) & !isS4(x)}
is.S3(df)
## [1] TRUE
❦ ❦
However, it is not really necessary to create such function by ourselves. We can leverage the pryr
library pryr, which provides a function otype() that returns the type of object. otype()
library(pryr)
otype(M)
## [1] "base"
otype(df)
## [1] "S3"
df$fac <-factor(df$X4)
otype(df$fac) # a factor is S3
## [1] "S3"
The methods are provided by the generic function.4 Those functions will do different things
for different S3 objects.
If you would like to determine if a function is S3 generic, then you can check the source code
for the use of the function useMethod(). This function will take care of the dispatching and hence useMethod()
decide which method to call for the given object.
However, this method is not foolproof because some primitive functions have this switch
statement embedded in their C-code. For example, [ , sum() , rbind() , and cbind() are
generic functions, but this is not visible in their code in R.
4A good way to see a generic function is as an overloaded function with a twist.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 92
❦
92 6 The Implementation of OO
Alternatively, it is possible to use the function ftype from the package pryr:
mean
## function (x, ...)
## UseMethod("mean")
## <bytecode: 0x563423e48908>
## <environment: namespace:base>
ftype(mean)
## [1] "s3" "generic"
sum
## function (..., na.rm = FALSE) .Primitive("sum")
ftype(sum)
## [1] "primitive" "generic"
R calls the functions that have this switch in their C-code "internal" "generic".
The S3 generic function basically decides to what other function to dispatch its task. For exam-
ple, the function print can be called with any base or S3 object and print will decide what to do
based on its class. Try the function apropos() to find out what different methods exist (or type
print. in RStudio.
apropos("print.")
## [1] "print.AsIs"
## [2] "print.by"
## [3] "print.condition"
## [4] "print.connection"
## [5] "print.data.frame"
❦ ## [6] "print.Date" ❦
## [7] "print.default"
## [8] "print.difftime"
## [9] "print.Dlist"
## [10] "print.DLLInfo"
## [11] "print.DLLInfoList"
## [12] "print.DLLRegisteredRoutines"
## [13] "print.eigen"
## [14] "print.factor"
## [15] "print.function"
## [16] "print.hexmode"
## [17] "print.libraryIQR"
## [18] "print.listof"
## [19] "print.NativeRoutineList"
## [20] "print.noquote"
## [21] "print.numeric_version"
## [22] "print.octmode"
## [23] "print.packageInfo"
## [24] "print.POSIXct"
## [25] "print.POSIXlt"
## [26] "print.proc_time"
## [27] "print.restart"
## [28] "print.rle"
## [29] "print.simple.list"
## [30] "print.srcfile"
## [31] "print.srcref"
## [32] "print.summary.table"
## [33] "print.summaryDefault"
## [34] "print.table"
## [35] "print.warnings"
## [36] "printCoefmat"
## [37] "sprintf"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 93
❦
6.2 S3 Objects 93
apropos("mean.")
## [1] ".colMeans" ".rowMeans" "colMeans"
## [4] "kmeans" "mean.Date" "mean.default"
## [7] "mean.difftime" "mean.POSIXct" "mean.POSIXlt"
## [10] "rowMeans"
This approach shows all functions that include “print.” in their name and are not necessarily
the methods of the function print. Hence, a more elegant ways is to use the purpose build function
methods(). methods()
methods(methods)
## no methods found
methods(mean)
## [1] mean.Date mean.default mean.difftime
## [4] mean.POSIXct mean.POSIXlt
## see '?methods' for accessing help and source code
Do not use the dot “.” in function names because it makes them look like S3 func-
tional methods. This might lead to confusion with the convention that the methods are
named as <<generic function>>.<<class name>> . Especially, if there is more than
one dot in the name. For example, print.data.frame() is not univocal: is it a data-
frame method for the generic function print or is it the frame method for the generic
function print.data? Another example is the existence of the function t.test() to run t.test()
❦ ❦
t-tests as well as t.dataframe(), that is the S3 method for the generic function t() to t.data.frame()
transpose a data frame. t()
To access the source code of the class-specific methods, one can use the function
getS3method(). getS3method()
getS3method("print","table")
## function (x, digits = getOption("digits"), quote = FALSE, na.print = "",
## zero.print = "0", justify = "none", ...)
## {
## d <- dim(x)
## if (any(d == 0)) {
## cat("< table of extent", paste(d, collapse = " x "),
## ">\n")
## return(invisible(x))
## }
## xx <- format(unclass(x), digits = digits, justify = justify)
## if (any(ina <- is.na(x)))
## xx[ina] <- na.print
## if (zero.print != "0" && any(i0 <- !ina & x == 0))
## xx[i0] <- zero.print
## if (is.numeric(x) || is.complex(x))
## print(xx, quote = quote, right = TRUE, ...)
## else print(xx, quote = quote, ...)
## invisible(x)
## }
## <bytecode: 0x5634250f12e8>
## <environment: namespace:base>
The other way around it is possible to list all generic functions that have a specific method for
a given class.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 94
❦
94 6 The Implementation of OO
You can also list all generics that have a method for a given class:
methods(class = "data.frame")
## [1] [ [[ [[<-
## [4] [<- $ $<-
## [7] aggregate anyDuplicated as.data.frame
## [10] as.list as.matrix by
## [13] cbind coerce dim
## [16] dimnames dimnames<- droplevels
## [19] duplicated edit format
## [22] formula head initialize
## [25] is.na Math merge
## [28] na.exclude na.omit Ops
## [31] plot print prompt
## [34] rbind row.names row.names<-
## [37] rowsum show slotsFromS3
## [40] split split<- stack
## [43] str subset summary
## [46] Summary t tail
## [49] transform unique unstack
## [52] within
## see '?methods' for accessing help and source code
It is also possible to create a class and set the attribute simultaneously with the function
structure() structure.
generic function To create methods for S3 generic function, all we have to do is follow the syntax:
<<generic function>>.<<class name>> . R will then make sure that if an object of “class
name” calls the generic function that then the generic function will dispatch the action to this
specific function.
# print.account
# Print an object of type 'account'
# Arguments:
# x -- an object of type account
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 95
❦
6.2 S3 Objects 95
S3 objects are always build on other more elementary types. The function inherits
(x, "classname") allows the user to determine if the class x inherits from the class
“classname.”
You probably remember that R returned "internal" "generic" as the class for some func-
tions; so the class can be a vector. That means that the behaviour of that object can depend on
different class-specific methods. The classes have to be listed from from most to least specific, so
that the behaviour can follow this cascade and will always execute the most specific behaviour if
it is present.
For example, the class of the glm() object is c("glm", "lm") . This means that the most
specific class is the generalised linear model, but that some behaviour they might inherit from
linear models. When a generic function will be called, it will first try to find a glm-specific method.
If that fails, it will look for the lm-method.
It is possible to provide a constructor function for an S3 class. This constructor function can, constructor
for example be used to check if we use the right data-type for its attributes.
# account
# Constructor function for an object of type account
❦ # Arguments: ❦
# x -- character (the name of the account holder)
# y -- numeric (the initial balance of the account
# Returns:
# Error message in console in case of failure.
account <- function(x,y) {
if (!is.numeric(y)) stop("Balance must be numeric!")
if (!is.atomic(x)) stop("Name must be atomic!!")
if (!is.character(x)) stop("Name must be a string!")
structure(list("name" = x, "balance" = y), class = "account")
}
The advantage of using the creator function for an instance of an object is obvious: it will
perform some checks and will avoid problems later on. Unlike in message-passing OO imple-
mentations, the S3 implementation allows to bypass the creator function or worse it allows you creator function
to change the class all too easy. Consider the following example.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 96
❦
96 6 The Implementation of OO
# add_balance
# Dispatcher function to handle the action of adding a given amount
# to the balance of an account object.
# Arguments:
# x -- account -- the account object
# amount -- numeric -- the amount to add to the balance
add_balance <- function(x, amount) UseMethod("add_balance")
This construct will do nothing else than trying to dispatch the real action to other functions.
❦ However, since we did not program them yet, there is nothing to dispatch to. To add those meth- ❦
ods, it is sufficient to create a function that has the right naming convention.
# add_balance.account
# Object specific function for an account for the dispatcher
# function add_balance()
# Arguments:
# x -- account -- the account object
# amount -- numeric -- the amount to add to the balance
add_balance.account <- function(x, amount) {
x[[2]] <- x[[2]] + amount;
# Note that much more testing and logic can go here
# It is not so easy to pass a pointer to a function so we
# return the new balance:
x[[2]]}
my_curr_acc <- add_balance(my_curr_acc, 225)
print(my_curr_acc)
## [1] 325
Leaving the code up to this level is not really safe. It is wise to foresee a default action in case
the function add_balance() is called with an object of another class.
# add_balance.default
# The default action for the dispatcher function add_balance
# Arguments:
# x -- account -- the account object
# amount -- numeric -- the amount to add to the balance
add_balance.default <- function(x, amount) {
stop("Object provided not of type account.")
}
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 97
❦
6.2 S3 Objects 97
So in S3, it is not the object that has to know how a call to it has to be handled, but it is the generic
function5 that gets the call and has to dispatch it.
The way this works in R is by calling UseMethod() in the dispatching function. This creates UseMethod()
a vector of function names, like
paste0("generic", ".", c(class(x), "default"))
and dispatches to the most specific handling function available. The default class is the last in the
list and is the last resort: if R does not find a class specific method it will call the default action.
Beneath you can see this in action:
# probe
# Dispatcher function
# Arguments:
# x -- account object
# Returns
# confirmation of object type
probe <- function(x) UseMethod("probe")
# probe.account
# action for account object for dispatcher function probe()
# Arguments:
# x -- account object
# Returns
# confirmation of object "account"
probe.account <- function(x) "This is a bank account"
# probe.default
❦ # action if an incorrect object type is provided to probe() ❦
# Arguments:
# x -- account object
# Returns
# error message
probe.default <- function(x) "Sorry. Unknown class"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 98
❦
98 6 The Implementation of OO
As you can see from the above, methods are normal R functions with a specific name.
So, you might be tempted to call them directly (e.g. call directly print.data.frame()
when working with a data-frame) Actually, that is not such a good idea. This means that
if, for example later you improve the dispatch method that this call will never see those
improvements.
However, you might find that in some cases, there is a significant performance gain when
skipping the dispatch method . . . well, in that case you might consider to bypass the dis-
patching and add a remark in the code to watch this instance.a
a How to measure and improve speed is described in Chapter 40 “The Need for Speed” on page 793.
1. Group Math : Members of this group dispatch on x . Most members accept only one argu-
ment, except log , round and signif that accept one or two arguments, while trunc
accepts one or more. Members of this group are:
2. Group Ops : This group contains both binary and unary operators ( + , - and ! ): when a
unary operator is encountered, the Ops method is called with one argument and e2 is
missing. The classes of both arguments are considered in dispatching any member of this
group. For each argument, its vector of classes is examined to see if there is a matching
specific (preferred) or Ops method. If a method is found for just one argument or the same
method is found for both, it is used. If different methods are found, there is a warning about
incompatible methods : in that case or if no method is found for either argument, the
internal method is used. If the members of this group are called as functions, any argument
names are removed to ensure that positional matching is always used.
• + , - , * , / , ^ , \%\% , \%/\%
6 Notice that there are no objects of these names in base R, but for example you will find some in the methods
methods
package. This package provides formally defined methods and objects for R.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 99
❦
6.2 S3 Objects 99
• &, |, !
• == , != , < , <= , >= , >
3. Group Summary : Members of this group dispatch on the first argument supplied.
• all , any
• sum , prod
• min , max
• range
Of course, a method defined for an individual member of the group takes precedence over a
method defined for the group as a whole, because it is more specific.
Math, Ops, Summary, and Complex aren’t functions themselves, but instead represent
groups of functions. Also note that inside a group generic function a special variable
.Generic provides the actual generic function that is called.
If you have complex class hierarchies, it is sometimes useful to call the parent method. This
❦ parent method is the method that would have been called if the object-specific one does not exist. ❦
For example, if the object is savings_account, which is a child of account then calling the
function with savings_account will return the method associated to account if there is no
specific method and it will call the specific method if it exists.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 100
❦
6.3 S4 Objects
The S4 system is very similar to the S3 system, but it adds a certain obligatory formalism. For
example, it is necessary to define the class before using it. This adds some lines of code but the
payoff is increased clarity.
In S4
1. classes have formal definitions that describe their data fields and inheritance structures
(parent classes);
2. method dispatch is more flexible and can be based on multiple arguments to a generic
function, not just one; and
While the methods package is always available when running R interactively (like in
RStudio or in the R terminal), it is not necessarily loaded when running R in batch mode.
So, you might want to include an explicit library(methods) statement in your code
when using S4.
❦ ❦
• Representation: A list of slots (or attributes), giving their names and classes. For example,
a person class might be represented by a character name and a numeric age, as follows:
representation(name = "character", age = "numeric")
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 101
❦
This will only create the definition of the objects. So, to create a variable in your code that can
be used to put your data or models inside, instances have to be created with the function new(). new()
Note the difference in syntax – for the function setClass – between how the argument
representation and the argument contains take values. The representation argu-
ment takes a function and hence, more arguments can be passed by adding them comma
separated. In order to pass more than one parent class to contains, one needs to provide
a character vector (for example c("InvAcc","Acc") ).
Both the arguments slots and contains will readily use S4 classes and the implicit class of
a base type. In order to use S3 classes, one needs first to register them with setOldClass(). If we
do not want type control when an instance of a class is generated, we can provide to the slots
argument a special class “ANY” (this tell R not to restrict the input).
❦ You might not have noticed right away, but we started off with a complex problem where ❦
some objects depend on others (in OO we speak about “parents” and “children”) and even where
some objects take others as attributes. Those two things are very different and a little tricky to
understand.
At this point, the classes Bnk and Acc exist and we can create a first instance for both.
# Create an instance of Bnk:
my_cust_bank <- new("Bnk",
name = "HSBC",
phone = 123456789)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 102
❦
attr() There is also a specific function to get attributes from an object: attr(). This function allows
to create attributes, change them or even remove them (by setting them to NULL)
While the function attr() allows partial matching. It is never a good idea to use partial
matching in a batch environment. This can lead to hard to detect programming errors.
Some slots – like class, comment, dim, dimnames, names, row.names and tsp (for time
series objects) – are special: they can only take some values. This knowledge can even be used to
change those attributes.
x <- 1:9
x # x is a vector
## [1] 1 2 3 4 5 6 7 8 9
class(x)
## [1] "integer"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 103
❦
Alternatives to access slots (attributes) include the function slot() , that works like [[ slot()
for regular objects.
slot(my_acc, "holder")
## [1] "Philippe"
The object my_acc is actually not very useful. It is a structure that would be in common for
all types of accounts (e.g. investment accounts, savings accounts and current accounts). However,
no bank would just sell and empty structure account. So, let us open a current account first.
# Note that the following does not work and is bound to fail:
also_an_account <- new("CurrAcc",
holder = "Philippe",
interest_rate = 0.01,
balance=0, Acc=my_acc)
## Error in initialize(value, ...): invalid name for slot of class "CurrAcc": Acc
Question #8
Why does the second approach fail? Would you expect it to work?
It appears that while the object my_acc exist, it is not possible to insert it in the definition
of a new object – even while this inherits from the first. This makes sense, because the object
“account” is not an attribute of the object “current account,” but its attributes become directly
attributes of current account.
The object my_curr_acc is now ready to be used. For example, we can change the balance.
Now, we will create an investment account. At this point, it becomes crucial to see that the
object “custodian bank” is not a parent class, but rather an attribute. This means that before we
can create an investment account, we need to define at least one custodian.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 104
❦
Question #9
❦ ❦
If you look careful at the code fragment above this question, you will notice that it is
possible to provide an object my_cust_bank as attribute to the object my_inv_acc. This
situation is similar to the code just above previous question, but unlike in the creation of
also_an_account, now it works. Why is this?
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 105
❦
The function getSlots() will return a description of all the slots of a class: getSlots()
getSlots("Acc")
## holder branch opening_date
## "character" "character" "Date"
Did you notice that R is silent about the missing balance? This is something to be careful
with. If you forget that a default value has been assigned then this might lead to confusing
mistakes.
An empty value for balance is not very useful and it can even lead to errors. Therefore, it is
possible to assign default values with the function prototype when creating the class definition. prototype()
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 106
❦
setClass("CurrAcc",
representation(interest_rate = "numeric",
balance = "numeric"),
contains = "Acc",
prototype(holder = NA_character_,
interst_rate = NA_real_,
balance = 0))
❦ ❦
Most programming languages implement an OO system where class definitions are cre-
ated when the code is compiled and instances of classes are created at runtime. During
runtime, it is not possible to change the class definitions.
However, R is an interpreted language that is interactive and functional. The conse-
quence is that it is possible to change class definitions at runtime (“while working in the
R-terminal”). So it is possible to call setClass() again with the same object name, and
R will assume that you want to change the previously defined class definition and silently
override it. This can lead, for example, to the situation where different objects pretend to
be of the same class, while they are not.
To make sure that a previous class definition cannot be changed add sealed = TRUE to
the call to setClass()
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 107
❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 108
❦
## Slot "branch":
## [1] "PAR01"
##
## Slot "opening_date":
## [1] "2020-01-30"
C++ Unlike C++, for example a call to new() will not automatically invoke the constructor
function (its existence is not enough to invoke it automatically). Make it a good habit
to always use explicitly the constructor function for an S4 objects (provided it exists of
course).
[email protected]
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 4 5 6
##
## [[3]]
## [1] 7 8 9
xdf@data_owner
## [1] "customer relationship team"
isS4() • isS4() returns TRUE, note that this is not the same as is.S3(), this is the class-specific
method of the function is(),
otype() • pryr::otype() returns S4 .
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 109
❦
S4 generics and methods are also easy to identify because they are S4 objects with well-defined
classes.
There aren’t any S4 classes in the commonly used base packages (stats, graphics, utils,
datasets, and base), so we will continue to use our previous example of the bank accounts.
str(my_inv_acc)
## Formal class 'InvAcc' [package ".GlobalEnv"] with 4 slots
## ..@ custodian :Formal class 'Bnk' [package ".GlobalEnv"] with 2 slots
## .. .. ..@ name : chr "HSBC Custody"
## .. .. ..@ phone: chr "123123123"
## ..@ holder : chr "Philippe"
## ..@ branch : chr "DUB01"
## ..@ opening_date: Date[1:1], format: "2019-02-21"
isS4(my_inv_acc)
## [1] TRUE
pryr::otype(my_inv_acc)
## [1] "S4"
The package methods provides the function is(). This function takes one object as argument,
and lists all classes that the object provided as argument inherits from. Using is() with two is()
arguments will test if an object inherits from the class specified in the second argument.
is(my_inv_acc)
## [1] "InvAcc" "Acc"
is(my_inv_acc, "Acc")
❦ ## [1] TRUE ❦
The downside of the function centric OO system is that some things become a little subtle.
Earlier we explained how to use isS4() . There is no function isS3(), but one will notice
that is.S3() exists. Now, you will understand that is.S3() is the S3 specific method of
the function is().
Looking up the source code can be helpful:
is.S3
## function(x){is.object(x) & !isS4(x)}
## <bytecode: 0x5634256e9f60>
There are many functions related to S4 objects, and it is not the aim to provide a full list
however, the following might be useful for your code.
• getClasses() lists all S4 classes (it does however, include shim classes for S3 classes and getClasses()
base types);
• showMethods() shows the methods for one or more generic functions, possibly restricted showMethods()
to those involving specified classes. Note that the argument where can be used to restrict
the search to the current environment by using where = search();
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 110
❦
setGeneric() • setGeneric() creates a new generic or converts an existing function into a generic.
setMethod() • setMethod() creates a method for a generic function aligned to a certain class. It takes as
argument the function, the signature of the class and the function definition.
We will build further on the example of the bank accounts as used in the previous sections of
this chapter. As a first step, we can create methods to credit and debit a current account S4 object.
# credit
# Dispatcher function to credit the ledger of an object of
# type 'account'.
# Arguments:
# x -- account object
# y -- numeric -- the amount to be credited
credit <- function(x,y){useMethod()}
While the functionality for credit might seem trivial, in reality crediting an account will
require a lot of checks (e.g. sanctioned countries and terrorist financing). So, let us create now a
little more useful example with a function debet(), because before debiting an account, one will
need to check if there is enough balance.
# debet
# Generic function to debet an account
# Arguments:
# x -- account object
# y -- numeric -- the amount to be taken from the account
# Returns
# confirmation of action or lack thereof
debet <- function(x,y){useMethod()}
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 111
❦
❦ ❦
Warning – Overloading functions
If you want to overload an existing function such as union() and exp(), then you should
of course not define the function first – as in the first line in the aforementioned code.
Doing so will make its original definition unavailable.
These aspects require utmost care. Personally, the author believes that this is where the S4
OO implementation stops to be practically relevant. When one needs further dependencies and
complexity the S4 method becomes too easy too complex and it might be hard to predict which
method will be called.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 112
❦
For this reason, it might be better to avoid multiple inheritance and multiple dispatch unless
absolutely necessary.
Finally, there are methods that allow us to identify which method gets called, given the spec-
ification of a generic call:
selectMethod("credit", list("CurrAcc"))
## Method Definition:
##
## function (x, y)
## {
## new_bal <- x@balance + y
## new_bal
## }
##
## Signatures:
## x
## target "CurrAcc"
## defined "CurrAcc"
There is a lot more to say about S4 inheritance and method dispatch, though it is not
really necessary in the rest of the book. Therefore, we refer to other literature. For example,
“Advanced R” by Wickham (2014) is a great source.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 113
❦
RC
6.4 The Reference Class, refclass, RC or R5 Model R5
The reference class OO system is also known as “RC,” “refclass” or as “R5.” It is the most recent7 refclass
OO implementation and it introduces a message-passing OO system for R.
Reference classes are reasonably new to R and therefore, will develop further after the
publication of this book. So for the most up-to-date information, we refer to R itself: type
?ReferenceClasses in the command prompt to see the latests updates in your version
of R.
People with OOP background will naturally feel more comfortable with RC, it is something OOP
object oriented
what people with C++, C#, PHP, Java, Python, etc., background will be familiar with: a message- programming
passing OO implementation. However, that sense of comfort has to be mended, in many ways,
the refclass system in R is a combination of S4 and environments.
That said, the RC implementation brings R programming to a next level. This system is par-
ticularly suited for larger projects, and it will seamlessly collaborate with S4, S3 and base types.
However, note that the vast majority of packages in R does not use RC, actually none of the most-
often used packages do. This is not only because the pre-date the refclass system but also because
they do not need it (even while some are rather complex).
Using RC in R will add some complexity to your code and many people advice to use the
❦ ❦
refclass system only where the mutable state is required. This means that even while using R5, it
is still possible to keep most of the code functional.
1. contains: The classes from which it inherits – note that only other refclass objects are
allowed;
2. fields: These are the attributes of the class – they are the equivalent of “slots” in S4; one can
supply them via a vector of field names or a named list of field types;
3. methods: These functions are the equivalent for for dispatched methods in S4, and they
operate within the context of the object and can modify its fields. While it is possible to add
these later, this is not good programming practice, makes the code less readable and will
lead to misunderstandings.
We will illustrate this with a remake of the example about bank accounts.
7 R5 has recently been added to R (2015). It responds to the need for mutable objects and as such makes packages
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 114
❦
❦ ❦
Note
Fields can also be supplied as a vector (we use a list in the example). Using a vector does
not allow to specify a type. It looks as follows:
Hint
It is possible to leave the input type undecided by not providing a type in the list environ-
ment. This could look like this:
setRefClass("account",
fields = list(holder, # accepts all types
branch, # accepts all types
opening_date = "Date" # dates only
)
)
Let us now explore this object that was returned by the function setRefClass().
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 115
❦
isS4(account)
## [1] TRUE
The object returned by setRefClass (or retrieved later by getRefClass) is called a generator generator object
object. This object automatically gets the following methods.
• new to create instances of the class. Its arguments are the named arguments specifying
initial values for the fields;
❦ ❦
• methods to add methods or modify existing;
• lock to lock the named fields so that their value can only be set once;
• accessors sets up accessor-methods in form of getxxx and setxxx (where “xxx” is replaced
by the field names).
Refclass objects are mutable, or in other words, they have reference semantics. To understand
this, we shall create the current account class and the investment account class.
Let us now come back to the current account and define it while adding some methods
(though as we will see later, it is possible to add them later too).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 116
❦
Notice that, the assignment operator within an object is the local assignment oper-
ator: <<- . The operator <<- will call the accessor functions if defined (via the
object$accessors() function).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 117
❦
The Reference Class OO implementation R5 uses the $ to access attributes and methods.
It is also possible – though not recommendable – to create methods after creation of the class.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 118
❦
Other common methods for R5 objects that are always available (because they are inherited
from envRefClass) are the following, illustrated for an object of RC class named RC_obj:
• RC_obj$initFields(): This method initializes the values fields if the super-class has no
initialization methods.
• RC_obj$copy(): This method creates a copy of the current object. This is necessary
because Reference Classes classes don’t behave like most R objects, which are copied on
assignment or modification.
• RC_obj$field(): This method provides access to named fields, it is the equivalent to slots
for S4. RC_obj$field("xxx") the same as RC_obj$xxx. RC_obj$field("xxx", 5), the
same as assigning the value via RC_obj$xxx <- 5.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 119
❦
The OO system that R provides is unlike what other OO languages provide. In the first place, it
offers not only a method-dispatching system but also has a message-passing system. Secondly, it is
of great importance that it is possible to use R without even knowing that the OO system exists. In
fact, for most of the following chapters in this book, it is enough to know that the generic-function
implementation of the OO logic exists.
Digression – R6
Note that we did not discuss the R6 OO system. R6 is largely the same as R5 but adds some R6
important OO features to the mix, such as private and public functions and properties.
However, it is not widely used and most probably will become obsolete in further versions
of R5.a
a The development of R6 can be followed up here: https://www.r-project.org/nosvn/pandoc/
R6.html
Three or even four OO systems in one language is a lot, but that complexity does not stand in
the way of practical applications. It seems that a little common sense allows the user to take the
best of what is available, even when mixing systems.
So what system to use?
1. For a casual calculation of a few hundred lines of code, it is probably not necessary to define
❦ your own classes. You probably will use S3 implicitly in the packages that you will load, but ❦
you will not have to worry about it at all: just use plot() and print() and expect it to work.
2. If your inheritance structure is not too complex (and so you do not need multi-argument
method signatures) and if your objects will not change themselves then S3 is the way to go.
Actually, most packages on CRAN use S3.
3. If still your objects are static (not self-modifying), but you need multi-argument method
signatures, then S4 is the way to go. You might want to opt for S4 also to use the formalism,
as it can make code easier to understand for other people. However, be careful with the
inheritance structure, it might get quite complex.
4. If your objects are self-modifying, then RC is the best choice. In S3 and S4, you would need
to use replacement methods. For example:
5. For larger projects, where many people work on the same code it makes sense to have a
look at R6 (it allows private methods, for example).
In fact, the OO system implemented in R is so that it does not come in the way of what you
want to do, and you can do all your analysis of great complexity and practical applicability without
even knowing about the OO systems. For casual use and the novice user alike, the OO system
takes much of the complexity away. For example, one can type summary(foo) , and it will work
regardless of what this “foo” is, and you will be provided with a summary that is relevant for the
type of object that foo represents.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 121
❦
♣7♣
R is Free and Open Source Software (FOSS), that implies that it is free to use, but also that you have
access to the code – if desired. As most FOSS projects, R is also easy to expand. Fortunately, it is
also a popular language and some of these millions of R users1 might have created a packages and
enhance R’s functionality to do just what you need. This allows any R users to stand on the shoul-
ders of giants: you do not have to re-invent the wheel, but you can just pick a package and expand
your knowledge and that of humanity. That is great, and that is one of the most important reasons
to use R. However, this has also a dark side: the popularity and the ease to expand the language
means that there are literally thousands of packages available. It is easy to be overwhelmed by the
❦ ❦
variety and vast amount of packages available and this is also one of the key weaknesses of R.
Most of those packages will require one or more other packages to be loaded first. These
packages will in their turn also have dependencies on yet other (or the same) packages. These
dependencies might require a certain version of the upstream package. This package maintenance
problem used to be known as the “dependency hell.” The package manager of R does, however,
a good job and it usually will work as expected.
Using the same code again after a few years, is usually more challenging. In the meanwhile
you might have updated R to a newer version and most packages will be updated too. It might
happen that some packages have become obsolete and are not maintained any more and therefore,
the new version is not available. This can cause some other packages to fail.
Maintaining code is not a big challenge if you just write a project for a course at the university
and will never use it again. Code maintenance becomes an issue when you want to use the code
later . . . but it becomes a serious problem if other colleagues need to review your work, expand it
and change it later (while you might not be available).
Another issue is that because of this flexibility, core R is not very consistent (though people OS
will argue that while Linux does even a worse job here and still is the best OS). operating system
Consistency does matter and it follows from a the choice of a programming philosophy. For
example, R is a software to do things with data, so each function should have a first argument
that refers to the data. Many functions will follow this rule, but not all. Similar issues exist for
arguments to functions, names of objects and classes (e.g. there is vector and Date, etc.)
1 According to the Tiobe-index (see https://www.tiobe.com/tiobe-index), R is the 14th most popular pro-
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 122
❦
Then there is the tidyverse. It is a recent addition to R that is both a collection of often used
functionalities and a philosophy.
The developers of tidyverse promote2 :
• Use existing and common data structures. So all the packages in the tidyverse will share
a common S3 class types; this means that in general functions will accept data frames (or
tibbles). More low-level functions will work with the base R vector types.
• Reuse data structures in your code. The idea here is that there is a better option than always
over-writing a variable or create a new one in every line: pass on the output of one line to
the next with a “pipe”: %>% . To be accepted in the tidyverse, the functions in a package
pipe need to be able to use this pipe.3
• Keep functions concise and clear. For example, do not mix side-effects and transformations,
function names should be verbs where ever possible (unless they become too generic or
meaningless of course), and keep functions short (they do only one thing, but do it well).
• Embrace R as a functional programming language. This means that reflexes that you might
have from say C++, C#, python, PHP, etc., will have to be mended. This means for example
that it is best to use immutable objects and copy-on-modify semantics and avoid using the
refclass model (see Section 6.4 “The Reference Class, refclass, RC or R5 Model” on page 113).
Use where possible the generic functions provided by S3 and S4. Avoid writing loops (such
as repeat and for but use the apply family of functions (or refer to the package purrr).
• Keep code clean and readable for humans. For example, prefer meaningful but long variable
names over short but meaningless ones, be considerate towards people using auto-complete
❦ in RStudio (so add an id in the first and not last letters of a function name), etc. ❦
Tidyverse is in permanent development as core R itself and many other packages. For further
and most up-to-date information we refer to the website of the Tidyverse: http://tidyverse.
tidyverse.org.
Tidy Data
Tidy data is in essence data that is easy to understand by people and is formatted and structured
with the following rules in mind.
4. a value (or NA) in each cell (a “cell” is the intersection between row and column).
The concept of tidy data is so important that we will devote a whole section to tidy data (Sec-
tion 17.2 “Tidy Data” on page 275) and how to make data tidy (Chapter 17 “Data Wrangling in
the tidyverse” on page 265). For now, it is sufficient to have the previous rules in mind. This will
allow us to introduce the tools of the tidyverse first and then later come back to making data tidy
by using these tools.
2 More information can be found in this article of Hadley Wickham: https://tidyverse.tidyverse.org/
articles/manifesto.html.
3 A notable exception here is ggplot2 This package uses operator overloading instead of piping (overloading of
the + operator).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 123
❦
Tidy Conventions
The tidyverse also enforces some rules to keep code tidy. The aims are to make code easier to read,
reduce the potential misunderstandings, etc.
For example, we remember the convention that R uses to implement it is S3 object oriented
programming framework from Chapter 6.2 “S3 Objects” on page 91. In that section we have
explained how R finds for example the right method (function) to use when printing an object via
the generic dispatcher function print(). When an object of class “glm” is passed to print(),
then the function will dispatch the handling to the function print.glm().
However, this is also true for data-frames: the handling is dispatched to print.data.frame().
This example illustrate how at this point it becomes unclear if the function print.data.frame()
is the specific case for a data.frame for the print() function or if it is the special case to print
a “frame” in the framework of a call to “print.data().” Therefore, the tidyverse recommends
naming conventions to avoid the dot ( . ). And use the snake_style or UpperCase style instead.
More about programming style in the tidyverse can be found in the online manifesto of
the tidyverse website: https://tidyverse.tidyverse.org/articles/manifesto.
html.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 124
❦
tidyverse Loading the tidyverse will report on which packages are included:
So, loading the library tidyverse, loads actually a series of other packages. The collection of
these packages are called “core-tidyverse.”
Further, loading tidyverse also informs you about which potential conflicts may occur.
For example, we see that calling the function filter() will dispatch to dplyr::filter()
(ie. “the function filter in the package dplyr,” while before loading tidyverse, the function
filter() stats::filter() would have been called).4
❦ ❦
Digression – Calling methods of not loaded packages
When a package is not loaded, it is still possible to call its member functions. To call a
function from a certain package, we can use the :: operator.
In other words, when we use the :: operator, we specify in which package this function
should be found. Therefore it is possible to use a function from a package that is not loaded
or is superseded by a function with the same name from a package that got loaded later.
R allows you to stand on the shoulders of giants: when making your analysis, you can rely
on existing packages. It is best to use packages that are part of the tidyverse, whenever there is
choice. Doing so, your code can be more consistent, readable, and it will become overall a more
satisfying experience to work with R.
• tidyr provides a set of functions that help you get to tidy up data and make adhering to
tidyr the rules of tidy data easier.
4 Here we use the notation package1::function1() to make clear that the function1 is the one as defined in
package1.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 125
❦
The idea of tidy data is really simple: it is data where every variable has its own column,
and every column is a variable. For more information, see Chapter 17.3 “Tidying Up Data
with tidyr” on page 277.
• dplyr provides a grammar of data manipulation, providing a consistent set of verbs that
solve the most common data manipulation challenges. For more information, see Chap-
ter 17 “Data Wrangling in the tidyverse” on page 265.
• ggplot2 is a system to create graphics with a philosophy: it adheres to a “Grammar of
Graphics” and is able to create really stunning results at a reasonable price (it is a notch
more abstract to use than the core-R functionality). For more information, see Chapter 31
“A Grammar of Graphics with ggplot2” on page 687. ggplot2
For both reasons, we will talk more about it in the sections about reporting: see Chapter 31
on page 687.
• readr expands R’s standard5 functionality to read in rectangular6 data. readr
It is more robust, knows more data types and is faster than the core-R functionality. For
more information, see Chapter 17.1.2 “Importing Flat Files in the Tidyverse” on page 267
and its subsections.
• purrr is casually mentioned in the section about the OO model in R (see Chapter 6 on
page 87), and extensively used in Chapter 25.1 “Model Quality Measures” on page 476. purrr
It is a rather complete and consistent set of tools for working with functions and vectors.
Using purrr it should be possible to replace most loops with call to purr functions that
will work faster.
❦ ❦
• tibble is a new take on the data frame of core-R. It provides a new base type: tibbles. tibble
Tibbles are in essence data frames, that do a little less (so there is less clutter on the screen
and less unexpected things happen), but rather give more feedback (show what went wrong
instead of assuming that you have read all manuals and remember everything). Tibbles are
introduced in the next section.
• stringr expands the standard functions to work with strings and provides a nice coherent
set of functions that all start with str_ . stringi
The package is built on top of stringi, which uses the ICU library that is written in C, so
it is fast too. For more information, see Chapter 17.5 “String Manipulation in the tidyverse”
on page 299. stringr
• forcats provides tools to address common problems when working with categorical vari-
ables7 . forcats
might not have a (strict) order relation. For example, “sex” (M or F) would not have an order, but salary brackets
might have.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 126
❦
readxl • Importing data: readxl for .xls and .xlsx files) and haven for SPSS, Stata, and SAS data.8
xlsx
xls
• Wrangling data: lubridate for dates and date-times, hms for time-of-day values, blob
lubridate for storing binary data. lubridate –for example – is discussed in Chapter 17.6 “Dates with
hms lubridate” on page 314.
blob
purrr • Programming: purrr for iterating within R objects, magrittr provides the famous pipe,
magrittr %>% command plus some more specialised piping operators (like %$% and %<>% ), and
paste() glue provides an enhancement to the paste() function.
glue
recipes • Modelling: this is not really ready, though recipes and rsample are already operational
rsample
and show the direction this is taking. The aim is to replace modelr9 . Note that there is also
modelr
broom the package broom that turns models into tidy data.
While the core-tidyverse is stable, the packages that are not core tend still to change and
improve. Check their online documentation when using them.
❦ ❦
8 Of course, if you need something else you will want to use the package that does exactly what you want. Here
are some good ones that adhere largely to the tidyverse philosophy: jsonlite for JSON, xml2 for XML, httr for web
APIs, rvest for web scraping, DBI for relational databases — a good resources is http://db.rstudio.com.
9 The lack of coherent support for the modelling and reporting area makes clear that the tidyverse is not yet a
candidate to service the whole development cycle of the company yet. Modelling departments might want to have
tidymodels a look at the tidymodels package.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 127
❦
7.3.1 Tibbles
Tibbles are in many aspects a special type of data frames. The do the same as data frames (i.e.
store rectangular data), but they have some advantages.
Let us dive in and create a tibble. Imagine for example that we want to show the sum of the
sine and cosine functions. The output of the code below is in Figure 7.1 on this page.
x <- seq(from = 0, to = 2 * pi, length.out = 100)
s <- sin(x)
c <- cos(x)
z <- s + c
plot(x, z, type = "l",col="red", lwd=7)
lines(x, c, col = "blue", lwd = 1.5)
lines(x, s, col = "darkolivegreen", lwd = 1.5)
1.5
1.0
0.5
❦ ❦
0.0
z
−0.5
−1.0
−1.5
0 1 2 3 4 5 6
Imagine further that our purpose is not only to plot these functions, but to use them in other
applications. Then it would make sense to put them in a data, frame. The following code does
exactly the same using a data frame.
x <- seq(from = 0, to = 2 * pi, length.out = 100)
#df <- as.data.frame((x))
df <- rbind(as.data.frame((x)),cos(x),sin(x), cos(x) + sin(x))
# plot etc.
This is already more concise. With the tidyverse, it would look as follows (still without using
the piping):
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 128
❦
library(tidyverse)
x <- seq(from = 0, to = 2 * pi, length.out = 100)
tb <- tibble(x, sin(x), cos(x), cos(x) + sin(x))
The code below first prints the tibble in the console and then plots the results in Figure 7.2 on
this page.
0 1 2 3 4 5 6
x
−1.0 −0.5 0.0 0.5 1.0
sin(x)
cos(x) + sin(x)
−0.5
❦ ❦
−1.5
The code with a tibble is just a notch shorter, but that is not the point here. The main advantage
in using a tibble is that it will usually do things that make more sense for the modern R-user. For
example, consider how a tibble prints itself (compared to what a data frame does).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 129
❦
tb$`sin(x)`[1]
## [1] 0
This convention is not specific to tibbles, it is used throughout R (e.g. the same back-ticks
are needed in ggplot2, tidyr, dyplr, etc.).
Hint
Be aware of the saying “They have to recognize that great responsibility is an inevitable
consequence of great power.”a It is not because you can do something that you must.
Indeed, you can use a numeric column names in a tibble and the following is valid code.
published in the French National Convention of 8 May, 1793 (see con (1793) – page 72). After that many
leaders and writers of comic books have used many variants of this phrase.
1. It will do less things (such as changing strings into factors, creating row names, change
names of variables, no partial matching, but a warning message when you try to access a
column that does not exist, etc.).
2. A tibble will report more errors instead of doing something silently (data type conversions,
import, etc.), so they are safer to use.
3. The specific print function for the tibble, print.tibble(), will not overrun your screen
with thousands of lines, it reports only on the ten first. If you need to see all columns, then print()
the traditional head(tibble) will still work, or you can tweak the behaviour of the print head()
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 130
❦
4. The name of the class itself is not confusing. Where the function print.data.frame()
potentially can be the specific method for the print function for a data.frame, it can also
be the specific method for the print.data function for a frame object. The name of the
class tibble does not use the dot and hence cannot be confusing.
# -- data frame --
df <- data.frame("value" = pi, "name" = "pi")
df$na # partial matching of column names
## [1] pi
## Levels: pi
df[,c("name", "value")]
## name value
## 1 pi 3.141593
# -- tibble --
df <- tibble("value" = pi, "name" = "pi")
df$name # column name
## [1] "pi"
This partial matching is one of the nicer functions of R, and certainly was an advantage for
interactive use. However when using R in batch mode, this might be dangerous. Partial matching
is especially dangerous in a corporate environment: datasets can have hundreds of columns and
many names look alike, e.g. BAL180801, BAL180802, and BAL180803. Till a certain point it is
safe to use partial matching since it will only work when R is sure that it can identify the variable
uniquely. But it is bound to happen that you create new rows and suddenly someone else’s code
will stop working (because now R got confused).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 131
❦
options(
options()
tibble.print_max=n, # If there are more than n
tibble.print_min=m, # rows, only print the m first
# (set n to Inf to show all)
tibble.width = l # max nbr of columns to print
# (set to Inf to show all)
)
Tibbles are also data frames, and most older functions – that are unaware of tibbles – will
work just fine. However, it may happen that some function would not work. If that happens, it is
possible to coerce the tibble back into data frame with the function as.data.frame().
tb <- tibble(c("a", "b", "c"), c(1,2,3), 9L,9)
is.data.frame(tb)
## [1] TRUE
## Error: Tibble columns must have consistent lengths, only values of length one are
recycled:
## * Length 2: Column ‘c(1, 2)‘
## * Length 3: Column ‘c("a", "b", "c")‘
The function view(tibble) works as expected and is most useful when working with
RStudio where it will open the tibble in a special tab.
While on the surface a tibble does the same as a data.frame, they have some crucial advantages
and we warmly recommend to use them.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 132
❦
This can also be written with the piping operator from magrittr
What R does behind the scenes, is feeding the output left of the pipe operator as main input
right of the pipe operator. This means that the following are equivalent:
# 1. pipe:
a %>% f()
# 2. pipe with shortened function:
❦ a %>% f ❦
# 3. is equivalent with:
f(a)
a <- c(1:10)
a %>% mean()
## [1] 5.5
a %>% mean
## [1] 5.5
mean(a)
## [1] 5.5
It might be useful to pronounce the pipe operator, %>% as “then” to understand what it
does.
10 R’s piping operator is very similar to the piping command that you might know from the most of the CLI shells of
popular *nix systems where messages like the following can go a long way: dmesg | grep "Bluetooth", though
differences will appear in more complicated commands.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 133
❦
functions will only calculate their arguments when they are really needed. There is of course a
good reason why those functions have lazy evaluation and the reader will not be surprised that
they cannot be used in a pipe. So there are many functions that use lazy evaluation, but most
notably are the error handlers. These are functions that try to do something, but when an error is
thrown or a warning message is generated, they will hand it over to the relevant handler. Exam- handler
❦ ples are try, tryCatch, etc. We do not really discuss error handling in any other parts of this ❦
book, so here is a quick primer.
# f1
# Dummy function that from which only the error throwing part
# is shown.
f1 <- function() {
# Here goes the long code that might be doing something risky
# (e.g. connecting to a database, uploading file, etc.)
# and finally, if it goes wrong:
stop("Early exit from f1!") # throw error
}
As can be understood from the example above, the error handler should not be evaluated if
f1 does not throw an error. That is why they use error handling. So the following will not work:
# f1
# Dummy function that from which only the error throwing part
# is shown.
f1 <- function() {
# Here goes the long code that might be doing something risky
# (e.g. connecting to a database, uploading file, etc.)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 134
❦
There is a lot more to error catching than meets the eye here. We recommend to read
the documentation of the relevant functions carefully. Another good place to start is
“Advanced R,” page 163, Wickham (2014).
Another issue when using the pipe operator %>% occurs when functions use explicitely the
current environment. In those functions, one will have to be explicit which environment to use.
More about environments and scoping can be found in Chapter 5 on page 81.
The aforementioned code fails. This is because R will not automatically add something like
data = t and use the “t” as far as defined till the line before. The function lm() expects as first
argument the formula, where the pipe command would put the data in the first argument. There-
fore, magrittr provides a special pipe operator that basically passes on the variables of the data
frame of the line before, so that they can be addressed directly: the %$% .
# The Tidyverse only makes the %>% pipe available. So, to use the
# special pipes, we need to load magrittr
library(magrittr)
##
## Attaching package: ’magrittr’
## The following object is masked from ’package:purrr’:
##
## set_names
## The following object is masked from ’package:tidyr’:
##
## extract
11 The
function lm() generates a linear model in R of the form y = a0 + n i ai xi . More information can be
found in Section 21.1 “Linear Regression” on page 375. The functions summary() and coefficients() that are
used on the following pages are also explained there.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 135
❦
Note how we can omit the brackets for functions that do not take any argument.
library(magrittr)
t <- tibble("x" = runif(100)) %>%
within(y <- 2 * x + 4 + rnorm(10, mean=0, sd=0.5)) %T>%
plot(col="red") # The function plot does not return anything
# so we used the %T>% pipe.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 136
❦
coefficients
6.5
6.0
5.5
5.0
y
4.5
4.0
3.5
3.0
Figure 7.3: A linear model fit on generated data to illustrate the piping command.
# Show x:
x
## [1] 2
We recommend to use this pipe operator only when no confusion is possible. We also
argue that this pipe operator makes code less readable, while not really making the code
shorter.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 137
❦
7.3.5 Conclusion
When you come from a background of compiled languages that provides fine graded control over
memory management (such as C or C++), you might not directly see the need for pipes that
much. However, it does reduce the amount of text that needs to be typed and makes the code
more readable.
Indeed, the piping operator will not provide a speed increase nor memory advantage even if
we would create a new variable at every line. R has a pretty good memory management and it
does only copy columns when they are really modified. For example, have a careful look at the
following:
library(pryr)
x <- runif(100)
object_size(x)
## 840 B
y <- x
y <- y * 2
❦ The piping operator can be confusing at first and is not really necessary (unless to read code ❦
that is using it). However, it has the advantage to make code more readable – once used to it – and
it also makes code shorter. Finally, it allows the reader of the code to focus more on what is going
on (the actions instead of the data, since that is passed over invisibly).
Pipes are as spices in the kitchen. Use them, but do so with moderation. A good rule
of thumb is that five lines is enough, and simple one-line commands do not need to be
broken down in more lines in order to use a pipe.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 139
❦
♣8♣
Elements of Descriptive
Statistics statistics
A measure of central tendency is a single value that attempts to describe a set of data by identifying central tendency
the central position within that set of data. As such, measures of central tendency are sometimes measure – central tendency
called measures of central location. They are also classed as summary statistics. The mean (often
called the average) is most likely the measure of central tendency that you are most familiar with,
but there are others, such as the median and the mode.
The mean, median, and mode are all valid measures of central tendency, but under different
❦ ❦
conditions, some measures of central tendency become more appropriate to use than others. In
the following sections, we will look at the mean, mode, and median, and learn how to calculate
them and under what conditions they are most appropriate to be used.
Probably the most used measure of central tendency is the “mean.” In this section we will start central tendency – mean
from the arithmetic mean, but illustrate some other concepts that might be more suited in some
situations too.
N
x̄ = P (x).x (for discrete distributions)
n=1
+∞
= x.f (x) dx (for continuous distributions)
−∞
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 140
❦
❦ ❦
Hint – Outliers
The mean is highly influenced by the outliers. To mitigate this to some extend the param-
eter trim allows to remove the tails. It will sort all values and then remove the x% smallest
and x% largest observations.
v <- c(1,2,3,4,5,6000)
mean(v)
## [1] 1002.5
Definition: f-mean
1
K
−1
x̄ = f . f (xk )
n
k=1
1 More information about the concept “dispatcher function” is in Chapter 6 “The Implementation of OO” on
page 87.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 141
❦
1 mean – power
• f (x) = : harmonic mean,
x power mean
• f (x) = xm : power mean,
mean – geometric
1
geometric mean
K
• f (x) = ln x : geometric mean, so x̄ = K
k=1 xk
One particular generalized mean is the power mean or Hölder mean. It is defined for a set of K holder mean
mean – holder
positive numbers xk by
m
1
1 m
K
x̄(m) = · xk
n
k=1
by choosing particular values for m one can get the quadratic, arithmetic, geometric and harmonic mean – quadratic
means.
• m → ∞: maximum of xk
• m = 2: quadratic mean
• m = 1: arithmetic mean
❦ ❦
• m → 0: geometric mean
• m = 1: harmonic mean
• m → −∞: minimum of xk
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 142
❦
What is the average return when you know that the share price had the following returns:
−50%, +50%, −50%, +50%. Try the arithmetic mean and the mean of the log-returns.
# Arithmetic mean:
aritmean <- mean(returns)
# The ln-mean:
log_returns <- returns
for(k in 1:length(returns)) {
log_returns[k] <- log( returns[k] + 1)
}
logmean <- mean(log_returns)
exp(logmean) - 1
## [1] -0.1339746
## mean of returns
V_0 * (aritmean + 1)
## [1] 1
x <- c(1:5,5e10,NA)
x
## [1] 1e+00 2e+00 3e+00 4e+00 5e+00 5e+10 NA
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 143
❦
# my_mode
# Finds the first mode (only one)
# Arguments:
# v -- numeric vector or factor
# Returns: unique()
# the first mode Linux
my_mode <- function(v) { FreeBSD
uniqv <- unique(v) tabulate()
tabv <- tabulate(match(v, uniqv)) uniqv()
uniqv[which.max(tabv)]
}
While this function works fine on the examples provided, it only returns the first mode
encountered. In general, however, the mode is not necessarily unique and it might make sense
to return them all. This can be done by modifying the code as follows:
# my_mode
# Finds the mode(s) of a vector v
# Arguments:
# v -- numeric vector or factor
# return.all -- boolean -- set to true to return all modes
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 144
❦
# Returns:
# the modal elements
my_mode <- function(v, return.all = FALSE) {
uniqv <- unique(v)
tabv <- tabulate(match(v, uniqv))
if (return.all) {
uniqv[tabv == max(tabv)]
} else {
uniqv[which.max(tabv)]
}
}
# example:
x <- c(1,2,2,3,3,4,5)
my_mode(x)
## [1] 2
We were confident that it was fine to over-ride the definition of the function my_mode.
Indeed, if the function was already used in some older code, then one would expect to
see only one mode appear. That behaviour is still the same, because we chose the default
value for the optional parameter return.all to be FALSE. If the default choice would
be TRUE, then older code would produce wrong results and if we would not use a default
❦ value, then older code would fail to run. ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 145
❦
measures of spread
8.2 Measures of Variation or Spread
Variation or spread measures how different observations are compared to the mean or other cen-
tral measure. If variation is small, one can expect observations to be closer to each other.
Definition: Variance
variance
2
VAR(X) = E X − X̄
Standard Deviation
N
2
SD(X) =
Xn − X̄
N −1
n=1
❦ ❦
t <- rnorm(100, mean=0, sd=20)
var(t)
## [1] 248.2647
sd(t)
sd()
## [1] 15.75642
sqrt(var(t))
## [1] 15.75642
Definition: mad
1 mad
mad(X) = median (|X − median(X)|) median absolute deviation
1.4826
mad(t)
## [1] 14.54922 mad()
mad(t,constant=1)
## [1] 9.813314
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 146
❦
E[mad(X1 , ..., Xn )] = σ
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 147
❦
When there is more than one variable, it is useful to understand what the interdependencies of
variables are. For example when measuring the size of peoples hands and their length, one can
expect that people with larger hands on average are taller than people with smaller hands. The
hand size and length are positively correlated.
covar(X, Y )
ρXY =
σX σY
(X − E[X])(Y − E[Y ])
=
(X − E[X])(Y − E[Y ])
❦ =: covar(x, y) ❦
cor(mtcars$hp,mtcars$wt)
## [1] 0.6587479 cor()
Of course, we also have functions that provide the covariance matrix and functions that con-
vert the one in the other.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 148
❦
cov2cor() cov2cor(cov(d))
## mpg wt hp
## mpg 1.0000000 -0.8676594 -0.7761684
## wt -0.8676594 1.0000000 0.6587479
## hp -0.7761684 0.6587479 1.0000000
The correlation between x and x2 is zero, and the correlation between x and exp(x) is a meagre
0.527173.
correlation – Spearman The Spearman correlation is the correlation applied to the ranks of the data. It is one if an
increase in the variable X is always accompanied with an increase in variable Y .
cor(rank(df$x), rank(df$x_exp))
## [1] 1
The Spearman correlation checks for a relationship that can be more general than only linear.
❦ It will be one if X increases when Y increases.
❦
Question #10
Plot y in function of x. What is their Pearson correlation? What is their Spearman corre-
lation? How do you understand that?
Not even the Spearman correlation will discover all types of dependencies. Consider the
example above with x2 .
x <- c(-10:10)
cor(rank(x), rank(x^2))
## [1] 0
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 149
❦
Chi-Square test in R
chisq.test(data)
where data is the data in form of a table containing the count value of the variables
For example, we can use the mtcars dataset that is most probably loaded when R was ini-
tialised.
The chi-square test reports a p-value. This p-value is the probability that the correlations is
actually insignificant. It appears that in practice a correlation lower than 5% can be considered as
insignificant. In this example, the p-value is higher than 0.05, so there is no significant correlation.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 150
❦
8.4 Distributions
R is a statistical language and most of the work in R will include statistics. Therefore we introduce
the reader to how statistical distributions are implemented in R and how they can be used.
The names of the functions related to statistical distributions in R are composed of two sec-
tions: the first letter refers to the function (in the following) and the remainder is the distribution
name.
distribution – normal
distribution – exponential
distribution – log-normal Distribution R-name Distribution R-name
distribution – logistic Normal norm Weibull weibull
distribution – geometric Exponential exp Binomial binom
distribution – Poisson
distribution – t Log-normal lnorm Negative binomial nbinom
distribution – f Logistic logis χ2 chisq
distribution – beta Geometric geom Uniform unif
distribution – weibull
distribution – binomial Poisson pois Gamma gamma
❦ distribution – negative t t Cauchy cauchy ❦
binomial f f Hypergeometric hyper
distribution – chi-squared
distribution – uniform
Beta beta
distribution – gamma
distribution – cauchy
distribution – Table 8.1: Common distributions and their names in R.
hypergeometric
As all distributions work in a very similar way, we use the normal distribution to show how
the logic works.
R has four built-in functions to work with the normal distribution. They are described below.
pnorm()
• pnorm(x, mean, sd): The cumulative distribution function (the probability of the obser-
vation to be lower than x)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 151
❦
• qnorm(p, mean, sd): Gives a number whose cumulative value matches the given prob- qnorm()
ability value p
with
• x: A vector of numbers
• p: A vector of probabilities
In the following example we generate data with the random generator function rnorm() and
then compare the histogram of that data with the ideal probability density function of the Normal
distribution. The output of the following code is Figure 8.1 on this page.
Histogram of obs
0.12
❦ ❦
0.10
0.08
Density
0.06
0.04
0.02
0.00
0 5 10 15 20
obs
Figure 8.1: A comparison between a set of random numbers drawn from the normal distribution
(khaki) and the theoretical shape of the normal distribution in blue.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 152
❦
In this simple illustration, we will compare the returns of the index S&P500 to the Normal distri-
bution. The output of the following code is Figure 8.2 on this page.
Histogram of SP500
0.4
0.3
Density
0.2
0.1
0.0
−8 −6 −4 −2 0 2 4
❦ SP500 ❦
Figure 8.2: The same plot for the returns of the SP500 index seems acceptable, though there are
outliers (where the normal distribution converges fast to zero).
library(MASS)
hist(SP500,col="khaki3",freq=FALSE,border="khaki3")
x <- seq(from=-5,to=5,by=0.001)
lines(x, dnorm(x,mean(SP500),sd(SP500)),col="blue",lwd=2)
A better way to check for normality is to study the Q-Q plot. A Q-Q plot compares the sample
Q-Q plot
quantiles with the quantiles of the distribution and it makes very clear where deviations appear.
library(MASS)
qqnorm(SP500,col="red"); qqline(SP500,col="blue")
From the Q-Q plot in Figure 8.3 on page 153 (that is generated by the aforementioned code
block), it is clear that the returns of the S&P-500 index are not Normally distributed. Outliers far
from the mean appear much more often than the Normal distribution would predict. In other
words: returns on stock exchanges have “fat tales.”
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 153
❦
Figure 8.3: A Q-Q plot is a good way to judge if a set of observations is normally distributed or not.
As for all distributions, R has four in-built functions to generate binomial distribution:
dbinom()
• dbinom(x, size, prob): The density function pbinom()
dbinom()
• pbinom(x, size, prob): The cumulative probability of an event
pbinom()
• qbinom(p, size, prob): Gives a number whose cumulative value matches a given
probability value qbinom()
• rbinom(n, size, prob): Generates random variables following the binomial distribu- rbinom()
tion.
Following parameters are used:
• x: A vector of numbers
• p: A vector of probabilities
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 154
❦
1.0
0.8
prob of maxium x tails
0.6
0.4
0.2
0.0
2 4 6 8 10
Number of tails
Figure 8.4: The probability to get maximum x tails when flipping a fair coin, illustrated with the
binomial distribution.
Similar to the Normal distribution, random draws of the Binomial distribution can be obtained
rbinom() via a function that starts with the letter ’r’: rbinom().
# Find 20 random numbers of tails from and event of 10 tosses
# of a coin
rbinom(20,10,.5)
## [1] 5 7 2 6 7 4 6 7 3 2 5 9 5 9 5 5 5 5 5 6
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 155
❦
In the Chapter 4 “The Basics of R” on page 21, we presented some of the basic functions of R that
– of course – include the some of the most important functions to describe data (such as mean
and standard deviation).
Mileage may vary, but in many research people want to document what they have done and
will need to include some summary statistics in their paper or model documentation. The stan-
dard summary of the relevant object might be sufficient.
N <- 100
t <- data.frame(id = 1:N, result = rnorm(N))
summary(t)
## id result
## Min. : 1.00 Min. :-1.8278
## 1st Qu.: 25.75 1st Qu.:-0.5888
## Median : 50.50 Median :-0.0487
## Mean : 50.50 Mean :-0.0252
## 3rd Qu.: 75.25 3rd Qu.: 0.4902
## Max. :100.00 Max. : 2.3215
This already produces a neat summary that can directly be used in most reports.2
library(tidyverse) # not only for %>% but also for group_by, etc.
# In mtcars the type of the car is only in the column names,
# so we need to extract it to add it to the data
n <- rownames(mtcars)
# Now, add a column brand (use the first letters of the type)
t <- mtcars %>%
mutate(brand = str_sub(n, 1, 4)) # add column
To achieve this, the function group_by() from dplyr will be very handy. Note that this func- group_by()
tion does not change the dataset as such, it rather adds a layer of information about the grouping.
# First, we need to find out which are the most abundant brands
# in our dataset (set cutoff at 2: at least 2 cars in database)
top_brands <- count(t, brand) %>% filter(n >= 2)
2 In the sections Chapter 32 “R Markdown” on page 699 and Chapter 33 “knitr and LAT X” on page 703 it will be
E
explained how these results from R can directly be used in reports without the need to copy and paste things.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 156
❦
## <chr> <int>
## 1 Fiat 2
## 2 Horn 2
## 3 Mazd 2
## 4 Merc 7
## 5 Toyo 2
The sections on knitr and rmarkdown (respectively Chapter 33 on page 703 and Chapter 32
on page 699) will explain how to convert this output via the function kable() into Table 8.2.
There are a few things about group_by() and summarise() that should be noted in order to
make working with them easier. For example, summarize works opposite to group_by and hence
will peel back any existing grouping, it is possible to use expression in group by, new groups will
preplace by default existing ones, etc. These aspects are illustrated in the following code.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:16pm Page 157
❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 159
❦
♣9♣
Visualisation Methods
This section demonstrates – in no particular order – some of the most useful plotting facilities
in R.1
The most important function to plot anything is plot() . The OO implementation of R is
function centric and the plot-function will recognize what object is fed to it and then dispatch to
the relevant function. The effect is that you can provide a wide variety of objects to the function
plot() and that R will “magically” present something that makes sense for that particular object. plot()
We illustrate this with the two very basic and simple objects, a vector and a data frame (the
plots appear respectively in Figure 9.1 on page 160 and Figure 9.2 on page 160):
x <- c(1:20)^2
plot(x)
❦ ❦
df <- data.frame('a' = x, 'b' = 1/x, 'c' = log(x), 'd' = sqrt(x))
plot(df)
If the standard plot for your object is not what you need, you can choose one of the many
specific plots that we illustrate in the remainder of this chapter.
1 The modular structure of R allows to build extensions that enhance the functionality of R. The plotting facility
has many exentions and arguably the most popular is ggplot2, which is described in Chapter 31 “A Grammar of
Graphics with ggplot2” on page 687.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 160
❦
400
300
200
x
100
0
5 10 15 20
Index
Figure 9.1: The plot-function will generate a scatter-plot for a vector. Note also that the legend is
automatically adapted. The xis is the index of the number in the vector and the y -axis is the value of
the corresponding number in the vector.
❦ ❦
Figure 9.2: The plot-function will generate a scatter-plot of each column in function of each other
column for a data frame. The main diagonal of the matrix has the names of the columns. Note also
that the x and y axis are labelled only on one side and that each row shares the same y -axis, while
each column shares one x-axis.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 161
❦
9.1 Scatterplots
Scatterplots are probably the first type of plot one can think of, they show points plotted on the scatterplot
Cartesian plane. Each point represents the combination of two variables. One variable is chosen plot – scatter
chart – scatterplot
in the horizontal axis and another in the vertical axis.
plot(x, y, main, xlab, ylab, xlim, ylim, axes, ...) with plot()
With the argument pch (that is short for “plot character”), it is possible to change the symbol that
is displayed on the scatterplot. Integer values between 0 to 25 specify a symbol as shown in Figure
9.3. It is possible to change the colour via the argument col. pch values from 21 to 25 are filled
symbols that allow you to specify a second color bg for the background fill. Most other characters
supplied to pch other than this will plot themselves.
Would you like to see the code that generated the plot Figure 9.3 on page 162? Please refer
to Chapter D “Code Not Shown in the Body of the Book” on page 840.
To illustrate the scatterplot, we use the dataset mtcars, and try to gain some insight in how the
fuel consumption depends on the horse power of a car – the output is in Figure 9.4 on page 162.
# Import the data
library(MASS)
# mpg2l
# Converts miles per gallon into litres per 100 km
# Arguments:
# mpg -- numeric -- fuel consumption in MPG
# Returns:
# Numeric -- fuel consumption in litres per 100 km
mpg2l <- function(mpg = 0) {
100 * 3.785411784 / 1.609344 / mpg}
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 162
❦
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
25 A A B B a a b b
Figure 9.3: Some plot characters. Most other characters will just plot themselves.
❦ ❦
20
L per 100km
15
10
Horse Power
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 163
❦
plot(x, type , main, xlab, ylab, xlim, ylim, axes, sub, asp ...) with
A line-plot example
The following code first genearates data and then plots it in a line chart. The output of the code
is in Figure 9.5 on the next page.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 164
❦
3500
top sales
3000
Sales in USD
2500
2000
Years
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 165
❦
A pie chart is a simple, but effective way to represent data when the total is 100%. The total chart – pie
circumference of the circle is that 100% and the angle of each section corresponds to the pie chart
percentage of one of the categories. plot – pie
To use this plot, we need a variable that is categorical (a “factor” in R) and a corresponding
numerical value. For example, the sales (numerical) per country (the categorical variable).
Avoid this plot when it is not clear what the 100% actually is or when there are too much
categories.
• radius: the radius of the circle of the chart (value between âffh1 and +1)
• clockwise: a logical value indicating if the slices are drawn clockwise or anti
clockwise
good
average
bad
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 166
❦
The following code segment produces the pie chart that is in Figure 9.6.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 167
❦
Bar charts visualise data via the length of rectangles. It is one of the most suited plots if the data is
numerical for different categories (that are not numerical).
The function barplot() is the key-function for bar plots, but it can do more than the sim-
ple plot of Figure 9.8 on page 168. For example, it is also possible to stack the bars as shown in
Figure 9.8 on page 168. This figure is produced by the following code:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 168
❦
Sales 2016
200
150
Sales in EUR
100
50
0
Regions
❦ ❦
Sales 2016
200
License
Maintenance
Consulting
150
Sales in EUR
100
50
0
Region
Figure 9.8: A bar-chart based on a matrix will produce stacked bars. Note how nicely this plot con-
veys the seasonal trend in the data.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 169
❦
The function barplot has no option to simply force all bars to 100% (equal length), we can
do this with the function barplot()2 . The function takes one argument margin, which is the prop.table()
index, or vector of indices for which the margins need to be calculated.
The following piece of code illustrates how this can work (the output is in Figure 9.9):
Sales 2016
1.0
Business line
License
Maintenance
Consulting
❦ ❦
0.8
0.6
Sales in EUR
0.4
0.2
0.0
Region
Figure 9.9: A boxplot where the total of each bar equals 100%. Note how the seasonal trend is
obscured in this plot, but how it now tells the story of how the consulting business is growing and
the maintenance business, relatively, cannot keep up wih that growth.
2 This function is a wrapper that does the same as sweep(x, margin, margin.table(x, margin), "/"). sweep()
The only difference is that if the argument margin has length zero, then one gets for newbies, except that
if ’margin’ has length zero, then one x/sum(x).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 170
❦
This type of plot should only be used if the total indeed is 100%, or if that concept
really makes sense. In any case, make sure that the legend of your plot describes what is
going on.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 171
❦
9.5 Boxplots
boxplot
chart – boxplot
Boxplotsare a particular form of plot that do an excellent job in summarising data. They will show plot – boxplot
quartiles, outliers, and median all in one plot.
An example can be seen in Figure 9.10 on page 172. Each category (4, 6, and 8) has one boxplot.
the centre is a box that spans from the first quartile to the third. The median is the horizontal line
in the box. The two bars that stick out on the top are called the “whiskers.” The whiskers stretch
from the lowest value at the bottom to the highest value on the top. However, the size of the
whiskers is limited to 1.5 times the interquartile range3 . This behaviour can be changed via the
parameter range.
• varwidth: a logical value (set to true to draw width of the box proportionate to the
sample size)
❦ ❦
• names: the group labels which will be printed under each boxplot.
• range: this number determines how far the plot whiskers can reach. If it is positive,
then the whiskers extend to the most extreme data point which is no more than
“range” times the interquartile range from the box. If range is set to 0, then the
whiskers extend to the data extremes.
To illustrate boxplots, we consider the dataset ships (from the library “MASS”), and use the
following code to generate Figure 9.10 on page 172:
library(MASS)
boxplot(mpg ~ cyl,data=mtcars,col="khaki3",
main="MPG by number of cylinders")
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 172
❦
30
25
20
15
10
4 6 8
Figure 9.10: Boxplots show information about the central tendency (median) as well as the spread
of the data.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 173
❦
Boxplots are great, but have some limitations. For example, the shape of the distribution is only
shown by five points. This implies that it might hide that data is bimodal4 for example. bimodal
As described by Hintze and Nelson (1998), boxplots marry the advantages of boxplots with
the added information of a an approximation of the distribution. A violin plot is a boxplot with a
rotated kernel density plot on each side.
In R, making boxplots is made easy by the package vioplot. vioplot
The function vioplot() will only plot violin plot. vioplot()
So unlike the boxplots, we have to use the function with() to get more of them. with()
This is illustrated in the following code and the output is in Figure 9.11 on page 174:
Tracing violin plots is also possible with the package ggplot25 . Below we show one possibility
and in Chapter 22.2.3 “The AUC” on page 396 we use another variation of violin plots. The output
of the following code is in Figure 9.12 on page 174 and Figure 9.13 on page 175.
# Library
library(ggplot2)
# Second type
ggplot(mtcars, aes(factor(cyl), mpg)) +
geom_violin(aes(fill = factor(cyl)))
4 Bimodal means that the probability density function shows two maxima. This can mean that there are two
distinct populations in the data. For example, in mtcars the group of eight cylinder cars seems to be trimodal
(composed of three difference groups of cars: SUVs, sports cars, luxury limousines.)
5 The package ggplot2 will be described in more detail in Chapter 31 “A Grammar of Graphics with ggplot2” on
page 687.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 174
❦
30
25
20
15
10
4 6 8
Figure 9.11: Violin plot as provided by the function vioplot from the package of the same name.
❦ ❦
35
30
25
cyl
8
7
mpg
5
20
4
15
10
4 6 8
factor(cyl)
Figure 9.12: Violin plot as traced by geom_violin provided by the library ggplot2; with the colour-
ing done according to the number of cylinders.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 175
❦
35
30
25
factor(cyl)
4
mpg
6
8
20
15
10
4 6 8
factor(cyl)
Figure 9.13: Violin plot as traced by geom_violin provided by the library ggplot2; with the colour-
ing done according to the factoring of the number of cylinders.
❦ ❦
Further information – ggplot2
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 176
❦
chart – histogram Every person, who works with data, will know a histogram. Histograms to a good job in
plot – histogram visualising the probability distribution of observations. In base-R, the function hist() can
produce a histogram with just one command.
• breaks: one of
hist() To illustrate how this function works, we will use the database ships, that is provided in the
package MASS. The code below shows two variants in Figure 9.14 on page 177 and Figure 9.15 on
page 177.
library(MASS)
incidents <- ships$incidents
# figure 1: with a rug and fixed breaks
hist(incidents,
col=c("red","orange","yellow","green","blue","purple"))
rug(jitter(incidents)) # add the tick-marks
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 177
❦
Histogram of incidents
30
25
20
Frequency
15
10
5
0
0 10 20 30 40 50 60
incidents
❦ ❦
Histogram of incidents
0.30
0.25
0.20
Density
0.15
0.10
0.05
0.00
0 20 40 60 80
incidents
Figure 9.15: In this histogram, the breaks are changed, and the y-axes is now calibrated as a prob-
ability. Note that leaving freq=TRUE would give the wrong impression that there are more observa-
tions in the wider brackets.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 178
❦
In the aforementioned code, we also used the function rug() and jitter().
rug() The function rug() produces a small vertical line per observation along the x-axis. However,
when some observations have the same value, the lines would overlap and only one line would
jitter() be visible. To provide some insight in the density, the function jitter() adds random noise to
the position.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 179
❦
While the function plot() allows to draw functions, there is a specific function curve(), that will
plot functions. The following code illustrate this function by creating Figure 9.16 and makes also
plot – functions(of)
clear how to add mathematical expressions to the plot:
❦ ❦
This also shows the standard capacity of R to include mathematical formulae into its plots
and even format annotations in a LaTeX-like markup language. LaTeX
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 180
❦
In this section we show how to make a plot to visualise real values in a 2D projection. This could
be a heat-map, a transition matrix for a Markov Chain, an elevation-map, etc. For street maps, we
refer to Section 36.2.1 “HTML-widgets” on page 719.
The function image() from the package graphics is a generic function to plot coloured rect-
angles. The following uses the dataset volcano from the package datasets and the examples
inspired by the documentation (see R Core Team (2018)). The output is generated by the last
section in Figure 9.17
105
60
105
110 115
130 100
155
50
110
95
180
40
165
155
150
190
160
30
170
180
0
17
185
160
20
175
165
140 145
100
135
10
125
120
115
115
110
0
105
110
10
10 20 30 40 50 60 70 80
Figure 9.17: A colour mapping combined with a contour plot provides a nice image of the heights of
Auckland’s Maunga Whau Volcano.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 181
❦
Long gone are Galileo’s times where data was scarce and had to be produced carefully. We live in
a time where there is a lot of data around. Typically, commercial and scientific institutions have
too much data to handle easily, or to see what is going on. Imagine for example to have the data
of a book of one million loans.
One way to get started is a heat-map. A heat-map is in essence a visualization of a matrix and heatmap()
is produced by the function heatmap(), as illustrated below. The output is in Figure 9.18.
d=as.matrix(mtcars,scale="none")
heatmap(d)
Toyota Corona
Porsche 914−2
Datsun 710
Volvo 142E
Merc 230
Lotus Europa
Merc 280C
Merc 280
Mazda RX4 Wag
Mazda RX4
Merc 240D
Ferrari Dino
Fiat 128
Fiat X1−9
Toyota Corolla
❦ Honda Civic
Merc 450SL
❦
Merc 450SE
Merc 450SLC
Dodge Challenger
AMC Javelin
Hornet 4 Drive
Valiant
Duster 360
Camaro Z28
Ford Pantera L
Pontiac Firebird
Hornet Sportabout
Cadillac Fleetwood
Lincoln Continental
Chrysler Imperial
Maserati Bora
cyl
am
vs
carb
wt
drat
gear
qsec
mpg
hp
disp
Note that this function by default will change the order of rows and columns in order to be
able to create a pattern, and the heuristics are visualized by the dendrograms (top and left in dendrogram
Figure 9.18).
While this is interesting, the results for the low numbers are all red. This is because the num-
bers are not of the same nature (have different scales), so we might want to rescale and reveal
the hidden patterns per variable. This is done with the variable scale and we tell how we want
the scaling to be done (rows or columns), as illustrated in the following code. The result is in
Figure 9.19 on page 182.
heatmap(d,scale="column")
The function heatmap() has many more useful parameters. Below we describe the most rel-
evant.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 182
❦
Toyota Corona
Porsche 914−2
Datsun 710
Volvo 142E
Merc 230
Lotus Europa
Merc 280C
Merc 280
Mazda RX4 Wag
Mazda RX4
Merc 240D
Ferrari Dino
Fiat 128
Fiat X1−9
Toyota Corolla
Honda Civic
Merc 450SL
Merc 450SE
Merc 450SLC
Dodge Challenger
AMC Javelin
Hornet 4 Drive
Valiant
Duster 360
Camaro Z28
Ford Pantera L
Pontiac Firebird
Hornet Sportabout
Cadillac Fleetwood
Lincoln Continental
Chrysler Imperial
Maserati Bora
cyl
am
vs
carb
wt
drat
gear
qsec
mpg
hp
disp
Figure 9.19: Heatmap for the “mtcars” data with all columns rescaled
• scale the default is “row,” it can be turned off by using “none” or switched to “col-
umn”;
• Rowv and Colv: determines if and how the row or column dendrogram should be
computed and re-ordered. Use “NA” to suppress re-ordering;
• na.rm: a logical value that indicates how missing values have to be treated;
• labCol and labRow: these can be character vectors with row and column labels to
use in the plot. Their default is rownames(x) or colnames(x) respectively (with
x the matrix supplied to the function);
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 183
❦
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 184
❦
library("SnowballC")
library("RColorBrewer")
❦ library("wordcloud")
❦
This section will be developed around one example: the text version of this book.
This book is made in LATEX, and hence it is easy to convert to a text-file. The program
detex comes with your LATEX (on Linux), and it is sufficient to use the following line on
the Linux CLI:
detex r-book.tex > data/r-book.txt
An now, we can walk through to the work-flow. The first step is of course, reading in the text
readLines() to be analysed. We will first import a text-file with the function readLines():
The function Corpus() from the package tm creates corpora. Corpora are collections of doc-
tm uments that contain human language. The variable doc is now an S3 object of
Corpus() class Corpus.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 185
❦
In this step, we remove all text that should not appear in our word-cloud. For example, if we would
convert the PDF file to text and not the LATEX-file, then also the headers and footers of every page
would end up in the text. This means that in our example, the name of the author would appear
at least thousand times and hence this would become the dominant words in the word-could. We
know who wrote this book, and hence in this step we would have to remove this name.
We did start from the LaTeX file, and hence
The command inspect(doc) can display information about the corpus or any text doc-
ument.
The tm_map() function is used to remove unnecessary white space, to convert the text to
lower case, and to remove common stop-words like “the” and “we.” tm_map()
The information value of “stop-words” is near zero due to the fact that they are so common
❦ ❦
in a language. Removing these kinds of words is useful before further analyses. For “stop-words,”
supported languages are Danish, Dutch, English, Finnish, French, German, Hungarian, Italian,
Norwegian, Portuguese, Russian, Spanish, and Swedish. Language names are case sensitive and
are provided without capital.
We will also show you how to make your own list of stop-words, and how to remove them
from the text. You can also remove numbers and punctuation with “removeNumbers” and
“removePunctuation” arguments.
Another important preprocessing step is to make a “text stemming.” This is reducing words
to their root form. So this process will remove suffixes from words to make it simple and to get the
common origin. For example, the words “functional” and “functions,” would become “function.”
The package SnowballC, that we loaded before is able to do this stemming. It works as follows:
# Remove numbers
doc <- tm_map(doc, removeNumbers)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 186
❦
# Remove punctuations
doc <- tm_map(doc, removePunctuation)
# Text stemming
#doc <- tm_map(doc, stemDocument)
Note however, that the word “stemming” is not perfect. For example, it replaces “probabil-
ity density function” with ‘probabl densiti function,” “example” with “exampl,” “variable” with
“variabl,” etc. That is why we choose not to use it and comment it out.
The downside of leaving it out in our example is that the both the terms “function” and “func-
tions” will be in the top-list. To avoid this, we need to apply some corrections. This is similar to
replacing words according to a dictionary.
library(stringi)
Now, we can visualize how many times certain words appear. The frequency of the first 10
frequent words are plotted in Figure 9.20 on page 187, that is produced via the following code
fragment.
barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,
col ="khaki3", main ="Most frequent words",
ylab = "Word frequencies")
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 187
❦
350
300
250
Word frequencies
200
150
100
50
0
function
data
use
model
example
code
package
company
method
market
Figure 9.20: The frequency of the ten most occurring words in this text. Note tha the name of the
software, R, is only one letter, and hence not retained as a word. So R is not not part of this list.
❦ ❦
Step 4: Generate the Word-cloud
Finally, we can generate the word-could and produce Figure 9.21 on page 188.
set.seed(1879)
wordcloud(words = d$word, freq = d$freq, min.freq = 10,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 188
❦
programming
capm false default
rstudio amount
sum another
idea positive distribution
useful
classification
specific
alternative just
find
sparkr
mean
present
buyer underlyingmeans variable node article
becomes
date linear much performance variables
small
provides case create methods
class
cross assets
make options
list
order number
model twofree
now
consider
scale mtcars
function
work test
dividend interest expected
values since spark models gpu
used
dataset
equity calculate times measure
buy growth
section
line tree type
look using
level matrix
parameter
general
regression
working via
example
similar
companies
total
hadoop
difference apache
code
time
output
problem
clear
course lot
result object
add still
process see
book
run
based
point
age data fit rate
ratio dividends
kernel
position
asset
package plot follows error
given
usedifferent large
easy
good
rather observations
needs note risk company
capital
really
usually
file set return
important
market
cluster way
method
allows
future
get
information
validation stock
short
works
done called results
distributed learning
shares
library
❦ system manysplit long
❦
even flow available certain
dataframe maturity decision computer
intrinsic
operating better true vector makes best calculations
numeric
machinenew right start
preference strike solution trees
exchange sense earnings
bond parallel investment
higher
Figure 9.21: A word-could for the text of this book. This single image gives a good idea of what this
book is about.
• random.order : plot words in random order. If false, they will be plotted in decreas-
ing frequency
• colors : color words from least to most frequent. Use, for example, colors =
’black’ for single color.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 189
❦
In a first step, we might want to identify the words that appear frequently. The function
findFreqTerms() of the package tm finds frequent terms in a document-term or term-document
matrix. In the example below we extract words that occur at least 150 times:
Word Associations in R
One can analyze the association between frequent terms (i.e., terms which correlate) using
findAssocs() function. function findAssocs() of the same package can identifey the functions
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 190
❦
## 0.15 0.15
## collapsedeparsef converts
## 0.15 0.15
## deparse bquote
## 0.15 0.15
## messed nicer
## 0.15 0.15
## plotmath titles
## 0.15 0.15
These word associations show us in what context the word “function” is used.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 191
❦
9.12 Colours in R
As a data scientist, you might not worry so much about colours; however, a good selection of
colours makes a plot easy to read and helps the message to stand out, and a limited consistent
choice of colours will make your document look more professional. Certainly we do not want to
worry too much about colours, and R does a good job of making colour management easy, but
still allow full customization.
To start with, R has 657 named colours built in. A list of those names can be obtained with
the function colours().
While many programming languages will force the user to use one of the flavours of the
English language, R allows everywhere (and even most package contributors respect this)
both “color” and “colour.”
This list of colours can be used to search for a colour whose name contains a certain string.
R allows also to define colours in different ways: named colours, RGB colours, hexadecimal
colours, and it also allows to convert the one to the other.
To illustrate the way colours can be addressed, we show a few plots. They do use a specific
plotting library ggplot2 that is introduced in Chapter 31 “A Grammar of Graphics with ggplot2”
on page 687. For now, we suggest to focus on the colour definitions. The following code creates
different plots and pareses them together into one set of three by two plots in one figure. That
figure is shown in Figure 9.22 on page 192.
library(ggplot2)
library(gridExtra)
##
## Attaching package: ’gridExtra’
## The following object is masked from ’package:dplyr’:
##
## combine
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 192
❦
Figure 9.22: An illustration of six predefined colour schemes in R. This figure uses the dataset
mtcars and shows in each sub-plot the number of seconds the car needs to speed up to one fourth of
a mile in function of its power (hp). The colouring of each observation (car type) is done in function
❦ of the fuel economy mpg. This presentation allows to visualize more than one relationship in one plot. ❦
# rainbow colours
p4 <- p + scale_color_gradientn(colours = rainbow(5)) +
ggtitle('Rainbow colours')
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 193
❦
• scale_fill_xxx() for surfaces in box plots, bar plots, violin plots, histograms, etc. scale_fill_grey()
The plot is in Figure 9.23 on page 194. In this plot, the number of the colour can be found by
the following formula:
nbr = (y − 1)9 + x
With y, the value on the y -axis, and x, the number on the x-axis. Once the formula is applied,
the name of the colour is the colour with that number in R’s list. Here are a few examples:
❦ colours()[(3 - 1)
❦
* 9 + 8]
## [1] "blue"
colours()[(50 - 1) * 9 + 1]
## [1] "lightsteelblue4"
Colour sets
The philosophy of R is that it is modular and easy to extend. There are many extensions that make
working with colours easier and produce really nice results. For example, we can use colour sets
as in the following example; the output of this code is in Figure 9.24 on page 195.
p <- ggplot(mtcars) +
geom_histogram(aes(cyl, fill=factor(cyl)), bins=3)
library(RColorBrewer)
p2 <- p + scale_fill_brewer() +
ggtitle('RColorBrewer')
p3 <- p + scale_fill_brewer(palette='Set2') +
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 194
❦
70
60
50
40
Y
30
❦ ❦
20
10
2 4 6 8
X
Figure 9.23: A visualisation of all built in colours in R. Note that the number of the colour can be
determined as by taking the y-value minus one times nine plus the x-value.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 195
❦
Figure 9.24: Examples of discrete colour sets. The name of the colour-set is the title of each plot.
ggtitle('RColorBrewer Set2')
❦ ❦
p4 <- p + scale_fill_brewer(palette='Accent') +
ggtitle('RColorBrewer Accent')
The scales provided by the library RColorBrewer provide colour scales that cater for different
needs: sequential, diverging, and qualitative colour schemes are available to underline the pattern
in the data or lack thereof. More information is on http://colorbrewer2.org.
We hope that with this section you have a solid toolbox to make plots and charts to illus-
trate your work. However, we did not cover everything and maybe you want some more
eye-candy? We suggest to have a look at https://www.r-graph-gallery.com.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 197
❦
♣ 10 ♣
Time series are lists of data points in which each data point is associated with a time-stamp. A time series
simple example is the price of a stock in the stock market at different points of time on a given day.
Another example is the amount of rainfall in a region at different months of the year. R language
uses many functions to create, manipulate and plot the time series data. The data for the time
series is stored in an R object called time series object. It is also a R data object like a vector or
data frame. time-stamp
with
• data: A vector or matrix containing the values used in the time series.
• start: The start time for the first observation in time series.
• end: The end time for the last observation in time series.
– frequency = 12: pegs the data points for every month of a year.
– frequency = 4: pegs the data points for every quarter of a year.
– frequency = 6: pegs the data points for every 10 minutes of an hour.
– frequency = 24 × 6: pegs the data points for every 10 minutes of a day.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 198
❦
Except the parameter "data" all other parameters are optional. To check if an object is a
time series, we can use the function is.ts(), and as.ts(x) will coerce the variable x
into a time series object.
To illustrate the concept of time series, we use an example of the stock exchange: the
“S&P500.” The S&P500 is short for the Standard and Poors stock exchange index of 500 most
important companies in the USA. It is the evolution of a virtual basket, consisting of 500 shares.
It is, of course, managed so that it is at any point a fair reflection of the 500 most significant
companies. Since it creation in 1926, it has shown an year-over-year increase in 70% of the years.
To a reasonably extend it is a barometer for the economy of the USA.
S&P500 It is possible to import this data in R, but the library MASS, that comes at no cost with the
your free installation of R, has a dataset that contains the returns of the S&P500 of the 1990s: the
dataset SP500:
library(MASS)
# The SP500 is available as a numeric vector:
str(SP500)
## num [1:2780] -0.259 -0.865 -0.98 0.45 -1.186 ...
# with:
class(SP500_ts)
## [1] "ts"
The time series object (class ts) has its own S3 plotting functionality. All we need to do is use
the standard plot() function to produce the plot in Figure 10.1 on page 199:
plot(SP500_ts)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 199
❦
4
2
0
SP500_ts
−2
−4
−6
Time
Figure 10.1: The standard plot for a time series object for the returns of the SP500 index in the 1990s.
For example, when plotting the ts object, we will see all its observations in stacked plots that
❦ share a common time-axis (see Figure 10.2) ❦
2
−6 −4 −2 0
1200
Value
Time
Figure 10.2: The standard plot functionality of time series will keep the z -axis for both variables the
same (even use one common axis).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 200
❦
10.2 Forecasting
forecasting Forecasting is the process of making predictions about the future and it is one of the main reasons
to do statistics. The idea is that the past holds valuable clues about the future and that by studying
the past one can make reasonable suggestions about the future.
Every company will need to make forecasts in order to plan, every investor needs to forecast
the market and company performance in order to make decisions, etc. For example, if the profit
grew the last four years every year between 15% and 20%, it is reasonable to forecast a hefty growth
rate based on this endogenous data. Of course, this makes abstraction of exogenous data, such as
an economic depression in the making. Indeed global economic crisis or pandemics do not warn
us in advance. They are outside most models (“exogeneous”).
No forecast is absolutely accurate, and the only thing that we know for sure is that the future
will hold surprises. It is of course possible to attach some degree of certainty to our forecast. This
is referred to as “confidence intervals,” and this is more valuable than a precise prediction.
There are a number of quantitative techniques that can be utilized to generate forecasts. While
some are straightforward, others might for example incorporate exogenous factors and become
necessarily more complex.
Our brain is in essence an efficient pattern-recognition machine that is so efficient that it will
even see patterns where there are none.1
Typically, one relies on the following key concepts:
• Autocorrelation: This refers to the phenomena whereby values at time t are influenced by
previous values. R provides an autocorrelation plot that helps to find the proper lag struc-
ture and the nature of auto correlated values.
• Randomness: A time series that shows no specific pattern (e.g. a random walk or Brownian
motion). Note that there can still be a systematic shift (a long term trend).
1 This refers to the bias “hot hand fallacy” – see e.g. De Brouwer (2012)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 201
❦
When it comes to macro economical data, the World Bank is a class apart. It’s website
https://data.worldbank.org has thousands of indicators that can be downloaded
and analysed. Their data catalogue is here: https://datacatalog.worldbank.org.
We have downloaded the GDP data of Poland and stored it in a csv-file on our hard-disk.
In the example that we use to explain the concepts in the following sections, we will use
that data.
To start, we load in the data stored in a csv file on our local hard-drive, and plot the data in
Figure 10.3
The next step is creating the moving average forecast. To do this we use the package forecast: forecast
require(forecast)
Figure 10.3: A first plot to show the data before we start. This will allow us to select a suitable method
for forecasting.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 202
❦
Now, we can plot the forecasted data together with the source data as follows with the follow-
ing code. The plot is in Figure 10.4 on page 202.
❦ ❦
It is also easy to plot the forecast together with the observations and confidence intervals. This
is shown in the following code and the result is plotted in Figure 10.5 on page 203.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 203
❦
plot(g.movav.tst,col="blue",lw=4,
main="Forecast of GDP per capita of Poland",
ylab="Income in current USD")
lines(year, GDP.per.capitia.in.current.USD, col="red",type='b')
❦ ❦
Figure 10.5: A backtest for our forecast.
In the forecast package, there is an automated forecasting function that will run through
possible models and select the most appropriate model give the data. This could be an auto
regressive model of the first order (AR(1)), an ARIMA model (autoregressive integrated moving ARIMA
average model) with the right values for p, d, and q, or even something else that is more
appropriate. The following code uses those functions to plot a forecast in Figure 10.6 on page 204.
train = ts(g.data[1:20],start=c(1990))
test = ts(g.data[21:26],start=c(2010))
arma_fit <- auto.arima(train)
arma_forecast <- forecast(arma_fit, h = 6)
arma_fit_accuracy <- accuracy(arma_forecast, test)
arma_fit; arma_forecast; arma_fit_accuracy
## Series: train
## ARIMA(0,1,0) with drift
##
## Coefficients:
## drift
## 515.5991
## s.e. 231.4786
##
## sigma^2 estimated as 1074618: log likelihood=-158.38
## AIC=320.75 AICc=321.5 BIC=322.64
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2010 12043.19 10714.69 13371.70 10011.419 14074.97
## 2011 12558.79 10680.00 14437.58 9685.431 15432.15
## 2012 13074.39 10773.35 15375.43 9555.257 16593.52
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 204
❦
ARIMA Note that ARIMA stands for autoregressive integrated moving average model. It is a
autoregressive integrated generalization of an autoregressive moving average (ARMA) model. Both of these models are
moving average model
fitted to time series data either to better understand the data or to predict future points in the
ARMA
autoregressive moving series (forecasting). ARIMA models are applied in some cases where data shows evidence of
average model non-statrionarity, where an initial differencing step (corresponding to the "integrated" part of the
model) can be applied one or more times to eliminate the non-stationarity.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 205
❦
The package forecast provides the function ses to execute this as follows:
Plotting the results is made easy by R’s function based OO model, and we can simply use the ses()
standard plot functionality – the results are in Figure 10.7:
plot(g.exp,col="blue",lw=4,
main="Forecast of GDP per capita of Poland",
ylab="income in current USD")
lines(year,GDP.per.capitia.in.current.USD,col="red",type='b')
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 206
❦
plot(g.exp,col="blue",lw=4,
main="Forecast of GDP per capita of Poland",
ylab="income in current USD")
lines(year,GDP.per.capitia.in.current.USD,col="red",type='b')
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 207
❦
The use of the function stl is straightforward and shown in the following code. In the last
line, we plot the data – the plot is in Figure 10.9.
10
seasonal
5
0
−10 −5
51
50
trend
49
48
4
remainder
2
−6 −4 −2 0
❦ ❦
1920 1925 1930 1935 1940
time
Figure 10.9: Using the stl-function to decompose data in a seasonal part and a trend.
The four graphs are the original data, seasonal component, trend component and the remain-
der and this shows the periodic seasonal pattern extracted out from the original data and the trend
that moves around between 47 and 51 degrees Fahrenheit. There is a bar at the right hand side
of each graph to allow a relative comparison of the magnitudes of each component. For this data
the change in trend is less than the variation doing to the monthly variation.
Note that a series with multiplicative effects can often by transformed into series with
additive effects through a log transformation. For example, assume that we have a time
series object exp_ts, then we can transform it into one with additional interdependence
(add_ts) via:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 208
❦
Figure 10.10: The Holt-Winters model fits an exponential trend. Here we plot the double exponential
model.
❦ ❦
Forecasts from HoltWinters
70
60
50
40
30
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 209
❦
Exponential Models
Both the HoltWinters() function from the package stats, and the ets() function in the fore- model – exponential
cast package, can be used to fit exponential models.
# Predictive accuracy
library(forecast)
accuracy(forecast(fit,5))
## ME RMSE MAE MPE
## Training set -69.84485 1051.488 711.7743 -2.775476
## MAPE MASE ACF1
## Training set 9.016881 0.8422587 0.008888197
Again, we can use the function forecast() to generate forecasts, based on the model:
We will plot the forecast together with the existing data (that we add via the function
lines()). The result is in Figure 10.10 on page 208
plot(forecast(fit, 5),col="blue",lw=4,
main="Forecast of GDP per capita of Poland",
ylab="income in current USD")
lines(year,GDP.per.capitia.in.current.USD,col="red",type='b')
Since, the GDP data did not show a seasonal trend, and hence it was not possible to use the
triple exponential mode, we will consider another example. Temperatures should certainly dis-
play a seasonal trend. The dataset nottem – from the package datasets, that is loaded with base
R – contains the daily temperatures in Nottingham between 1920 and 1939. In the example, we
use the HolWinters function and plot the results in Figure 10.11 on page 208.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 210
❦
Question #11
Use the moving average method on the temperatures in Nottingham (nottam). Does it
work? Which model would work better?
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 211
❦
♣ 11 ♣
Further Reading
Hopefully you got a taste of R and you liked it so far. The rest of the book will use R to dive into
data (see Part III “Data Import” on page 213), data wrangling (see Part IV “Data Wrangling” on
page 259), modelling and machine learning (see Part V “Modelling” on page 373), companies and
corporate environment (see Part VI “Introduction to Companies” on page 565) and reporting (see
Part VII “Reporting” on page 685). If you want to learn more about the foundations of R then there
are many excellent books and online documentation available.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 212
❦
The information about R and its ecosystem is growing fast and probably a search engine like
DuckDuckGo or Qwant is an ideal starting point, rather than a book that is more static. I will
maintain a web-page for this book at http://www.de-brouwer.com/r-book, where you can
find all the code for this book but also further references and examples.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:12pm Page 213
❦
PART III
Data Import
♥
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 215
❦
♣ 12 ♣
Before we dive into extracting data and modelling, it is worth spending a few minutes to
understand how the data is stored. There are many data storage systems available and most are in
use. Many companies started investing in computers in the late 1950s or in the 1960s. IBM was an
important player in the market of electronic computers and software and developed a particular
useful computer language in the 1950s: FORTRAN. FORTRAN became soon the software of FORTRAN
choice for many companies. Later developments allowed for structured programming (in a
nutshell this is less using the goto lineNbr command and more use constructs such as if ,
for , etc.)1
Many large companies have since been building on their code and adding functionality. This
❦ means that some code is about retirement age (60 years old).2 ❦
DBMS
Fortran outlived the punched cards, the tape readers and also encountered the navigational
database management systems (DBMS) in the 1960s. In the age of tapes, any data was simply DBMS – navigational
a list of things. It is at the advent of the disk and drum readers that “random access”3 to data
became possible and that the term “database” came in use. The technologies from the 1960s were punched cards
a leap ahead and allowed more complex data and more relations between data to be stored and
accessed.
As the number of computers, computer users and applications grew faster and faster, it can
of course be expected that the number of database systems also grew faster. Soon the need for a CODASYL
standard imposed itself and the “Database Task Group” was formed within CODASYL (the same
group that created the COBOL programming language). COBOL
That standard was designed for a simple sequential approach of data search. In that early
standard, there was already the concept of a primary key (called “the CALC key”) as well as
the concept of relations. However, all records should be searched in a sequential order. Soon
there were many variations and implementations of the CODASYL standard. IBM had a slightly
different system: IMS (Information Management System) – designed for the Apollo program on IBM
their System/360.
1 Maybe the reader will not know FORTRAN, but most probably you are familiar with “BASIC.” BASIC is derived BASIC
from Fortran II. BASIC improved on Fortran by having improved logical structures. Fortran II
2 One of the author’s previous employers and once the bank with the largest balance sheet in the world has still
such gems in their payment systems. Indeed, why would one replace code that works and is literally interlinked
with thousands of other systems and software.
3 Random access means that each piece of data can be accessed when it is needed, as opposed to a tape that has
to be read sequentially. So if for example addresses are stored on a tape and you need the address of a person, then
there is no other way than reading name after name till we find the right match.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 216
❦
The 1970s saw the dawn of the relational database management system (RDBMS) or the rela-
RDBMS tions database system that till today dominates data storage. The standard of RDBMS improved
on the CODASYL standard by adding an efficient search facility that made the best use of a ran-
dom access storage. The focus on records that were stored in a hierarchical system that was the
core of the CODASYL system also was replaced by the concept “table,” and it allowed the content
table of the database to evolve without too much overhead to rewrite links and pointers.
Also in theory and application, the RDBMS was superior as it allows both hierarchical and
navigational models to be implemented. The design can be tuned for performance or storage opti-
mization and ease to update data.4
Edgard Codd not only invented the RDBMS but also provided a set-oriented language based on
tuple calculus (a field to which he contributed greatly). This language would later be transformed
into SQL.
INGRES Soon RDBMS was implemented in different systems. It is rare to encounter systems from the
System R 1970s such as INGRES or System R (which has its own query language: QUEL).
QUEL Later IBM created SQL/DS based on their System R and later on introduced DB/2 in the 1980s.
In those days, the author worked in the IT department of an insurance company, who was one
of the early adaptors of the DB2 system. The new database system of IBM was extremely fast and
powerful compared to the other mainframe solutions that were used till then. Many systems from
those days are still in use today, and it is not uncommon to encounter a DB2 database server in
the server rooms.
Oracle Oracle was founded in 1977, and its database solutions started to conquer the world. For a long
time the strategy of that company that is now (2018) the world’s fourth largest software company
was based on one product: a RDBMS that had the same name as the company. Oracle started also
PC to supply additions to SQL with their SQL-PL5 that is to be Turing complete.
❦ The 1980s saw also the advent of a personal computer (henceforth, PC) that was powerful ❦
personal computer
and affordable enough to conquer the workfloor of any company. At the beginning of the 1980s,
most employees would not have a computer on their desk, but when the 1980s came to their end,
everyone (but the security guard) who had a desk would have had a computer on it. Once the
1990s ended, also the security guard got his/her PC.
This also meant that many databases would be held on a PC, and soon solutions like Lotus
1-2-3 and dBase conquered the market. However, today they are all replaced by their heirs MS
Access and MS Excel (or eventually free alternatives such as Base and LibreOffice).
After the release of C++ in 1985 by Bjarne Stoustrup, we had to wait till the 1990s to see object
OO oriented programming (henceforth, OO) really take off. Data-storage had become more affordable
and available there was more data than ever, and software became more and more complex. Since
in many applications data plays the principle role and the rest of the code only transforms data,
it made sense to wrap code around data instead of around procedural logic. The OO concept had
ORM also many other advantages, for example it made code easier to maintain.
Programmers preferred not to follow the strict “normal form” logic of building a relational
object-relational mappings
database in its purest form. Rather it made sense to build databases that followed the objects.
While it is possible to implement this directly in a RDBMS, a cleaner mapping is possible. In fact
the relations between data become relations between objects and their attributes – as opposed
to relations between individual fields. This gap with the relational data model was called “the
object-relational impedance mismatch.” In fact it is the inconvenience of translating between
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 217
❦
programmed objects and database tables made explicit. This gave rise to “object databases” and object databases
“object-relational databases.” Other approaches involved software libraries that programmers
could use to map objects to tables. These were called object-relational mappings (ORMs) attempt
to solve the same problem.
The amount of data available continued to grow exponentially: the Internet became more
popular and as soon as every household had its personal computer we saw mobile phones even
IoT
faster conquering the world and producing data at ever increasing rates. Recently the Internet of Internet of things
Things (henceforth, IoT) came to further enhance this trend.
This made it impractical to bring data to the processing unit, and we needed to turn around
the concept and bring the processors closer to the data: Big Data was born. Distributed computing Big Data
such as Hadoop allows to use cheap and redundant hardware and dispatch calculations to those
CPUs that are close to the data that is needed for a certain calculation and then bring the results
together for the user. This revived the NoSQL concept. The concept actually dates from the 1950s,
but was largely neglected till it became impossible to recalculate indexes because databases were
so large and data was coming in faster than ever before. NoSQL
NoSQL allows for horizontal scaling as well as for relational rules to be broken, so they do not
require table schemes to be well defined or waste too much time on recalculating indices.6
A recent development is NewSQL. It is a new approach to a RDBMS that aims to provide NewSQL
the same scalable performance as NoSQL systems for transaction processing (read-write) work-
loads; still using SQL and maintaining the atomicity, consistency, isolation, and durability (ACID) ACID
guarantees of a traditional database system.7
When working with data in a corporate setting, it is common to encounter older systems (or
“legacy systems”) or RDBMS that understand (a dialect of) SQL, desktop databases (in the form
of MS Access or MS Excel), and big data systems mixed together. However, the most popular are
❦ those systems that are an RDMS and understand SQL. ❦
Because of the importance of relational database systems, we will study them in more detail
in next chapter: Chapter 13 “RDBMS” on page 1.
6 As distributed databases grew in popularity the search for high partition tolerance was on. However, we found
that (due to the “CAP theorem”) it is impossible for a distributed database system to simultaneously provide (1)
consistency, (2) availability, and (3) partition tolerance guarantee. It seems that a distributed database system can
satisfy two of these guarantees but not all three at the same time.
7 A database system that is ACID compliant will respect rules of atomicity, consistency, isolation and durability.
This means that the database system would guarantee intended to guarantee validity even when errors occur or
power failures happen.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 219
❦
♣ 13 ♣
RDBMS
In a nutshell, a relational database management system (RDBMS) is a set of tables (rectangular RDBMS
data), where each piece of information is only stored once and where relations are used to find the
information needed. Usually, they are accompanied by an intelligent indexing system that allows relational database
searching in logarithmic times. management system
Consider a simple example that will demonstrate the basics of a relational database system.
Imagine that we want to create a system that governs a library of books. There are multiple ways
to do this, but the following tables are a good choice to get started:
• Authors, with their name, first name, eventually year of birth and death (if applicable) – A
table of authors: Table 13.1 on page 220;
• Books, with title, author, editor, ISIN, year, number of pages, subject code, etc. – A table of
❦ ❦
books: Table 13.2 on page 220;
• Subject codes, with a description – A table of genres: Table 13.3 on page 220.
The tables have been constructed so that they already apply with some principles that under-
build relational database systems. For example, we see that certain field from the Author table are
used in the Books table. This allows us to store information related to the author (such as birth
date) only once. The birth-date is necessarily the same for all books authored by this person,
because the birth-date depends only on the person.
Indeed, this simple example, with only three tables, shows right away some interesting
aspects:
• The unit of interest is the “table”: rectangular data – “data frame” in R’s terminology. table
• Each table has fields (in R that would be referred to as “columns”). field
• Each table has a primary key (identified by the words PK in the table). The primary key PK
is unique for each record (the record is the row), or otherwise stated each record can be primary key
uniquely identified with the primary key.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 220
❦
220 13 RDBMS
tbl_authors
id pen_name full_name birth death
PK
1 Marcel Proust Valentin Louis G. E. Marcel Proust 1871-07-10 1922-11-18
2 Miguel de Cervantes Miguel de Cervantes Saavedra 1547-09-29 1616-04-22
3 James Joyce James Augustine Aloysius Joyce 1882-02-02 1941-01-13
4 E.L. James Erika Leonard 1963-03-07
5 Isaac Newton Isaac Newton 1642-12-25 1726-03-20
7 Euclid Euclid of Alexandria Mid-4th C BC Mid-3rd C BC
11 Bernard Marr Bernard Marr
13 Bart Baesens Bart Baesens 1975-02-27
17 Philippe De Brouwer Philippe J.S. De Brouwer 1969-02-21
Table 13.1: The table of authors for our simple database system.
tbl_books
id author year title genre
PK FK FK
1 1 1896 Les plaisirs et les jour LITmod
2 1 1927 Albertine disparue LITmod
4 1 1954 Contre Sainte-Beuve LITmod
5 1 1871–1922 À la recherche du temps perdu LITmod
7 2 1605 and 1615 El Ingenioso Hidalgo Don Quijote de la Mancha LITmod
❦ 9 2 1613 Novelas ejemplares LITmod ❦
10 4 2011 Fifty Shades of Grey LITero
15 5 1687 PhilosophiNaturalis Principia Mathematica SCIphy
16 7 300 BCE Elements SCImat
18 13 2014 Big Data World SCIdat
19 11 2016 Key Business Analytics SCIdat
20 17 2011 Malowian Portfolio Theory SCIfin
tbl_genres
id (PK) type sub_type location
PK FK
LITmod literature modernism 001.45
LITero literature erotica 001.67
SCIphy science physics 200.43
SCImat science mathematics 100.53
SCIbio science biology 300.10
SCIdat science data science 205.13
FINinv financial investments 405.08
Table 13.3: A simple example of a relational database system or RDBMS for a simple system for a library. It shows
that each piece of information is only stored once and that tables are rectangular data.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 221
❦
13 RDBMS 221
• The primary key can be something meaningful (e.g. “LITmod” can be understood as the
subsection “modern” within the larger group of “literature”) or just a number.1 The nice
thing about numbers is that most RDBMS will manage those for you (for example when
you create a new record, you do not have to know which numbers are taken and the system
will automatically choose for your the smallest available number).
• There might be other fields that are unique besides the primary key. For example, our loca-
tions (in the genres) are unique as well as our dates and years in the table of both books and
authors. However, in this particular case, they should not define them as unique, because
we might have more than one author that is born on the same date.
• Some tables have a foreign key (FK). A foreign key is a reference to the primary key of FK
another table. That is called a relation. foreign key
• In many cases, the relationship PK–FK will be a “one-to-many” (one author can write many
books). However, thinking about it, we could have the situation where one book is written
by more than one person . . . so actually we should foresee a “many-to-many” relationship.
But that would not work exactly the way we intend it to work. To achieve that, we should
build an auxiliary table that sits between the authors and the books—but let us revisit this
idea later.
• This small example contains a few issues that you are bound to encounter in real life with
❦ real data: ❦
– Some values are missing. While this can happen randomly, it also might have a par-
ticular meaning (such as no death-date for an author can mean that he or she is still
alive) or it can be simply a mistake.
– Some columns contain non-consistent data types (e.g. years and dates).
– Some values are self-explanatory, but some are cryptic (e.g. location in tbl_genre
seems to consist of two numbers). We need a data dictionary to really understand data dictionary
the data.2 to find out what it means,
– We need intimate and close understanding of what happened in order to understand
why things are the way they are.
– That intimate knowledge will help us understand what is wrong. For example, Alain
Proust’s famous work “‘AÌŁ la recherche du temps perdu” is not just one book. It is a
heptalogy and the first volume was printed in 1871, the last in 1922.
– To solve this we have to go to the library and see what really happened there. The data
does not have all information. The data needs correction, the system was designed to
1 When we refer to an author in the table of books for example we do not use the name (the reason is that (a)
this might not be unique, but also (b) because this would lead to errors and confusion and (c) it would be slower to
index and more difficult to manage for the RDMBS). In our case we could be reasonably sure that there are no two
authors that share the same pen-name, real name and birth-date (though somewhere in distant future this might
happen), so we could define the combination of pen-name, real name and birth-date as unique and eventually use
this as primary key.
2 A data dictionary is the documentation of the data, in reality you will find that many columns have cryptic
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:17pm Page 222
❦
222 13 RDBMS
hold one book in tbl_books and not a series of seven in one entry. If we have the
series of seven books, we need to find the details of all seven; if we have one big book
containing all, then it will have one year where it was printed.3
There seem to be many challenges with the data. Let us address them when we build our
database in the next section.
❦ ❦
3 A similar issue occurs with the work of Miquel de Servantes. His work is a diptych, one book published in two
parts. Most probably we have a modern edition and there is just one book with one publishing date. It seems that
the programmers intended to have publishing dates here and our librarian has put in the original first publishing
dates.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 223
❦
♣ 14 ♣
SQL
Before starting to create tables, it is a good idea to reflect on how the tables should look like and
ER
what is optimal. Good database design starts with an entity-relationship diagram (ER diagram). entity relationship
The ER diagram is a structured way of representing data and relations and the example of the
library from Chapter 13 “RDBMS” on page 219 is represented in Figure 14.1 on page 224.
The ER diagram is designed to be intuitive and understandable with just a few words of
explanation. In our ER diagram, we notice that:
This ER diagram is the ideal instrument for the analyst to talk to the data owner or subject
matter expert (SME)1 . So the ER diagram is the tool to make sure the analyst understands all
dependencies and relations.
For example, it is critical to understand that one author has only one pen-name in our design.
Is this a good choice now and in the near future?
Once the ER diagram is agreed, we can start focussing on the database design. The first step of
the database design is a layout of the tables and relations. In our – very simple – case this follows
directly from the tables Table 13.1 on page 220, Table 13.2 on page 220, and Table 13.3 on page 220.
This database design is visualised in Figure 14.2 on page 224.
The next step is to agree which fields can be empty and understand what type of searches will
be common. This might result in altering this design and adding tables with indexes that allow to
find the result of those common queries really quick.
1 Corporates generally like acronyms and the subject matter expert is also referred to as SME, which is not a great
acronym as it is also used for “small and medium enterprises.” A general rule of advice here is to limit the use of
acronyms as much as possible in daily operations. It leads to confusion, less efficient communication and exclusion
since other teams might have other acronyms for the same concept or even use the same acronym for something
else.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 224
❦
224 14 SQL
❦ ❦
Figure 14.1: The entity relationship (ER) diagram for our example, the library of books.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 225
❦
Alternatively, the database could be tuned for fast inputs. In a payment system for a bank2 , for
example it is important that payments go through fast and that there is no delay. In such systems,
there will be no time for building indexes while the payments go through.
Now, we have sufficient information to build our database.
❦ ❦
2A typical bank will handle anything between ten and hundred thousand payments per second.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 226
❦
226 14 SQL
The installation will ask you for a master-password for the database (“root password”). Make
sure to remember that password.4 We will refer to it as MySQLrootPassword. Whenever you
encounter this in our code, you will need to replace this word with your password of choice.
MariaDB Alternatively, one can use MariaDB. This is a fork of MySQL that was created when Oracle
took over MySQL.
sudo apt-get install mariadb-server
sudo apt-get install mariadb-client
# and eventually
sudo apt-get libmariadbclient-dev libmariadbd-dev
sudo apt-get phpmyadmin
The popular opinion that MariaDB works as a plug-in replacement for MySQL, is not
really true. For example, (in 2019) making a full SQL-dump in MySQL and restoring it
3 The name “MySQL” is a contraction of “My,” which was the name of the daughter of one of the co-founders,
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 227
❦
in MariaDB leads to errors on technical tables that do not have the expected number of
fields. However, some things are still the same: you also use the command mysql to start
MariaDB, the same connectors from C++ or R that were created for MySQL still work,
etc. MariaDB and MySQL are now separate projects, and the people making decisions are
not the same, so expect them to diverge more in the future.
Now, the database server exists, and we can connect to it via the client (that was installed in
the line mysql-client or mariadb-client).
mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 13
Server version: 5.7.23-0ubuntu0.16.04.1 (Ubuntu)
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>
Listing 14.2: Starting MySQL as root user. The first line is the command in the CLI, the last line is
❦ the MySQL prompt, indicating that we are now in the MySQL shell. ❦
In modern installations of MySQL and MariaDB, only the root user of the server has
access to the root account of the database . . . and this without extra password. This can
be changed with the following approach. The following will get you in:
sudo mysql -u root
If it is appropriate that each user with root access to the computer can have root access to
the database, then there is nothing further to do. If, you want a separate password for the
database server, then run these commands:
use mysql;
update user set plugin='' where User='root';
flush privileges;
exit
Now, back in the Linux command prompt run the following, set the root password and
answer the questions.
sudo systemctl restart mariadb.service
sudo mysql_secure_installation
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 228
❦
228 14 SQL
Note that you only type the first line MySQL -u root -p, then the line Enter password:
invites you to type the password. Type it followed by enter and only then the rest of the screen
appears. At this point, you have left the (bash) shell and you are in MySQL environment.
Carefully read the text that appears. It explains some commands and also informs that Oracle
does not take any responsibility if something goes wrong.5
For example, it is a good idea to type \h and see the commands that are available at this level.
This will also inform you that each line has to be ended by a semicolon ;. This is noteworthy as
in R this is not necessary.
The MySQL monitor provides an environment that is very similar to the R terminal. It is
an interactive environment where the commands that you type will be executed immedi-
ately. However, R will try to execute the command when you type enter. MySQL will do
so when you type a semicolon ;. Then again . . . R also understands the semicolon and the
following is understood by R as two separate commands.
x <- 4; y <- x + 1
There are also other licence options possible. However, MySQL was owned and sponsored by the Swedish for-profit
company “MySQL AB” and this company has been bought by Oracle Corporation. This basically means that it is
free to use, however, if you want to pay (and potentially hold a private company liable for something) then you can
do that and you will get the same software, but with another start-up message.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 229
❦
From this listing, we can see that there are more databases than only the one that we created.
Some might be there from earlier use, but some belong to MySQL itself and are managed by the
database manager. For example, “information_schema” will hold information about the structure
of all other databases, their tables, their relations and their indexes. Unless, you know really well
❦ what you do and why you do it: never ever touch them. ❦
Note that a comment-line in SQL is preceded by two dashes: --. Longer comments can
be put between /* ... */
The last command is \q . This command will leave the MySQL terminal and return to the
Linux shell. Alternatively, one can use exit .
Now, that we are logged in as libroot, we will start to create the tables in which later all data
will reside:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 230
❦
230 14 SQL
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 231
❦
• auto_increment tells MySQL to manage the value of this field by itself: if the user does
not provide a unique number, then MySQL will automatically allocate a number that is free
when a record is created;
• We also provide a “collation.” A collation in MySQL is a set of rules that defines how to
compare and sort character strings, and is somehow comparable to the typical regional set-
tings. For example, the utf8_unicode_ci” implements the standard Unicode Collation
Algorithm, it supports expansions and ligatures, for example: German letter (U+00DF
LETTER SHARP S) is sorted near “ss” Letter (U+0152 LATIN CAPITAL LIGATURE
OE) is sorted near “OE,” etc. We opt for this collation, because we expect to see many inter-
national names in this table.
The field author is a “foreign key” (FK). This field will refer to the author in the table
tbl_authors. This is the way that an RDBMS will find the information in the author. From the
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 232
❦
232 14 SQL
book, we find the author-id and then we look this id up that in the table of authors to access the
rest of the information.
FK A foreign key (FK) is indicated with the keywords REFERENCES, followed by table name and
foreign key field name between round brackets. Then there is the statement ON DELETE RESTRICT. This
means that as long as there is a book of an author in our library that we cannot delete the author.
In order to delete the author, we need firs to delete all the books of that author. The alternative is
ON DELETE CASCADE, this means that if we delete an author that automatically all his/her books
will be deleted.
Also here we made some hard choices. The field year is implemented as SMALLINT, this
means that we will not be able to add a range of dates. However, since the keyword UNSIGNED is
not provided, it is possible to use negative numbers to code the book written before common era.7
The difference between VARCHAR(n) and CHAR(n) is that the first will hold up to n characters,
while the second will encode always n characters. For example, the word “it” in a CHAR(4) field
will be encoded as “space-space-it,” while in a VARCHAR(4) field the spaces would not be part of
the data in the table.8
Then there are two additional lines that start with CREATE INDEX. These commands create
an additional small table that contains just this keyword (ordered) and a reference to the position
in the table. This allows the database system to look up the index logarithmic time9 , and then
via the reference find the record in the table. The database server expects many lookups on the
primary key, so it will always index this field (therefore, we do not need to specify that).
When the data comes in very fast, when records hold a lot of information, or when the table
becomes really long then it is not practically possible to rewrite the whole table so that all data
sits ordered via the primary key. So, the relational database management system (RDBMS) will
create a thin table that holds only the index and the place where the record is in the physical table.
❦ ❦
Whenever a new record comes in, the index is re-written.
An index can hence be seen as a thin table that helps MySQL to find data fast. Not only primary
keys will have their index, but it is also possible to ask MySQL to index any other field so that
lookup operations on that field will become faster. The syntax is:
-- Create an index where the values must be unique
-- (except for the NULL values, which may appear multiple times):
ALTER TABLE tbl_name ADD UNIQUE index_name (column_list);
-- Create an index in which any value may appear more than once:
ALTER TABLE tbl_name ADD INDEX index_name (column_list);
data will take O(N ). For the longer tables, this becomes a massive difference in time. Note also that the searching
will be done in a table with less columns (fields), also this contributes to the speed.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 233
❦
-- Drop an index:
ALTER TABLE table_name DROP INDEX index_name;
Note that MySQL allows us to create tbl_books and define the field genre as a foreign
key that refers to tbl_genres before that table exists. This is useful when creating the
tables, however, we will get problems at this point we would put in data.
It is possible to check our work with DESCRIBE TABLE. That command will show relevant
details about the table and its field:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 234
❦
234 14 SQL
We have now a minimal viable relational database, and we could start adding data into this
structure: we can start adding our books. This will also make clear what the impact is of choices
that we made earlier.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 235
❦
Since we did the effort to create a username for updating the data, let us use it. In MySQL, type
exit or \q followed by enter and then login again from the command prompt.
mysql -u librarian -p
Listing 14.10: Logging in as user “librarian.”
To insert data, we use the query (or “sentence in SQL”) that starts with INSERT INTO. This
statement has generally two forms: one for adding one record and one for adding many records.
Let us start by adding one record:
INSERT INTO tbl_authors
VALUES (1, "Philppe J.S. De Brouwer", "Philppe J.S. De Brouwer", "1969-02-21"
, NULL);
Listing 14.11: Adding the first author to the database.
While MySQL was supposed to manage the author_id, it did not complain when it was
coerced in using a certain value (given that the value is in range, of good type and of course
still free). Without this lenience, uploading data or restoring and SQL-dump would not
always be possible.
❦ ❦
This form can only be used if we provide a value for exactly each column (field). Usually, that
is not the case and then we have to tell MySQL what fields we have:
-- First, remove the record that we just created:
DELETE FROM tbl_authors WHERE author_id = 1;
Note that we did not provide a deceased date since this author still lives and that MySQL
decided that this is anyhow NULL. NULL means “nothing,” not even an empty string.
In R this is concept is called “NA”. To see how MySQL represents this, run the following
command.
mysql> SELECT * FROM tbl_authors;
+-----------+--------------...--+------------+
| author_id | pen_name ... | death_date |
+-----------+--------------...--+------------+
| 2 | Philppe J.S. ...1 | NULL |
+-----------+--------------...--+------------+
1 row in set (0,00 sec)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 236
❦
236 14 SQL
If you enter a wrong query (for example forget closing quotes of a string), then – just as
R – MySQL will not understand that your command is finished but expect more input (it
starts a newline with + ). In that case, you can press CTRL+c followed by ENTER .
Here we decided to leave dates earlier than 1000-01-01 blank (NULL). In fact, MysQL will
accept some earlier dates, but since the documentation does not specify this, it is safer not to use
them. Also note that negative years will not be accepted in a field of type DATE.
The result is that we might have missing dates for multiple reasons: the person did not die yet
or the date was before 1000-01-01 . . . or we do not know (the case of Mr. Marr).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 237
❦
While MySQL will give feedback about the success of the query, you might want to check
if the data is really there. We will discuss these type of queries in the next section, but in
the meanwhile you might want to use this simple code:
SELECT * FROM tbl_authors;
Since we defined the field genre in tbl_books as NOT NULL, we will have to add the data to
the table with genres first, before we can add any books. This is because MySQL will guard the
referential integrity of our data. This means that MySQL will make sure that the rules that we have
defined when the tables were created are respected at all time (e.g. the PK is unique, a FK refers
to an existing record in another table, etc.). The following code adds the necessary records to the
table tbl_genres.
INSERT INTO tbl_genres (genre_id, type, sub_type, location)
VALUES
("LITmod", "literature", "modernism", "001.45"),
("LITero", "literature", "erotica", "001.67"),
("SCIphy", "science", "physics", "200.43"),
("SCImat", "science", "mathematics", "100.53"),
("SCIbio", "science", "biology", "300.10"),
("SCIdat", "science", "data science", "205.13"),
❦ ("FINinv", "financial", "investments", "405.08") ❦
;
Listing 14.14: Add the data to the table tbl_genres.
For the books, we can leave the book_id up to MySQL or specify it ourselves.
INSERT INTO tbl_books (author, year, title, genre)
VALUES
(1, 1896, "Les plaisirs et les jour", "LITmod"),
(1, 1927, "Albertine disparue", "LITmod"),
(1, 1954, "Contre Sainte-Beuve", "LITmod"),
(1, 1922, "AÌĂ la recherche du temps perdu", "LITmod"),
(2, 1615, "El Ingenioso Hidalgo Don Quijote de la Mancha", "LITmod"),
(2, 1613, "Novelas ejemplares", "LITmod"),
(4, 2011, "Fifty Shades of Grey", "LITero"),
(5, 1687, "PhilosophiÃ˛e Naturalis Principia Mathematica", "SCIphy"),
(7, -300, "Elements (translated )", "SCImat"),
(13, 2014, "Big Data World", "SCIdat"),
(11, 2016, "Key Business Analytics", "SCIdat"),
(14, 2011, "Maslowian Portfolio Theory", "FINinv")
;
Listing 14.15: Add the data to the table tbl_books.
Note how it was possible to encode a year before common era – for the book – as a negative
number. Any software using this data will have to understand this particular implementation,
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 238
❦
238 14 SQL
otherwise undesired results will occur. Again, we see how important it is to understand data. The
careful reader will notice that actually “300 BCE” might better be implemented as “-299” because
there was never a year “0.” The year before “1 CE” is the year “-1” (not zero). In our case, we did
not do this, because we go for what seems most natural to the librarian. The point is that data
in itself is meaningless and that guessing what it means is dangerous, we need to get the data
together with the explanation (the data dictionary).
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 239
❦
Now, the hard work is done: MySQL has our data and it knows our database design and will
enforce referential integrity, this means that if a field is defined as a foreign key it will not allow referential
any entries that are not in the table that hold the primary key (PK). integrity
To get data back from the RDBMS that uses SQL, it is sufficient to run a SELECT query. The
SELECT-query is constructed as an intuitive sentence.
-- Include also the ones that have no birth data in the system"
SELECT pen_name FROM tbl_authors
WHERE (
(birth_date > DATE("1900-01-01"))
OR
(ISNULL(birth_date))
) ;
Listing 14.16: Some example of SELECT-queries. Note that the output is not shown here, simply
because it would be too long.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 240
❦
240 14 SQL
So far the queries that we have presented only pull information from one table, while the
whole idea of a relational database is just to store information in the relations. Well, just as we
can ask information from multiple fields by listing them separated by commas and we can do the
same for tables. Adding tables after the FROM keyword will actually prompt SQL to compute the
Cartesian product Cartesian product (ie. all possible combinations).
Test this by running:
SELECT pen_name, title FROM tbl_authors, tbl_books;
This query will only show those combinations where the author number is equal in both tables.
This is the information that we were looking for.
This type of filtering appears so often that it has its own name: “inner join.” An inner join will
only show those where both fields are equal. This – just as in R – excludes all rows that would
have a NULL in that field, because the system has no way to tell if the values match.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 241
❦
❦ ❦
Figure 14.3: Different join types illustrated. Note that the Venn-diagrams are only for illustration
purposes, they do not truly show the complexity. Also note that we used a table “A” with field “a” and
a table “B” with field “b.”
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 242
❦
242 14 SQL
Even with our very simple database design, the difference in syntax can be demonstrated:
SELECT pen_name , location FROM tbl_authors A, tbl_books B, tbl_genres B
WHERE
(A.author_id = B.author) AND
(B.genre = G.genre_id)
;
If you want to join on fields that carry the same name in both tables, then you can use the
keyword USING:
USING (author)
is equivalent to
ON A.author = B.author
While the keyword USING is a handy shorthand, in many cases, it makes the code less
transparent, and in some cases, this might not be entirely equivalent.
Is there a difference between a right and a left joins when the order of the tables are
inverted?
-- Left join:
SELECT cols_list FROM A LEFT JOIN ON B USING X;
-- Right join:
SELECT cols_list FROM B RIGHT JOIN ON A USING X;
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 243
❦
In MySQL, this should return the same, hence there is no difference in both aforemen-
tioned statements.
There is a lot more about SQL and queries and it is beyond the scope of this book to dig even
deeper. So, let us conclude with a simple example of finding values in table A that have no match
in table B (assuming that both tables A and B share a column with the name x).
-- Option 1 (slower)
SELECT A.* FROM A LEFT JOIN B USING x
WHERE B.x IS NULL;
-- Option 2:
SELECT A.* FROM A
WHERE A.x NOT IN (SELECT x FROM B);
Note that the CROSS JOIN does the same as the Cartesian product10 , so here is no performance
improvement or any other gain to choose one form over the other. Note that it is possible to omit
the ON A.a = B.b statement.
As you will gradually make more and more complex queries, you will end up with many
tables and fields in your query. Here are some hints to keep it tidy:
2. tables can be named to be used in the query in a very similar way with the keyword
AS: long_table_name AS t allows us to address the table using the alias “t”;
3. many fields from many tables can have the same name, in case there can be doubt,
one should make clear from what table the field is with the dot-separator. For exam-
ple, to make clear that field “fld” is from table A and not B, one types A.fld
For example:
Please note that the SQL implementation in MySQL is also Turing complete. It knows for
example functions, can read from files, etc. Further, information can be found at https:
//dev.mysql.com/doc.
10 The Cartesian product is the same as “all possible combinations.” So if A has 100 rows and B has 200 rows the
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 244
❦
244 14 SQL
This is a great day today. We received finally our copy of Hadley Wickham and Garret Gerole-
mund’s book “R for Data Science” and we want to add it to our library. However, we can enter
only one reference to one author in our library. After a brainstorm meeting, we come up with the
following solutions:
1. Pretend that this book did not arrive, send it back or make it disappear: adapt reality to the
limitations of our computer system.
2. Just put one of the two authors in the system and hope this specific issue does not occur to
often.11
3. Add a second author-field to the table tbl_books. That would solve the case for two
authors, but not for three or more.
4. Add 10 additional fields as described above. This would indeed solve most cases, but we
still would have to re-write all queries in a non-obvious way. Worse, most queries will just
run and we will only find out later that something was not as expected. Also, we feel that
this solution is not elegant at all.
5. Add a table that links the authors and the books. This will solution would allow us to
record between zero and a huge amount of authors. This would be a fundamentally dif-
ferent database design and if the library software would already be written12 this solution
might not pass the Pareto rule.
❦ We will demonstrate how the last solution (solution 5) could work. We choose this one because ❦
it will allow us to show how complicated such a seemingly simple thing would be for the database
admin. However, rewriting all applications that use the data would be a lot more work. This under-
lines the importance of a good database design up-front and this in its turn demonstrates how
important subject matter knowledge is.13
First, we need to update the database design. The database design is a framework for that will
guide us on what to do. The solution is illustrated below in Figure 14.4 on page 245:
11 Indeed, that is not necessarily a bad and lazy approach. In business it is worth to keep the “Pareto rule” (or the
“80/20 rule”) in mind. It seems that software development does not escape this rule and that by solving the top-20
cases, 80% of the real world problems are solved.
12 In Section 14.4 “Querying the Database” on page 239 we have shown how to query the database. Any software
that uses the tables will hence depend on the structure of the tables. Therefore, changing database design is not
something that can be done lightly and will have a massive impact on all software using the database.
13 People that are subject matter experts are often referred to as “SME” – this confusing as the same company will
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 245
❦
Figure 14.4: The improved database scheme that allows multiple authors to co-author one book by
adding a table tbl_author_book between the table with books and the table with authors. Now,
only the pair author/book has to be unique.
Note that while many RDBMS are ACID compliant and hence do a good job in making sure
❦ that no conflicts occur, in our specific case, we first have to delete the relationship PK/FK and then ❦
rebuild it in another table. The type of operations that we are performing here (deleting tables,
redefining links, etc.) are not covered by ACID compliance. Redesign of the database will leave ACID
the database vulnerable in the time in between. So, it would be wise to suspend all transactions
in the meanwhile in one way or another. (e.g. allow only one user from a specific IP to access the
database, etc.).
However, we think that – for our purpose here – there is another elegant workaround: first
build the new tables, insert the relevant data then enforce referential integrity and only then dis-
connect the old fields. We can do this during lunch break and hope that all will be fine.
Further information
The solution proposed here will generally not be fine, because transactions might occur
while we are altering the tables and this will mess up the database. So, while we decided
not to lock tables in this example, it can be done as follows:
SET autocommit=0;
LOCK TABLES tbl_author read, tbl_books write;
-- Alter design and build referential information.
-- So, here would come the next code block.
COMMIT;
UNLOCK TABLES;
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 246
❦
246 14 SQL
Listing 14.18: This code first creates the table tbl_author_books and then inserts the necessary
❦ information that was already into the database also in that table. Finally, it discards the old ❦
information.
It is, of course, also necessary to update all software that uses these tables. For example,
the field author can not longer be a simple text-field. The user interface must now allow to
select more than one author and the software must push this information no longer to the table
tbl_books, but rather to the table tbl_author_book.
Also retrieving information will be different. For example, finding all the books of a given
author can work as follows:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 247
❦
that you might know from the CLI in Linux, Windows NT, or DOS. If you would be working with MS
Access instead, you will be able to use the asterisk, because the particualar dialect implemented in MS
Access does not follow the SQL standard.
tbl_author_book holds not two PKs, but the two fields combined form one PK. It is
entirely possible to use this combined key as a FK in other tables. However, this makes
querying the database more difficult. That is the reason why we introduced the ab_id
field.
Now, it is finally possible to add the book of Hadley and Garrett. To do this, we will need to
add the book and the authors and then in a second step add two entries in tbl_author_book:
Note that the aforementioned code might not be entirely foolproof in a production environ-
ment where e.g. 50 000 payments per second are processed. The function LAST_INSERT_ID()
will keep the value of the last insert into an auto_increment variable that has been fed with
NULL. So, even if some other process added a book or but did not feed NULL into the field
book_id, this will keep its value. Since there is only one register, we need to store the result in
variables.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 248
❦
248 14 SQL
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 249
❦
Mileage may vary on the particular dialect of your SQL server, however, MySQL, MariaDB, Oracle,
PostgreSQL, etc. have all many more features. In fact the SQL standard is Turing complete, which
means that each process can be simulated in SQL. In this section we will have a quick walk-
through and illustrate some features that come in handy.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 250
❦
250 14 SQL
Note that functions can also be used within queries to transform data or update data:
-- Functions can also be used in queries:
DROP FUNCTION tooSimple;
CREATE FUNCTION tooSimple (str1 CHAR(10), str2 CHAR(50))
RETURNS CHAR(60) DETERMINISTIC
RETURN CONCAT(str1,str2,'!');
-- A UNION query combines two separate queries (column names must match)
SELECT tooSimple('Hello, ', full_name) AS myMSG FROM tbl_authors WHERE ISNULL(
death_date)
UNION
SELECT tooSimple('RIP, ', full_name) AS myMSG FROM tbl_authors WHERE NOT
ISNULL(death_date);
-- Output:
+-----------------------------------------+
| myMSG |
+-----------------------------------------+
| Hello,Erika Leonard! |
| Hello,Euclid of Alexandria! |
| Hello,Bernard Marr! |
| Hello,Bart Baesens! |
| Hello,Philippe J.S. De Brouwer! |
| Hello,Hadley Wickham! |
| Hello,Garrett Grolemund! |
| RIP,Valentin Louis G. E. Marcel Proust! |
❦ | RIP,Miguel de Cervantes Saavedra! | ❦
| RIP,James Augustine Aloysius Joyce! |
| RIP,Isaac Newton! |
+-----------------------------------------+
Listing 14.22: Just a little taste of some additional features in SQL. We encourage you to learn more
about SQL. This piece of code introduces functions, variable, and the UNION-query.
In SQL, it makes sense to change the delimiter before creating a function. That allows
you to use the delimiter ; inside the function without triggering the server to execute the
command that is followed by ; .
With this section, we hope that we encouraged you to understand databases a little better;
that we were able to illustrate the complexity of setting up, maintaining, and using a professional
database system, but above all that it is now clear how to get data out of those systems with SQL.
However, it would be inefficient to have to go to the SQL server, select data, export it in a CSV
file, and then import this file in R. In Chapter 15 “Connecting R to an SQL Database” on page 253
we will see how to import data from MySQL directly to R, and use further our knowledge of data
manipulation on the database server and/or within R.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 251
❦
It is very time-consuming to rebuild a database, and making a backup is a little more tricky
than copying a file to another hard disk. Apart from solutions that involve backup servers
for redundancy, for the private user an “sql-dump” is the way to go. An sql-dump will
copy all tables – complete with their definitions and data – to a text-file on your computer.
A complete SQL-dump will even include users and access rights. This backup comes in
handy when something goes wrong: restoring this is a work of minutes instead of weeks.
The following code shows how this can be done for MySQL from the Linux command
prompt:
# Make the dump for your databases only:
mysqldump -u root -p -- databases library another_DB > dump_file.sql
# Or for all databases:
mysqldump -u root -p -- all-databases > dump_file.sql
The sql-dump is a text-file, we encourage you to open it and learn from it.
❦ ❦
This section barely scratched the surface of databases and SQL. It is a good idea to start
your own simple project and learn while you go or find a book, online resource or course
about SQL and discover many more functions such as ordering data with ORDER BY,
understand the grouping commands GROUP BY with the additional HAVING (the WHERE
clause for grouped data), the ways to upload files with data and commands (the command
SOURCE), etc.
As mentioned earlier, the documentation of MySQL is a good place to start.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 253
❦
♣ 15 ♣
Connecting R to an SQL
Database
In this section we will study one example of one RDBMS: MySQL. There are good reasons for this MySQL
choice, it is widely used as a back-end of many websites, it is fast, reliable, and free. However,
there are many databases available. In general, a good starting point is the package RODBC. RODBC
Hint – RODBC
Some databases have their own drivers for R, others not. It might be a good idea to have
a look at the package RODBC. It is a package that allows to connect to many databases via
❦ the ODBC protocol.a ❦
a ODBC stands for Open Database Connectivity and is the standard API for many DBMS.
In many cases, one will find specific drivers (or rather APIs) for the DBMS that is being used.
These interfaces come in the familiar form of packages. For MySQL or MariaDB, this is RMySQL. RMySQL
Digression – MariaDB
As mentioned in previous section, MariaDB is – almost – a drop-in replacement for
MySQL. There are some differences in the configuration tables, how data is displayed,
the command prompt (in MariaDB you will see the name of the database), CTRL+C is to
be avoided in the MariaDB client), etc. However, till now, most other things that only rely
on APIs such as phpmyadmin and RMySQL will work for both database engines.
With the package RMySQL, it is possible to both connect to MariaDB and MySQL in a conve-
nient way and copy the data to R for further analysis. The basics of the package are to create a
connection variable first and then use that connection to retrieve data.
# install.packages('RMySQL')
library(RMySQL)
# connect to the library
con <- dbConnect(MySQL(),
user = "librarian",
password = "librarianPWD",
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 254
❦
dbname = "library",
host = "localhost"
)
Now, we have the connection stored in the object con and can use this to display data about
the connection, run queries, and retrieve data.
show(con)
summary(con, verbose = TRUE)
# dbGetInfo(con) # similar as above but in list format
dbListResults(con)
dbListTables(con) # check: this might generate too much output
# get data
df_books <- dbGetQuery(con, "SELECT COUNT(*) AS nbrBooks
FROM tbl_author_book GROUP BY author;")
There are a few other good reasons to wrap the database connection in functions:
1. This will make sure that the connection will always get closed.
2. We can design our own error messages with the functions try() and tryCatch().
3. We can assure that we do not keep a connection open too long. This is important because
RDBMS are multi-user environments that can get under heavy loads and hence it is always
bad to keep connections to SQL servers open too long.
The code below does the same as the aforementioned code, but with our own custom functions
that are a wrapper for opening the connection, running the query, returning the data and closing
the connection. We strongly recommend to use this version.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 255
❦
# db_get_data
# Get data from a MySQL database
# Arguments:
# con_info -- MySQLConnection object -- the connection info to
# the MySQL database
# sSQL -- character string -- the SQL statement that
# selects the records
# Returns
# data.frame, containing the selected records
db_get_data <- function(con_info, sSQL){
con <- dbConnect(MySQL(),
user = con_info$user,
password = con_info$password,
dbname = con_info$dbname,
host = con_info$host
)
df <- dbGetQuery(con, sSQL)
dbDisconnect(con)
df
}
# db_run_sql
# Run a query that returns no data in an MySQL database
# Arguments:
# con_info -- MySQLConnection object -- open connection
# sSQL -- character string -- the SQL statement to run
db_run_sql <-function(con_info, sSQL)
{
con <- dbConnect(MySQL(),
user = con_info$user,
password = con_info$password,
❦ dbname = con_info$dbname, ❦
host = con_info$host
)
rs <- dbSendQuery(con,sSQL)
dbDisconnect(con)
}
Assume that we want to generate a histogram of how many books our authors have in our
library. The functions that we defined in the code segment above will help us to connect to the
database, get the data to R and finally we can use the functionality of R to produce the histogram
as in Figure 15.1 on page 256 with the following code:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 256
❦
Histogram of df$nbrBooks
8
6
Frequency
4
2
0
df$nbrBooks
Figure 15.1: Histogram generated with data from the MySQL database.
❦ ❦
Hint – Clearing the query cache
In case MySQL starts responding erratically and slows down it might be useful to clear
the query cache. The following function does this for the whole server and for MySQL.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:12pm Page 257
❦
PART IV
Data Wrangling
♥
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:12pm Page 259
❦
Data wrangling refers to the process of transforming a dataset into a form that is easier to under- data wrangling
stand and easier to work with. Data Wrangling skills will help you to work faster and more
efficient. Good preparation of data will also lead to better models.
We will use data wrangling in its narrow sense: transforming data. Some businesses will use
the word in a larger sense that also includes data visualization and modelling. This might be
useful if the “Data and Analytics Team” will work on data end-to-end (from collecting the data
from the database systems up to the final presentation). We will treat this in a separate section of
this book: Part VII “Reporting” on page 685.
In many companies, “data wrangling” is used as somehow equivalent to “building a data-
mart.” A data-mart is the something like a supermarket for data, where the modeller can pick
up the data in a format that is ready to use. The data-mart can also be seen as the product of
data-wrangling.
Data wrangling, just as modelling and writing code, is as much a form of art as it is a science.
Wrangling in particular cannot be done without knowing the steps before and the steps after: we
need to understand the whole story before we can do a good job. The main goal is transforming the
data that is obtained from the transaction system or data-warehouse in such form that it becomes
directly useful for making a model.
For example, consider that we are making a credit scorecard in a bank. Then our data might
contain the balances of current accounts of the last 24 months. These will be just numbers with
the amount that people have on their account. While it is not impossible to feed this directly in a
neural network or even linear regression, the results will probably not be robust, and the model
is probably over-fit.1 It makes much more sense to create new parameters, that contain a certain
meaningful type of information. For example, we could create a variable was_overdrawn that is
1 when at least one of the balances was negative, 0 otherwise:
❦ ❦
IF any_balance < 0
THEN was_overdrawn = 0
ELSE was_overdrawn = 1
END IF
Or maybe, you want to create a parameter that indicates if a person was overdrawn last year
as opposed to this year (maybe the financial problems were something from the past); or maybe,
you prefer to have a parameter that shows the longest period that someone was overdrawn (which
will not impact people that just forget to pa and do pay when the reminder arrives).
The choices that we make here, do not only depend on what data we have (upstream reality),
but also on what the exact business model is (downstream reality). For example, if the bank earns a
nice fee from someone that forgets to pay for a week, then it is worth not to sanction this behaviour.
In any case, you will have to “wrangle” your data before it is ready for modelling.2
However, more often you will be faced with the issue that real data is not homogeneous. For
example, your bank might have very little credits of people older than 80 years old. Imagine that
those were all produced by one agent that was active in retirement homes. First, of all that agent
1 In wide data (data that contains many rows) it is quite possible to find spurious correlations – correlations that
data-mart
appear to be strong but are rather incidental and limited to that particular cut of data.
2 At this moment, the practice of fitting a logistic model for creditworthiness resulting in a “credit score” is so data-warehouse
long the standard way of working that there is even some kind of understanding of what standard variables would
be needed in such case. This has lead to the concept “data-mart” as a standard solution. It is a copy of the data-
warehouse of the company with data transformed in such way that it is directly usable for credit-modellers. In these
logistic regressions it is also important to “bin” data in a meaningful way.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:12pm Page 260
❦
now is gone, and second this overly pushing of loans has lead to worse credits. So, how should we
treat these customers?
“Data-binning” (also known as “discrete binning” or “bucketing”) can alleviate these issues.
It is a data pre-processing technique used to reduce the effects of minor observation errors. The
original values, which fall in a given small interval, a “bin,” are replaced by a value representative
of that interval, often the central value. It is a form of quantization of continuous data.
Statistical data binning is a way to group a number of more or less continuous values into a
smaller number of “bins.” For example, if you have data about a group of people, you might want
to arrange their ages into a smaller number of age intervals (for example, grouping every five years
together). It can also be used in multivariate statistics, binning in several dimensions at once.
For many statistical methods, you might want to normalize your data (have all variables rang-
ing between 0 and 1). This makes sure that the model is easy to understand and can even help in
some cases make algorithms converge faster.
All this involves changing all records or adding new and all these examples are a form of “data
wrangling” or data manipulation.
Usually, you or someone else will have to get data out of an RDBMS. These systems sit on
very efficient servers and are optimized for speed. Just as R they have their strenghts and
weaknesses. Doing simple operations on a large amount of data might be faster than your
R application. So, use your knowledge from Chapter 14 “SQL” on page 223 to prepare as
much as possible data on these servers. This might make a significant difference on how
❦ fast your will get the model ready. ❦
In this part, we will briefly discuss all of those aspects, but we will start with a particular issue
that is most important for privacy: anonymous data.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 261
❦
♣ 16 ♣
Anonymous Data
Before we can start the real work of manipulating data in order to gain more information from
it, there might be the need to reduce the information content first and anonymise it. This step
should be done before anything else, otherwise copies of sensitive data can be lingering around
and will be found by people who should not have access to it. The golden rule to remember is:
only take data that you might need and get rid of confidential or personal data as soon as possible
in the process.
It is best to anonymise data even before importing it in R. Typically, it will sit in an SQL
database and then it is easy to scramble before the data even leaves this database:
-- Using AES 256 for example:
MariaDB [(none)]> SELECT AES_ENCRYPT("Hello World", "secret_key_string");
+-------------------------------------------------+
❦ | AES_ENCRYPT("Hello World", "secret_key_string") | ❦
+-------------------------------------------------+
| Ew*0W5% |
+-------------------------------------------------+
1 row in set (0.00 sec)
-- Example:
SELECT AES_ENCRYPT(name, "secret_key_string"), AES_ENCRYPT(phone number, "
secret_key_string"), number_purchases, satifaction_rating,
sustomer_since, etc.
FROM tbl_customers;
Imagine that despite your request for anonymised data you still got data with data that would
allow you to identify a person. In that case, R will also allow us to hash confidential data. This can
be the moment to get creative1 (the least conventional your encryption is, the more challenging
it might be to crack it), or as usual in R “stand on the shoulders of giants” and use for example
the package anonymizer. This package has all the tools to encrypt and decrypt to a very high anonymizer
standard, and might be just what you need.
1A nice example are the ideas of Jan Górecki; published here: https://jangorecki.github.io/blog/
2014-11-07/Data-Anonymization-in-R.html
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 262
❦
There is also the package sodium, created by Jeroen Ooms. It is a wrapper around libsodium,
OS which is a standard library. So you will need to install this first on your operating system (OS).
operating system
• deb: libsodium-dev (Debian, Ubuntu, etc.)
This means – most probably – your will first need to open a terminal and run the following
commands in the CLI (command line interface of your OS):
sudo apt-get install libsodium-dev
sodium Then we can open R and install the sodium library for R.
# Encrypt:
msg_encr <- data_encrypt(msg, key)
❦ ❦
orig <- data_decrypt(msg_encr, key)
stopifnot(identical(msg, orig))
If you are a programmer or work on a large R-project and you need a secret space in a public
code library, then a good place to start looking is the package secure from Hadley Wickham.
rcrypt
When there is the need to encrypt entire files, then you might use rcrypt, which provides
high standard symmetric encryption using GNU Privacy Guard (GPG).2 . The default encryption
mechanism is AES256, but at the user’s request it can also do “Camellia256,” “TWOFISH,” or
“AES128.” However, last time we checked it did not work with the last version of R and since it is
only a wrapper around gpg it makes sense to use the OS and use gpg directly.
AES128
Camellia256
This is only a tiny sliver of what encryption is and how to use it. Encryption can be a challeng-
TWOFISH ing subject and is a specialisation in itself. For most data scientists and modellers, it is far better
GPG to make sure that we get anonymised data.
Also note that encryption is a vast territory that is in rapid expansion. This is a place to watch
AES256 on the Internet and make sure that you are up-to-date with the latest developments. Especially
GNUPG
now that we are at the verge of a revolution spurred by the dawn of quantum computers, one
might expect rapid developments in the field of encryption.
PGP 2 GPG or GnuPG (which stands for “GNU Privacy Guard”) is a complete and free implementation of the
pretty good privacy
OpenPGP standard as defined by RFC4880 (also known as PGP – which stands for “pretty good privacy”). GnuPG
allows to encrypt and sign your data and communications; it features key management and access modules for all
kinds of public key directories.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:18pm Page 263
❦
It is beyond the scope of this book to provide a solid introduction to the subject of
cryptology. There are many good books about cryptography, and some provide great
overviews such as Van Tilborg and Jajodia (2014). Don’t forget to read Phil Zim-
mermann’s paper “Why I wrote PGP” (see https://www.philzimmermann.com/EN/
essays/WhyIWrotePGP.html). But above all, I’m sure that you will enjoy reading “The
joy of cryptography” by Mike Rosulek – it is a free book and available online and part
of the “open textbook initiative” of the Oregon State University – http://web.engr.
oregonstate.edu/~rosulekm/crypto.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 265
❦
♣ 17 ♣
An iconic effort to obtain a certain form of standardization is the work around the tidyverse.
The tidyverse defines “tidy data” and “tidy code” in a logical and compelling matter. Adhering to
those rules makes code more readable, easier to maintain and more fun to build.
It is possible to program in R without the tidyverse – as we did in Chapter 4 “The Basics of R”
on page 21 – because it is important to know and be able to read and understand base-R code.
Knowing both, we invite you to make your own, informed choice.
In Chapter 7 “Tidy R with the Tidyverse” on page 121 we already introduced the tidyverse and
largely explained operators and ideas. In this chapter, we focus on obtaining “tidy data.”
❦ ❦
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 266
❦
# db_run_sql
# Run a query that returns no data in an MySQL database
# Arguments:
# con_info -- MySQLConnection object -- containing the connection
# info to the MySQL database
# sSQL -- character string -- the SQL statement to be run
db_run_sql <-function(con_info, sSQL)
{
con <- dbConnect(MySQL(),
user = con_info$user,
password = con_info$password,
dbname = con_info$dbname,
host = con_info$host
)
rs <- dbSendQuery(con,sSQL)
dbDisconnect(con)
}
We can now use those functions to connect to the database mentioned above. Since we want
to use the tidyverse and its functionalities, we load this first. At the end of this code, the user
will notice the function as_tibble(). That function coerces a two dimensional dataset into a
as_tibble() tibble. A tibble is the equivalent of data.frame in the tidyverse.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 267
❦
str(authors)
## 'data.frame' :11 obs. of 5 variables:
## $ author_id :num 1 2 3 4 5 7 11 13 14 15...
## $ pen_name :chr "Marcel Proust" "Miguel de Cervantes" "James Joyce" "E. L. James"...
## $ full_name :chr "Valentin Louis G. E. Marcel Proust" "Miguel de Cervantes Saavedra"
"James Augustine Aloysius Joyce" "Erika Leonard"...
## $ birth_date:chr "1871-07-10" "1547-09-29" "1882-02-02" "1963-03-07"...
## $ death_date:chr "1922-11-18" "1616-04-22" "1941-01-13" NA...
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 268
❦
utils Compared to the base R (or as provided by the package utils), the functions from readr
have the following distinct advantages.
• They are faster (which is important when we are reading in a larger dataset – for example
high-frequency financial markets data, with multiple millions of lines, or the four billion
of payments that the larger banks has pushed through today).
• When things take a while the readr functions will have a progress bar (that is useful
because we see that something is progressing and should not be doubting that the computer
is stuck and interrupt the process for example).
• They are more “strong and independent”: they leave less choices to the operating system
and allow to fix encodings for example. That helps to make code reproducible. This is
especially important in regions such as Belgium where many languages are spoken or in
large organisations where one model reviewer or modeller might get one day a model from
China – prepared on Windows 10 – and the next day from France, prepared on Linux.
• They adhere to the tidyverse philosophy and hence produce tibbles. Doing so, they also
adhere to the philosophy of not naming rows1 , not shorten column names, not coercing
strings to factors, etc.
In an industrial environment, most of those issues are important, and it did indeed make sense
to implement new functions. If you have the choice, and there is more new code than legacy code,
then it is worth to move towards the tidyverse and embrace its philosophy, use all its packages and
rules. It is even worth to rewrite older code, so that in ten years from now, it is still possible to
❦ maintain it. ❦
The functions of readr adhere to the tidyverse philosophy and hence work is passed on from
more general functions to more specified functions. The functions that the typical reader will use
(from the read-family that read in a specific file type) will all rely on the basic functions to convert
text to data types: the parser functions.
There is a series of parser functions with functions to generate columns that are then com-
bined in a series of read-functions. These read-functions are specialized for a certain type of file.
The parser functions will convert a string (from a flat file or any other input given to them)
into a more specialized type. This can be a double, integer, logical, string, etc.
1 The issue with naming rows is illustrated in Chapter 8.5 “Creating an Overview of Data Characteristics” on
page 155. In that section, we need the brand name of the cars in the mtcars database. This is data and hence should
be in one of the columns, but it is stored as the names of the columns.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 269
❦
parse_number(nbrs) parse_number()
## [1] 10000.0 12.4
parse_date(s_dte) parse_date()
## [1] "2018-05-03"
There is even a function to guess what type suits best: guess_parser(), that is used in the guess_parser()
parse_guess(v)
## [1] "1.0" "2.3" "2.7" "."
parse_guess(v, na = ".")
## [1] "1.0" "2.3" "2.7" NA
parse_guess(s_dte)
## [1] "2018-05-03"
❦ guess_parser(v) ❦
## [1] "character"
guess_parser(v[1:3])
## [1] "double"
guess_parser(s_dte)
## [1] "date"
guess_parser(nbrs)
## [1] "character"
There are also parsers for logical types, factors, dates, times and date-times. Each parse_*()
function will collaborate with a col_*() function to puzzle tibbles together from the pieces of
the flat file read in.
This is only the top of the iceberg. There are a lot more functions catering for a lot
more complicated situations (such as different encodings, local habits such as using a
point or comma as decimal. A good place to start is the documentation of the tidyverse:
https://readr.tidyverse.org/articles/readr.html
These parser functions are used in the functions that can read in entire files. These functions
are all of the read-family. For example, the function to read in an entire csv-file is read_csv().
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 270
❦
s_csv = "'a','b','c'\n001,2.34,.\n2,3.14,55\n3,.,43"
read_csv(s_csv)
## # A tibble: 3 x 3
## `'a'` `'b'` `'c'`
## <chr> <chr> <chr>
## 1 001 2.34 .
## 2 2 3.14 55
## 3 3 . 43
❦ There is a function read_csv2() that will use the semi-colon as separator. This is useful ❦
read_csv2()
for countries that use the comma as thousand separator.
If the guesses of readr are not what you would want to have, then it is possible to over-ride
these guesses. A good workflow is to start is to ask readr to report on the guesses and modify
only where necessary. Below we illustrate how this can be achieved.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 271
❦
Checking the data-type before data-import makes a lot of sense for large databases when
the import can take very long time.
The output obtained via the function spec_csv can serve as a template that can be fed into
the function read_csv() to over-ride the defaults guessed by readr. For example we can coerce
column “c” to double.
read_csv(s_csv, na = '.',
col_names = TRUE,
cols(
`'a'` = col_character(),
`'b'` = col_double(),
`'c'` = col_double() # coerce to double
)
)
## # A tibble: 3 x 3
## `'a'` `'b'` `'c'`
## <chr> <dbl> <dbl>
## 1 001 2.34 NA
## 2 2 3.14 55
## 3 3 NA 43
The package readr comes with a particularly challenging csv-file as example. Import this
file to R without losing data and with no errors in the variable types.
Here is a hint on how to get started:
# Start with:
t <- read_csv(readr_example("challenge.csv"))
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 272
❦
functionality. Fixed width tables are in essence text files where each column has a fixed width.
They look a little like the output that MySQL gave us in Chapter 15 “Connecting R to an SQL
Database” on page 253.
utils The package utils provides read.fwf() and in the tidyverse, it is the package readr that
read.fwf()
comes to the rescue with the function read_fwf(). This function is coherent with the philosophy
of the tidyverse, e.g. using an underscore in the name rather than a dot, allowing for piping and
the output is a tibble.2
Importing data is a common task to any statistical modelling and hence readr is part of the
core-tidyverse and will be loaded with the command library(tidyverse). However, it is also
readr
possible to load just readr
# load readr
library(readr)
# Or load the tidyverse with library(tidyverse), it includes readr.
We also need a text file. read_fwf() can of course read in a file from any disk and you can
use any text editor to make the text-file with the following content:
book_id year title genre
1 1896 Les p l a i s i r s e t l e s j o u r LITmod
2 1927 Albertine disparue LITmod
3 1954 Contre S a i n t e−Beuve LITmod
4 1922 A l a r e c h e r c h e du temps perdu LITmod
5 1615 El I n g e n i o s o H i d a l g o Don Q u i j o t e de l a Mancha LITmod
6 1613 Novelas ejemplares LITmod
7 2011 F i f t y Shades o f Grey LITero
8 1687 P h i l o s o p h i æ N a t u r a l i s P r i n c i p i a Mathematica SCIphy
9 −300 Elements ( t r a n s l a t e d ) SCImat
10 2014 B i g Data World SCIdat
11 2016 Key B u s i n e s s A n a l y t i c s SCIdat
❦ 12 2011 Maslowian P o r t f o l i o Theory FINinv ❦
13 2016 R f o r Data S c i e n c e SCIdat
It is also possible to define a string variable in R with the same content. this can be done as
follows.
Starting from this string variable, we will create a text file that has data in the fixed-width
format.
The previous code chunk has created the text file book.txt in the working path of R. Now,
read_fwf() we can read it back in to illustrate how the read_fwf() function works.
2 More about this issue is in Chapter 17.1.2 “Importing Flat Files in the Tidyverse” on page 267.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 273
❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 274
❦
Note that this last method fails: it identifies a separate column for the word “Mathematica”,
while this is actually part of the column “title”:
print(t3)
## # A tibble: 9 x 5
## X1 X2 X3 X4 X5
## <dbl> <dbl> <chr> <chr> <chr>
## 1 1 1896 Les plaisirs et les jour <NA> LITm~
## 2 2 1927 Albertine disparue <NA> LITm~
## 3 3 1954 Contre Sainte-Beuve <NA> LITm~
## 4 8 1687 Philosophi\(\ae\) Naturalis Prin~ Mathemati~ SCIp~
## 5 9 -300 Elements (translated ) <NA> SCIm~
## 6 10 2014 Big Data World <NA> SCId~
## 7 11 2016 Key Business Analytics <NA> SCId~
❦ ## 8 12 2011 Maslowian Portfolio Theory <NA> FINi~ ❦
## 9 13 2016 R for Data Science <NA> SCId~
It is even possible to read in compressed files and/or files from the Internet. Actually, files
ending in .gz, .bz2, .xz, or .zip will be automatically decompressed, where files start-
ing with http://, https://, ftp://, or ftps:// will be automatically downloaded.
Compressed files that are on the Internet will be first downloaded and then decompressed.
Now, that we have seen how to import data from an RDMS directly (Chapter 15 “Connecting
R to an SQL Database” on page 253), import flat files (text files that contain data in a delimited
format or fixed width) in this section, it is time to make sure that the data that we have is tidy.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 275
❦
While we already introduced the tidyverse in Chapter 7 “Tidy R with the Tidyverse” on page 121,
in this section we will focus on the concept “tidy data.”
The tidyverse is a collection of packages for R that all adhere to a certain rules. It is quite hard
to make functions, scripts, and visualizations work universally and it is even harder to make code
readable for everyone. Most people will encode a dataset as columns. The header of the columns
is the name of the variable and each row has in that column a value (or NA). But why not storing
variables in rows? In R (nor any other programming language) there are no barriers to do so,
but it will make your code really hard to read and functionalities such as plotting, and matrix
multiplication might not deliver the results that you are seeking.
For example, in the dataset mtcars, one will notice that the names of the cars are not stored
in a column, but rather are the names of the rows.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1
## am gear carb l
## Mazda RX4 1 4 4 11.20069
## Mazda RX4 Wag 1 4 4 11.20069
## Datsun 710 1 4 1 10.31643
❦ ## Hornet 4 Drive 0 3 1 10.99134
❦
## Hornet Sportabout 0 3 2 12.57832
## Valiant 0 3 1 12.99528
Tidy data looks like the mtcars data frame, but rather will have the car names – which is
data – stored into a column. In summary tidy data will always have
4. a value (or NA) in each cell – the intersection between row and column.
So, this is what tidy data looks like, but it is harder to describe what messy data looks like.
There are many ways to make data messy or “untidy”. Each data provider can have his/her own
way to produce untidy data.
A common mistake is for example putting data in rownames. Imagine for example a data
frame that looks as follows:
Unit Q1 Q2 Q3 Q4
Europe 200 100 88 270
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 276
❦
This dataset has columns that contain data: the moment of observation (quarter 1, etc.) in the
column names. Its tidy equivalent would look like:
Unit Quarter Sales
Europe Q1 200
Asia Q1 320
Europe Q2 100
Asia Q2 315
...
This is the type of data-frame that can be directly used in ggplot2 (that is part of the tidyverse
family). In Section 17.3.2 “Convert Headers to Data” on page 281 we will see how to make this
particular conversion with the function gather().
Another typical mistake is to use variable names in cells
Unit Dollar Q1
Europe sales 200
Europe profit 55
Asia sales 320
Asia profit 120
...
This dataset has stored variable names. In Chapter 17.3.3 “Spreading One Column Over Many” on
page 284 tidyverse provides the function spread() to clean up such data.
There are much more ways to produce untidy data. So, we will revisit this subject and devote
❦ a whole section – Chapter 17.3 “Tidying Up Data with tidyr” on page 277 – to tidying up data. ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 277
❦
Now, we have some data in R, but long before we can start the cool work and build a neural
network or a decision tree, there is still a lot of “heavy lifting” to be done. At this point, the data-
wrangling can really start and needs to precede the modelling. In the first place, we want our data
to be tidy. Tidy data will lead to neater code, the syntax will be easier to read for your future self
or someone else, and it will be more recognisable and that in its turn will lead to more efficient
work and more satisfaction.
In defining tidy data we tie in with the ideas proposed by Hadley Wickham – Wickham et al.
(2014) – as explained in Chapter 17.2 “Tidy Data” on page 275.
Definition: – Tidy data
In many cases, our data will not be very tidy. We often observe that:
• there are not columns that can be addressed (this means it is a “flat text file” or a fixed-width
table);
• a single table contains more than one observational units (for example we receive the inner
join of the database of customers, accounts and balances);
• a single observational unit is spread out over multiple tables (e.g. the customers address
details are in the table of current accounts and other details are in the table of credit cards);
• the column headers are also values (and not variable names), for example “sales Q1,” sales
Q2” are actually different observations (rows: Q1 and Q2) of one variable “sales” seen by
the credit card application, customers from core-banking system, customers from the com-
plaints database;
• more than one variable are combined in one column (e.g. postcode and city-name, first
name and last name, year-month, etc.);
• etc.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 278
❦
Notice that the above table is not in order of frequency of appearing in real life, but rather in
the order that you would want to address it. Also, we did not mention missing data or wrong data
for example. Those are not issues that can be solved by following the concept of tidy data.
The good news is that cleaning up data is not an activity that requires a lot of tools. The pack-
age tidyr provides a small set of functions to get the job done.
## --
## -- Load dplyr via tidyverse
library(tidyverse)
library(RMySQL)
# db_get_data
# Get data from a MySQL database
# Arguments:
# con_info -- MySQLConnection object -- the connection info
# to the MySQL database
# sSQL -- character string -- the SQL statement that
# selects the records
# Returns
# data.frame, containing the selected records
❦ db_get_data <- function(con_info, sSQL){ ❦
con <- dbConnect(MySQL(),
user = con_info$user,
password = con_info$password,
dbname = con_info$dbname,
host = con_info$host
)
df <- dbGetQuery(con, sSQL)
dbDisconnect(con)
df
}
# db_run_sql
# Run a query that returns no data in an MySQL database
# Arguments:
# con_info -- MySQLConnection object -- the connection info
# to the MySQL database
# sSQL -- character string -- the SQL statement
# to be run
db_run_sql <-function(con_info, sSQL)
{
con <- dbConnect(MySQL(),
user = con_info$user,
password = con_info$password,
dbname = con_info$dbname,
host = con_info$host
)
rs <- dbSendQuery(con,sSQL)
dbDisconnect(con)
3 The example of the library is developed in Section 14 “SQL” on page 223 and also used in Chapter 15 “Connecting
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 279
❦
It is not uncommon to get a dataset as t_mix. In some cases, this might be the most useful
form. However, imagine that we get a mix of the customer table and the table of loans. This table
would only allow us a to make a prediction based on the rows in it. So, we would actually look at
what customers declare, what the loan is. But it is of course more powerful to get a true view of
the customer by aggregating all loans, and calculate how much loan instalments the person pays
every month as compared to his/her income.
To achieve that we need to split the tables in two separate tables. Assuming that the data is
correct and that there are no inconsistencies, then this is not too difficult. In real life, this becomes
more difficult. For example, to get a loan a customer declares an income of x. Two years later
he/she applies again for a loan and declares x + δ . Is he or she mistaken? Is the new income the
correct one? Maybe there is another postcode? Maybe we need to make customer snapshots at the
moment of loan application but maybe these are mistakes.
Of course, it is even possible that the current address as reported by the core banking system
is different from the address as reported by the loan system. Data is important and it is impossible
to create a good model when the data has too much noise.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 280
❦
The problem is often in communication. The agent needs to know the importance of correct
data. But also the modeller needs to understand how the data is structured in the transactional
system. That is the only way to ask for a suitable set of data. In many cases, it is sufficient to ask for
two separate tables to avoid getting a mix of tables. Although the author has met crisis situations
where the bank simply had convoluted data and did not have the technical capacity to go back
and we had to work from that.
Maybe the best solution is to upload your data back into a RDBMS and build a logical structure
there before downloading again to R. However, R can do this also as follows:
1. Understand the data structure, eventually talk to the data owners and understand what is
the job at hand. In this case, it is a mix of four tables: authors, a link-table to books, books,
and genres.
nbr_auth$n - nbr_auth2$n
## [1] 3 1 0 0 0 0 0 0 0 0 3 1 0 0
2. Learn from experiments till we find the right structure. In our case “book” is not unique
for an “author,” so we try again.
❦ ❦
# Try without book:
nbr_auth2 <- t_mix %>%
count(author_id, pen_name, full_name, birth_date, death_date)
3. This looks better. But note that this exact match is only possible because our data is clean
(because we took care and/or because we asked MySQL to help us to guard referential
integrity). We still have to determine now which table takes which fields.
4. Now, the heavy lifting is done and we can simply extract all data.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 281
❦
6. Check the data and see once more if it all makes sense. In our case we will want to correct
some of the data that has been imported and coerce them to the right type.
auth <- tibble(
author_id = as.integer(my_authors$author_id),
pen_name = my_authors$pen_name,
full_name = my_authors$full_name,
birth_date = as.Date(my_authors$birth_date),
death_date = as.Date(my_authors$death_date)
) %>%
unique %>%
print
## # A tibble: 10 x 5
## author_id pen_name full_name birth_date death_date
## <int> <chr> <chr> <date> <date>
## 1 1 Marcel Pr~ Valentin Lo~ 1871-07-10 1922-11-18
## 2 2 Miguel de~ Miguel de C~ 1547-09-29 1616-04-22
## 3 4 E. L. Jam~ Erika Leona~ 1963-03-07 NA
## 4 5 Isaac New~ Isaac Newton 1642-12-25 1726-03-20
## 5 7 Euclid Euclid of A~ NA NA
## 6 11 Bernard M~ Bernard Marr NA NA
❦ ## 7 13 Bart Baes~ Bart Baesens 1975-02-27 NA ❦
## 8 14 Philippe ~ Philippe J.~ 1969-02-21 NA
## 9 15 Hadley Wi~ Hadley Wick~ NA NA
## 10 16 Garrett G~ Garrett Gro~ NA NA
In our particular case, it seems that we still should clean up the names of books. There seem to be
some random quote signs as well as newline ( \n ) characters. However, the quotes are due to how
print() works on a tibble, so only the newlines should be eliminated. The tools to do this are
described in Chapter 17.5 “String Manipulation in the tidyverse” on page 299. So if the following
is not immediately clear, we refer to that chapter.
auth$full_name <- str_replace(auth$full_name, "\n", "") %>%
print
## [1] "Valentin Louis G. E. Marcel Proust"
## [2] "Miguel de Cervantes Saavedra"
## [3] "Erika Leonard"
## [4] "Isaac Newton"
## [5] "Euclid of Alexandria"
## [6] "Bernard Marr"
## [7] "Bart Baesens"
## [8] "Philippe J.S. De Brouwer"
## [9] "Hadley Wickham"
## [10] "Garrett Grolemund"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 282
❦
gather() In this example the years are part of the headers (column names). The function gather()
tidyr from tidyr helps to correct such situation by moving multiple columns into one long column.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 283
❦
• the label of the new column to be created (the headers of the columns mentioned further
will go there),
• the label of the row that will contain the data that was in the old tibble,
In this case, the problem is not yet really solved. We still have the dates in a format that is not
usable. So we still need to convert the dates to a date-format. This will be explained further in
Chapter 17.6 “Dates with lubridate” on page 314, but the following is probably intuitive. lubridate
The following code uses lubridate to convert the dates to the correct format and then plots
the results in Figure 17.1:
library(lubridate)
##
## Attaching package: ’lubridate’
## The following object is masked from ’package:base’:
##
## date
140
120
100
t2$date
Figure 17.1: Finally, we are able to make a plot of the tibble in a way that makes sense and allows
to see the trends in the data.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 284
❦
library(dplyr)
sales_info <- data.frame(
time = as.Date('2016-01-01') + 0:9 + rep(c(0,-1), times = 5),
type = rep(c("bought","sold"),5),
value = round(runif(10, min = 0, max = 10001))
)
sales_info
## time type value
## 1 2016-01-01 bought 5582
## 2 2016-01-01 sold 1998
## 3 2016-01-03 bought 7951
## 4 2016-01-03 sold 6388
## 5 2016-01-05 bought 4382
## 6 2016-01-05 sold 1836
## 7 2016-01-07 bought 3675
## 8 2016-01-07 sold 4028
## 9 2016-01-09 bought 974
## 10 2016-01-09 sold 8081
The functions spread and gather are each other’s opposite and when applied in sequen-
tially on the same data and variables they cancel each other out.
sales_info %>%
spread(type, value) %>%
gather(type, value, 2:3)
## time type value
## 1 2016-01-01 bought 5582
## 2 2016-01-03 bought 7951
## 3 2016-01-05 bought 4382
## 4 2016-01-07 bought 3675
## 5 2016-01-09 bought 974
## 6 2016-01-01 sold 1998
## 7 2016-01-03 sold 6388
## 8 2016-01-05 sold 1836
## 9 2016-01-07 sold 4028
## 10 2016-01-09 sold 8081
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 285
❦
The tibble that is the output of spread() is indeed what is easier to work with in a sensible
way. Things like the average of “bought” and “sold” are now easier to calculate as. We have indeed
achieved that each column is a variable and each row and observation so that each cell in our table
can be one data point.
library(tidyr)
turnover <- data.frame(
what = paste(as.Date('2016-01-01') + 0:9 + rep(c(0,-1), times = 5),
rep(c("HSBC","JPM"),5), sep = "/"),
value = round(runif(10, min = 0, max = 50))
)
turnover
## what value
## 1 2016-01-01/HSBC 34
## 2 2016-01-01/JPM 23
## 3 2016-01-03/HSBC 31
## 4 2016-01-03/JPM 46
❦ ## 5 2016-01-05/HSBC 44 ❦
## 6 2016-01-05/JPM 30
## 7 2016-01-07/HSBC 11
## 8 2016-01-07/JPM 27
## 9 2016-01-09/HSBC 9
## 10 2016-01-09/JPM 1
When the function separate() gets a number as separator (via the argument sep =
n), then it will split after n characters. The separator actually takes a regular expression
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 286
❦
(see more about this in Chapter 17.5.2 “Pattern Matching with Regular Expressions” on
page 302). Note also that the separator itself will disappear!
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 287
❦
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 288
❦
Now, that our data is “tidy,” we can start the real work and transform data to insight. This step
involves selecting, filtering, and changing variables, as well as combining different tables.
In this section, we will use the example of the simple library that was created in Chapter 14
“SQL” on page 223 and downloaded to R in Chapter 15 “Connecting R to an SQL Database” on
page 253. It was a simple model of a library, with a table for authors, books, genres, and a table
between authors and books.
This functionality is provided by the library dplyr of the tidyverse. So, we will load it here
and not repeat this in every sub-section.
library(dplyr)
We described first how to merge tables and columns and how to split them before we show
selecting and filtering. Usually, it is necessary to use all techniques simultaneously. For
example, merging tables can lead to very large data frames and hence it is best to select
immediately only the columns that we will need.
❦ ❦
17.4.1 Selecting Columns
Just as in SQL, it is not necessary to print or keep the whole table and it is possible to ask R to
keep only certain columns. The function select() from dplyr does this in a consistent manner.
select() The first variable to the function select() is always the data frame (or tibble) and then we can
feed it with one or more columns that we want to keep.
Note that all the functions of dplyr will always return a tibble, so they can be chained
with the pipe operator. Piping in the R (with the tidyverse) is explained in Section 7.3.2
“Piping with R” on page 132.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 289
❦
Just as in SQL, the result will not include any of the rows that have missing birth dates.
When we loaded tidyverse, it informed us that the package stats has also a function
filter and that it will be “masked.” This means that when we use filter(...), then the
function from dplyr will be used. The function filter from stats can be accessed as
follows: stats::filter().
Another thing that we might want to check is if our assumed primary keys are really unique.4
authors %>%
count(author_id) %>%
filter(n > 1) %>%
nrow()
## [1] 0
❦ There are indeed zero rows that would have more than one occurrence for author_id. ❦
Which authors do have more than one book in our database? The answer can be found by
using count() and filter(). count()
author_book %>%
count(author) %>%
filter(n > 1)
## # A tibble: 2 x 2
## author n
## <dbl> <int>
## 1 1 4
## 2 2 2
Note that
filter(count(author_book, author), n > 1}
And, while you are at it, also pay attention that the aforementioned R-code does the same
as the one in the text but does not use the piping command.
4 Note that we use the piping operator that is explained in Section 7.3.2 “Piping with R” on page 132.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 290
❦
17.4.3 Joining
join() If we want to find out which books can be found where, we would have to join two tables. While
left_join() this can be done with the WHERE clause or the multiple variants of the JOIN clause in SQL in
inner_loin()
right_join() R there are also functions in dplyr to do this. dplyr provides a complete list of joins that are
full_join() similar to the familiar clauses in SQL. A list of them is in the documentation and can be accessed
by typing ?join at the R-prompt.
First, we will distinguish mutating joins that output fields of both data frames and filtering
join joins that only output the fields (columns) of the left data frame. Within those two categories
there is a series of possibilities that are fortunately very recognizable from the SQL environment.
dplyr In the tidyverse , dplyr provides a series of join-functions, that all share a similar synthax:
1. mutating joins output fields of both data frames. Contrary to what the name suggests,
they do not mutate the tibbles on which they opearate. These joins output fields of both
mutating join data frames. dplyr provides the following “mutating joins”.
• inner_join() returns all the columns for x and y, but only those rows that have
matching values in their respective field/columns mentioned in the by-clause and
all columns from “x” and “y.” Note that it is possible that some of the join-fields are
not unique and hence there can be multiple matches for the same record, then all
combinations are all returned.
• left_join() returns all the columns for x and y, so that all rows of x will be returned
❦ at least once (with a match of y if it exists, otherwise with a match to NA (or NULL in
❦
SQL vocabulary) (matches are defined by the by-clause). Note that it is possible that
some of the join-fields are not unique and hence there can be multiple matches for
the same record or x, then all combinations are all returned.
• right_join() is similar to the previous but roles of x and y are inverted. Hence, it returns
all rows from y, and all columns from x and y. Rows in y with no match in x will still
be returned but have NA values in those rows of the y data frame.
• full_join() returns all rows and all columns from both data frames x and y. Where
there are not matching values, returns “NA” for the one missing.
filtering join 2. filtering joins that only output the fields (columns) of the left data frame.
• semi_join() returns all rows from x but only if there is a matching values in the field
of y, while only keeping the columns of x. Note that unlike an inner join, the semi
join will never duplicate rows of x
• anti_join() returns all rows from x that do not have a matching value in y, while keep-
ing only the columns of x.
Since we took care that our library database respects referential integrity most all “mutating
join” types will provide similar results.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 291
❦
In SQL, we would use nested joins to find out which authors can be found in which location.
In dplyr it works similar and the functions are conveniently named along the SQL keywords. On
top of that, the piping operator makes the work-flow easy to read.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 292
❦
The default sort order in the function arrange() is increasing, to make that decreasing,
use the function desc() as follows:
arrange(desc(location))
Note that if we want to tell R only to look at unique location values and only keep those
rows. Then the function unique() is not the best tool. For example, we might have a list
of authors and locations (for one location we will find many authors in this case), but we
might want to keep just one sample author per location. To achieve that we can do:
a3[!duplicated(a3$location), ]
## pen_name location
## 1 Marcel Proust 001.45
## 7 E. L. James 001.67
## 8 Euclid 100.53
## 9 Isaac Newton 200.43
## 10 Bernard Marr 205.13
## 14 Philippe J.S. De Brouwer 405.08
duplicated() Note that the function duplicated() will keep only one line per location, and is from
❦ ❦
base-R and hence always available.
Note that what we are doing here is similar to the GROUP BY statement of SQL. This might
group_by() make the reader think that the function group_by() in R will do the same. However, this is
misleading, this function only changes the appearance of the data frame (the way it is printed for
example). Therefore, the group_by function does something different. It groups data so that data
manipulations can be done per group.
Note – Short-cuts
Both in SQL and R there are many short-cuts available. Imagine that we have a table A
and B, and that both have a field z that is linked.a In that case, it is not necessary to write:
However, the last forms are more vulnerable. If the argument by is not supplied, it defaults
to NULL, and for dplyr, this is a sign to try to make an inner join on all columns that have
the same name. So if later, we add matching columns, this selection will also change.
a We have chosen to add _id after a primary key, and where the same field is used as a foreign key, we
used it without that suffix. So in our tables, foreign and primary keys never have the same names.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 293
❦
In fact, we were lucky that the particular thing that we wanted to do has its specific function. In
general, however, it is wise to rely on the function mutate() from dplyr. This function provides
a framework to add columns that contain new information calculated based on existing columns.
In the simplest of all cases, we can just copy an existing column under a new heading (maybe
❦ to make joining tibbles easier?) ❦
genres %>%
mutate(genre = genre_id) %>% # add column genre
inner_join(books) %>% # leave out the "by="
dplyr::select(c(title, location))
## Joining, by = "genre"
## title
## 1 Maslowian Portfolio Theory
## 2 Fifty Shades of Grey
## 3 Les plaisirs et les jour
## 4 Albertine disparue
## 5 Contre Sainte-Beuve
## 6 A? la recherche du temps perdu
## 7 El Ingenioso Hidalgo Don Quijote de la Mancha
## 8 Novelas ejemplares
## 9 Big Data World
## 10 Key Business Analytics
## 11 R for Data Science
## 12 Elements (translated )
## 13 Philosophi\xe6 Naturalis Principia Mathematica
## location
## 1 405.08
## 2 001.67
## 3 001.45
## 4 001.45
## 5 001.45
## 6 001.45
5 An example is mentioned earlier: using the weekly balances of current accounts to calculate one parameter that
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 294
❦
## 7 001.45
## 8 001.45
## 9 205.13
## 10 205.13
## 11 205.13
## 12 100.53
## 13 200.43
mutate() The function mutate() allows, of course, for much more complexity. Below we present an
example that makes an additional column for a short author name (one simple and one more
complex) and finds out if someone is alive (or if we cannot tell).
It is probably is obvious what the functions if_else() and str_sub() do, however, if you
want more information we refer to Section 17.5 “String Manipulation in the tidyverse” on page 299.
Of course, the function mutate() also does arithmetic such as new_column = old_column
* 2. The function mutate() understands all basic operators, and especially the following func-
tions are useful in mutate statements:
• arithmetic, relational and logical and operators, such as +,-, *, /, >, & (see Section 4.4
“Operators” on page 57);
• any base function such as log(), exp(), etc.
• lead(), lag() will find the previous or next value (e.g. to calculate returns between dates)
— also see the section about quantmod.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 295
❦
• cumsum(), cummean(), cummin(), cummax(), cumany() and cumall() from base-R and
dplyr together form complet set of functions related to the cumulative distribution.
• if_else() (as used before), recode() (a vectorized version of the switch command),
case_when() (an elegant way to construct a list of “else if” statements.
If in your particular case, you prefer to drop existing columns, then the function transmute() transmute()
Do you need even more granular functionality over the data transformations, such mutate_all()
as when and where the transformation is applied? Then consider looking at the doc- mutate_if()
mutate_at()
umentation of the scoped variants of mutate(): mutate_all(), mutate_if() and
transmute_all()
mutate_at() as well as their sister functions of the above mentioned transmute(): transmute_if()
transmute_all(), transmute_if(), and transmute_at(). transmute_at()
❦ ❦
Consider for example the situation where we want a list of authors and a list of dates that
are five days before their birthday (so we can put those dates in our calendar and send them a
postcard).
authors %>%
transmute(name = full_name, my_date = as.Date(birth_date) -5)
## name my_date
## 1 Valentin Louis G. E. Marcel Proust 1871-07-05
## 2 Miguel de Cervantes Saavedra 1547-09-24
## 3 James Augustine Aloysius Joyce 1882-01-28
## 4 Erika Leonard 1963-03-02
## 5 Isaac Newton 1642-12-20
## 6 Euclid of Alexandria <NA>
## 7 Bernard Marr <NA>
## 8 Bart Baesens 1975-02-22
## 9 Philippe J.S. De Brouwer 1969-02-16
## 10 Hadley Wickham <NA>
## 11 Garrett Grolemund <NA>
The list of birthday warnings in previous example is not very useful as it contains also warn-
ings for deceased people as well as missing dates. The function filter() allows us to select only filter()
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 296
❦
## name my_date
## 1 Erika Leonard 1963-03-02
## 2 Bart Baesens 1975-02-22
## 3 Philippe J.S. De Brouwer 1969-02-16
It is important not to confuse the function filter() with the filtering joins
(semi_join() and anti_join()). These are respectively an equivalent of the LEFT
JOIN from SQL and the a left-join that would only return rows where there is no match.
The function documentation, accessed via ?anti_join , is kept together so that you have
all functions in one place.
The similarity with set-operators makes it easy to understand how these operators work. The
simple example below helps to make clear what they do.
union(A,B)
## [[1]]
## [1] 1 2 3 4
##
## [[2]]
## [1] 4 4 5
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 297
❦
union_all(A,B)
## # A tibble: 7 x 1
## col1
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 4
## 6 4
## 7 5
setdiff(A,B)
## # A tibble: 4 x 1
## col1
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
setequal(A,B)
## [1] FALSE
union(A,B)
## [[1]]
## [1] 1 2 3 4
##
## [[2]]
## [1] "a" "a" "b" "b"
##
## [[3]]
## [1] 4 4 5
##
## [[4]]
## [1] "b" "b" "c"
union_all(A,B)
## # A tibble: 7 x 2
## col1 col2
## <int> <chr>
## 1 1 a
## 2 2 a
## 3 3 b
## 4 4 b
## 5 4 b
## 6 4 b
## 7 5 c
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 298
❦
setdiff(A,B)
## # A tibble: 4 x 2
## col1 col2
## <int> <chr>
## 1 1 a
## 2 2 a
## 3 3 b
## 4 4 b
setequal(A,B)
## [1] FALSE
To use these functions both data frames must have the same column headings, but not
necessarily the same length!
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 299
❦
Base R provides solid tools to work with strings and for string manipulation – see Section 4.3.9
“Strings or the Character-type” on page 54, however, as with many things in free software they
have been contributed by different people in different times and hence lack consistency.
The tidyverse includes stringr that in its turn is based upon stringi and it provides a solid stringr
and fast solution that is coherent in naming conventions and on top of that they will follow the stringi
library(tidyverse)
library(stringr)
# define strings
s1 <- "Hello" # double quotes are fine
s2 <- 'world.' # single quotes are also fine
Did you notice that the string functions of stringr in previous example started with
str_ . Well, the good news is that they all start with the same letters. So if you work in an
environment such as RStudio, you can see a list of all string functions by typing str_ .
Do not confuse the functions str_c() with s.c_str() in C++. For example, the follow-
ing R-code
s <- 'World'
str_c('Hello, ', s, '.')
## [1] "Hello, World."
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 300
❦
int main ()
{
string s ("World");
std::cout << "Hello, " << s.c_str() << "." << std::endl;
// We admit that c_str() is not necessary here :-)
return 0;
}
While the name of the functions c_str() and str_c() look very similar, they do some-
thing very different. Of course, the concept of a C-string makes little sense in a highl level
language such as R.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 301
❦
# Use pipes:
sVector[4] %>%
str_sub(1,4) %>%
str_to_upper()
## [1] "PHIL"
It is also possible to reverse the situation and modify strings with str_sub()
There are a few more special purpose functions in stringr. Below we list some of the most
useful ones.
Duplicate Strings
One of the most simple string manipulations is duplicating them to form a longer string. Here we
ask stringr to produce a dark shade of grey.
❦ ❦
str <- "F0"
str_dup(str, c(2,3)) # duplicate a string str_dup()
## [1] "F0F0" "F0F0F0"
This will help to format nicely tables on the screen, or even in printed documents when it is both
unnecessary and not aesthetic to show the complete long strings (in a table).
str_trim(str,"left")
## [1] "1 " "abc"
## [3] "Philippe De Brouwer "
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 302
❦
library(stringr) # or library(tidyverse)
sV <- c("philosophy", "physiography", "phis",
"Hello world", "Philippe", "Philosophy",
"physics", "philology")
So far the regex works as a normal string matching: the algorithm looks for an exact match
to the letters “Phi.” However, there is a lot more that we can do with regular expressions. For
example, we notice that some words are capitalized. Of course, we can coerce all to capital and
then match, but that is not the point here. In regex we would use the syntax of “one letter from
the set (‘p’, ‘P’)” and then exactly “hi.” The way to instruct a regex engine to select one or more
possibilities is to separate them with | .
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 303
❦
str_extract(sV, "(p|P)hi")
## [1] "phi" NA "phi" NA "Phi" "Phi" NA "phi"
# Or do it this way:
str_extract(sV, "(phi|Phi)")
## [1] "phi" NA "phi" NA "Phi" "Phi" NA "phi"
# Other example:
str_extract(sV, "(p|P)h(i|y)")
## [1] "phi" "phy" "phi" NA "Phi" "Phi" "phy" "phi"
# Is equivalent to:
str_extract(sV, "(phi|Phi|phy|Phy)")
## [1] "phi" "phy" "phi" NA "Phi" "Phi" "phy" "phi"
Note that it extracts now both “Phi” and “phi,” but does not change how it was written.
Now, imagine that we want all matches as before, except those that precede and “l.” It is pos-
sible to exclude characters by putting them in square brackets and precede them with ^
str_extract(sV, "(p|P)h(i|y)[^lL]")
## [1] NA "phys" "phis" NA NA NA "phys"
## [8] NA
Regular expressions are a very rich subject, and the simple rules explained so far can already
match very complex patterns. For example, note that there is a shorthand to make the matching
case insensitive.
str_extract(sV, "(?i)Ph(i|y)[^(?i)L]")
❦ ## [1] NA "phys" "phis" NA NA NA "phys" ❦
## [8] NA
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
Note that – when supplying the pattern to R – we have to escape the escape character \ , because also
R uses the backslash ( \ ) as escape character and will remove one before passing to the function. So
matching an email would look in R as follows:
The regex
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 304
❦
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
reads as follows.
• ^([a-zA-Z0-9_\\-\\.]+) : at the start of the string (indicated by “ ^ ”) find a choice of num-
bers, letters, underscores, hyphens, and/or points (made clear by “ [a-zA-Z0-9_\\-\\.] ”), but
make sure to find at least one of those. This concept “one or more” is expressed by “ (...)+ ”
• then find the @ symbol
• then find at least one more occurence of a letter (lower- or uppercase), underscore, hyphen or
point – ([a-zA-Z0-9_\\-\\.]+)
• followed by a point – .
• finally followed by minimum two and maximum five letters (lower- or uppercase) –
([a-zA-Z]{2,5}
• and that last match, must be the end of the string – $ .
Probably it is already more or less clear by now how a regular expression works and how to read
it. Note in particular the following:
• Some characters need to be escaped because they have a special meaning in regular expressions
(for example the hyphen, - , indicates a range like [a-z] . This expression matches all letters
whose ASCII values are between those of “a” and “z”)
• Some signs are anchors and indicate a position in the string (e.g. “∧” is the start of the word and
“$” is the end of the word — so putting the first at the start of the pattern and the latter at the
end forces the pattern to match the whole string, this avoids that a valid email is surrounded by
other text.
• There are different types of brackets used. In general, the round brackets group characters, the
square ones indicate ranges, and the curly brackets indicate a range of matches.
Note that in the remainder of this section, we leave out the double escape characters. If you
use this in R, then you will need to replace all \ with \\ wherever it occurs.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 305
❦
Now that we understand how a regex works, we can have a closer look at their syntax and provide
an overview of the different symbols and their special meaning. Below we show the main building
blocks.
Special characters
Anchors
\n newline
^ begin of string or line
\r carriage return)
$ end of string (or line)
\t tab
\< beginning of a word
\v vertical tab
\> end of a word
\f form feed
Quantifiers
? 0 or 1 times
Character groups
* 0 or more
. any character, but \n
+ 1 or more
[abc] accepted characters
{n} n times
[a-z] character range
{n,m} between n and m times
(...) characters group
{n,} n or more times
{,m} m or less times
Logic
| “OR,” e.g. (a|b) matches a or b
\1 content of group one, e.g. r(\w)g(\1)x matches "regex"
\2 group two, e.g. r(\w)g(\1)x(\2)xpr matches "regexexpr"
(?:..) non capturing group = ignore that match in the string to return
❦ [^a-d] “not”: no character in range a to d ❦
Other
Lookaround – requires PERL = TRUE
\Qa\E treat a verbatim, e.g.
a(?!b) a not followed by b
\QC++?\E matches “C++?”
a(?=b) a if followed by b
\K drop match so far, e.g.
(?<=b)a a if preceded by b
x\K\dreturns from
(?<!b)a a if not preceded by b
x1 only 1
Line modifiers
(?i) makes all matches case insensitive
(?s) single line mode: . also matches \n
(?m) multi line mode: ^ and $ become begin and end of line
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 306
❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 307
❦
should try to master, but making a regex is very useful and a skill that will come in handy in about
every computer language of any value today.
R uses POSIX extended regular expressions. You can switch to PCRE regular expressions by
using PERL = TRUE for base-R or by wrapping patterns with perl() for stringr. perl()
All functions can be used with literal searches using fixed = TRUE for base or by wrapping pat-
terns with fixed () for stringr. Note also that base functions can be made case insensitive by
specifying ignore.cases = TRUE.
Showing all possibilities and providing a full documentation of the regex implementation in R
is a book in itself. Hence, we just showed here the tip of the iceberg and refer for example to the
excellent “cheat sheets” published by RStudio: https://www.rstudio.com/resources/
cheatsheets.
Good sources about regex itself are here in https://www.regular-expressions.info and
https://www.rexegg.com.
##
## Attaching package: ’rex’
## The following object is masked from ’package:stringr’:
##
## regex
## The following object is masked from ’package:dplyr’:
##
## matches
## The following object is masked from ’package:tidyr’:
##
## matches
#host name
group(zero_or_more(valid_chars,
zero_or_more('-')),
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 308
❦
one_or_more(valid_chars)),
#Domain name:
zero_or_more('.',
zero_or_more(valid_chars,
zero_or_more('-')),
one_or_more(valid_chars)),
end
)
❦ ❦
17.5.2.2 Functions Using Regex
Both base-R and stringr have a complete suite of string manipulating functions based on regular
expressions. Now, that we understand how a regex works, they do not need a lot of explanation. The
functions of base-R are conveniently named similar to the functions that one will see in a Unix or
Linux shell such as bash and hence might be familiar. The functions of stringr follow a more logical
naming convention, are more consistent, and allow to use the pipe-command, because the data is the
first argument of the function.
In the examples below, we mention both the functions for base-R and stringr. The functions of
stringr can be recognized, because they all start with str_. In all examples of this section, we use
the same example where we try to match a digit within a string. Let us define it here:
The pattern, \d , is a shorthand notation that will match any digit. Only the second element of the
vector of strings does not have a digit, so only there will be no match.
Detect a Match
These functions will only report if a match is found, no information about starting positions of he match
is given.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 309
❦
## [1] 1 3 4 5 6
Locate
In many cases, it is not enough to know if there is a match, but also where the match occurs in the
string; that is what we call “locating” a match in a string.
# Locate the first match (the numbers are the position in the string):
regexpr (pattern, string)
## [1] 5 -1 2 2 1 1
## attr(,"match.length")
## [1] 1 -1 1 1 1 1
## attr(,"useBytes")
## [1] TRUE
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 310
❦
## [[6]]
## [1] 1
## attr(,"match.length")
## [1] 1
## attr(,"useBytes")
## [1] TRUE
The locating functions make the “detect match functions” almost redundant: if no match is
found, the the indexes of the locating function are −1. We can then use this number to check
if a match is found.
Replace
Often we want to do more than just finding where a match occurs, but we want to change it with
something else. This process is called “replacing” matches with strings.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 311
❦
Extract
If it is not our aim to replace the match, then it might be the case that we want to extract it for further
use and manipulation in other sections or functions. The following functions allow to extract matches
the regular expressions. from strings. The output of these functions can be quite verbose such as the
❦ functions to locate matches. ❦
# regmatches() with regexpr() will extract only the first match:
regmatches(string, regexpr(pattern, string))
## [1] "1" "5" "1" "1" "6"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 312
❦
## [,1]
## [1,] "1"
## [2,] NA
## [3,] "5"
## [4,] "1"
## [5,] "1"
## [6,] "6"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 313
❦
## [3,] "3"
##
## [[6]]
## [,1]
## [1,] "6"
Not all regex implementations are created equal. When going through documentation make
sure it is relevant for R and stringr. For example, Java has a slightly different implemen-
tation. Also note that PERL has a rich implementation and has more functionality than the
POSIX regex. This functionality is understood by many functions when the additional argu-
ment PERL=TRUE is added. It allows for example for if statements.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 314
❦
Working with dates and times looks deceivingly easy. Sure there is that fact that hours are not decimals
but duodecimal, but that is something that we got used to as children. We easily forget that not all years
are 365 days, not all days are 24 hours, and that not all years exist.
The Egyptians used a year of exactly 365 days. The Julian Calendar was introduced by Gaius Julius
Caesar, and it skips one day every four years. However, by 1582 the difference with the tropical year
was about 10 days and Pope Gregorius XIII introduced the Gregorian Calendar and doing so skipped
10 days. This calendar was immediately adopted by European Roman Catholic countries since 1582;
however, Greece for example only adopted it in 1923. The Gregorian Calendar that uses the 24 hours
day and 365 day years needs adjustment every four years (February 29 is added for every year that is
divisible by 4, unless they are divisible by 100, in which case they also need to be divisible by 400 to get
that leap year. So, 1900, 2100, and 2200 are not leap years, but 1600, 2000, and 2400 are leap years.
So, in a period of four centuries, the Gregorian calendar will be missing 3 of its 100 Julian leap years,
leaving 97. This brings the average year to 365 + 97/400 = 365.2425. This is close but not exactly the
same as the astronomical year, which is 365.2422 days. There are propositions to leave out one leap
year every 450 years or rather skip one every 4000 years. We might be worrying about that in 2500
years. But even if we get that right than in a few million years the rotation of the earth will be slowed
down too significantly. Anyhow, let us hope that by then we would consider the rotation of the earth as
a mundane detail not worth bothering. In the meanwhile, all countries that converted to the Gregorian
calendar somehow have to adjust count.
Also note that for example there is no year 0, the year before year 1 CE is −1 or 1 BCE. For all those
reasons, even the simple task of calculating the difference between two dates can be daunting.
Would time be easier? Would 1 p.m. not always be the same? Well, first of all, there are the time-
time-zone zones. Then there are occasions where countries change time-zone. This can complicate matters if we
❦ want to calculate a difference in hours. ❦
There is also the daylight savings time (DST). Canada used daylight savings time first in 1908,
and Germany introduced DST in 1916: the clocks in the German Empire, and its ally Austria, were
turned ahead by one hour on 30 April 1916 in the middle of the first World War. The assumed logic
was to reduce the use of artificial lighting and save fuel. Within a few weeks, the idea was copied by
the United Kingdom, France, and many other countries. Most countries reverted to standard time after
DST World War I, and it was the World War II that made DST dominate in the Europe. Today, the European
daylight savings time Union realizes that only a fool would believe that one gets a longer blanket by cutting of a piece on the
bottom and sowing it on the top . . . and hence considers to leave DST.
At least every day has 24 hours and hence 86 400 seconds, isn’t it? Well, even that is not correct.
Now, and then, we have to introduce a leap second or seconds as the rotation of the earth slows down.
We have now atomic clocks that are so precise that the difference between atomic time and astronomic
time (earth rotation) becomes visible as the earth slows down. Since 1972, we have been adding leap
seconds whenever the difference reaches 0.9 seconds. The last leap second was on 31 December 2016,
IERS the next one has still to be announced by the IERS in Paris.6
Since we only consider time on the earth, we discard the complexity of interstellar travel, special
and general relativity. But even doing so it will be clear by now that we do not want to deal with all
those details if we only want to make some neural network based on time differences of stock prices
for example. We need R to help us.
lubridate The package lubridate – announced in Grolemund and Wickham (2011) – is part of the
tidyverse and does an excellent job and provides all functionality that one would need. It does all
that while being compliant with the tidyverse philosophy.
6 The IERS is the International Earth Rotation and Reference Systems Service.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 315
❦
We will load the package here and show this part of the code only once. All sub-sections that follow
will use this package.
The first key concept is that of a date and a date-time. For most practical purposes, a date is something
that can be stored as yyyy-mmm-dd.
It can be noted that R follows the ISO 8601 Notation and so will we do. Any person who believes in
inclusion and not in imposing historical nation bound standards to the rest of the world will embrace
the ISO standards, a fortiori any programmer or modeller with an inclusive world-view will also use
the ISO 8601 standards. But there are many people who will not do this, and it is not uncommon to get
dates in other formats or have to report dates in those formats. So, the format that we will use is the
ISO format: yyyy-mm-dd, but we will also show how to convert to other systems.
mdy("4/5/2018")
## [1] "2018-04-05"
mdy("04052018")
## [1] "2018-04-05"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 316
❦
as.numeric(dt)
## [1] 17656
ydm() ymd(17656)
yq("201802")
## [1] "2018-04-01"
All the functions above can use dates combined with a time. In other words, they can not only be
used with dates, but also with a date-time.7 For example:
❦ ymd_hms("2018-11-01T13:59:00")
❦
## [1] "2018-11-01 13:59:00 UTC"
dmy_hms("01-11-2018T13:59:00")
## [1] "2018-11-01 13:59:00 UTC"
ymd_hm("2018-11-01T13:59")
## [1] "2018-11-01 13:59:00 UTC"
ymd_h("2018-11-01T13")
## [1] "2018-11-01 13:00:00 UTC"
hms("13:14:15")
## [1] "13H 14M 15S"
hm("13:14")
## [1] "13H 14M 0S"
The functions above also support other date formats. This can be done by tuning the argument
tz (time-zone) and locale. A locale is a set of local customs to represent dates, numbers and
lists. In Linux we can get a list of all installed locales via system('locale -a') .
Essentially, a date is stored as a number of days since 1970-01-01 and a time is stored as the number
of seconds lapsed since 00 : 00 : 00.
7 There is one exception: the function that relates to quarters.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 317
❦
as_date("2018-11-12")
## [1] "2018-11-12"
as_date(0)
## [1] "1970-01-01"
as_date(-365)
## [1] "1969-01-01"
as_date(today()) - as_date("1969-02-21")
## Time difference of 18605 days
The package lubridate also provides an alternative to Sys.Date(() with the function
today() and now().
today()
## [1] "2020-01-30"
now()
## [1] "2020-01-30 00:01:02 CET"
17.6.2 Time-zones
❦ Date-time can be seen as “a moment in time,” and it is a given moment within a given day. Its standard ❦
format is yyyy-mm-dd hh:mm XXX, where XXX is the time-zone like “UTC.” R knows about 600 time- time-zone
zones. From each time-zone, it keeps a full history, so it will know when Alaska for example changed
time-zone and takes that into account when showing time differences.
We mentioned before that R knows about 600 time-zones. That is much more than the time-zones
that actually exist. However, the simplicity of time-ones can be deceiving. For R to be able to calculate
accurately time lapses and differences, it is not even enough to keep a different and unique entry in its
database for each time-zone history profile, but also for each DST history.
# Force time-zone:
as_datetime("2006-07-22T14:00 UTC")
## [1] "2020-06-07 22:14:00 UTC"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 318
❦
Many functions – even when not designed for the purpose of managing time-zones – take a time-
zone as input. For example:
If you get years, months and days in separate columns then you can put them together with
make_datetime() make_datetime(). For example:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 319
❦
## [1] 68
❦ The function update provides a flexible way to change date-times to other known values. ❦
# We will use the date from previous example:
dt1
## [1] "1890-12-29 08:00:00 MST"
All functions of the tidyverse will return a value that contains the requested modifications, but
will never as a side effect change the variables that are provided to the function.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 320
❦
Sure, the above example makes sense, but it is confusing at least. The difference between the two
date-times (simply used as numbers of seconds since 1970-01-01 00:00) gave us the real time that was
lapsed between the moments in time, even if humans make things unnecessarily complicated with
DST.
So, when calculating time differences, there are a few different concepts that have to be distin-
guished. We define the following.
1. Duration: A duration is the physical amount of time that has been elapsed between two events.
2. Periods: Track changes in clock times (so pretend that DST, leap seconds, and leap years do not
exist).
3. Intervals: Periods of time defined by start and end date-time (duration or period can be
extracted)
Those three definitions are connected. The time interval only makes sense if one reduces it to a
period or a duration. For example, a time interval that starts before an hour skip in a daylight savings
time-zone might be two hours in duration but the “period” measured only one hour. Depending on the
❦ application one might need the one or the other. ❦
17.6.4.1 Durations
Durations are the time intervals that model the cosmic time differences regardless the human defini-
tions such as “9 a.m. in CET time-zone.” Examples are natural events such as tides, eruptions of geysers,
the time it took to fill out a test, etc.
Durations can be created with a range of functions starting with d (for duration). The following
examples illustrate what these functions do and how they interact:
dweeks(x = 1)
## [1] "604800s (~1 weeks)"
ddays(x = 1)
## [1] "86400s (~1 days)"
dhours(x = 1)
## [1] "3600s (~1 hours)"
dminutes(x = 1)
## [1] "60s (~1 minutes)"
dseconds(x = 1)
## [1] "1s"
dmilliseconds(x = 1)
## [1] "0.001s"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 321
❦
dmicroseconds(x = 1)
## [1] "1e-06s"
dnanoseconds(x = 1)
## [1] "1e-09s"
dpicoseconds(x = 1)
## [1] "1e-12s"
str(dur)
## Formal class 'Duration' [package "lubridate"] with 1 slot
## ..@ .Data: num 1e-09
print(dur)
## [1] "1e-09s"
If the duration is not given in one number, but for example in with units expressed as a string,
we can use the function duration(). There is also a series of functions that can coerce to a duration,
❦ check if something is a duration: ❦
as.duration(dur)
## [1] "315360000s (~9.99 years)"
is.duration(dur)
## [1] TRUE
is.difftime(dur)
## [1] FALSE
as.duration(dur)
## [1] "315360000s (~9.99 years)"
17.6.4.2 Periods
Periods model time intervals between events that happen at specific clock times. For example, the
opening of a stock exchange is always 9 a.m., regardless of DST or a leap second.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 322
❦
years(x = 1)
## [1] "1y 0m 0d 0H 0M 0S"
months(x = 1)
## [1] "1m 0d 0H 0M 0S"
weeks(x = 1)
## [1] "7d 0H 0M 0S"
days(x = 1)
## [1] "1d 0H 0M 0S"
hours(x = 1)
## [1] "1H 0M 0S"
minutes(x = 1)
## [1] "1M 0S"
seconds(x = 1)
## [1] "1S"
milliseconds(x = 1)
## [1] "0.001S"
microseconds(x = 1)
## [1] "1e-06S"
nanoseconds(x = 1)
## [1] "1e-09S"
picoseconds(x = 1)
❦ ❦
## [1] "1e-12S"
str(per)
## Formal class 'Period' [package "lubridate"] with 6 slots
## ..@ .Data : num 0
## ..@ year : num 0
## ..@ month : num 0
## ..@ day : num 1
## ..@ hour : num 0
## ..@ minute: num 0
print(per)
## [1] "1d 0H 0M 0S"
# For automations:
period(5, unit = "years")
## [1] "5y 0m 0d 0H 0M 0S"
as.period(10)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 323
❦
## [1] "10S"
period_to_seconds(p)
## [1] 10
These functions return period-objects, that have their own arithmetic defined. This allows us to
define periods simply as follows:
17.6.4.3 Intervals
Time intervals are periods or durations limited by two date-times. This means that in order to define
them we need two dates (or two date-times). Intervals can be seen as a start and end-moment in one
object, they are nor periods nor durations, but by dividing them by a period or a duration we can find
out how long they take.
The package lubridate provides a set of functions that allow to check if a date is in an interval,
move the interval forward, etc.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 324
❦
int_shift(ww2, years(-1))
## [1] 1938-09-01 09:00:00 CET--1944-08-15 05:00:00 CEST
❦ ❦
17.6.4.4 Rounding
Occasionally, it is useful to round dates. It might make sense to summarize moments to the closest start
of the month instead of just dropping the day and coercing all occurrences to the first of that month.
Of course, this will work also with years, days, and minutes by using a different value for the unit
attribute.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 325
❦
Categorical data in R are called “factors”: they help to present discrete labels in a particular order that is
not necessarily alphabetical and have been discussed in Section 4.3.7 “Factors” on page 45. In the past
when computing time was in much more limited supply, they offered a gain in performance and that
is why in base R they pop up even where they aren’t very helpful. For example, when importing files
or coercing data to data frames strings will become factors. This is one of the things that the tidyverse
will not allow: silently changing data types.
We already have discussed the base-R functions related to factors in Section 4.3.7 “Factors” on
page 45. Hence, the reader will be acquainted with the subject of factors itself, and therefore, in this sec-
tion we can focus on the specificities of the package forcats, which is part of the core-tidyverse. “For- forcats
cats” is an anagram of “factors.”
In this section, we will focus on an example of a fictional survey (with fabricated data). Imag-
ine that we have the results of a survey that asked about the satisfaction with our service8 rated as
high/medium/low. First, we generate the results for that survey:
set.seed(1911)
s <- tibble(reply = runif(n = 1000, min = 0, max = 13))
hml <- function (x = 0) {
if (x < 0) return(NA)
if (x <= 4) return("L")
if (x <= 8) return("M")
if (x <= 12) return("H")
return(NA)
}
surv <- apply(s, 1, FUN = hml) # output is a vector
surv <- tibble(reply = surv) # coerce back to tibble
❦ surv ❦
## # A tibble: 1,000 x 1
## reply
## <chr>
## 1 H
## 2 M
## 3 L
## 4 M
## 5 L
## 6 H
## 7 L
## 8 H
## 9 <NA>
## 10 L
## # ... with 990 more rows
To put the labels in the right orders, we have to make clear to R that they are factors and that
we have a specific order for our factors. This can be done with the argument levels in the function
parse_factor(). parse_factor()
8 We do not recommend to measure customer satisfaction as high, medium and low. This is not actionable. Good
KPIs are actionable. Best is to ask “how likely are you to recommend our service to a friend (on a scale from 1 to 10).”
This would allow us to find net-detractors, neutral users and net promoters. This is useful, but not needed for our
example to use factors.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 326
❦
Finally, we have our survey data gathered and organised for use. To study the survey, we can ask for
its summary via the function summary() and plot the results. We can plot with the function plot()
and R will recognise that the data is a factor-object and hence produce the plot in Figure 17.2.
summary() summary(survey)
## L M H <NA>
## 295 313 310 82
Customer Satisfaction
300
250
200
150
100
❦ ❦
50
0
L M H
Figure 17.2: The standard plot function on a factored object with some values NA (last block without
label).
factor() You will remember that the function factor() does exactly the same: converting to factors.
However, it will silently convert all strings that are not in the set of levels to NA. For example,
a typo “m” (instead of “M”) would become NA.
Therefore, we recommend to use readr’s function parse_factor().
If you do not know the factor levels in advance (or need to find out), then you can use the
function unique() to find out which levels there are and feed that into parse_factor().
This works as follows:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 327
❦
The package forcats provides many functions to work with labels. First, let us count the occurrences
of the labels with fct_count(). fct_count()
Another useful thing is that labels can be remodelled after creation. The function fct_relabel() fct_relabel()
provides a powerful engine to change labels and remodel the way the data is modelled and/or presented.
❦ The following code illustrates how this is done and displays the results in Figure 17.3 on page 328. ❦
The open mechanism that allows to pass a function as parameter to another function is a very
flexible tool. For example, it allows to use the power of regular expressions – see Section 17.5.2
“Pattern Matching with Regular Expressions” on page 302 to achieve the same result.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 328
❦
600
500
400
300
200
100
0
Low Medium/High
Figure 17.3: Maybe you would prefer to show this plot to the board meeting? This plot takes the
two best categories together and creates the impression that more people are happy. Compare this to
previous plot.
❦ ❦
Besides changing labels, it might occur that the order needs to be changed. This makes most sense
when more variables are present and when we want to show the interdependence structure between
those two variables. Assume that we have a second variable: age. The following code provides this
analysis and plots the results in Figure 17.4 on page 330
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 329
❦
for (n in 1:num_obs) {
if (!is.na(srv$age[n])) {
srv$reply[n] <- hml(rnorm(n = 1, mean = srv$age[n] / 7, sd = 2))
}
else {
srv$reply[n] <- hml(runif(n = 1, min = 1, max = 12))
}
}
f_levels <- c("L", "M", "H")
srv$fct <- parse_factor(srv$reply, levels = f_levels)
❦ ❦
# From least frequent to more frequent:
srv$fct %>%
fct_infreq %>%
fct_rev %>% fct_rev()
levels
## [1] NA "L" "H" "M"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 330
❦
❦ ❦
H
H
M
M
L
Figure 17.4: A visualisation of how the age of customers impacted the satisfaction in our made-up
example. The NA values have been highlighted in red.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 331
❦
In some sense, we were “lucky” that the order produced by the medians of the age variable some-
how make sense for the satisfaction variable. In general, this will mix up the order of the factorized
variable. So, these techniques make most sense for a categorical variable that is a “nominal scale,”9
such as for example countries, colours, outlets, cities, people, etc.
Finally, we mention that forcats foresees a function to anonymise factors. It will change the order
and the labels of the levels. For example, consider our survey data (and more in particular the tibble
srv that has the variable fct that is a factor.
srv %>%
mutate("fct_ano" = fct_anon(fct)) %>%
print
## # A tibble: 1,000 x 4
## reply age fct fct_ano
## <chr> <dbl> <fct> <fct>
## 1 L 59.5 L 4
## 2 L 30.7 L 4
## 3 M 48.9 M 1
## 4 L NA L 4
## 5 H 50.7 H 2
## 6 M 57.7 M 1
## 7 M 45.1 M 1
## 8 H 75.1 H 2
## 9 L 30.0 L 4
## 10 M 41.2 M 1
## # ... with 990 more rows
❦ Question #13 ❦
The package forcats comes with a dataset gss_cat containing census data that allows to link
number of hours television watched, political preference, marital status, income, etc. Many
of the variables can – and should be – considered as categorical variables and hence can be
converted to factors. Remake the previous analysis in a sensible way for that dataset.
9 A nominal scale is a measurement scale that has no strict order-relation defined – see Chapter B “Levels of
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 333
❦
♣ 18 ♣
Most datasets will have missing values, and missing values will make any modelling a lot more
complicated. This section will help you to deal with those missing values.
However, it is – as usual – better to avoid missing data than solve the problem later on. In the
first place, we need to be careful when collecting data. The person who collects the data, is usually
not the one reading this book. So, the data scientist reading this book, can create awareness to the
higher management that can then start measuring data quality and improve awareness or invest
in better systems.
We can also change software so that it helps to collect correct data. If, for example, the retail
staff systematically leaves the field “birth date” empty, then we can via software refuse that birth
dates that are left empty, pop up a warning if the customer is over 100 years old and simple do not
accept birth dates that infer an age over 150. The management can also show numbers related to
❦ losses due to loans that were accepted on wrong information. Also the procedures can be adapted, ❦
and for example require a copy of the customer’s ID card, and the audit department can then check
compliance with this procedure, etc.
However, there will still be cases where some data is missing. Even if the data quality is gen-
erally fine, and we still have thousands of observations left after leaving out the small percentage
of missing data, then it is still essential to find out why the data is missing. If data is missing at
random, leaving out missing observations will not harm the model and we can expect the model
not to loose power when applied to real and new cases.1
1 Even if there is only a small percentage of data missing, then still it is worth to investigate why the data is
missing. For example, on a book of a million loans we notice that 0.1% of the fields “ID card number” is missing.
When we check the default rate it appears that if that field is missing we have a default rate of 20%, while in the
rest of the population this is only 4%. This is a strong indication that not filling in that field is correlated with the
intention of fraud. We can, indeed, leave this small percentage out to build our model. However, we need to point
out what is happening and prevent further fraud.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 334
❦
If data is missing, there is usually an identifiable reason why this is so. Finding out what that
reason is, is most important, because it can indicate fraud, or be systematic, and hence influence
the model itself.
For example, missing data can be due to:
1. input and pre-processing (e.g. conversion of units: some dates were in American format,
others in UK format, some dates got misunderstood and others rejected; a decimal comma
is not understood, data is copied to a system that does not recognize large number,
etc.);
2. unclear or incomplete formulated questions (e.g. asking “are you male or female?”, while
having only a “yes” and “no” answer possible.);
3. fraud or other intend (if we know that young males will pay higher for a car insurance we
might omit the box where the gender is put);
4. random reason (e.g. randomly skipped a question, interruption of financial markets due to
external reason, mistake, typing a number too much, etc.)
Besides missing data it happens also all too often that data is useless. For example, if a
school wants to gather feedback about a course, they might ask “How satisfied are you with
❦ the printed materials.” Students can put a low mark if they did not get the materials, if they ❦
thought that the printed materials were too elaborate, too short, low quality, or even because
they do not even like to get printed materials and prefer electronic version. Imagine now that the
feedback is low for the printed materials. What can we do with this information? Did we even
bother to ask if the printed materials are wanted, and if so, how important are they in the overall
satisfaction?
The typical mistake of people that formulate a questionnaire is that they are unable to imag-
ine that other people might have another set of beliefs and values, or even have a very different
reality.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 335
❦
Often, data is missing because questionnaires are written carelessly and formulated
ambiguously. What to think about questions such as these:
4. is the DQP sufficient to monitor the IMRDP and in line with GADQP? yes/no
1. The reader will assume that this question is there because the teacher will be
assessed on the quality of printed materials and it also assumes that the student
cares. Which is an assumption, that should be asked first. What would you do with
this question if you did not want printed materials in the first place?
2. Whose study level are we asking here? That of the student or that of the parent?
3. Assume that the teacher is often late, then you can answer both “yes” or “no,”
assume the opposite and the same holds. So what to answer?
❦ 4. This question has two common problems. First, it uses acronyms that might not ❦
be clear to everyone; second, it combines two separate issues – being sufficient and
compliance with some rules – into one yes/no question. What would you answer if
only one of the two is true?
5. Some people might not want to fill in this question because they feel it should be
irrelevant. Would you expect the propensity to leave this question open to depend
on the racial group to which one belongs?
It also might be that there is a specific reason why it is missing. This could be because the
provider of the data does not like to divulge information. For example, when applying for a loan,
the young male might know that being young and being male works in his disadvantage. Maybe
he would like to identify as a lady for one day or simply leave the question open? In this example
it is possible that most of the missing gender information is that from young males. Leaving out
missing values will lead to a model that discriminates less young males, but might lead to more
losses for the lender and inadvertently lure many young males into unsustainable debt.
Assume that there is no possible systematic reason why data is missing. Then we have some
more options to deal with the missing data points.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 336
❦
1. Leave out the rows with missing data. If there is no underlying reason why data is missing,
then leaving out the missing data is most probably harmless. This will work fine if the
dataset is sufficiently large and of sufficient quality (even a dataset with hundred thousand
lines but with 500 columns can lead to problems when we leave out all rows that miss one
data point).
2. Carefully leave out rows with missing data. Same as above, but first make up our mind
which variable will be in the model and then only leave out those rows that miss data on
those rows.
3. Somehow fill in the missing data based on some rules or models. For example, we can
replace the missing value by:
Whatever we do or not do with missing data will influence the outcome of the model and
needs to be handled with the utmost care. Therefore, it is important to try to understand the data
and make an educated choice in handling missing data. As usual, R comes to the rescue with a
variety of libraries.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 337
❦
Example
For the remainder of this section, we will use a concrete example, using the database
iris, that is provided by the library datasets and is usually loaded at startup of R. This
database provides information of 150 measurements of petal and sepal measurements as
well as the type of Iris (setosa, versicolor, or virginica). This database is most useful for
training classification models. For the purpose of this example, we will use this database
and introduce some missing values.
set.seed(1890)
d1 <- d0 <- iris
i <- sample(1:nrow(d0), round(0.20 * nrow(d0)))
d1[i,1] <- NA
i <- sample(1:nrow(d0), round(0.30 * nrow(d0)))
d1[i,2] <- NA
head(d1, n=10L)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 NA 1.4 0.2
## 6 5.4 3.9 1.7 0.4
## 7 4.6 NA 1.4 0.3
## 8 5.0 3.4 1.5 0.2
## 9 4.4 2.9 1.4 0.2
## 10 NA 3.1 1.5 0.1
## Species
## 1 setosa
## 2 setosa
❦ ## 3 setosa ❦
## 4 setosa
## 5 setosa
## 6 setosa
## 7 setosa
## 8 setosa
## 9 setosa
## 10 setosa
For the remainder of this section, we will use the example of the iris flowers as defined above.
In order to draw conclusions about the different methods, it would make sense to run a
certain model like a neural network, decision tree, or other. Then apply some methods to
impute missing data and then compare results. However, since we delete data in a certain
way – random – we would expect to see that reflected in the results and in fact not really
being able to determine which method is best. In fact there is no such thing as “the best
method,” since we do not know why the data is missing and if we knew, then it would not
be missing.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 338
❦
PMM One of the oldest methods – see Little (1988) – is predictive mean matching (PMM). For each
predictive mean matching observation that has a variable with a missing value, the method finds an observation (that has
no missing value on this variable) with the closest predictive mean to that variable. The observed
value from this observation is used as imputed value. This means that it preserves automati-
cally many important characteristics such as skew, boundness (e.g. only positive data), base type
(e.g. integer values only), etc.
The PMM process is as follows:
1. Take all observations that have no missing values and fit a linear regression of variable x –
that has the missing values – to one or more variables y , and produce a set of coefficients b.
2. Draw random coefficients b∗ from the posterior predictive distribution of b. Typically, this
would be a random draw from a multivariate normal distribution with mean b and the esti-
mated covariance matrix of b (with an additional random draw for the residual variance).
This step will ensure that there is sufficient variability in the imputed values.
3. Using b∗ , generate predicted values for x for all cases (as well for those that have missing
values in x as those that do not.
4. For each observation with missing x, identify a set of cases with observed x whose predicted
values are close to the predicted value for the observation with missing data.
5. From those observations, randomly choose one and assign its observed value as value to be
imputed to the missing x.
❦ ❦
6. Repeat steps 2 – 5 till all the missing variables for x have an impute candidate.
Interestingly, and unlike many methods of imputation, the purpose of the linear regression is
not to find the values to be imputed. Rather, the regression is used to construct a metric for what
is a matching observation (whose observation can be borrowed).
Many of the packages that we will describe below have an implementation of PMM.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 339
❦
18.3.1 mice
If data is missing at random (MAR) – which is defined so that the probability that a value is MAR
missing depends only on observed value and can be predicted using only that value, the package missing at random
mice
mice2 is a great way to start. Multivariate imputation via chained equations (mice) imputes data
on a variable by variable basis by specifying an imputation model per variable. This code loads
the package and uses its function md.pattern() to visualise the missing values. This output is
in Figure 18.1.
Petal.Length Petal.Width Species Sepal.LengthSepal.Width
87 0
33 1
18 1
❦ ❦
12 2
0 0 0 30 45 75
Figure 18.1: The visualization of missing data with the function md.pattern().
2 Mice is an acronym and stands for “Multivariate Imputation via Chained Equations.”
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 340
❦
## Sepal.Width
## 87 1 0
## 33 0 1
## 18 1 1
## 12 0 2
## 45 75
The table shows that the dataset d1 has 87 complete cases, 33 missing observations in
Sepal.Width, 18 observations, where Sepal.Length is missing, and 12 cases where both are
missing.
Now, that we used the capabilities of mice to study and visualize the missing data, we can also
mice() use it to replace the missing values with a guess. This is done by the function mice() as follows:
This created five possible datasets and we can select one completed set as follows.
18.3.2 missForest
missForest uses under the hood a random forest algorithm and hence is a non-parametric impu-
tation method. A non-parametric method does not make explicit assumptions about functional
form of the distribution. missForest will build a random forest model for each variable and uses
this model to predict missing values in the variable with the help of observed values.
❦ ❦
It yields an out of bag (OOB) imputation error estimate and provides fine grained control over
the imputation process. It even has the possibility to return OOB separately (for each variable)
instead of aggregating over the whole data matrix. This allows to assess how accurately the model
has chosen values for each variable.
Note also that the process works well on categorical variables.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 341
❦
Hint – Fine-tuning
Errors can usually be reduced by fine-tuning the parameters of the function missForest:
• mtree – the number of variables being randomly sampled at each split, and
18.3.3 Hmisc
Hmisc does actually a lot more than dealing with missing data. It provides functions for data
analysis, plots, complex table making, model fitting, and diagnostics (for linear regression, logistic
regression, and cox regression) as well as functionality to deal with missing data.
For missing values Hmisc provides the following functions: Hmisc
impute()
• impute(): Imputes missing values based on median (default), mean, max, etc.
aregImpute()
• aregImpute(): Imputes using additive regression, bootstrapping, and predictive mean
matching (PMM).
In the bootstrapping method, different bootstrap samples are used for each of multiple impu-
tations. Then, an additive model (non parametric regression method) is fitted on samples taken
❦ with replacements from original data and missing values elected using non-missing values as ❦
independent variables. Finally, it uses PMM by default to impute missing values.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 342
❦
## Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
print(aregImp)
##
## Multiple Imputation using Bootstrap and PMM
##
## aregImpute(formula = ~Sepal.Length + Sepal.Width + Petal.Length +
## Petal.Width + Species, data = d1, n.impute = 4)
##
## n: 150 p: 5 Imputations: 4 nk: 3
##
## Number of NAs:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 30 45 0 0
## Species
## 0
##
## type d.f.
## Sepal.Length s 2
## Sepal.Width s 1
## Petal.Length s 2
## Petal.Width s 2
## Species c 2
##
## Transformation of Target Variables Forced to be Linear
❦ ## ❦
## R-squares for Predicting Non-Missing Values for Each Variable
## Using Last Imputations of Predictors
## Sepal.Length Sepal.Width
## 0.901 0.714
The mi package provides multiple imputation with PMM method – as explained earlier.
You might want to consider the mi package for one of the following reasons:
• it detects certain issues with the data such as – near – collinearity of variables;
The subject of missing data is by no means closed with this short overview. It is a complex
issue that is hard to tackle. We find that the best and most powerful approach is to start at the
source: improve data quality. In any case, imputation methods have to be used with utmost care
to avoid influencing the model being fitted.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 343
❦
♣ 19 ♣
Data Binning
Histograms are well known, and they are an example of “data-binning” (also known as “discrete
binning,” “bucketing,” or simply “binning”). They are used in order to visualize the underlying binning
distributions for data that has a limited number of observations.
Consider the following simple example where we start with data drawn from a known distri-
bution and plot the histogram (the output of this code is in Figure 19.1 on page 344):
set.seed(1890)
d <- rnorm(90)
par(mfrow=c(1,2))
❦ hist(d, breaks=70, col="khaki3")
❦
hist(d, breaks=12, col="khaki3")
Other examples of data binning are in image processing. When small shifts in the spectral
dimension from mass spectrometry (MS) or nuclear magnetic resonance (NMR) could be falsely
interpreted as representing different components, binning will help. Binning allows to reduce the
spectrum in resolution to a sufficient degree to ensure that a given peak remains in its bin despite
small spectral shifts between analyses. Also, several digital camera systems use a pixel binning
function to improve image contrast and reduce noise.
Binning reduces the effects of minor observation errors, especially when the observations are
sparse, binning will bring more stability. The original data values which fall in a given small inter-
val, a bin, are replaced by a value representative of that interval, often the central value. It is a form
of quantization. Statistical data binning is a way to group a number of more or less continuous
values into a smaller number of “bins.”
More often one will be faced with the issue that real data relates to a very diverse population.
Consider for example a bank that has a million consumer loans1 . That data will typically have
more people in their twenties and thirties than in any group – largely because of the population
curves but also because those people have some creditworthiness and need to build up their lives.
Even if it has thousands of customers above 80 years, they will still form a minority.
1 A “consumer loan” is understood as a loan with no collateral. Examples are the overdraft on a current account,
a credit card, a purpose loan (e.g. to buy a car – but without the car being a collateral), etc.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 344
❦
Histogram of d Histogram of d
25
5
20
4
15
3
Frequency
Frequency
10
2
1
5
0
0
−2 −1 0 1 2 −2 −1 0 1 2
d d
Figure 19.1: Two histograms of the same dataset. The histogram with less bins (right) is easier to
read and reveals more clearly the underlying model (Gaussian distribution). Using more bins (as in
the left plot) can overfit the dataset and obscure the true distribution.
❦ ❦
Most datasets will have a bias. For example, in databases with people, we might find biases
towards, age, postcode, country, nationality, marital status, etc. If the data is about com-
panies, one cane expect a bias against postcode, sector, size, nationality, etc. If the data
concerns returns on stock exchanges one will – hopefully – have little data such as 2008
bias (and hence the data has a bias against dramatic stock market crashes).
The problem is that our observations might be a true trend (for example mining will have
more mortal accidents than banking) or might be something that is only in this database
(for example creditworthiness and nationality). The problem with this last example is
that nationality could indeed be predictive to the creditworthiness in a given place. For
example, many poor people emigrated from country X to country Y. These people display
a higher default rate in country Y. However, in country X this might not be true. This is
yet another important subject: discrimination. With discrimination we refer to judgement
based on belonging to certain groups (typically these are factors that are determined by
birth and not by choice).
For the bank, it might seem sensible to assume that people from country X will not pay
back loans. However, if people from country X will systematically be denied loans, they
cannot start a business and they will stay poor. This might lead to further social stigmatisa-
tion, social unrest, unproductive labour force, increased criminality, etc. Thinking further,
this also implies that the country Y misses out on GDP created by the entrepreneurial peo-
ple from country X. This will have the most negative impact for the people of country X,
but impact the economy of Y, and hence, also our bank.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 345
❦
More than any other institution, banks, can never be neutral. Whether they do something,
or whether they do not, they are partial. More than any other institution, banks carry a
serious responsibility.
Imagine that the purpose is to calculate a creditworthiness for further loans to the same cus-
tomers. In such case, there are generally a few possibilities:
1. the dependency on age is linear and then a linear model will do fine
2. the dependency on age is of non-linear nature, but a simple model – that can be explained
– can be fitted
3. the dependency on age is more complex. Young people have on average a less stable situa-
tion and hence have worse credit score. Later people get their lives in order and are able to
fulfil the engagements they take. However, when they retire the income drops significantly.
This new situation can again lead to increased propensity to financial problems.
Sure, we need to make another assumption at this point: is the lender willing to discriminate
for age? If age is included in the explaining variables of the models, then the result will be that a
people of certain age groups will face more difficulties to get a loan.2
In the following example, we assume that we want to build a model that is based on the cus-
tomer profile, and tries to predict customer churn.3
We already encountered in the answer to Question 6 on page 48 how to perform binning in R.
❦ Using the example of data from the normal distribution mentioned above, and building further ❦
on the histogram in Figure 19.1 on page 344, we are fully armed to make a binning for the data
in the data frame d. In the first place, we want to make sure that no bin has too little observations
in it.
To achieve this in R, we will use the function cut() as our workhorse: cut()
# This is not good, it will not make solid predictions for the last bin.
# Hence we need to use other bins:
c <- cut(d, breaks = c(-3, -0.5, 0.5, 3))
table(c) table()
## c
## (-3,-0.5] (-0.5,0.5] (0.5,3]
## 27 41 22
2 Whileone might think that “this is the right thing to do because the data tells me to,” it in fact a delicate
balancing act between risk (and cost to the lender) and social justice. Indeed, if we live in a country where social
security for old people is weak, then old people will systematically be considered as worse risk, regardless of the
true characteristics of the individual. Replace age with gender, social status, race, postcode, etc. Actually, for most
of the observable parameters, the same dilemma exists. More information is in the insert “Bias” on page 344.
3 “Customer churn” refers to customers that do not come back.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 346
❦
Executing a binning in R is simple enough. However, this is not the whole story: taking care
that all bins have a similar size will be conducive for a strong model. Sure, we do not want bins
so small that they do not have a statistically significant sample size, but it is not so important that
all bins have the same size. Much more important is how different observations in the different
bins are.
In the next section, Section 19.2 “Tuning the Binning Procedure” on page 347, we will elaborate
on this idea more and will aim to make bins so that they make the model more stable and give it
more predictive power.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 347
❦
In this section, we will tune our binning so that it will actually make the model both robust and
predictive. We will look for patterns in the data and make sure that these patterns are captured.
Let us start from previous dataset generated by the normal distribution and make some further
assumptions. The code below generates this data. Look at it carefully and you will see what bias
is built in.
set.seed(1890)
age <- rlnorm(1000, meanlog = log(40), sdlog = log(1.3))
y <- rep(NA, length(age))
for(n in 1:length(age)) {
y[n] <- max(0,
dnorm(age[n], mean= 40, sd=10)
+ rnorm(1, mean = 0, sd = 10 * dnorm(age[n],
mean= 40, sd=15)) * 0.075)
}
y <- y / max(y)
plot(age, y,
pch = 21, col = "blue", bg = "red",
xlab = "age",
ylab = "spending ratio"
)
This data could be from an Internet platform and describe how active customers are on our
service (game, shop, exchange, etc.). It seems that – just by looking at the data – that our service is
particularly attractive and addictive to customers in their thirties and forties. Younger and older
people tend to stop being active and get removed from this data after a few months. Of course, we
cannot see from this data what the reason is (money, value proposition, presentation, etc.).
Assume now that we want to model this propensity to buy next month, based on the data of
this month and the month before. This could lead to a targeted marketing campaign to people
that are predicted to spend very little (e.g. offer them a discount maybe).
In this example, we started from fabricating data, so we know very well what patterns are in
the data. However, with real data, we need to discover those patterns. Below we show a few tools
that can make this work easier and will plot a loess estimation and the histogram.
# Leave out NAs (in this example redundant):
d1 <- dt[complete.cases(dt),]
# Fit a loess:
d1_loess <- loess(spending_ratio ~ age, d1)
# Add predictions:
d1_pred_loess <- predict(d1_loess)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 348
❦
1.0
0.8
0.6
Spending ratio
0.4
0.2
0.0
20 30 40 50 60 70 80 90
Age
Figure 19.2: A plot of the fabricated dataset with the spending ratio in function of the age of the
customers. The spending ratio is defined as Sn−1 +Sn , where Sn is the spending in period n. If both
Sn
par(mfrow=c(1,1))
Later, in Chapter 31 “A Grammar of Graphics with ggplot2” on page 687, we will intro-
duce the plotting capabilities of the library ggplot2. The code that produces a similar
visualization with ggplot2 is here: Section D “Code Not Shown in the Body of the Book”
on page 839.
From the histogram and loess estimate in Figure 19.3 on page 349, we can see that:
• the spending ratio does not simply increase or decrease with age – the relation is non-linear;
• we have little young customers and little older ones (it even looks as if some of those have
a definite reason to be inactive on our Internet-shop).
So when choosing binning, we need to capture that relationship and make sure that we have
bins of the variable age where the spending ratio is high and other bins where the spending ratio
is low. Before we do so, we will fit a model without binning so that afterwards we can compare
the results.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 349
❦
Histogram of d1$age
1.0
200
0.8
150
0.6
Apending ratio
Frequency
100
0.4
50
0.2
0.0
20 40 60 80 20 40 60 80
Age Age
Figure 19.3: A simple aid to select binning borders is plotting a non-parametric fit (left) and the
histogram (right). The information from both plots combined can be used to decide on binning.
To illustrate the effect of binning, we will use a logistic regression.4 First, without binning and
then with binning.
Fitting the logistic regression worsk as follows:
4 The logistic regression is explained in Section 22.1 “Logistic Regression” on page 388.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 350
❦
The function summary(lReg1) tells us what the estimated parameters are (for example the
intercept is −0.099 and the coefficient of the variable age is −0.021). Most importantly it tells us
how significant the variables are. For example, the intercept has no stars and the coefficient has
2 stars. The legend for the star is explained below the numbers: two start means that their is only
a probability of 1% that the coefficient would be 0 in this model.
Inspired by Figure 19.3 on page 349, we can make an educated guess of what bins would make
sense. We choose bins that capture the dynamics of our data and make sure to have a bin for
values with high spending rations and bins that have low spending ratios.
Now, we will introduce a simple data binning, calculate the logistic mode and show the results:
# Bin the variable age:
c <- cut(dt$age, breaks = c(15, 30, 55, 90))
❦ ❦
# Check the binning:
table(c)
## c
## (15,30] (30,55] (55,90]
## 118 781 101
# We have one big bucket and two smaller (with the smallest
# more than 10% of our dataset.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 351
❦
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.88247 -0.31393 -0.03812 0.22173 1.50439
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.74222 0.02791 -26.595 <2e-16 ***
## is_L -0.85871 0.09404 -9.132 <2e-16 ***
## is_H -2.20235 0.16876 -13.050 <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.132909)
##
## Null deviance: 195.63 on 999 degrees of freedom
## Residual deviance: 144.92 on 997 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
❦ MSE2 ❦
## [1] 0.02603179
We see that indeed the mean square error (MSE) is improved.5 That is great: our model will
make better predictions. However, what is even more important: the significance of our coeffi-
cients is up: we have now 3 stars for each and the significance of the intercept is up from 0 to
3 stars. That means that the model 2 is much more significant and hence robust to predict the
future.
In our example the MSE is improved by binning. However, the main reason why binning
is used is that it makes the model less prone to over-fitting and hence should improve the
out-of-sample (future) performance. Further, information about this logic and concept
can be found in Section 25.4 “Cross-Validation” on page 483.
5 The concept of mean square error is introduced in Section 21.3.1 “Mean Square Error (MSE)” on page 384. It is
a measure that increases when the differences between the predictions and data are larger – so smaller is better.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 352
❦
Now, imagine that we have a dataset where both male and female customers have similar averages
on what we want to predict (e.g. spending ratio6 , propensity to have a car accident for insurance
claims, etc.) and that there is an underlying reason that causes different relationships between
the explained variable for males and for females. Matrix binning can be the answer to this issue.
To illustrate this, we will construct a dataset that has such built-in structure. In the following
block of code this data is generated so that males and females have the same average spending
ratio, but for males the ratio increases with age and for females it decreases. The code ends with
plotting the data in a scatter-plot (see Figure 19.4 on page 353):
# Ladies first:
# age will function as our x-value:
age_f <- rlnorm(N, meanlog = log(40), sdlog = log(1.3))
# x is a temporary variable that will become the propensity to buy:
x_f <- abs(age_f + rnorm(N, 0, 20)) # Add noise & keep positive
x_f <- 1 - (x_f - min(x_f)) / max(x_f) # Scale between 0 and 1
x_f <- 0.5 * x_f / mean(x_f) # Coerce mean to 0.5
# This last step will produce some outliers above 1
x_f[x_f > 1] <- 1 # Coerce those few that are too big to 1
# We want a double plot, so change plot params & save old values:
oldparams <- par(mfrow=c(1,2))
plot(age_f, p_f,
pch = 21, col = "blue", bg = "red",
xlab = "Age",
ylab = "Spending probability",
main = "Females"
)
plot(age_m, p_m,
pch = 21, col = "blue", bg = "red",
xlab = "Age",
ylab = "Spending probability",
main = "Males"
)
6 The concept “spending ratio” is introduced in Section 19.2 “Tuning the Binning Procedure” on page 347. We will
in this section construct a new dataset that is inspired on the one used in the aforementioned section.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 353
❦
Females Males
1.0
0.6
0.8
Spending probability
Spending probability
0.4
0.6
0.2
0.4
0.0
0.2
20 40 60 80 20 40 60 80
Age Age
Figure 19.4: The underlying relation between spending probability for females (left) and males
(right) in our fabricated example.
In Figure 19.4, we can see that the data indeed has the properties that we were looking for.
Age is somehow normally distributed, but there are more young than old people, but also the
propensity to buy (or spending probability) is decreasing with age for females and for males, it
works opposite.
This first step was only to prepare the data and show what is exactly inside. In the next step,
we will merge the data, and assume that this merged data set is what we got to work with. The
following block of code will do this and then plot the histogram for the all observations (combined
males and females) in Figure 19.5 on page 354:
# Now, we merge the data and consider this as our input-data:
tf <- tibble("age" = age_f, "sex" = "F", "is_good" = p_f)
tm <- tibble("age" = age_m, "sex" = "M", "is_good" = p_m)
t <- full_join(tf, tm, by = c("age", "sex", "is_good"))
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 354
❦
1.0
1.0
0.8
0.8
Spending probability
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
20 40 60 80 0 1
Figure 19.5: The dataset “as received from the customer service department” does not show any clear
relationship between Age or Sex and the variable that we want to explain: the spending ratio.
Now, we are in a situation that is similar as to the starting point of a real situation: we have
got data and we need to investigate what is inside. To do this, we will follow the approach that we
used earlier in this chapter, and more in particular the simple methods proposed in Section 19.2
“Tuning the Binning Procedure” on page 347. The following code does this, and plots the results
in Figure 19.6 on page 355:
d1 <- t[complete.cases(t),]
d1 <- d1[order(d1$age),]
d1_age_loess <- loess(is_good ~ age, d1)
d1_age_pred_loess <- predict(d1_age_loess)
d1 <- d1[order(d1$sexM),]
d1_sex_loess <- loess(is_good ~ sexM, d1)
d1_sex_pred_loess <- predict(d1_sex_loess)
d1 <- d1[order(d1$sexM),]
plot(d1$sexM, d1$is_good, pch=16,
xlab = 'Gender', ylab = 'Spending probability')
lines(d1$sexM, d1_sex_pred_loess, lwd = 7, col = 'dodgerblue4')
hist(d1$sexM, col = 'khaki3', xlab = 'gender')
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 355
❦
Histogram of d1$age
200
0.0 0.2 0.4 0.6 0.8 1.0
Spending probability
150
Frequency
100
50
0
20 40 60 80 20 40 60 80
Age age
Histogram of d1$sexM
Frequency
1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0
Gender gender
Figure 19.6: The data does not reveal much patterns for any of the variables (Gender and Age).
par(mfrow=c(1,1))
❦ ❦
The data – as prepared in the aforementioned code – has a particular relation between the
dependent variable and one of the exploratory variables. The spending probability for males and
females is on average the same, and the averages for all age groups are comparable. On the surface
of it, nothing significant happens. However, for males it will increase with age and for females the
opposite happens.
In the case of the logistic regression7 , normalization matters mainly for interpretability of the
model. This is not our major concern here so we will skip it. And try directly to fit a first naive
model.
# Note that we can feed "sex" into the model and it will create
# for us a variable "sexM" (meaning the same as ours)
# To avoid this confusion, we put in our own variable.
regr1 <- glm(formula = is_good ~ age + sexM,
family = quasibinomial,
data = t)
7 More information about the model itself is in Section 22.1 “Logistic Regression” on page 388.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 356
❦
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.990e-02 9.133e-02 -0.765 0.444
## age 1.684e-03 1.680e-03 1.002 0.316
## sexM 6.015e-05 3.730e-02 0.002 0.999
##
## (Dispersion parameter for quasibinomial family taken to be 0.08694359)
##
## Null deviance: 91.316 on 999 degrees of freedom
## Residual deviance: 91.229 on 997 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 3
As expected, this did not go well. From the explanatory variables, only the intercept has three
stars. Age got one star, but the value is really small. The MSE is 0.022. This is not good for a
variable that has a range roughly between 0 and 2.5.
Now, we will try a binning method that combines Age and Sex. This new variables should be
able to reveal the interactions between the variables.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 357
❦
Now, we have created four variables that group a combination of age and sex. There are two
groups that we could have calculated too: the middle-aged men and women. However, since these
are correlated to the existing categories, it is not wise to add them.
In the next, step we will fit a linear regression on the existing variables:
MSE1
## [1] 0.02166756
MSE2
## [1] 0.01844601
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 358
❦
In many cases, the dependent variable will be binary (0 or 1, “yes” or “no”). That means
that we are trying to model a yes/no decision. For example, 1 can be a customer that
defaulted on a loan, a customer to receive a special offer, etc.
Remake model 1 (logistic regression in function of age and sexM) and model 2 (logistic
regression for the variables is_LF, is_HF, is_LM, and is_HM). What does this change? Are
the conclusions different?
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 359
❦
We tried to make our binning as good as possible: we made sure that no bucket is too small, that
we are able to capture the important dynamics in the data, etc. This is essential, but it would be
better if somehow we could quantify our success.
Weight of evidence (WOE) and information value (IV) do exactly this in a simple and intuitive
way. The idea is to see if proportions of “good” outcomes and “bad” outcomes is really that dif-
ferent in the different bins.8 This result in a value for WOE or IV and we can compare this value
with another one obtained from another possible binning.
IV Predictability
< 0.02 Not predictive
0.02 − 0.3 Weak
0.1 − 0.3 Medium
0.3 − 0.5 Strong
> 0.5 Suspicious
Table 19.1: Different levels of information value and their commonly accepted interpretation – which
works good in the environment of credit data for example.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 360
❦
Just for the sake of illustration and presentation purposes, we use knitr to present any
kable() tables via its function kable(). More information is in Section 33 “knitr and LATEX” on
page 703.
We also use the package tidyverse, from which we use the pipe operator. More infor-
mation is in Chapter 7 “Tidy R with the Tidyverse” on page 121.
In this example, we will use the dataset that we have prepared in Section 19.3 “More Complex
Cases: Matrix Binning” on page 352:
This dataset contains a specific property where males and females have a similar propensity to
spend as an average population. However, this propensity is decreasing for females and increasing
for males. This situation is particularly difficult, since at first glance the variables Age and Sex will
not be predictive at all. We need to look at the interactions between the variables in order to find
the underlying relations.
Now, that we have data, we can load the package InformationValue, create a weight of
evidence table and calculate the information value for a given variable:
#install.packages("InformationValue")
library(InformationValue)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 361
❦
Indeed, just dividing the data based on gender is not sufficient, and it does not work. Using
our variables, such as is_LF (female from the lower age group), which combine the information
of age and gender should work better.
Question #17
Consider the dataset mtcars and investigate if the gearbox type (the variable am is a good
❦ predictor for the layout of the motor (the variable vs, V-motor or not). Do this by using ❦
WOE and IV.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 363
❦
♣ 20 ♣
Factor analysis and principal component analysis are mathematically related: they both rely on
calculating eigenvectors (on a correlation matrix or on a covariance matrix of normalized data), PCA
both are data reduction techniques that help to reduced the dimensionality of the data and out- principal component
analysis
puts will look very much the same. Despite all these similarities, they solve a different problem:
PCs
principal component analysis (PCA) is a linear combination of variables (so that the principal principal components
components (PCs) are orthogonal); factor analysis is a measurement model of a latent variable. factor analysis
❦ ❦
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 364
❦
PCA is a data reduction technique that calculates new variables from the set of the measured
variables. These new variables are linear combinations (think of it as a weighted average) of those
measured variables (columns). The index-variables that result from this process are called “com-
ponents.”
The process relies on finding eigenvalues and their eigenvectors and – unless the covariance
matrix is singular – this results in as many variables as the datasets has measured variables.
However, a PCA will also give us the tools to decide how much of those components are really
necessary. So we can find an optimal number of components, which implies an optimal choice
of measured variables for each component, and their optimal weights in those principal compo-
nents.
The results is that any regression or other model can be build on this – limited – number of
components, so it will be less complex and more stable (since all variables are orthogonal) and one
can even hope that it is less over-fit since we left out the components that explain only a small
portion of the variance in the dataset. As a rule of thumb – but that depends on your data and
purpose – one tries to capture at least 85% of the variance.
The function princomp of the package stats allows to execute an un-rotated principal com-
ponent analyses (PCA).
The following code executes the PCA analysis, shows how to access the details of the principal
components, and produces the plots Figure 20.1 on page 366 (see the line with plot(...)) and
Figure 20.2 on page 367 (see the line with biplot(...)):
❦ # PCA: extracting PCs from the correlation matrix
❦
princomp() fit <- princomp(mtcars, cor=TRUE)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 365
❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 366
❦
fit
7
6
5
Variances
4
3
2
1
0
Figure 20.1: A visualization of the loadings of the principal components of the dataset mtcars.
In case we did not normalize (scale) our data, all variables will have a different range, and
hence, and the PCA method will be biased towards the variables with the largest range.
That is why we have set the option cor = TRUE.
If data is normalized (all ranges are similar or the same), then principal components can
be based on the covariance matrix and the option can be set to FALSE. Also note that in
case there are constant variables, the correlation matrix is not defined and cannot be used.
Further, the covmat-option can be used to enter a correlation or covariance matrix
directly. If you do so, be sure to specify the option n.obs.
The aforementioned code is only a few short lines, but a lot can be learned from its output.
For example, note that:
• summary(fit) shows that the first two principal components explain about 85% of the
variance. That is really high, adding a third PC brings this to almost 90%. This means that
while our dataset has 11 columns, there are in fact only 3 truly independent underlying
dimensions. To some extend other dimensions also exist, but they are less relevant.
• biplot(fit) shows the structure of the dataset as projected in the plane spanned by the
two first principal components. This 2D project explains 85% of the variance – in our case
– and hence can be expected to include most of the information of the dataset. Note for
example that
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 367
❦
−8 −6 −4 −2 0 2 4
Valiant
Hornet 4 Drive qsec
Toyota Corona
4
Merc 230
0.2
Merc 240D
vs
Dodge Challenger
2
AMC
Pontiac
CadillacContinental
Fleetwood
Merc Javelin
Firebird
450SLC
Lincoln wt Hornet
Merc
Merc Sportabout
450SE
450SL
Chrysler Imperial Merc 280C Toyota Corolla
Fiat 128
disp Merc 280 Datsun 710
Fiat X1−9
0.0
l mpg
0
cyl Volvo 142E
Duster 360
Comp.2
−2
hp
Lotusdrat
Europa
Mazda RX4 Wag
−0.2
Mazda RX4
carb am
Porsche 914−2
−4
gear
−6
Ferrari Dino
Ford Pantera L
−0.4
−8
Maserati Bora
Comp.1
Figure 20.2: The biplot of the dataset mtcars: all observation and dimensions projected in the plane
span by the first two principal components.
❦ – the cars group in four clusters: this means that there are four underlying car concepts; ❦
– in the Northwest part of the plot, we see the cluster that represents “muscle cars”
and it is very well aligned with the dimension of weight, displacement, cylinders, and
horse power;
– dimensions such as fuel consumption, line motor, and rear axle ratio seem to be oppo-
site to the previous groups of dimensions;
– indeed, the output of loadings(fit) confirms this finding.
Since a lot of variance is explained in the first PCs, it is a good idea to fit any model (such as
a logistic regression or any other model) not directly on mtcars, but rather on its principal
components.
This will make the model more stable and one can expect the model to perform better out
of sample, the only cost is the loss of transparency of the model.a
a If the first principal components can be summarizes as a certain concept, then there is little to no loss
of transparency. However, usually, the PCs are composed of too many variables and do not summarize as
one concept.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 368
❦
A factor analysis also will lead to data reduction, but it answers a fundamentally different ques-
tion. Factor analysis is a model that tries to identify a “latent variable.” The assumption is that
this latent variable cannot be directly measured with a single variable. The latent variable is made
visible through the correlations that it causes in the observed variables.
For example, factor analyses can help to determine the number of dimensions of the human
character. What is really an independent dimension in our character? How to name it? In a
commercial setting, it can try to measure something like customer satisfaction, that is in the ques-
tionnaire comprised of many variables rather than one question “how satisfied are you?.”
The idea is that one latent factor that summarises some of our initial parameters might exisist.
For example, in psychology, the dimensions “agreeability” seems to be a fundamentally indepen-
dent dimension that build up our character. This underlying latent factor cannot be asked directly.
Rather we will ask to rate on a scale of one to five questions such as “When someone tells me to
clean my room, I’m inclined to do so right away.”, “When someone of authority ask me to do
something, this is motivating for me,” etc. The underlying, latent factor agreeability will cause
answers to those questions to be correlated. Factor analysis aims to identify the underlying factor.
Compared to PCA one could say that the assumed causality is opposite. Factor analysis
assumes that the latent factor is the cause, while in PCA, the component is a result of the
observed variables.
The function factanal() of the package stats can be used to execute a maximum likelihood
factor analysis on a covariance matrix of on the raw data. This is illustrated in the next code
❦ segment, which produces the figure that follows the code. ❦
# Maximum Likelihood Factor Analysis
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 369
❦
cyl hp
disp carb
0.5
l
wt
am
gea
0.0
Factor2
drat
mpg
−0.5
❦ ❦
vs
qsec
−1.0
Factor1
The options for the parameter rotation include “varimax,” “promax,” and “none.” The
option scores accepts the values “regression” or “Bartlett” and determines the method to
calculate the factor scores. The option covmat can be used to specify directly a correlation
or covariance matrix directly. When entering a covariance matrix, do not forget to provide
the option n.obs.
A crucial decision in exploratory factor analysis is how many factors to extract. The nFactors
package offers a suite of functions to aid in this decision. The code below will demonstrate the
use of this package and finally plot the Scree test in Figure 20.3 on page 370. First, we load the
package:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 370
❦
Eigenvalues (>mean = 2 )
Parallel Analysis (n = 2 )
Optimal Coordinates (n = 2 )
Acceleration Factor (n = 1 )
6
Eigenvalues
4
(AF)
(OC)
2
0
2 4 6 8 10 12
Components
Then we can perform the analysis, get the optimal number of factors, and plot a visualisation:
❦ ❦
# Get the eigenvectors:
eigV <- eigen(cor(mtcars))
We will not discuss those rules here, but – for your reference – the rules used are the Kaiser-
Guttman rule, parallel analysis, and the nScree does the nScree test to find an optimal number of
nScree() factors. The documentation of the nScree() function has all references.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:19pm Page 371
❦
The FactoMineR package offers a large number of additional functions for exploratory
factor analysis. This includes the use of both quantitative and qualitative variables, as well
as the inclusion of supplementary variables and observations. Here is an example of the
types of graphs that you can create with this package – note that the output is not shown.
# PCA will generate two plots as side effect (not shown here)
result <- PCA(mtcars)
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:12pm Page 373
❦
PART V
Modelling
♥
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 375
❦
♣ 21 ♣
Regression Models
Linear Regression
With a linear regression we try to estimate an unknown variable y based on a known variable x regression – linear
linear regression
and some constants (a and b). Its form is
y = ax + b
❦ ❦
To illustrate the linear regression, we will use data of the dataset survey from the pack-
age MASS. This dataset contains some physical measures, such as the span of the hand (variable
Wr.Hand), the height of the person (variable Height, and the gender (variable Sex). Let us first
illustrate the data by plotting the hand size in function of the height (results in Figure 21.1 on
page 376):
library(MASS)
The package stats, that is loaded at start of R, has a function called lm() that can handle a lm()
linear regression. Its use is not difficult: we provide a data-frame (argument data) and a formula
(argument formula). The formula has the shape of
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 376
❦
The linear model is of class lm and the method based OO system in R takes care of most of
the specific issues. As seen above, the summary() method prints something meaningful for this
class. Plotting the linear model is illustrated below, the output is in Figure 21.2 on page 377:
# predictions
h <- data.frame(Height = 150:200)
Wr.lm <- predict(lm1, h)
plot(survey$Height, survey$Wr.Hnd,col="red")
lines(t(h),Wr.lm,col="blue",lwd=3)
In previous code, we visualized the model by adding the predictions of the linear model (com-
abline() pleted with lines between them, so the result is just a line). The function abline() provides
another elegant way to draw straight lines in plots. The function takes as arguments the intercept
and slope. The following code illustrates its use, and the result is in Figure 21.3 on page 377
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 377
❦
Figure 21.2: A plot visualizing the linear regression model (the data in red and the regression in
blue).
❦ ❦
Figure 21.3: Using the function abline() and cleaning up the titles.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 378
❦
Consider the data set mtcars from the library MASS. Make a linear regression of the fuel
consumption in function of the parameter that according to you has the most explanatory
power. Study the residuals. What is your conclusion?
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 379
❦
Multiple regression is a relationship between more than two known variables to predict one regression – multiple linear
variable. multiple linear regression
y = b + a1 x1 + a2 x2 + · · · + an xn
In R, the lm() function will handle this too. All we need to do is update the parameter formula: lm()
Note also that all coefficients and intercept can be accessed via the function coef(): coef()
The variables that are defined in this way, can be used to implement the model (for example
in a production system). It works as follows:
# This allows us to manually predict the fuel consumption
# e.g. for the Mazda Rx4
2.23 + a_disp * 160 + a_hp * 110 + a_wt * 2.62
## disp
## -11.30548
Consider the data set mtcars from the library MASS. Make a linear regression that pre- MASS
dicts the fuel consumption of a car. Make sure to include only significant variables and
remember that the significance of a variable depends on the other variables in the model.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 380
❦
In particular the Poisson regression can only be useful where the unknown variable (response
response variable variable) can never be negative, for example in the case of predicting counts (e.g. numbers of
variable – unknown events).
log(y) = b + a1 x1 + a2 x2 + bn xn
with:
The Poisson Regression can be handled by the function glm() in R, its general form is as
follows.
where:
❦ ❦
• formula is the symbolic representation the relationship between the variables,
• family is R object to specify the details of the model and for the Poisson Regression
glm() is value is “Poisson”.
glm() Note that the function glm() is also used for the logistic regression; see Chapter 22.1
on page 388. The function glm() has mutliple types of models build in. For example:
linear, gaussian, inverse gaussian, gamma, quasi binomial, quasi poisson. To access its
full documentation type in R: ?family .
Consider a simple example, where we want to check if we can estimate the number of cylin-
ders of a car based on its horse power and weight, using the dataset mtcars
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 381
❦
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.59240 -0.31647 -0.00394 0.29820 0.68731
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.064836 0.257317 4.138 3.5e-05 ***
## hp 0.002220 0.001264 1.756 0.079 .
## wt 0.124722 0.090127 1.384 0.166
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 16.5743 on 31 degrees of freedom
## Residual deviance: 4.1923 on 29 degrees of freedom
## AIC: 126.85
##
## Number of Fisher Scoring iterations: 4
Weight does not seem to be relevant, so we drop it and try again (only using horse power):
This works better, and the Poisson model seems to work fine for this dataset.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 382
❦
In least square regression, we establish a regression model in which the sum of the squares
of the vertical distances of different points from the regression curve is minimized. We generally
start with a defined model and assume some values for the coefficients. We then apply the nls()
function of R to get the more accurate values along with the confidence intervals.
nls() The use of the function nls() is best clarified by a simple example. We will fabricate an exam-
ple in the following code, and plot the data in Figure 21.4 on page 383
The model seems to fit quite well the data. As usual, we can extract more information from
the model object via the functions summary() and/or print().
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 383
❦
Figure 21.4: The results of the non-linear regression with nls(). This plot indicates that there is one
outlier and you might want to rerun the model without this observation.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 384
❦
model performance The choice of a model is a complicated task. One needs to consider:
• Simple predictive model quality: is the model able to make systhematically better predic-
tions than a random model (i.e. Height of the lift curve / AUC for classification models and
mean square error for regression models).
• Generalization ability of the model (difference of model quality when compared on the
training data (creation) and testing data. In other words: can we be sure that the model is
not over-fit?)
• Explanatory power of the model (does the model “make sense” of the data? Can it explain
something?)
• Model Stability (What is the confidence interval of the model on the lift curves?)
In this section, we will focus on the first issue: the intrinsic predictive quality of the model
and present some tools that are frequently used.
The means square error is the average residual variance. The following is a predictor:
1
N
MSE(y, ŷ) = (yk − ŷ)2
N
k=1
21.3.2 R-Squared
R-squared While MSE provides a reliable measure of variation that is not explained by the model, it is sen-
sitive to units (e.g. the use of millimetres or kilometres will result in a different MSE for the same
model).
There is another, equally important measure, R2 (pronounced “R-sqaured”), that does not
have this problem. In some way it is a normalized version of MSE.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 385
❦
Definition: R-squared
R-squared is the proportion of the variance in the dependent variable that is predictable
from the independent variable(s). We can calculate R-squared as:
N
(yk − ŷ)2
R2 = 1 − k=1
N 2
k=1 (yk − ȳ)
with yˆk the estimate for observation yk based on our model, and y¯k the mean of all obser-
vations yk .
Let us consider a simple example of a linear model that predicts the fuel consumption of a
car based on its weight, and use the dataset mtcars to do so. The R-squared is such important
measure, that it shows up in the summary and it is immediately available after fitting the linear
model:
summary(m)$r.squared
## [1] 0.7528328
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 386
❦
Use the dataset mtcars (from the library MASS), and try to find the model that best
explains the consumption (mpg).
In the case where outliers matter less, or where outliers distort the results, one can rely on more
mean average deviation
MAD robust methods that are based on the median. Many variations of these measures can be useful.
We present a selection.
Definition: Mean average deviation (MAD)
1
N
MAD(y, ŷ) := |yk − ŷ|
N
k=1
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 387
❦
♣ 22 ♣
Classification Models
Classification models do not try to predict a continuous variable, but rather try to predict if an
observation belongs to a certain discrete class. For example, we can predict if a customer would
be creditworthy or not, or we can train a neural network to classify animals or plants on a picture.
These models with discrete outcome have an equally large domain of use and most essential to any
business application. The first one that we will discuss, the logistic regression, does only binary
prediction but is very transparent and allows to build strong models.
❦ ❦
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 388
❦
regression – logistic Logistic regression (aka logit regression) is a regression model where the unknown variable is
logit categorical (can have only a limited number of values): it can either be “0” or “1.” In reality if you
can refer to any mutually exclusive concept such as: repay/default, pass/fail, win/lose, survive/die,
binomial or healthy/sick.
Cases where the dependent variable has more than two outcome categories may be analysed
in multinomial logistic regression, or, if the multiple categories are ordered, in ordinal logistic
regression. In the terminology of economics, logistic regression is an example of a qualitative
response/discrete choices.
In its most general form, the logistic regression is defined as follows.
Definition: – Generalised logistic regression
This type of model can be used to predict the probability that Y = 1 or to study the fn (Xn )
and hence understand the dynamics of the problem.
❦ ❦
The general, additive logistic regression can be solved by estimating the fn () via a back-fitting
algorithm within a Newton-Rapshon procedure. The linear additive logistic regression is defined
as follows.
Logistic Regression
Assuming a linear model for the fn such that, the probability that Y = 1 is modelled as:
1
y=
1 + e−(b+a1 x1 +a2 x2 +a3 x3 +··· )
glm() This regression can be fitted with the function glm(), that we encountered earlier.
# Consider the relation between the hours studied and passing
# an exam (1) or failing it (0):
hours <- c(0,0.50, 0.75, 1.00, 1.25, 1.50, 1.75,
1.75, 2.00, 2.25, 2.50, 2.75, 3.00, 3.25,
3.50, 4.00, 4.25, 4.50, 4.75, 5.00, 5.50)
pass <- c(0,0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 1, 1, 1, 1, 1)
d <- data.frame(cbind(hours,pass))
m <- glm(formula = pass ~ hours, family = binomial,
data = d)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 389
❦
The function glm() is also used for the Poisson regression; see Chapter 21.2.1 on page 379.
Plotting the observations and the logistic fit, can be done with the following code. The plot is
in Figure 22.1.
0.6
0.4
❦ ❦
0.2
0.0
0 1 2 3 4 5
Hours studied
Figure 22.1: The grey diamonds with red border are the data-points (not passed is 0 and passed is
1) and the blue line represents the logistic regression model (or the probability to succeed the exam in
function of the hours studied.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 390
❦
Regression models fit a curve through a dataset and hence the performance of the model can be
judged by how close the observations are from the fit. Classification models aim to find a yes/no
prediction. For example, in the dataset of the Titanic disaster we can predict if a given passenger
would survive the disaster or not.
Pseudo-R2 values have been developed for binary logistic regression. While there is value in
testing for predictive discrimination, the pseudo R2 for logistic regressions have to be handled
with care: there might be more than one reason for them to be high or low. A better approach is to
present any of the goodness of fit tests available. For example, the Hosmer–Lemeshow test. This
test provides an overall calibration error test. However, it does not properly take over fitting into
account, nor the choice of bins, and has other issues. Therefore, Hosmer et al. provide another
test: the omnibus test of fit, which is implemented in the rms package and its residuals.lrm()
function.
When selecting the model for the logistic regression analysis, another important consider-
ation is the model fit. Adding independent variables to a logistic regression model will always
increase the amount of variance explained in the log odds (for example expressed as R2 ). How-
ever, adding too many variables to the model can result in over-fitting, which means that while
the R2 improves, the model loses on predictive power on new data.
In the following sections we will use the dataset from the package titanic. This is data of the
passengers on the RMS Titanic, that sunk in 1929 in the Northern Atlantic Ocean after a collision
with an iceberg.
❦ The data can be unlocked as follows: ❦
# if necessary: install.packages('titanic')
library(titanic)
Fitting the logistic model is done with the function glm(), where we provide the “binomial”
for the argument family, and the rest works the same as for other regressions.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 391
❦
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.487528 0.996601 8.516 < 2e-16 ***
## Pclass -2.429192 0.330221 -7.356 1.89e-13 ***
## Sexmale -6.162294 0.929696 -6.628 3.40e-11 ***
## Age -0.046830 0.008603 -5.443 5.24e-08 ***
## SibSp -0.354855 0.120373 -2.948 0.0032 **
## Pclass:Sexmale 1.462084 0.349338 4.185 2.85e-05 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 614.22 on 708 degrees of freedom
## (177 observations deleted due to missingness)
## AIC: 626.22
##
## Number of Fisher Scoring iterations: 6
• Negative predictive value: The proportion of negative cases that were correctly identified.
• Sensitivity or Recall: The proportion of actual positive cases which are correctly identified. sensitivity
• Specificity: The proportion of actual negative cases which are correctly identified. specificity
– True positive (TP) the positive observations (y = 1) that are by the model correctly TP
classified as positive; true positive
– False positive (FP) the negative observations (y = 0) that are by the model incorrectly FP
false positive
classified as positive – this is a false alarm (Type I error); Type I error
– True negative (TN) the negative observations (y = 0) that are by the model correctly TN
classified as negative; true negative
– False negative (FN) the positive observations (y = 1) that are by the model incorrectly FN
classified as negative – miss (Type II error). false negative
Type II error
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 392
❦
Using these definitions, we construct a matrix that summarizes that information: the confu-
sion matrix. This confusion matrix is visualised in Table 22.1.
Table 22.1: The confusion matrix, where “pred.” refers to the predictions made by the model, “pred.”
stands for “predicted,” and the words “positive” and “negative” are shortened to three letters.
More than any metric, the confusion matrix is a good instrument to make clear to lay persons
that any model has inherent risk, and that there is no ideal solution for cut-off value. One can
quite easy observer what happens when
A prediction object and the function table() is all we need in R to produce a confusion
table() matrix.
Building further on the idea of classification errors as a measure of suitability of the model, we
will explore in the following sections measures that focus on separation of the distribution of the
observations that had a 0 or 1 outcome and how well the model is able to discriminate between
them.
Unfortunately, there are multiple words that refer to the same concepts. Here we list a few
important ones.
• TPR = True Positive Rate = sensitivity = recall = hit rate = probability of detection
TP TP
TPR = = = 1 − FNR
P TP + FN
FP FP
FPR = = = 1 − TNR
N FP + TN
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 393
❦
TN TN
TNR = = = 1 − FPR
N FP + TN
FN FN
FNR = = = 1 − TPR
P TP + FN
TP
PPV =
TP + FP
TN
NP V =
TN + FN
• ACC = accuracy
TP + TN TP + TN
ACC = =
N +P TP + TN + FP + FN
PPV × TPR 2 TP
❦ F1 = = ❦
PPV + TPR 2 TP + FP + FN
There are more measures used from time to time, though, we believe these to be the most impor-
tant.
22.2.2 ROC
Engineers working on improving the radar during World War II developed the ROC for detecting
(or not detecting) enemy objects. They called it the “receiver operator characteristic” (ROC). Since ROC
then it became one of the quintessential tools for assessing the discriminative power of binary receiver operator
models. characteristic
The ROC curve is formed by plotting the true positive rate (TPR) against the false positive rate
(FPR) at various cut-off levels.1 Formally, the ROC curve is the interpolated curve made of points
whose coordinates are functions of the threshold: threshold = θ ∈ R, here θ ∈ [0, 1]
FP(θ) FP(θ)
ROCx (θ) = FPR(θ) = =
FP(θ) + TN(θ) #N
TP(θ) FP(θ) FN(θ)
ROCy (θ) = TPR(θ) = = =1− = 1 − FNR(θ)
FN(θ) + TP(θ) #P #P
An alternative way of obtaining the ROC curve is using the probability density function of the
true positives (TP, or “detection” as in the original radar problem) and false positives (FP, or “false
alarm” in the war terminology). The ROC is then obtained by plotting the cumulative distribution
1 In other words, the ROC curve is the sensitivity plotted as a function of fall-out.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 394
❦
1.0
0.8
True positive rate
0.6
0.4
0.2
0.0
function of the TP probability in the y -axis versus the cumulative distribution function of the FP
❦ ❦
probability on the x-axis.
Later we will see how the ROC curve can help to identify which models are (so far) optimal
and which ones are inferior. Also, it can be related to the cost functions typically associated with
ROCR TP and FP.
We visualize the ROC curve with the aid of the ROCR package. First, we need to add predictions
to the model, and then we can use the function plot() on a performance object that uses the
plot() predictions. The plot is generated in the last lines of the following code, and is in Figure 22.2.
library(ROCR)
# Re-use the model m and the dataset t2:
pred <- prediction(predict(m, type = "response"), t2$Survived)
The function performance() from the package ROCR generates an S4 object with (see Sec-
tion 6.3 “S4 Objects” on page 100). It typically gets slots with x and y values and their names (as
well as in this case slots for alpha). This object will know how it can be plotted. If necessary, then
it can be converted to an a data frame as follows:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 395
❦
In a final report, it might be desirable to use the power of ggplot2 consistently. In the fol-
lowing code we illustrate how this a ROC curve can be obtained in ggplot2.2 The plot is in
Figure 22.3. ggplot2
library(ggplot2)
p <- ggplot(data=df,
aes(x = `False positive rate`, y = `True positive rate`)) +
geom_line(lwd=2, col='blue') +
# The next lines add the shading:
aes(x = `False positive rate`, ymin = 0,
ymax = `True positive rate`) +
geom_ribbon(, alpha=.5)
p
The performance object can also provide the accuracy of the model, and this can be plotted
as follows – note that the plot is in Figure 22.4 on page 396.
1.00
0.75
True positive rate
0.50
0.25
0.00
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 396
❦
0.8
0.7
Accuracy
0.6
0.5
0.4
Cutoff
Figure 22.4: A plot of the accuracy in function of the cut-off (threshold) level.
θ ≤ θ′ =⇒ TPR(θ) ≤ TPR(θ′ )
A good way to understand this inequality is by remembering that the TPR is also the fraction
of observed survivors that is by the model predicted to survive (in “lending terms”: the fraction of
“goods accepted”). This fraction can only increase and never decrease.
One also observes that a random prediction would follow the identity line ROCy = ROCx ,
because if the model has no prediction power, then the TPR will increase equally with TNR.
Therefore, a reasonable model’s ROC is located above the identity line as a point below it
would imply a prediction performance worse than random. If that would happen, it is possible to
do the invert the prediction.
All those features combined make it reasonable to summarize the ROC into a single value by
calculating the area of the convex shape below the ROC curve – this is the area under the curve
(AUC). The closer the ROC gets to the optimal point of perfect prediction, the closer the AUC gets
AUC to 1.
To illustrate this, we draw the ROC and label the areas in Figure 22.5 on page 397, which can
be obtained by executing the following code.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 397
❦
B
0.8
True positive rate
0.6
A
0.4
C
0.2
0.0
Figure 22.5: The area under the curve (AUC) is the area A plus the area C. In next section we char-
acterise the Gini coeffient, which equals area A divided by area B.
The AUC in R is provided by the performance() function of ROCR and stored in the perfor-
mance object. It is an S4 object, and hence we can extract the information as follows.
AUC <- attr(performance(pred, "auc"), "y.values")[[1]]
AUC
## [1] 0.8615241
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 398
❦
In R, extracting the Gini coefficient from the performance object is trivial, given the AUC that
we calculated before. In fact, we can use the AUC to obtain the Gini:
Kolmogorov Smirnov The KS test is another measure that aims to summarize the power of a model in one parameter.
In general, the KS is the largest distance between two cumulative distribution functions:
This KS of a logistic regression model is the KS applied to the distribution of the “bads” (neg-
atives) and the “goods” (positives) in function of their score. The higher the KS, the better model
we have.3
Since these cumulative distribution functions are related to the TPR and the FPR, there are
two ways of understanding the KS: as the maximum distance between the ROC curve and the
bisector or as the maximum vertical distance between the cumulative distribution functions of
positives and negatives in our dataset.
A visualization of the KS measure, based on the definition of the cumulative distribution func-
tions can be found in Figure 22.6.
❦ 1.00 ❦
0.75
KS=62.44% true_result
0.50 not survived
y
survived
0.25
0.00
Figure 22.6: The KS as the maximum distance between the cumulative distributions of the positive
and negative observations.
3 Note that the wording “goods” and “bads” are maybe not correct English words, they mean respectively “an
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 399
❦
The code for the plots in Figure 22.6 on page 398 is not included here, but you can find it
in the appendix: Chapter D “Code Not Shown in the Body of the Book” on page 841. That
code includes an interesting alternative to calculate the KS measure.
The package stats from base R provides the functions ks.test() to calculate the KS. ks.test()
stats
As you can see in the aforementioned code, this does not work in some cases. Fortunately, it
is easy to construct an alternative:
perf <- performance(pred, "tpr", "fpr")
❦ ks <- max(attr(perf,'y.values')[[1]] - attr(perf,'x.values')[[1]]) ❦
ks
## [1] 0.6243656
Now, that we have the value of the KS, we can visualize it on the ROC curve in Figure 22.7 on
page 400. This is done as follows:
pred <- prediction(predict(m,type="response"), t2$Survived)
perf <- performance(pred, "tpr", "fpr")
plot(perf, main = paste0(' KS is',round(ks*100,1),'%'),
lwd = 4, col = 'red')
lines(x = c(0,1),y=c(0,1), col='blue')
# The KS line:
diff <- [email protected][[1]] - [email protected][[1]]
xVal <- attr(perf,'x.values')[[1]][diff == max(diff)]
yVal <- attr(perf,'y.values')[[1]][diff == max(diff)]
lines(x = c(xVal, xVal), y = c(xVal, yVal),
col = 'khaki4', lwd=8)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 400
❦
KS is 62.4%
1.0
0.8
True positive rate
0.6
0.4
0.2
0.0
Figure 22.7: The KS as the maximum distance between the model and a pure random model.
However, to make predictions we need to choose one cut-off level. In a first naive approach,
❦ one might want to select an optimal threshold by optimizing specificity and sensitivity. ❦
We are now ready to present a simple function that optimizes sensitivity and specificity.
# get_best_cutoff
# Finds a cutof for the score so that sensitivity and specificity
# are optimal.
# Arguments
# fpr -- numeric vector -- false positive rate
# tpr -- numeric vector -- true positive rate
# cutoff -- numeric vector -- the associated cutoff values
# Returns:
# the cutoff value (numeric)
get_best_cutoff <- function(fpr, tpr, cutoff){
cst <- (fpr - 0)^2 + (tpr - 1)^2
idx = which(cst == min(cst))
c(sensitivity = tpr[[idx]],
specificity = 1 - fpr[[idx]],
cutoff = cutoff[[idx]])
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 401
❦
# opt_cut_off
# Wrapper for get_best_cutoff. Finds a cutof for the score so that
# sensitivity and specificity are optimal.
# Arguments:
# perf -- performance object (ROCR package)
# pred -- prediction object (ROCR package)
# Returns:
# The optimal cutoff value (numeric)
opt_cut_off = function(perf, pred){
mapply(FUN=get_best_cutoff,
[email protected],
[email protected],
pred@cutoffs)
}
However, in general, this is a little naive for it assumes that sensitivity and specificity come at
the same cost. Indeed, in general, the cost of Type I error is not the same as the cost of a Type II
error. For example, in banking, it is hundreds of times more costly to have one customer that
defaults on a loan than rejecting a good customer unfairly. In medical image recognition, it would
be costly to start cancer treatment for someone who does not require it, but it would be a lot worse
❦ to tell a patient to go home while in fact he or she needs treatment. ❦
To take this asymmetry into account one can introduce a cost-function that is this multiplier
between the cost of type I and the cost of type II errors:
# get_best_cutoff
# Finds a cutof for the score so that sensitivity and specificity
# are optimal.
# Arguments
# fpr -- numeric vector -- false positive rate
# tpr -- numeric vector -- true positive rate
# cutoff -- numeric vector -- the associated cutoff values
# cost.fp -- numeric -- cost of false positive divided
# by the cost of a false negative
# (default = 1)
# Returns:
# the cutoff value (numeric)
get_best_cutoff <- function(fpr, tpr, cutoff, cost.fp = 1){
cst <- (cost.fp * fpr - 0)^2 + (tpr - 1)^2
idx = which(cst == min(cst))
c(sensitivity = tpr[[idx]],
specificity = 1 - fpr[[idx]],
cutoff = cutoff[[idx]])
}
# opt_cut_off
# Wrapper for get_best_cutoff. Finds a cutof for the score so that
# sensitivity and specificity are optimal.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 402
❦
# Arguments:
# perf -- performance object (ROCR package)
# pred -- prediction object (ROCR package)
# cost.fp -- numeric -- cost of false positive divided by the
# cost of a false negative (default = 1)
# Returns:
# The optimal cutoff value (numeric)
opt_cut_off = function(perf, pred, cost.fp = 1){
mapply(FUN=get_best_cutoff,
[email protected],
[email protected],
pred@cutoffs,
cost.fp)
}
When over-writing functions, it is always a good idea to make sure that new parameters
have a default value that results in a behaviour that is exactly the same is in the previous
version. That way we make sure that code that was written with the older version of the
function in mind will still work.
❦ ❦
While it is insightful to program this function ourselves, it is not really necessary. ROCR
provides a more convenient way to obtain this. All we need to do is provide this cost.fp argument
to the performance() function of ROCR on beforehand. Then, all that remains is finding the
minimal fpr.
# e.g. cost.fp = 1 x cost.fn
perf_cst1 <- performance(pred, "cost", cost.fp = 1)
str(perf_cst1) # the cost is in the y-values
## Formal class 'performance' [package "ROCR"] with 6 slots
## ..@ x.name : chr "Cutoff"
## ..@ y.name : chr "Explicit cost"
## ..@ alpha.name : chr "none"
## ..@ x.values :List of 1
## .. ..$ : Named num [1:410] Inf 0.996 0.995 0.995 0.995 ...
## .. .. ..- attr(*, "names")= chr [1:410] "" "298" "690" "854" ...
## ..@ y.values :List of 1
## .. ..$ : num [1:410] 0.406 0.408 0.406 0.402 0.401 ...
## ..@ alpha.values: list()
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 403
❦
0.4
0.2
Cutoff
❦ ❦
(b) cost(FP) = 5 x cost(FN)
2.5
Explicit cost
1.5
0.5
Cutoff
Figure 22.8: The cost functions compared different cost structures. In plot (a), we plotted the cost
function when the cost of a false positive is equal to the cost of a false negative. In plot (b), a false
positive costs five times more than a false negative (valid for a loan in a bank).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 404
❦
Again, the performance object with cost information can be plotted fairly easy. The code below
uses the function par() to force two plots into one frame, and then resets the frame to one plot.
The result is in Figure 22.8 on page 403
par(mfrow=c(2,1))
plot(perf_cst1, lwd=2, col='navy', main='(a) cost(FP) = cost(FN)')
plot(perf_cst2, lwd=2, col='navy', main='(b) cost(FP) = 5 x cost(FN)')
par(mfrow=c(1,1))
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 405
❦
♣ 23 ♣
Learning Machines
Machine learning is not a new, nor a recent concept. The term “machine learning” (ML) was ML
coined by Artur Samuel 1959 while he worked for IBM. A first book about machine learning for machine learning
pattern recognition was “Learning Machines” – Nilsson (1965). Machine learning remained an
active research topic, and the author was in the 1980s studying perceptron neural networks at the
university.
Already in 1965, the first framework for “multilayer perceptrons” was published by Alexey
Ivakhnenko and Lapa. Since 1986, we call those algorithms “deep learning,” thanks to Rina
Dechter. However, it was only in 2015, when Google’s DeepMind was able to beat the best human
Go player in the world with their AlpahGo system, that the term “deep learning” very popular
and the old research field got a lot of new interest.
In the early days of machine learning (ML), one focussed on creating algorithms that could
❦ mimic human brain function, and achieve a form of artificial intelligence (AI). In the 1990s, com- AI ❦
artificial intelligence
puters were fast enough to make practical applications possible and the research became more
focussed on wider applications, rather than focusing on reproducing intelligence artificially.
Rule based intelligence (as in the computer language Lisp), became less popular as a research
subject and in general the focus shifted away from symbolic approaches – inherited from AI, and
it became more and more a specific discipline in statistics. This coincided with a period where
massive amounts of data became available and computers passed a critical efficiency level. Today,
we will continue to see practical results of this machine learning in general and deep learning in
particular.
What makes machine learning different as a discipline, is that humans only set the elemen-
tary rules of learning and then let the machine learn from data, or even allow the machine to
experiment and find out by itself what is a good or a bad result. Usually, one distinguishes the
following forms of learning.
• Supervised learning: The algorithm will learn from provided results (e.g. we have data of
good and bad credit customers) In other words, there is a “teacher” that knows the exact learning – supervised
answers.
• Unsupervised learning: The algorithm groups observations according to a given criteria (e.g.
the algorithm classifies customers according to profitability without being told what good learning – unsupervised
or bad is).
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 406
❦
learning – reinforced • Reinforced learning: The algorithm learns from outcomes: rather than being told what is
good or bad, the system will get something like a cost-function (e.g. the result of a treat-
learning – induction
ment, the result of a chess game, or the relative return of a portfolio of investments in a
competitive stock market). Another way of defining reinforced learning is that in this case,
the environment rather than the teacher provides the right outcomes.
If feedback is available either in the form of correct outcomes (teacher) or can be deduced
from the environment, then the task is to fit a function based on example input and output. We
distinguish two forms of inductive learning:
These problems are equivalent to the problem of trying to approximate an unknown function
f (). Since we do not know f (), we use a hypothesis h() that will approximate f (). A first and
hypothesis powerful class of approximations are linear functions and hence linear regressions.1
With a regression, we try to fit a curve (line in the linear case) as closely as possible through all
data points. However, if there are more observations, then the line will in general not go through
all observations. In other words, given the hypothesis space HL of all linear functions, the problem
realizable is realizable if all observations are co-linear. Otherwise it is not realizable, and we have to try to
unrealizable find a “good fit.”
Of course, we could choose our hypothesis space H as large as all Turing machines. However,
this will still not be sufficient in case of conflicting observations. Therefore, we need a method
that finds a hypothesis that is optimal in a certain sense. The OLS method for linear regression
❦ OLS is such example. The line will fit optimally all observations for the L2 -norm in the sense that the ❦
ordinary least squares sum of the L2 distances between each point and the fitted line is minimal.
1 Indeed, we argue here that linear regressions are a form of machine learning.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 407
❦
A decision tree is another example of inductive machine learning. It is one of the most intuitive decision tree
methods and in fact many heuristics from bird determination to medical treatment are often sum-
marized in decision trees when humans have to learn. The way of thinking that follows from a
decision tree comes natural: if I’m hungry then I check if I have cash on me, if so then I buy a
sandwich.
N
ŷ = fˆ(x) =
αn I {x ∈ Rn }
n=1
x1 < 0.33
(0, 1) (1, 1)
❦ ❦
α4 α5 (0, 0) x1 (1, 0)
Figure 23.1: An example of the decision tree on fake data a represented in two ways: on the left the
decision tree and on the right the regions Ri that can be identified in the (x1 , x2 )-plane.
From Figure 23.1, it will be clear that only a certain type of partitioning can be obtained.
For example, overlapping rectangles will never be possible with a decision tree, nor will
round shapes be obtained.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 408
❦
Unfortunately, finding the best split of the tree for the minimum sum of squares criterion
is in general not practical. One way to solve this problem is working with a “greedy algorithm.”
Start with all data and considering a splitting variable j (in the previous example we had only two
variables x1 and x2 , so j would be 1 or 2) and a splitting value xsj so that we can define the first
region R1 as
R1 (j, xsj ) := {x|xj < xsj }
The best splitting variable is then the one that minimizes the sum of squares between
estimates and observations
⎡ ⎤
For any pair (j, xsj ), we can solve the minimizations with the previously discussed average as
estimator:
yˆ1 = avg yi |xi ∈ R1 (j, xsj )
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 409
❦
The idea is to minimize the “cost of complexity function” for a given pruning parameter α.
The cost function is defined as
|ET |
Cα (T ) := SEn (T ) + α|T | (23.2)
n=1
This is the sum of squares in each end-note plus α times the size of the tree. |T | is the number
of terminal nodes in the sub-tree T (T is a subtree to T0 if T has only nodes of T0 ), |ET | is the
number of end-nodes in the tree T and SEn (T ) is the sum of squares in the end-node n for the
tree T . The square errors in node n (or in region Rn ) also equals:
SEn (T ) = Nn MSEn (T )
Nn
1
= Nn (yi − ŷn )2
Nn
xi ∈Rn
Nn
(yi − ŷn )2
=
xi ∈Rn
The class c that has the highest proportion p̂n,c , is defined as argmaxc (p̂m,k ). This is the value
that we will assign in that node. The node impurity then can be calculated by one of the following:
Gini index = p̂n,c p̂n,c̃ (23.4)
c=c̃
C
= p̂n,c (1 − p̂n,c ) (23.5)
c=1
C
Cross-entropy or deviance = − p̂n,c log2 (p̂n,c ) (23.6)
c=1
1
Misclassification error = I{yi = c} (23.7)
Nn
xi ∈Rn
= 1 − p̂n,c (23.8)
with C the total number of classes.
2 For more details about ordinal and nominal scales we refer to Chapter B on page 829.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 410
❦
0.5
Deviance or cross−entropy
0.4
Gini index
Impurity measure
0.3
Misclassification index
0.2
0.1
0.0
Figure 23.2: Three alternatives for the impurity measure in the case of classification problems.
With R we can plot the different measures of impurity. Note that the entropy-based measure
❦ has a maximum of 1, while the others are limited to 0.5, so for the representation it has been
❦
divided by 2 for the plot. The plot produced by the following code is in Figure 23.2.
The behaviour of the misclassification error around p = 0.5 is not differentiable and might
lead to abrupt changes in the tree for small differences in data (maybe just one point that is
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 411
❦
forgotten). Because of that property of differentiability, cross-entropy, and Gini index are better
suited for numerical optimisation. It also appears that both Gini index and deviance are more
sensitive to changes in probabilities in the nodes. For example, it is possible to find cases where
the misclassification error would not prefer a more pure node.
Note also that the Gini index can be interpreted in some other useful ways. For example, if we
code the values 1 for observations of class c̃, then the variance over the node of this 0-1 response
variable is p̂n,c (1 − p̂n,c ) and summing over all classes c we find the Gini index.
Also if one would not assign all observations in the node to class c but rather assign them to
class c with a probability equal to p̂n,c , then the Gini-coefficient is the training error rate in node
n: c=c̃ p̂n,c p̂n,c̃ .
For pruning, all three measures can be trusted. Usually, the misclassification error is not used, CART
because of the aforementioned mathematical properties. The procedure described in this chapter Classification and
is known as the Classification and Regression Tree (CART) method. Regression Tree
Which in the case of two possible outcomes (G the number of “good” observations and B the
number of “bad” observations) reduces to
G B G G B B
I , =− log2 − log2
G+B G+B G+B G+B G+B G+B
The amount of information provided by each attribute can be found by estimating the amount of
information that still is to be gathered after the attribute is applied. Assume an attribute A (for
example age of the potential creditor), and assume that we use 4 bins for the attribute age. In that
case, the attribute age will separate our set of observations in four sub-sets: S1 , S2 , S3 , S4 . Each
subset Si then has Gi good observations
and Bi bad
observations. This means that if the attribute
A is applied that we still need I GiG Bi
+Bi Gi +Bi bits of information to classify all observations
i
,
correctly.
i +Bi
A random sample from our dataset will be in the ith age group with a probability Pi := BG+B ,
th
since Bi + Gi equals the number of observations in the i age group. Hence, the remaining infor-
mation after applying the attribute age is:
4
Gi Bi
remainder(A) = Pi I ,
Gi + Bi Gi + Bi
i=1
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 412
❦
So, it is sufficient to calculate the information gain for each attribute and split the population
according to the attribute that has the highest information gain.
1. Loss matrix: In reality a misclassification in one or the other group does not come at the
same cost. If we identify the tumour wrongly as cancer, the health insurance will lose some
money for an unnecessary treatment, but if we misclassify the tumour as harmless then
the patient will die. That is an order of magnitude worse. A bank might wrongly reject
a good customer and fail to earn an interest income of 1000 on a loan of 10 000. How-
ever, if the bank accepts the wrong customer then the bank can loose 10 000 or more in
recovery costs. This can be mitigated with a loss matrix. Define a C × C loss matrix L,
with Lkl the loss incurred by misclassifying a class k as a class l. As a reference, one usu-
❦ ❦
ally takes a correct classification as a zero cost: Lkk = 0. Now, it is sufficient to modify the
Gini-index as C k=l Lkl p̂nk p̂nl . This works fine when C > 2, but unfortunately, the sym-
metry of a two-class problem this has no effect since the coefficient of p̂nk p̂nl is (Lkl + Llk ).
The workaround is weighting the observations with of class k by Lkl .
2. Missing values: Missing values are a problem for all statistical methods. The classical
approach of leaving such observation out or trying to fill them in via some model, can lead
to serious issues if the fact that data is missing has a specific reason. For example, males
might not fill in the “gender” information if they know that it will increase the price of their
car-insurance. In this particular case no gender information can be worse insurance risk
than “male” if only the males that already had issues learned this trick. Decision trees allow
for another method: assign a specific value “missing” to that predictor value. Alternatively,
one can use surrogate splits: first work only with the data that has no missing fields and
then try to find alternative variables that provide the same split.
3. Linear combination splits: In our approach, each node has a simple decision model of
the form xi ≤ xsj . One can consider decisions of the form αi xi ≤ xsj . Depending on the
particular case, this might considerably improve the predictability of the model, however, it
will be more difficult to understand what happens. In the ideal case, these nodes would lead
to some clear attributes such as “risk seeking behaviour” (where the model might create this
concept out of a combination of “male,” “age group 1” and “marital status unmarried”) or
affordability to pay (“martial status married,” “has a job longer than two years”). In general,
this will hardly happen and it becomes unclear what exactly is going on and why the model
is refusing the loan. This in its turn makes it more difficult to assess if a model is over-fit
or not.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 413
❦
4. Link with ANOVA: An alternative way to understand the ideal stopping point is using the
ANOVA approach. The impurity in a node can be thought of as the MSE in that node. anova
n
(yi − ȳ)2
MSE =
i=1
with yi the value of the ith observation and ȳ the average of all observations.
This node impurity can also be thought of as in ANOVA analyses.
SSbetween
B−1
SSwithin
∼ Fn−B,B−1
n−B
with
2
SSbetween = nb B
b=1 (ȳb − ȳ)
(ȳbi − ȳ)2
B nb
SSwithin = b=1 i=1
with B the number of branches, nb the number of observations in branch b, yb i the value
of observation bi.
Now, optimal stopping can be determined by using measures of fit and relevance as in a
linear regression model. For example, one can rely on R2 , MAD, etc.
5. Other tree building procedures: The method described so far is known as classification
and regression tree (CART). Other popular choices are ID3 (and its successors C4.5, C5.0)
and MARS.
❦ ❦
23.1.2.2 Selected Issues
When working with decision trees, it is essential to be aware of the following issues, and have
some plan to mitigate or minimize them.
1. Over-fitting: this is one of the most important issues with decision trees. It should never
be used without appropriate validation methods such as cross validation or random forest
approach before an effort to prune the tree. See e.g. Hastie et al. (2009). over-fitting
3. Instability: Small changes in data can lead to dramatically different tree structures. This
is because even a small change on the top of the tree will be cascaded down the tree.
This works very different in linear regression models for example where one additional
data-point (unless it is an outlier) will only have a small influence on the parameters of the
model. Methods such as Random Forest somehow mitigate this instability, but also bagging
of data will improve the stability. bagging
4. Difficulties to capture additive relationships: A decision tree naturally will fit decisions
that are not additive. For example, if a person has the affordability to pay and he is honest
then he will pay the loan back. This would work fine in a decision tree, if however, the fact
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 414
❦
of the customer paying back the loan depends on many other factors that have to be all in
place and can mitigate each other then a additive relationship might work better.3
• subset: optional expression that indicates which section of the data should be
used.
As usual, more information is in the documentation of the function and the package.
3 An additional logic to predict payment capacity would be for example having a stable job, having a diploma,
having a spouse that can step in, having savings, etc. Indeed, all those things can be added, and if the spouse is
gone, the diploma will help to find a next job: these variables compensate. This is not the case for variables such
as over-indebted, fraud (no intention to pay), unstable job, etc., in this case there is no compensation: one of those
things going wrong will result in a customer that does not pay his loan back.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 415
❦
2. fit the tree with rpart() and eventually control the size of the tree with control =
rpart.control(minsplit=20, cp=0.01). This example will set the minimum number
of observations in a node to 20 and that a split must decrease the overall impurity of fit by
a factor of 0.01 (cost complexity factor) before being attempted.
More information about the rpart package can be found in the documentation
❦ of the project: https://cran.r-project.org/web/packages/rpart/vignettes/ ❦
longintro.pdf.
If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree
building does not continue. We could also say that tree construction does not continue unless it would decrease the
overall lack of fit by a factor of cp. The complexity parameter as implemented in the package rpart is not the same
as Equation 23.2 but follows from a similar build-up: Ccp (T ) := C(T ) + cp |T | C(T0 )
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 416
❦
Figure 23.3: The plot of the complexity parameter (cp) via the function plotcp().
In this section we will develop an example based on the sinking of the RMS Titanic in 1929.
❦ The death toll of the disaster of the RMS Titanic was very high and only 500 of the 1309 passengers ❦
500
survived. This means that the “prior probability” of surviving is about 0.3819 = 1309 . So, this
particular dataset has a similar amount of good and bad results. In this case, one might omit the
following parameter:
The code below imports the data of the passengers on board of the RMS Titanic, fits a regres-
sion tree, prunes the tree, and visualises results. Three plots are produced:
1. A visualisation of the complexity parameter and its impact on the errors in Figure 23.3.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 417
❦
sex=b
|
age>=9.5 pclass>=2.5
0 1 embarked=d
1
sibsp>=1.5
age>=27.5 1
0
0 1
Figure 23.4: The decision tree, fitted by rpart. This figure helps to visualize what happens in the
decision tree that predicts survival in the Titanic disaster.
❦ sex=b ❦
|
pclass>=2.5
0
embarked=d
1
0 1
Figure 23.5: The same tree as in Figure 23.4 but now pruned with a complexity parameter ρ = 0.01.
Note that the tree is .
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 418
❦
text(t0)
When plotting the tree, we used the standard method that is supplied with the library rplot: the
rpart.plot plot.rpart and text.rpart. The package rpart.plot replaces this functionality and adds
useful information and visual pleasing effects. There are options specific for classification trees
and others for regression trees.
prp() The following code loads the library rpart.plot and uses its function prp() to produce the
plot in Figure 23.6 on page 419
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 419
❦
sex
mal
fml
age pclass
0 1 1
embarked
0.82 0.55 0.94
S
C,Q
sibsp 1
0.65
>= 2
< 2
0 age
0.87
>= 28
< 28
0 1
0.67 0.56
Figure 23.6: The decision tree represented by the function prp() from the package rpart.plot.
This plot not only looks more elegant, but it is also more informative and less simplified. For example
the top node “sex” has now two clear options in which descriptions we can recognize the words male
❦ and female, and the words are on the branches, so there is no confustion possible which is left and ❦
which right.
The function prp() takes many more arguments and allows the user to write functions to
obtain exactly the desired result. More information is in the function documentation that is avail-
able in R itself by typing ?prp .
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 420
❦
Figure 23.7: The plot of the complexity parameter (cp) via the function plotcp()
❦ wt ❦
>= 2.26
< 2.26
cyl 30.07
n=6
>= 7
<7
hp disp
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 421
❦
wt
>= 2.26
< 2.26
cyl 30.07
n=6
>= 7
<7
15.1 20.92
n=14 n=12
Figure 23.9: The same tree as in Figure 23.8 but now pruned with a complexity parameter ρ of 0.1.
The regression tree is – in this example – too simple.
❦ ❦
# Example of a regression tree with rpart on the dataset mtcars
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 422
❦
plotcp(t)
summary(t)
## Call:
## rpart(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + am +
## gear, data = mtcars, na.action = na.rpart, method = "anova",
## control = rpart.control(minsplit = 10, minbucket = 20/3,
## cp = 0.01, maxcompete = 4, maxsurrogate = 5, usesurrogate = 2,
## xval = 7, surrogatestyle = 0, maxdepth = 30))
## n= 32
##
## CP nsplit rel error xerror xstd
## 1 0.65266121 0 1.0000000 1.0714029 0.2592763
## 2 0.19470235 1 0.3473388 0.7060794 0.1736361
## 3 0.03532965 2 0.1526364 0.4249901 0.1067328
## 4 0.01471297 3 0.1173068 0.4124873 0.1057872
## 5 0.01000000 4 0.1025938 0.4219970 0.1050699
##
## Variable importance
## wt disp hp drat cyl qsec
## 25 24 20 15 10 5
##
## Node number 1: 32 observations, complexity param=0.6526612
## mean=20.09062, MSE=35.18897
## left son=2 (26 obs) right son=3 (6 obs)
## Primary splits:
## wt < 2.26 to the right, improve=0.6526612, (0 missing)
## cyl < 5 to the right, improve=0.6431252, (0 missing)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 423
❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 424
❦
##
## Node number 9: 7 observations
## mean=16.78571, MSE=2.369796
##
## Node number 10: 6 observations
## mean=19.75, MSE=2.1125
##
## Node number 11: 6 observations
## mean=22.1, MSE=2.146667
# plot(t) ; text(t) # This would produce the standard plot from rpart.
# Instead we use:
prp(t, type = 5, extra = 1, box.palette = "Blues", digits = 4,
shadow.col = 'darkgray', branch = 0.5)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 425
❦
A first and very simple approach to asses performance is to calculate the confusion matrix.
This confusion matrix shows the correct classifications and misclassifications.
# The confusion matrix:
confusion_matrix <- table(predic, titanic$survived)
rownames(confusion_matrix) <- c("predicted_death",
"predicted_survival")
colnames(confusion_matrix) <- c("observed_death",
"observed_survival")
confusion_matrix
##
## predic observed_death observed_survival
## predicted_death 706 150
## predicted_survival 103 350
# As a precentage:
confusion_matrixPerc <- sweep(confusion_matrix, 2,
margin.table(confusion_matrix,2),"/")
Above we show the code to produce a confusion matrix. It is also possible to use the func-
tion confusionMatrix() from the package caret. It will not only show the confusion
confusionMatrix()
matrix but also some useful statistics.
The package caret provides great visualization tools and as well as tools to split data, pre- caret
process data, feature selection, model tuning using resampling and variable importance
estimation.
Here is a great introduction: https://cran.r-project.org/web/packages/caret/
vignettes/caret.html
The ROC curve can be obtained via the package ROCR as we have seen in Section 22.2.2 “ROC”
on page 393. To do this, we load the library, create predictions and show the ROC in Figure 23.10
on page 426 with the following code.
library(ROCR)
pred <- prediction(predict(t0, type = "prob")[,2],
titanic$survived)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 426
❦
1.0
0.8
True positive rate
0.6
0.4
0.2
0.0
The accuracy of the model in function of the cutoff value can be ploted via the acc attribute
❦ of the performance object. This is done by the following code and the plot is in Figure 23.11. ❦
0.8
0.7
Accuracy
0.6
0.5
0.4
Cutoff
Figure 23.11: The accuracy for the decision tree on the Titanic data.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 427
❦
# AUC:
AUC <- attr(performance(pred, "auc"), "y.values")[[1]]
AUC
## [1] 0.816288
# GINI:
2 * AUC - 1
## [1] 0.632576
# KS:
perf <- performance(pred, "tpr", "fpr")
max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])
## [1] 0.5726823
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 428
❦
random forest Decision trees are a popular method for various machine learning tasks. Tree learning is invariant
under scaling and various other transformations of feature values, is robust to inclusion of irrel-
evant features, and produces models that are transparent and can be understood by humans.
However, they are seldom robust. In particular, trees that are grown very deep tend to learn
coincidental patterns: they “overfit” their training sets, i.e. have low bias, but very high variance.
Maybe the situation would be better if we would have the option to fit multiple decision trees
and somehow average the result? It appears that this is not a bad approach. Methods that fit many
models and let each model “vote” are called ensemble methods. For example, the random forest
method does this.
Random forests are a way of averaging multiple deep decision trees, trained on different parts
of the same training set, with the goal of reducing the variance (by averaging the trees). This comes
at the expense of a small increase in the bias and some loss of interpretability, but generally greatly
boosts the performance and stability of the final model.
Random forests use a combination of techniques to counteract over-fitting. First, many ran-
dom samples of data are selected, and second variables are put in and out of the formula.
With this process, we obtain a measure of importance of each attribute (input variable), which
in its turn can be used to select a model. This can be particularly useful when forward/backward
stepwise selection is not appropriate and when working with an extremely high number of can-
didate variables that needs to be reduced.
In R, we can rely on the library randomForest to fit a random forest model. First we install
❦ it and then load it: ❦
library(randomForest)
The process of generating a random forest needs two steps that require a source of random-
ness, so it is a good idea to set the random seed in R before you begin. Doing so makes your results
set.seed() reproducible next time you run the code. This is done by the function set.seed(n).
The code below demonstrates how the random forest can be fitted. The package
randomForest randomForest is already loaded in the aforementioned code. So all we have to do is fit the
random forest with the function randomForest(), and the whole methodology will be executed.
The rest of the code is about exploring the model and understanding it. After fitting the model,
we show:
• who to get information about the model with the functions print(), plot() (result in
getTree() Figure 23.12 on page 429), summary(), and getTree();
• how to study the importance of the each independent variable via the function
importance() importance();
• how to plot a summary of this importance object via the function plot() (the result is in
Figure 23.13 on page 429);
• and finally we automate the plotting of the partial dependence on each variable and plot
partialPlot() those with the function partialPlot() – those plots are in Figure 23.14 on page 430,
randomForest()
Figure 23.15 on page 430, and Figure 23.16 on page 433.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 429
❦
forestCars
20
18
16
14
Error
12
10
8
6
trees
Figure 23.12: The plot of a randomForest object shows how the model improves in function of the
number of trees used.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 430
❦
21.5
22
22
21.0
20.5
21
21
20.0
20
20
19.5
19.0
19
19
18.5
100 200 300 400 2 3 4 5 50 100 200 300
disp wt hp
20.5
20.6
20.4
21.0
20.4
20.3
20.2
20.5
20.2
20.0
20.1
20.0
19.8
20.0
19.6
19.5
19.9
19.4
19.8
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 431
❦
head(mtcars)
## mpg cyl disp hp drat wt qsec vs
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1
## am gear carb
## Mazda RX4 1 4 4
## Mazda RX4 Wag 1 4 4
## Datsun 710 1 4 1
## Hornet 4 Drive 0 3 1
## Hornet Sportabout 0 3 2
## Valiant 0 3 1
# Show an overview:
print(forestCars)
##
## Call:
## randomForest(formula = frm, data = mtcars)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
❦ ## ❦
## Mean of squared residuals: 6.001878
## % Var explained: 82.94
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 432
❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 433
❦
20.2
20.2
20.1
20.1
20.0
20.0
❦ ❦
19.9
19.9
3.0 3.5 4.0 4.5 5.0 0.0 0.2 0.4 0.6 0.8 1.0
gear am
The random forest method seems to indicate us that in this case one might make a tree fit-
ting on the number of cylinders, displacement, horse power and weight.5 However, appearances
can be deceiving. There is no guarantee that a random forest does not over-fit itself. While the
technique of random forest seems to be a good tool against over-fitting, it is no guarantee against
it.
The random forest technique (randomly leaving out variables and data) can also be used
with other underlying models. Though, most often, it is used with decision trees, and that
is also how the function randomForest() implements it.
5 Of course, in reality one has to be very careful. For example, something can happen that disturbs the dependence
of the parameters. For example, once electric or hybrid cars are introduced, this will look quite different.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 434
❦
The weight wij,kl is the strength of the connection between neuron ij and neuron kl (two indices
since in our example we work from a two-dimensional representation). When the network is
being trained, the weights are modified till an optimal fit is achieved. In its turn, there is a possible
choice to be made for g(α). Typically, one will take something like the logit, logistic, or hyperbolic
tangent function. For example:
1
g(α) = logistic(α) = logit−1 (α) =
1 + e(−α)
or
−1 π
g(α) = Φ α
8
(the scaling of alpha is not necessary but if applied the derivative in 0 will be the same as the
logistic function) or
g(α) = tanh(α)
The parameter θ (threshold) is a number that should carefully be chosen.
6 Please note that a one-dimensional, one-layer neural net with a binary output parameter is similar to a logistic
regression.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 435
❦
While this approach already has some applications, neural nets become more useful and better
models if we allow them to learn by themselves and allow them to create internally hidden layers
of neurons.
For example, in image recognition, it is possible to make a NN learn to to identify images that
contain gorillas by analysing example images that have been labelled as “has gorilla” or “has no
gorilla.” Training the NN on a set of sample images will make it capable of recognizing gorillas
in new pictures. They do this without any a priori knowledge about gorillas (it is for example
not necessary for the neural net to know that gorillas have four limbs, are usually black or grey,
have two brown eyes, etc.). The NN will instead create in its hidden layers some abstract concept
of a gorilla. The downside of neural networks is that it is for humans very hard to impossible to
understand what that concept is or how it can be interpreted.7
An ANN is best understood as a set of connected nodes (called artificial neurons, a simpli-
fied version of biological neurons in an animal brain). Each connection (a simplified version of a
synapse) between artificial neurons can transmit a signal from one to another. The artificial neu-
ron that receives the signals can process it and then signal the other artificial neurons connected
to it.
Usually, the signal at a connection between artificial neurons is a real number, and the output
of each artificial neuron is calculated by a non-linear function of the sum of its inputs as explained
above. Artificial neurons and connections typically have a weight that adjusts as learning pro-
ceeds. The weight increases or decreases the strength of the signal at a connection. Artificial
neurons may have a threshold such that only if the aggregate signal crosses that threshold is the
signal sent. Typically, artificial neurons are organized in layers. Different layers may perform dif-
ferent kinds of transformations on their inputs. Signals travel from the first (input), to the last
(output) layer, possibly after traversing the layers multiple times.
❦ So, in order to make an ANN learn it is important to be able to calculate the derivative so that ❦
weights can be adjusted in each iteration.
Another way to see ANNs is as an extension of the logistic regression. Actually, the neurons
inside the neural network have two possible states: active or inactive (hence 0 or 1). Every neuron
in every layer layer in a neural network can be considered as a logistic regression. Figure 23.17
on page 436 shows the scheme for the logistic regression that symbolises this logic. A logistic
regression is the same as a neural network with one neuron (and 0 hidden layers).
The interpretation of these neurons in internal layers are quite abstract. In fact they do not
correspond necessarily to a real-world aspect (such as age, colour, race, sex, weight, etc.). This
means that it is quite hard to understand how the decision process of a neural network works.
One can easily observe the weights of the neurons in layer zero, but that does not mean that
these are representative for the way a network makes a decision (except in the case where there
is only one layer with one neuron) . . . hence in the case where there is equivalence with a logistic
regression.
7 This is why banks for example are very slow to adopt neural networks in credit analysis. They will rather rely in
less powerful linear regression models. Regressions are extremely transparent and it is easy to explain to regulators
or in court why a certain customer was denied a loan and it is always possible to demonstrate that there has been
no illegal discrimination (e.g. refusing the loan because a person belongs to a certain racial minority group). With
neural networks this is not possible, it might be that the neural net has in its hidden layer something like a racial
bias and that the machine derives that racial background via other parameters.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 436
❦
1
wt
−3
.74
19
qsec
0.
86
42
7
am
2
.4
5
7
4
2
11.9
hp
−0.0
497 8
205
9
5 836
−0.1
cyl
1
5
7
9
.7
0
drat
3
79
66
0.
gear
6
10
.21
carb
−0
Figure 23.17: A logistic regression is actually a neural network with one neuron. Each variable con-
tributes to a sigmoid function in one node, and if that one node gets loadings over a critical threshold,
then we predict 1, otherwise 0. The intercept is the “1” in a circle. The numbers on the arrows are the
❦ loadings for each variable. ❦
There is a domain of research that tries to extract knowledge from neural networks. This is
called “rule extraction,” whereby the logic of the neural net is approximated by a decision tree.
The process is more or less as follows:
• one will first try to understand the logic of the hidden variables with clustering analysis;
See for example Setiono et al. (2008), Hara and Hayashi (2012), Jacobsson (2005), Setiono
(1997), or the IEEE survey in Tickle et al. (1998).
As one could expect, there are several packages in R that help to fit neural networks and train
them properly. First, we will look at a naive example and then address the issue of over-fitting.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 437
❦
The function neuralnet() accepts many parameters, we refer to the documentation for
more details. The most important are:
• linear.output: set to TRUE to fit a regression network, set to FALSE to fit a clas-
sification network,
• algorithm: a string that selects the algorithm for the NN – select from: backprop
for back-propagation, rprop+ and rprop- for resilient back-propagation, and sag
and slr to use the modified globally convergent algorithm (grprop).
The function neuralnet() is very universal and allows for a lot of customisation. However,
leaving most options to their default will in many cases lead to good results. Consider the dataset
mtcars to illustrate how this works.
❦ ❦
#install.packages("neuralnet") # Do only once.
# Fit the aNN with 2 hidden layers that have resp. 3 and 2 neurons:
# (neuralnet does not accept a formula wit a dot as in 'y ~ .' )
nn1 <- neuralnet(mpg ~ wt + qsec + am + hp + disp + cyl + drat +
gear + carb,
data = mtcars, hidden = c(3,2),
linear.output = TRUE)
An interesting parameter for the function neuralnet is the parameter “hidden.” This is a
vector with the number of neurons in each layer, the number of layers will correspond to the
length to the vector provided. For example, c(10,8,5) implies three hidden layers (the first has
ten neurons, the second eight and the last five).
In fact, it seems that neural networks are naturally good in approaching other functions and
already with one layer an ANN is able to approach any continuous function. For discontinuous
functions or concepts that cannot be expressed as functions more layers may be required.
It appears that one best chooses a number of neurons that is not too small but does not exceed
the number of input layers.
Another important parameter is linear.output: set this to TRUE in order to estimate a con-
tinuous value (regression) and to FALSE in order to solve a classification problem.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 438
❦
As expected the function plot() applied on an object of the class “nn” will produce a
visual that makes sense for that class. This is illustrated in the following code and the plot is in
plot() Figure 23.18
1 1 1
wt
1.4
87
11..6614
47
0.8
25 87
81
32
qsec
89
0.169
82
3.3
00.6.767
4.2
098
14 11 .96 2654
17 61
−0.75
14
1
5.0
am
56
1−0.
59
43 6.0
68
15 5
0.5 33
8
51
hp
0.89
6.29
0.6
846
424
442
7
0.2
06 79
3
14 49
3.4
0.
62
disp −0.11075 mpg
4
8.4
48
69
0. .30
24
02
01
−0
65
−1.
6.8273
8
cyl
04 317
2.1 8
20 62
7.7
1. 0.10
43 0.8437
49
78
−
drat 97
−0.19
5
4.1
6
71 645
31 31
01
6.5
00..3275
84
gear 2.545
00 31
45
0−.04.71
34
46
❦ 1.2 ❦
carb
Figure 23.18: A simple neural net fitted to the dataset of mtcars, predicting the miles per gallon
(mpg). In this example we predict the fuel consumption of a car based on some other values in the
dataset t mtcars.
If the parameter rep is set to ’best’, then the repetition with the smallest error will be plot-
ted. If not stated all repetitions will be plotted, each in a separate window.8
The black lines in the plot of the ANN represent the connections between each neuron of the
previous and the following layer and the number is the weight of that connection. The blue lines
bias term show the bias term added in each step. The bias can be thought as the intercept of a linear model.
8 While in interactive mode, this might be confusing, in batch mode, this will create problems. For example, when
knitr
LAT using an automated workflow with knitr and LATEX, this will cause the plot to fail to render in the document, but
EX
not result in any error message.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 439
❦
With this data, we will now step by step prepare the neural network.
If there would be missing data, we should first address this issue, but the data is complete.
The next step is to split the data in a training and testing subset.
When there are missing values, we might want to remove them first. If you decide to do
so, then first, create a dataset that holds only the columns that you need and then use
❦ d1 <- d[complete.cases(d),] ❦
That way, we assure that we only delete data that we will not use and we avoid deleting
a row where an observation is missing in a variable that we anyhow will not use. More
information on treating missing values is in Chapter 18 “Dealing with Missing Data” on
page 333.
Neural networks have the tendency to over-fit. So, it is essential to apply some form of cross vali-
dation. More information about this subject is in Chapter 25.4 “Cross-Validation” on page 483. In
our example, we will only use simple cross validation.
set.seed(1877) # set the seed for the random generator
idx.train <- sample(1:nrow(d),round(0.75*nrow(d)))
d.train <- d[idx.train,]
d.test <- d[-idx.train,]
To be effective, a NN has a certain complexity, but only a few layers lead to so much parameters
that it is almost impossible to understand what the NN does and why it does what it does. This
makes them effectively black boxes. So, in order to use a neural network, we must be certain that
the performance gain is worth the complexity and loss of transparency. Therefore, we will use a
challenger model.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 440
❦
# Make predictions:
pr.lm <- predict(lm.fit,d.test)
The sample(x,size) function outputs a vector of the specified size of randomly selected
samples from the vector x. By default the sampling is without replacement, and the variable
idx.train is a vector of indices.
We have now a linear model that can be used as a challenger model.
Step 4: Rescale the Data and Split into Training and Testing Set
Now, we would like to start fitting the NN. However, before fitting a neural network it is useful
to normalize data between the interval [0, 1] or [−1, 1]. This helps the optimisation procedure to
converge faster and it makes the results easier to understand. There are many methods available
to normalize data (z -normalization, min-max scale, logistic scale, etc.). In this example we will
use the min-max method and scale the data to the interval [0, 1].
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 441
❦
Note that scale() returns a matrix and not a data frame, so we use the function
as.data.frame() to coerce the results in to a data-frame.
library(neuralnet)
Visualising the NN is made simple by R’s S3 OO model (more in Chapter 6 “The Implementa-
tion of OO” on page 87): we can use the standard function plot(). The output is in Figure 23.19
on page 442.
Now, we can predict the values for the test dataset based on this model and then calculate
the MSE. Since the ANN was trained on scaled data, we need to scale it back in order to make a
meaningful comparison.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 442
❦
1 1 1 1
medv
Figure 23.19: A visualisation of the ANN. Note that we left out the weights, because there would be
too many. With 13 variables, and three layers of respectively 7, 5, and 5 neurons, we have 13 × 7 +
7 × 5 + 5 × 5 + 5 + 7 + 5 + 5 + 1 = 174 parameters.
❦ ❦
In terms of RMSE, the NN is doing a better work than the linear model at predicting medv.
However, this result will depend on
1. our choices for the number of hidden layers and the numbers of neurons in each layer,
2. the selection of the training dataset.
The following code produces a plot to visualize the performance of the two models. We will
plot each observed house value on the x-axis and on the y -axis the predicted value. On the same
plot we we trace the unit line (y = x) that is the reference where we would the points should be.
The plot appears in Figure 23.20 on page 443
par(mfrow=c(1,2))
plot(d.test$medv,pr.lm,col='blue',
main='Observed vs predicted lm',
pch=18, cex=0.7)
abline(0,1,lwd=2)
legend('bottomright', legend='LM', pch=18,col='blue', bty='n',
cex=.95)
We see that indeed the predictions of the linear model are closer to the observed values. How-
ever, again we stress the fact that this picture can be very different for other choices of hidden
layers. Since in our dataset there are not too many observations, it is possible to plot both in one
graph. The code below does this in and puts the result in Figure 23.21 on page 443.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 443
❦
40
50
30
40
pr.nn2
20
pr.lm
30
10
20
0
10
NN LM
10 20 30 40 50 10 20 30 40 50
d.test$medv d.test$medv
Figure 23.20: A visualisation of the performance of the ANN (left) compared to the linear regression
model (right).
❦ Observed vs predicted NN
❦
50
40
pr.nn2
30
20
10
NN
LM
10 20 30 40 50
d.test$medv
Figure 23.21: A visualisation of the performance of the ANN compared to the linear regression model
with both models in one plot.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 444
❦
plot (d.test$medv,pr.nn2,col='red',
main='Observed vs predicted NN',
pch=18,cex=0.7)
points(d.test$medv,pr.lm,col='blue',pch=18,cex=0.7)
abline(0,1,lwd=2)
legend('bottomright',legend=c('NN','LM'),pch=18,
col=c('red','blue'))
library(boot)
set.seed(1875)
lm.fit <- glm(medv ~ ., data = d)
Now, we will fit the ANN. Note that we split the data so that we retain 90% of the data in the
training set and 10% test set. This we will do randomly for ten times.
plyr It is possible to show a progress bar via plyr.
To see this progress bar, uncomment the relevant lines.
# Reminders:
d <- Boston
nm <- names(d)
frm <- as.formula(paste("medv ~", paste(nm[!nm %in% "medv"],
collapse = " + ")))
# Store the maxima and minima:
d.maxs <- apply(d, 2, max)
d.mins <- apply(d, 2, min)
# Set parameters:
set.seed(1873)
cv.error <- NULL # Initiate to append later
k <- 10 # The number of repetitions
# This code might be slow, so you can add a progress bar as follows:
#library(plyr)
#pbar <- create_progress_bar('text')
#pbar$init(k)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 445
❦
While this is short and elegant, it is not exactly the same. This is not a k-fold validation,
since it does not ensure that each observations is used in the testing dataset just once.
Programming this is a lot longer. That is the reason why we had to used the tidyverse
here.
Running this code can take a while. Once it is finished, we calculate the average MSE and plot
the results as a boxplot in Figure 23.22 on page 446.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 446
❦
5 10 15 20
MSE
Figure 23.22: A boxplot for the MSE of the cross validation for the ANN.
cv.error
## [1] 5.190725 11.595832 8.178917 10.288129 9.136706
## [6] 10.260873 8.326635 7.114097 20.080831 5.109996
As you can see, the average MSE for the neural network is lower than the one of the linear
model, although there is a lot of variation in the MSEs of the cross validation. This may depend
on the splitting of the data or the random initialization of the weights in the ANN.
In the left plot of Figure 23.20 on page 443, you can see that there is one outlier. In the MSE
of the k-fold cross validation. When that observation was part of the test-data. It also became an
outlier in the vector of MSEs.
It is also important to realize that this result, in its turn can be influenced by the choosing seed
for the random generator. By running the simulation different times with different seeds, one can
get an idea about the sensitivity of the MSE for this seed.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 447
❦
Support vector machines (SVMs) are a supervised learning technique that is – just as decision
trees and neural networks – capable of solving both classification and regression problems. In
practice, it is however, best suited for classification problems.
The idea behind support vector machines (SVM) is to find a hyperplane that best separates the SVM
data in the known classes. The idea is to find a hyperplane that maximises the distance between support vector
the groups. machine
The problem is in essence a linear set of equations to be solved, and it will fit a hyperplane,
which would be a straight line for two dimensional data.
Obviously, if the separation is not linear, this method will not work well. The solution to this
issue is known as the “kernel trick.” We add a variable that is a suitable combination of he two
variables (for example if one group appears to be centred in the 2D plane, then we could us z =
x2 + y 2 as third variable). Then we solve the SVM method as before (but with three variables
instead of two), and find a hyperplane (flat surface) in a 3D space span by (x, y, z). This will allow
for a much better separation of the data in many cases.
Most parameters work very similar to other models such as lm, glm, etc. For example
data and formula do not need much explanation anymore. The variable type, however,
is an interesting one and it is quite specific for the SVM model:
3. one-classification: Allows to detect outliers and can be used when only one
class is available (say only cars with four cylinders and it allows to detect “unusual
cars with four cylinders”);
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 448
❦
5. nu-regression: The regression model that allows to tune the number of support
vectors.
Another important parameter is kernel. This parameter allows us to select which kernel
should be used. The following options are possible:
1. Linear: t(u)*v
When used, the parameters gamma, coef0, and degree can be provided to the function
if one wants to over-ride the defaults.
library(e1071)
svmCars1 <- svm(cyl ~ ., data = mtcars)
summary(svmCars1)
##
## Call:
## svm(formula = cyl ~ ., data = mtcars)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.1
## epsilon: 0.1
##
##
## Number of Support Vectors: 17
The function svm has treated the number of cylinders as numerical and decided that a regres-
sion was the best model. However, we are not really interested in fractional cylinders and hence
svm() can either round the results or fit a new model but coerce a classification model.
Below we illustrate how classification SVM model can be fitted:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 449
❦
# create predictions
pred <- predict(svmCars2, x)
The number of cylinders is highly predictable based on all other data in mtcars and the model
works just like this. However, in most cases, the user will need to fine tune some parameters. The
function svm allows for many parameters to be set, and it has multiple kernels that can be used.
print(svmTune)
❦ ## ❦
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 10 0.5
##
## - best performance: 0.991777
After you have found the optimal parameters, you can run the model again and specify the
desired parameters and compare the performance (e.g. with the confusion matrix).
In order to evaluate which models works best, it is worth to read Chapter 25 “Model
Validation” on page 475.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 450
❦
There are so many possible models and machine learning algorithms that in this book we can
only scratch the surface and encourage further study. Till now most models are used to predict
the dependant variable. The starting point is usually a clear and well defined segment of the larger
pool of possible populations. For example, when we want to predict if a customer is creditworthy,
we will make a separate model per credit type and further allocate the customers to groups, such
as customers with a current account, customers that are new to the bank.
Would it not be nice if we could ask the machine to tell us what grouping makes sense, or
what customers are alike? This type of questions is answered by the branch of machine learning
unsupervised where we do not tell the machine what a good or bad outcome looks like. Therefore, it is also
learning referred to as “unsupervised learning.”
Typical applications are
• customer segmentation: Identify groups of customers with similar traits (so that different
segments can get different offers and a more targeted level of service);
• stock market clustering: Group stock based so that the resulting group will have similar
behaviour under different market conditions;
• Segment data so that for each group a specific and more specialized model can be build.
For example, in Chapter 27.7.6 “PCA (Gaia)” on page 553 we ask ggplot to tell us which
❦ ❦
service centre locations are similar and show the groups on the plot in the space of the first two
principal components. The underlying algorithm is called “k-means.”
Clustering methods identify sets of similar objects – referred to as “clusters” – in a multivariate
data set. The most common types of clustering include
1. partitioning methods,
2. hierarchical clustering,
3. fuzzy clustering,
5. model-based clustering.
In the next sections, we will explain very briefly what those methods are, show how some of those
methods can be executed in R and how great visualizations can be obtained.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 451
❦
k
argminC ||x − μi ||
i=1 x inCi
The standard algorithm start from randomly taking k different observations as initial centre
for the k clusters. Each observation is then assigned to the cluster whose centre is the “closest.”
The distance is usually expressed as the Euclidian distance between that observation and the
centroid of the cluster.
Then we calculate again the centre of each cluster9 and the process is repeated: each obser-
vation is now allocated to the cluster that has the centroid closest to the observation. This step is
then repeated till there are no changes in the cluster allocations in consecutive steps.
The clustering problem is actually NP-hard and the algorithm described (or any other
known to date) converge fast, but converge to a local minimum. This means that it is
possible that better clustering can be found with other initial conditions.
35
❦ ❦
Toyota Corolla
Fiat 128
Lotus Europa
30
Honda Civic
Fiat X1−9
Cadillac Fleetwood
10 Lincoln Continental
2 3 4 5
wt
Figure 23.23: The cars in the dataset mtcars with fuel consumption plotted in function of weight
and coloured by the number of cylinders.
9 Note that this new centre does not necessarily coincide with an observed point.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 452
❦
library(ggplot2)
library(ggrepel) # provides geom_label_repel()
ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) +
geom_point(size = 5) +
geom_label_repel(aes(label = rownames(mtcars)),
box.padding = 0.2,
point.padding = 0.25,
segment.color = 'grey60')
Compare the plot in Figure 23.23 with the result that we could get from adding to our plot
the standard geom_text():
35
Toyota Corolla
Fiat 128
Lotus
HondaEuropa
Civic
30
Fiat X1−9
Porsche 914−2
25
Merc 240D
factor(cyl)
Datsun 710 Merc 230
4
mpg
Toyota Corona
Volvo Hornet 4 Drive
142E 6
Mazda Mazda
RX4 RX4 Wag
8
20 Ferrari Dino
Merc 280Pontiac Firebird
Hornet Sportabout
Valiant
Merc 280C
Merc 450SL
Merc 450SE
Ford Pantera
Dodge LChallenger
AMC Merc
Javelin450SLC
Maserati Bora
15 C
Duster 360
Camaro Z28
Ca
10
2 3 4 5
wt
10 For more information on ggplot, see Chapter 31 “A Grammar of Graphics with ggplot2” on page 687.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 453
❦
It also works, but geom_label() and geom_label_repel do a lot of heavy lifting: geom_label()
putting a frame around the text, uncluttering the labels, and even adding a small line
between the box and the dot if the distance gets too big.
Plotting the cars in the (wt, mpg) plane we notice a certain – almost linear – relation and
colouring the dots according to the number of cylinders we might be able to imagine some possible
groups.
Before we can use a k-means, however, we need to define a “distance” between cars. This is in
essence a process where imagination comes in handy and a deep understanding of the subject is
essential. It will also make sense to normalize the data, because any measure would be dominated
by the biggest nominal values.
The Euclidean distance – or any other distance for that matter – assumes a certain equality
between dimensions. If we have no idea about what the most important criterion is, then
we can use data normalised to one. In case we ask for example users to rate various aspects
of user experience and rate their importance, then we have a proxy for such importance.
In this case, ratings can be scaled according to importance. This means that we can then
multiply the normalised variables with a given weight.
❦ In the next code block, we will cluster cars using weight and fuel consumption. ❦
# normalize weight and mpg
d <- data.frame(matrix(NA, nrow = nrow(mtcars), ncol = 1))
d <- d[,-1] # d is an empty data frame with 32 rows
print(carCluster)
## K-means clustering with 3 clusters of sizes 7, 3, 22
##
## Cluster means:
## mpg_n wt_n
## 1 0.54951538 0.07814475
## 2 0.04228122 0.70550639
## 3 0.23518370 0.33595636
##
## Clustering vector:
## [1] 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2 2 2 1 1 1 3 3 3 3 3 1 1
## [28] 1 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 0.09687975 0.01124221 0.29960626
## (between_SS / total_SS = 79.5 %)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 454
❦
##
## Available components:
##
## [1] "cluster" "centers" "totss"
## [4] "withinss" "tot.withinss" "betweenss"
## [7] "size" "iter" "ifault"
First, we can investigate to what extend the number of cylinders is a good proxy for the clusters
as found with the k-means algorithm.
table(carCluster$cluster, mtcars$cyl)
##
## 4 6 8
## 1 7 0 0
## 2 0 0 3
## 3 4 7 11
# Note that the rows are the clusters (1, 2, 3) and the number of
# cylinders are the columns (4, 6, 8).
35
Toyota Corolla
Fiat 128
Lotus Europa
30
Honda Civic
Fiat X1−9
Porsche 914−2
25 Merc 240D
Cadillac Fleetwood
10 Lincoln Continental
2 3 4 5
wt
Figure 23.24: The result of k-means clustering with three clusters on the weight and fuel consump-
tion for the dataset mtcars.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 455
❦
Useful KPIs” on page 593 – is usually a much better solution that will increase dramatically prof-
itability for the bank and customer satisfaction (since customers will get products and service that
suits their needs, not their net worth).
We can also visualize the clustering in the same way as we did before: we only need to colour
the observations according to cluster and not according to number of cylinders. The ggplot2
visualisation of the next code block is in Figure 23.24 on page 454.
Would you like to plot also the Voronoi cell borders? Then we recommend to have a look
at the package ggvoronoi.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 456
❦
The PCA is executed in R via the function prcomp of the package stats, which is loaded
when R is started.
# Note also:
❦ class(pca1) ❦
## [1] "prcomp"
We see that the first two components explain about 90% of the variance. This means that for
most applications only two principal components will be sufficient. This is great because the 2D
visualizations will be sufficiently clear. The function plot on the PCA object (prcomp-object in
R), will visualize the relative importance of the different principal components (PCs) – in Fig-
ure 23.25 on page 457, and the function biplot() projects all data in the plane (PC1 , PC2 ) and
biplot() hence should show maximum variance – in Figure 23.26 on page 457:
# Plot for the prcomp object shows the variance explained by each PC
plot(pca1, type = 'l')
The plot in Figure 23.26 on page 457 is cluttered and hardly readable. We can try to make a
better impression with ggplot2. To make clear what happens, we will do this in two steps. First,
we produce the plot (in the following code – plot in Figure 23.27 on page 458) and then we will
make sure that the labels are readable.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 457
❦
pca1
0.5
0.4
0.3
Variances
0.2
0.1
0.0
1 2 3 4 5 6 7 8 9 10
Figure 23.25: The plot() function applied on a prcomp object visualises the relative importance of
the different principal components.
❦ −2 −1 0 1 2 3 ❦
Maserati Bora
3
FerrariFord
DinoPantera L
0.3
PorscheMazda
Mazda RX4
914−2 RX4
Wag
am
2
0.2
1
0.1
oda Civic142E
Volvo hp
ota
FiatCorolla
Datsun
128 710
X1−9 drat cyl
mpg
0.0
disp
Camaro Z28
0
Duster 360
wt Chrysler
Lincoln
Cadillac Imperial
Continental
Fleetwood
qsec Merc
Merc
Merc
Hornet
Pontiac450SL
450SLC
450SE
Sportabout
Firebird
Dodge
AMC Challenger
Javelin
−0.1
−1
vs
−0.2
Merc280C
Merc 280
−2
Merc240D
Merc 230
ToyotaHornet
Corona4 Drive
Valiant
−0.3
PC1
Figure 23.26: The custom function biplot() project all data in the plane that is span by the two
major PCs.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 458
❦
Maserati Bora
Ford Pantera L
Ferrari Dino
Mazda RX4
Porsche Mazda
914−2 RX4 Wag
0.2
mpg
PC2 (29.35%)
0.6
Lotus Europa
onda Civic
Volvo 142E 0.4
ota Datsun
X1−9 710
Corolla
Fiat 128
0.2
0.0 Camaro Z28
Duster 360
Chrysler Impe
Lincoln Conti
Cadillac Flee
Merc
Merc 450SL
450SLC
MercSportabout
Hornet 450SE 0.0
Pontiac
AMC
Dodge Firebird
Javelin
Challenger
−0.2
Merc 280C
Merc 280
Merc240D
Merc 230
Hornet
Toyota Corona 4 Drive
Valiant
Figure 23.27: A projection in the plane of the two major principal components via ggplot2. It looks
good, but the labels are cluttered.
Both plots, Figure 23.26 on page 457 and Figure 23.27, are not very elegant. However, they
server their purpose well: they help us to see that there are actually four types of cars in this
database. That means that there are four clusters, and now we can run the k-means algorithm,
while requesting four clusters.
We can now run the clustering algorithm on the data as it is or on the data in the orthogonal
base (and hence use all the PCs). In this case we could even run the clustering only on the two
first PCs, because they explain so much of the variance that the clusters would be the same, when
only using these two PCs.
Finally, we are ready for the k-means analysis on the 11-dimensional data (not only two).
We choose to run the analysis directly on the normalised data frame, d. Of course we decide on
four clusters. The last line of the following code also plots the results (with a different colour per
cluster), adds labels in a more readable manner in Figure 23.28 on page 459.
library(ggplot2)
library(ggrepel)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 459
❦
From Figure 23.28 we see that the cars indeed seem to group into four clusters. Looking at
the “loadings” (projections of the attributes) in the PC1 , PC2 ) space we can understand the four
groups of cars.
1. red (West – cluster 1): the city cars. These cars are best for fuel consumption, have a high
rear axle ratio, low displacement, horse power and weight. They rather have a line engine
(the 1 in the VS variable) and tend to have a manual gearbox.
2. green (South – cluster 2): the light coupe cars (two door sedans). These cars are able to take
four passengers and travel longer distances, but motors are not very special (low number
of carburators, horse power) and the gearbox has a low number of gears, they are slow to
accelerate and have typically cylinders in line. Therefore, they are reasonably fuel efficient.
3. cyan (East – cluster 3): the “muscle cars.” These are heavy cars with large motors that pro-
duce a lot of horse power, hence, need a low rear axle ration and are very fuel inefficient and
they tend towards automatic gears. Probably the SUVs would fit also here, but this database
precedes the concept SUV.11
❦ Maserati Bora
❦
Ferrari Dino Ford Pantera L
Mazda RX4
am
0.2
cluster
PC2 (29.35%)
Merc 280
−0.2 vs Merc 280C
Merc 230
Hornet 4 Drive
Merc 240D
Toyota Corona Valiant
−0.3 −0.2 −0.1 0.0 0.1 0.2
PC1 (59.93%)
Figure 23.28: The projection of mtcars in the surface formed by the two first principal components
and colored by number of cluster.
11 The database mtcars seem to be something that is compiled no later than the early 1980s.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 460
❦
4. violet (North – cluster 4): the sportscars. These cars have manual gears, are high in horse
power, number of carburators and low in number of seconds to travel 1/4 mile; they are
neutral on displacement and fuel consumption (they are fast but light).
Clustering analysis is a method where the algorithm can find by itself which objects should
belong to one group. There is no training of the model needed, and hence it is considered as a
method of unsupervised learning.
However, after reading so far, it will be clear that the modeller has a major impact on the
outcome. First and foremost the selection of variables is key. For example we could leave out axle
ratio, weight, etc. and focus on the size of the factory, the country where the car is produced, the
number of colours available, etc. This could lead to a different clustering.
Another way to influence results (in absence of PCA) is adding related measures. For example,
adding a variable for number of seats would be quite similar to the axle ratio, and hence we would
already have two variables that contribute to setting the sports cars apart.
autoplot is a layer of automation over ggplot2. It does a great job, but it is not so hard
to get nice results with native ggplot2. Note that in the following code most of the lines
related to manipulating the colours and the title of the legend.
1.0
Maserati Bora
❦ ❦
Ferrari Dino Ford Pantera L
Mazda RX4
0.5
Clusters
Lotus Europa Chrysler Imperial
a 1
PC2
−0.5
Merc 280
Merc 280C
Merc 230
Hornet 4 Drive
Merc 240D
Toyota Corona Valiant
−1.0 −0.5 0.0 0.5 1.0
PC1
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 461
❦
ggplot(pca1$x,
aes(x=PC1,y=PC2, fill = factor(carCluster$cluster))) +
geom_point(size = 5, alpha = 0.7, colour = my_colors)+
scale_fill_manual('Clusters',
values =c("darkolivegreen3","coral",
"cyan3", "gray80")) +
geom_label_repel(aes(label = rownames(mtcars)),
box.padding = 0.2,
point.padding = 0.25,
segment.color = 'grey60')
There is a much deeper relations between PCA and k-means clustering. The optimal
solution (or even local minima) of the k-means algorithm describe spherical clusters in their
Euclidean space. If we only have two clusters, then the line connecting the two centroids is the
one-dimensional projection direction that would separate the two clusters most clearly. This
is also what the direction of the first PC would indicate. The gravitational centre of that line
separates the clusters in the k-means approach.
This reasoning can readily be extended for more clusters. For example, if we consider
three clusters, then the two-dimensional plane spanned by three cluster centroids is the
two-dimensional projection that shows the most variance in our data. This is similar to the plane
spanned by the two first principal components.
So, it appears that the solution of k-means clustering, is given by principal component analysis
(PCA).12 The subspace spanned by the directions of the principal components is identical to the
cluster centroid subspace.
In the case where the real clusters are actually not spherical, but for example shaped as the
crescent of the moon, this reasoning does not work any more, clusters are not well separated by
nor k-means or PCA, and it is possible to construct counterexamples where the cluster centroid
subspace is not spanned by the principal directions.13
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 462
❦
❦ ❦
−0.5 0.0 0.5 1.0
1.0
0.5
0.0
PC1
−0.5
−1.0
1.0
0.5
PC2
0.0
−0.5
0.6
0.4
0.2
PC3
−0.4 −0.2 0.0
−1.0 −0.5 0.0 0.5 1.0 −0.4 −0.2 0.0 0.2 0.4 0.6
Figure 23.29: Two dimensional projections of the dependency structure of the data in the first prin-
cipal components. Note that in this plot, we see different 2D projections of 3D data.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 463
❦
While, Figure 23.29 provides some insight in how the clusters operate, these are only 2D
projections of data that is actually 3D. Seeing the data in three dimensions, is more natural and
provides so much more insight.
There are some really great solutions in R to show 3D data. For example, there is the package
plot3D. The code below shows how to use it and the plot is in Figure 23.30.
library(plot3D)
scatter3D(x = PCs$PC1, y = PCs$PC2, z = PCs$PC3,
phi = 45, theta = 45,
pch = 16, cex = 1.5, bty = "f",
clab = "cluster",
colvar = as.integer(carCluster$cluster),
col = c("darkolivegreen3", "coral", "cyan3", "gray"),
colkey = list(at = c(1, 2, 3, 4),
addlines = TRUE, length = 0.5, width = 0.5,
labels = c("1", "2", "3", "4"))
)
text3D(x = PCs$PC1, y = PCs$PC2, z = PCs$PC3, labels = rownames(d),
add = TRUE, colkey = FALSE, cex = 1.2)
Maserati Bora
cluster
❦ Volvo Europa
Lotus 142E Ford
Ferrari Pantera L
Dino ❦
Datsun
Honda 710
Civic 3
Fiat
Fiat X1−9
128
Toyota Corolla
z
Merc 280C
280 Mazda RX4 Wag
Porsche 914−2
Merc
Merc 230
240D 2
Toyota
HornetCorona
Valiant 4 Drive Camaro
Chrysler
Duster Z28
360 Imperial
Lincoln
Cadillac Continental
Fleetwood
Merc 450SE
450SLC
Merc 450SL
Hornet
Pontiac
Dodge
AMC Sportabout
Firebird
Challenger
Javelin
1
x
Figure 23.30: A three dimensional plot of the cars with on the z -axis the first principal component,
the second on the y -axis and the third along the z -axis.
Of course, there is also ggplot2, where stunning results can be obtained. We would prefer
to present the package gg3D, it works like simply by adding a z -axis to ggplot2. Unfortunately, it
does not work with the latest versions of R.
A great alternative – for an interactive environment – is plotly.14 The plot of the following
code is Figure 23.31 on page 464.
14 We will also use plotly in Chapter 36.3.2 “A Dashboard with flexdashboard” on page 731.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 464
❦
library(plotly)
plot_ly(x = PCs$PC1, y = PCs$PC2, z = PCs$PC3,
type = "scatter3d", mode = "markers",
color = factor(carCluster$cluster))
❦ ❦
Figure 23.31: plotly will produce a graph that is not only 3D but is interactive. It can be moved
with the mouse to get a better view on how the data is located. It is ideal for an interactive website or
RStudio.
(MCDA)” on page 511 and it is the field that tries to identify optimal solutions when there are more than one crite-
rion to optimize. This means that there is no unique mathematical solution, but that the best solution depends on
compromises.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 465
❦
One such algorithm is called “fuzzy clustering” – also referred to as soft clustering or soft
k-means. It works as follows:
2. Each observation has a coefficient wij (the degree of xi being in a cluster j ) for each clus-
ter — in the first step assign those coefficients randomly.
where m is the parameter that controls how fuzzy the cluster will be. Higher m values will
result in a more fuzzy cluster. This parameter is also referred to as the “hyper-parameter” hyper-parameter
4. For each observation calculate again the weights with the updated centroids.
1
wij = 2
k ||xi −ci || m−1
l xi −cl
5. Repeat from step 3, until the algorithm has coefficients that do not change more than a
given small value ǫ, the sensitivity threshold
k
n
wij ||xi − cj ||2
argminC
i=1 j=1
With ggplot2 and ggfortify it is easy to obtain nice results with little effort. Below we will
use the function fanny() from the library cluster to execute the fuzzy clustering and plot the
results in Figure 23.32 on page 466
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 466
❦
geom_label_repel(aes(label = rownames(mtcars)),
box.padding = 0.2,
point.padding = 0.25,
segment.color = 'grey40') +
theme_classic()
0.4
Maserati Bora
Mazda RX4
am
Porsche 914−2 Mazda RX4 Wag
0.2
1
Honda Civic Fiat X1−9 carb
gear Duster 360 2
Camaro Z28
Volvo 142E Merc 450SE hp
Chrysler Imperial 3
0.0 Toyota Corolla
drat cyl
Datsun mpg
710 Merc 450SLC Merc 450SL 4
disp
Fiat 128 Hornet Sportabout wt
qsec AMC Javelin
Cadillac Fleetwood
Dodge Challenger
Lincoln Continental
Pontiac Firebird
−0.2
vs
Figure 23.32: A plot with autoplot(), enhanced with ggrepel of the fuzzy clustering for the
dataset mtcars.
plot(cars_hc)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 467
❦
Cluster Dendrogram
15
10
Height
Ford Pantera L
Maserati Bora
0
Ferrari Dino
Porsche 914−2
Lotus Europa
Toyota Corona
Honda Civic
Merc 240D
Merc 230
Fiat X1−9
Chrysler Imperial
Duster 360
Camaro Z28
Valiant
Hornet 4 Drive
Datsun 710
Volvo 142E
Dodge Challenger
AMC Javelin
Fiat 128
Toyota Corolla
Pontiac Firebird
Hornet Sportabout
Merc 450SLC
Merc 280
Merc 280C
Mazda RX4
Mazda RX4 Wag
Merc 450SE
Merc 450SL
Cadillac Fleetwood
Lincoln Continental
.
hclust (*, "ward.D2")
❦ Note that unlike most of the other clustering algorithms used in this section, hclust is pro- ❦
vided by the package stats and not by cluster. This results in minor differences in how it hclust()
functions. The function hclust() will not work on the dataset d as we did before. This func- dist()
tion expects a “dissimilarity structure,” this is an object produced by the function dist() that
calculates a dissimilarity matrix.
We choose to use the function scale() instead of our own simple scaling used before.
scale()
Scale allows to centre data and scale it at the same time, and it allows different choices
per column.
Hierarchical clustering does not require to determine the number of clusters on beforehand
and it results in the information-rich dendrogram. It does, however, require us to choose one of
the many possible algorithms. hclust() alone provides five methods, so it is worth to study the
functions and make a careful choice as well as experiment to see what works best for your goal
and your data.16
16 We count the ward.D and ward.D2 as just one method. The difference is only that the ward.D method expects
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 468
❦
One of the intersting clustering methods available in the function hclust() is “Ward’s
method.”
More about Ward’s method can be found in Murtagh and Legendre (2011) and Szekely
and Rizzo (2005).
library(class)
knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)
One will notice that this function is designed to take a training dataset and a test dataset. This
makes it a little impractical for mtcars (which is a really small dataset). It also needs the parameter
knn() cl which holds the true classifications.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 469
❦
♣ 24 ♣
The package modelr provides a layer around R’s base functions that allows not only to work with
models using the pipe %>% command, but also provides some functions that are more intuitive
to work with. modelr is not part of the core-tidyverse, so, we need to load it separately. modelr
library(tidyverse)
library(modelr)
In the next sections, we will use the library modelr to create predictions, perform cross vali-
dations, etc. While it is possible to learn it as you read through the next chapters, it is also useful
❦ to have an overview of what modelr can do for you. Therefore, we briefly introduce the methods ❦
that modelr provides and later use them in Chapter 25 “Model Validation” on page 475.
To present the functionality, we will focus on a simple model based on the well-known dataset
mtcars from the package datasets. As usual, we will model the miles per gallon of the different
car models.
d <- mtcars
lm1 <- lm(mpg ~ wt + cyl, data = d)
While we show the functionality on a linear model, you can use virtually any model and
modelr will take care of the rest.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 470
❦
Each model leads to predictions and predictions can be used to test the quality of the model. So
adding predictions will be the work of every modeller for every model. Therefore it is most impor-
tant to have a standardised way of doing this. modelr’s function add_predictions() provides
add_predictions() this.
Function use for add_predictions()
Adds predictions to a dataset for a given model, the predictions are added in a column
named by the variable pred.
library(modelr)
The column pred is easy to understand as the prediction for mpg based on the other variables
using the linear model. We can now compare these values with the given mpg, calculate MSE, etc.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 471
❦
The second most used concept is the residual. It is the difference between the observed value and
its prediction. To calculate the MSE, for example, one first calculates the residuals, then squares
and sums them. Therefore, modelr also provides a function to do this: add_residuals(). add_residuals()
Adds residuals to a given dataset for a given model. The new column is named by the
parameter var.
Again we notice the additional column resid, that contains the residuals.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 472
❦
As we will see in Chapter 25.4 “Cross-Validation” on page 483, bootstrapping is an essential ele-
ment of cross validation and hence an essential tool in assessing the validity of a model.
bootstrap(data, n, id = ".id")
Generates n bootstrap replicates (dataset build from random draws – with replacement –
of observations from the source data) of the dataset data.
The following code illustrates how bootstrapping can be used to generate a set of estimates for
relevant coefficients for a linear model. The histogram of the estimates is shown in Figure 24.1.
(Intercept) cyl wt
3
Count
Figure 24.1: The results of the bootstrap exercise: a set of estimates for each coefficient.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 473
❦
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:20pm Page 474
❦
There are much more functions available in modelr, however, they refer to concepts that appear
later in the following chapter: Chapter 25 “Model Validation” on page 475. At least the following
will come in handy:
The library modelr provides also tools to apply formulae, interact with ggplot2, model
quality functions (see Chapter 25.1 “Model Quality Measures” on page 476), a wrapper around
model.matrix, tools to warn about missing data, re-sample methods (Chapter 25.4.1 “Elemen-
tary Cross Validation” on page 483), etc. Some of those will be discussed in the next chapter that
deals with model validation: Chapter 25 “Model Validation” on page 475.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 475
❦
♣ 25 ♣
Model Validation
The modeller has a vast toolbox of measures to check the power of the model: lift, MSE, AUC,
Gini, KS and many more (for more inspiration, see Chapter 25.1 “Model Quality Measures” on
page 476). However, the intrinsic quality of the model is not the only thing to worry about. Even
if it fits really good the data that we have, how will it behave on new data?
First, we need to follow up the power of the model. For example, we choose KS as the key
parameter together with a confusion matrix. This means that as new data comes in we build a
dashboard (for example see Chapter 36.3 “Dashboards” on page 725) that will read in new data
and new observations of good and bad outcomes, and as new data becomes available, we can and
should calculate the these chosen parameters (KS and confusion matrix) on a regular base.1
Secondly, we need an independent opinion. Much like a medical doctor will ask a peer for
a second opinion before a risky operation, we need another modeller to look at our model. In a
❦ professional setting, this is another team that is specialized in scrutinizing the modelling work of ❦
other people. Ideally this team is rather independent and will be a central function, as opposed to
the modellers, who should be rather be close to the business.
We do not want to downplay the importance of independent layers (aka “lines of defence”),
but for the remainder of this chapter, we will focus on the mathematical aspects of measuring the
quality of a model and making sure it is not over-fit.
1 In fact, we can and should also test the performance of the model on other things that were not so prevalent in
our model definition. For example, we work for a bank and made a model on house-loan approvals. We defined our
good definition on “not more than two months arrears after 3 years.” This means that we can and should monitor
things like: never paid anything back, one month arrears after one year, three months arrears any-time, etc.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 476
❦
modelr also provides a convenient set of functions to access the measures of quality of the model.
All functions mentioned in this section are from modelr. Consider a simple linear model that
predicts the miles per gallon in the dataset mtcars, and the following risk measures are right
away available:
# load modelr:
library(modelr)
# Fit a model:
lm1 <- lm(mpg ~ wt + qsec + am, data = mtcars)
A simple cross validation, is also easy and straightforward with modelr. We use the function
resample_partition() resample_partition
set.seed(1871)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 477
❦
Bespoke functions of modelr allow to add predictions and residuals to a dataset. Consider the
same linear model as in previous section (predicting mpg in the dataset mtcars):
sum((df$resid)^2) / nrow(mtcars)
❦ ❦
## [1] 5.290185
mse(lm1, mtcars)
## [1] 5.290185
35
30
25
d$pred
20
15
10
2 3 4 5
d$wt
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 478
❦
It might be useful to visualize the model with a data grid with even spacing for the indepen-
data_gird() dent variables via the function data_grid() – output in Figure 25.1 on page 477:
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 479
❦
25.3 Bootstrapping
Bootstrapping is a technique to enrich the data that we have. It will calculate repeatedly a metric bootstrapping
based on a different subset of the data. This subset is drawn randomly and with replacement. sampling
It is part of the broader class of resampling methods and it is possible to use bootstrapping to
estimate also measures of accuracy (bias, variance, confidence intervals, prediction error, etc.) to
the estimates.
With bootstrapping, one estimates properties of an estimator (for stochastic variables such
as conditional value at risk or variance) by measuring those properties from an approximated
distribution based on a sample (e.g. the empirical distribution of the sample data).
The reasons to do this is because
• to test the robustness of the model, we calibrate it on a subset of the data and see how it
performs on another set of data.
• prob: a vector of probability weights for obtaining the elements of the vector being
sampled
Consider the following example: data from the 500 largest companies on the USA stock
exchanges: the Standard and Poors 500 index (S& P500). Let us simply take a sample and
compare its histogram with that of the complete dataset. This plot is in Figure 25.2 on page 480.
# Create the sample:
SP500_sample <- sample(SP500,size=100)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 480
❦
0.4
0.4
0.3
0.3
Density
Density
0.2
0.2
0.1
0.1
0.0
0.0
−8 −6 −4 −2 0 2 4 −8 −6 −4 −2 0 2 4
SP500 SP500_sample
4
2
2
0
0
−4
−4
−8
−8
Figure 25.2: Bootstrapping the returns of the S&P500 index.
In base R, the sample is a dataset itself and it can be addressed as any other dataset:
mean(SP500)
## [1] 0.04575267
mean(SP500_sample)
## [1] 0.08059197
sd(SP500)
## [1] 0.9477464
sd(SP500_sample)
## [1] 0.9380602
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 481
❦
mean(as.data.frame(boot$strap[[3]])$mpg)
## [1] 20.66875
set.seed(1871)
library(purrr) # to use the function map()
boot <- bootstrap(mtcars, 150)
par(mfrow=c(1,1)) map()
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 482
❦
(a) wt (b) hp
40
40
30
30
Frequency
Frequency
20
20
10
10
0
0
−5 −4 −3 −2 −1 −0.06 −0.04 −0.02
25
20
20
Frequency
Frequency
15
15
10
10
❦ ❦
5
5
0
0 2 4 6 8 28 30 32 34 36 38 40
Figure 25.3: The histograms of the different coefficients of the linear regression model predicting the
mpg in the dataset mtcars. We show (a) Estimate for wt., (b) Estimate for hp., (c) Estimate for am:vs.,
and (d) Estimate for the intercept.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 483
❦
25.4 Cross-Validation
Making a model is actually simple, making a good model is really hard. Therefore, we tend to cross validation
spend a lot of time validating the model and testing the assumptions. In practice, one of the most
common issues is that data is biased: spurious correlations leads us astray and might mask true
dependencies, or at least will mislead us to think that the model is really strong, while in reality it
is based on coincidental correlations in that particular dataset. Later, when the model is put into
production the results will be very different. This effect is called “over-fitting.”
1. The performance of the model on the training dataset should not be too much lower than
that calculated on the training data.
2. The absolute values of the performance calculated on the testing dataset should match the
❦ set criteria (not the performance calculated on the training dataset). For example, when we
❦
want to compare this model with a challenger model, we use the performance calculated
on the testing dataset.
As usual, there are multiple ways of doing this. The function sample() from base R is actually sample()
all we need:
If we work with the tidyverse then it makes sense to use the function resample() of modelr modelr
not only because it allows us to use the piping command. That function generates a “resample resample()
class object” that is simply a set of pointers to the original data. It can be turned into data by
coercing it to a data frame.
set.seed(1870)
sample_cars <- mtcars %>%
resample(sample(1:nrow(mtcars),5)) # random 5 cars
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 484
❦
# or into a tibble
as_tibble(sample_cars)
## # A tibble: 5 x 11
## mpg cyl disp hp drat wt qsec vs am
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10.4 8 472 205 2.93 5.25 18.0 0 0
## 2 21 6 160 110 3.9 2.62 16.5 0 1
## 3 10.4 8 460 215 3 5.42 17.8 0 0
## 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1
## 5 33.9 4 71.1 65 4.22 1.84 19.9 1 1
## # ... with 2 more variables: gear <dbl>, carb <dbl>
A further advantage of the resample family of functions is that there is a concise way to split
resample_partition() the data between a training and testing dataset: the function resample_partition() of modelr.
library(modelr)
rs <- mtcars %>%
resample_partition(c(train = 0.6, test = 0.4))
# Check execution:
lapply(rs, nrow)
## $train
## [1] 19
##
## $test
## [1] 13
Now, that we have a training and test dataset, we have all the tools necessary. The standard
workflow now becomes simply the following:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 485
❦
print(rmse_tst)
## [1] 2.097978
We were using a performance measure that was readily available via the function rmse(), rmse()
but if we want to calculate another risk measure, we might need the residuals and/or predictions
first. Below, we calculate the same risk measure without using the function rmse(). Note that
step one is the same as in the aforementioned code.
# 2. Add predictions and residuals:
x_trn <- add_predictions(d_train, model = lm1) %>%
add_residuals(model = lm1)
x_tst <- add_predictions(d_test, model = lm1) %>%
add_residuals(model = lm1)
print(RMSE_tst)
## [1] 2.097978
In general, risk metrics calculated on the test dataset will be worse than those calculated on
the training dataset. The difference gives us an indication of the robustness of the model.
This simple test might be misleading on small datasets such as mtcars. The results vary from
a better result on the training set to significantly worse results. We will discuss solutions in the
next sections. If the results are much worse on the testing dataset, then this is an indication that
the model is over-fit and should not be trusted for new data.
For most practical purposes, the MSE calculated on the test dataset will be more relevant than
the one calculated on the training set.
In the examples, we selected randomly. For the datasets mtcars and titanic, this makes
sense because all observations have the same time-stamp. If we have an insurance or loan
portfolio for example where all customers came in over the last ten years or so, we might
consider other options. Assume we have ten years history, then we might test with this
method if a model calibrated on the first eight years will still be a good fit for the customers
entering the last two years.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 486
❦
We split our data in a training and validation set, then calibrate the model on the train-
ing data and finally check its validity on the test dataset. This method validates – if the
results on the test dataset are satisfactory – the choice of model (linear regression, logistic
regression, etc.) as well as the parameters that go in the model (such as wt and hp for the
mtcars data). Along the way we also found an intercept and coefficients.
But which model should we use in production? Should we use the coefficients as produced
by the training model or should we refit and use all our data?
There is some controversy on the subject, however, from a statistician point of view it
would be hard to defend not using all available data.a So, we should indeed do a last effort
and recalibrate the model on all relevant data before we put it in production.
However, doing so takes away the ability to validate it again? But wasn’t this done already?
a There might be a good reason to do so: for example if we assume that old exchange rates or older
customers might not be representative for the customers coming in now. In that case, we might choose
to calibrate a model on the last five years of data for example and do this every year or month again. Of
course, the choice of intervals can then be subject to another cross validation.
The rule of thumb is that both training and validation dataset need enough observations
to draw some statistically significant conclusions. Typically, the training dataset is larger
❦ than the test dataset. This choice is of course related to the answer to the aforementioned
❦
question (in the info-box “to refit or not to refit”). If this is only used for validation pur-
poses, then it does not matter too much. If, however, one wants to use afterwards the
model fitted on the training data only, then this set of data must be chosen as large as
possible.
In practice – and for larger databases – the training dataset is anything between 15% and
30%.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 487
❦
For example, suppose we chose to use 10% of your data as test data. The first rep might then
yield rows number 2, 7, 15, and 19. The next run, it might be 8, 10, 15 and 18. Since the partitions
are done independently for each run, the same point can appear in the test set multiple times.2
While it is not much work to repeat the aforementioned code in a for-loop, modelr provides
one convenient function to do create the desired number of selections in one step. The following
code uses modelr’s crossv_mc() function and explores the result.
# Monte Carlo cross validation
cv_mc <- crossv_mc(data = mtcars, # the dataset to split
n = 50, # n random partitions train and test
test = 0.25, # validation set is 25%
id = ".id") # unique identifier for each model
# Example of use:
To illustrate how the Monte Carlo cross validation works with in practice, we reuse the exam-
ple of Chapter 25.3 “Bootstrapping” on page 479 fitting a simple linear model on the data of mtcars
and predicting mpg. We use modelr and purrr and plot the histogram in Figure 25.4 on page 488
❦ ❦
in the following code.
set.seed(1868)
library(modelr) # sample functions
library(purrr) # to use the function map()
Note that the results with the simple cross validation of previous section (the RMSE was
2.098) is in the range found by the Monte Carlo cross validation (see Figure 25.4 on
page 488. This is normal. The simple cross validation is only one experiment and the
Monte Carlo cross validation repeats in some sense this elementary experiment multiple
times.
While for the novice a for-loop might be easier to understand, the aforementioned tidy afore-
mentioned code is concise and surprisingly very readable. The way of coding by using the piping
2 This is the key difference with the method described in the following section.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 488
❦
Histogram of RMSE
15
Frequency
10
5
0
RMSE
Figure 25.4: The histogram of the RMSE for a Monte Carlo cross validation on the dataset mtcars.
command has the advantage of allowing the reader to focus on the logic of what happens instead
❦ of what are the intermediary variables in which data is stored. We discuss this programming style ❦
and the details on how to use the piping operators in more detail in Section 7.3.2 “Piping with R”
on page 132.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 489
❦
k = 10, then we might assign rows 1–10 to fold #1, 11–20 to fold #2, . . . , rows 91–100 to fold #10.
Now, we will consider one fold as the test dataset and the nine others as the training set. This
process is then repeated 10 times till each fold had its turn as test dataset.
The function crossv_kfold of modelr will prepare the selections as for each run as follows.
library(modelr)
# k-fold cross validation
cv_k <- crossv_kfold(data = mtcars,
k = 5, # number of folds
id = ".id") # unique identifier for each
Each observation of the 32 will now appear once in one test dataset:
cv_k$test
## $`1`
## <resample [7 x 11]> 4, 10, 11, 14, 17, 18, 25
##
## $`2`
## <resample [7 x 11]> 5, 6, 9, 12, 28, 29, 32
##
## $`3`
## <resample [6 x 11]> 1, 3, 7, 24, 30, 31
##
## $`4`
## <resample [6 x 11]> 2, 15, 20, 21, 22, 26
##
## $`5`
## <resample [6 x 11]> 8, 13, 16, 19, 23, 27
❦ The previous example – used in Chapter 25.4.2 “Monte Carlo Cross Validation” on page 486 – ❦
but with a 5-fold cross validation – becomes now:
set.seed(1868)
library(modelr)
library(magrittr) # to access the %T>% pipe
crossv <- mtcars %>%
crossv_kfold(k = 5)
RMSE <- crossv %$%
map(train, ~ lm(mpg ~ wt + hp + am:vs, data = .)) %>%
map2_dbl(crossv$test, rmse) %T>%
hist(col = "khaki3", main ="Histogram of RMSE",
xlab = "RMSE")
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 490
❦
Histogram of RMSE
2.0
1.5
Frequency
1.0
0.5
0.0
RMSE
Figure 25.5: Histogram of the RMSE based on a 5-fold cross validation. The histogram indeed shows
that there were 5 observations. Note the significant spread of RMSE: the largest one is about four times
the smallest.
❦ ❦
If we use cross validation for inferential purposes – statistically compare two possible algo-
rithms – then averaging the results of a k-fold cross validation run offers a nearly unbiased esti-
mate of the algorithm’s performance. This is a strong argument in favour of the k-fold cross
validation. Although note that the variance might be rather high, because there are only a k obser-
vations. This – of course – becomes more important as the data set gets smaller. Therefore, k-fold
validation is mainly useful for larger datasets.
With a Monte Carlo cross validation we will sooner have more observations to average and
hence, the result will be less variable. However, the estimator is more biased.
Taking these observations into account, it is of course possible to design multiple methods
that combine advantages of both Monte Carlo and k-fold cross validation. For example,
the “5 × 2 cross validation” (five iterations of a two-fold cross validation), “McNemar’s
cross validation,” etc. (see Dietterich (1998) for a comparison of some methods as well as
more details in Bengio and Grandvalet (2004)).
A cross validation is a good idea to check if the model is not over-fit, but it is not a silver bullet.
All cross validation methods (even simple splitting data in two sets) are for example vulnerable if
the data contains only a limited number of observations.
Also a cross validation method cannot solve problems with the data itself. For example a bank
that is making a model to approve credit requests will only have data of customers that were
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 491
❦
approved in the past. The data does not contain a single customer that was not approved in the
past.3
In model types that do not directly allow for weighting rare but important observations4 it
might be advisable to force enough such observations to appear in both training and testing set
or alternatively multiply the rare observations so that they get more “weight.”
Usually we have more observations of one type (e.g. customers that pay back the loans as
compared to those defaulting), and usually we have more observations in our training dataset
than in our testing dataset. This can lead to the situation that the split in good/bad is very different
in our testing dataset as compared to the overall population. This is not a good situation to start
from and we need to enrich data or take another sample.
❦ ❦
3 In
this particular example the bank can alleviate this issue a to some extend. In most countries there will be a
credit bureau that collects credit history of more people than just the customers of this one bank. So, we would also
have information about people that got a loan somewhere else. Still, we do not have information of people that got
refused in every bank.
4 For example, the decision tree allows quite directly for such implementation of a weight, the logistic regression
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 492
❦
In the previous sections, we have focussed on validation methods that help us to assess how good
a model is, compare models, and via cross validation, we can even make an educated guess how
well it will perform when new data comes in. Is this enough? Or is there more that can and should
model validation be done?
Model validation that investigates the power of a model and cross validation alone would be
like a medical doctor that only takes your pulse and concludes that you are fine to run the race
while he forgets to check if you have legs at all. The model is just one section of the wheel of
continuous improvement.
The complexity of our world is increasing fast, more and more data is available. Making a
model and just bask in its splendour is not an option. Models are more and more an integrated
part of a business cycle and essential to a company’s sustainable results. Figure 25.6 shows a
simple version of a model cycle.
1. Formulate a question: It all starts with a question. For example: can we provide consoli-
dation5 loans? That question leads directly to the type of data we need. In this case, we
could use data from existing products such as credit cards, overdrafts, cash loans, purpose
loans, etc.
get data
formulate
❦ question ❦
wrangle
data
gather data
make &
fit model
use model
validate
model
Figure 25.6: The life cycle of a model: a model is an integrated part of business and focus of contin-
uous improvement. Note how using a model will collect more data and lead to improvement of the
model itself.
5 These are loans that replace a set of loans that the customer already has. People that have too much loans on
short term might struggle to pay the high interests and it might make sense to group all the loans together into one
new loan that has a longer horizon. However, this customer already got himself or herself into a difficult situation.
Will he or she manage his or her finances better this time?
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 493
❦
2. Get data: Once we know what data is needed, we can get it from the data-mart or data-
warehouse in our company. This is typically a team that resides under the chief data officer.
This is the subject of Part III “Data Import” on page 213.
3. Wrangle data: Once we have the data, it will most probably needs some tweaking before
we can use it to fit a model. We might have too much and too sparse information, we might
consider to consolidate variables (e.g. the list of 100 account balances at given dates can
be reduced to maximal overdraft and maximum time in red). We also might have missing
data and we need to decide what to do with the missing variables, and we need to come
up with a binning that makes sense and provides good weight of evidence or information
value. This is Part IV “Data Wrangling” on page 259.
4. Make and fit the model: First, we need to decide which model to use and all models come
with advantages and disadvantages. It might be worth to make a few, and once we have a
model and new model can be considered as a “challenger model.” The new model would
then need to have significant advantages in terms of performance, stability (measure of how
it is not over-fit), simplicity and/or transparency in order to be preferred. For example, we
can choose a linear regression, make an appropriate binning of our data, and fit the model.
This is the subject of Part V “Modelling” on page 373.
5. Validate the model: This is the step where we (or preferably and independent person) scru-
tinize the model. This consists of (i) the mathematical part of fitness and over-fitness of a
model (this section: Chapter 25 “Model Validation” on page 475) and (ii) an audit of the
whole modelling cycle (this is what we are explaining now).
❦ 6. Use the model: After passing its final exam – the validation – the model can be used. Now, ❦
the model will be used for business decisions (e.g. to which customers do we propose our
new product, the consolidation loan).
7. Gather more data: As the model is being used, it will generate new data. In our example we
can already see if customers do at least start to pay back our loan, after a year we can model
the customers that are three months in arrears, etc. This data together with the performance
monitoring of the model (not visualized in Figure 25.6 on page 492) will open the possibility
to ask new questions . . . and the cycle can start again.
Therefore, it is not possible to validate a model by only considering how fit and how over-fit
it is. The whole cycle needs to be considered. For example, the validator will want to have a look
at the following aspects.
1. Question: A model answers a question, but is this the right question to ask? Is there a better
question to ask?
2. Get data: When pulling data from systems and loading it into other systems, it is important
to check if conversions make sense (for example an integer can be encoded in 4-bit and
in the other system only two, which means that large numbers will be truncated). More
importantly, did we use the right data? Is there data that could be more appropriate?
3. Wrangle data: Data wrangling is already preparing directly for the model. Different models
will need different type of data and will be more sensitive to certain assumptions. Is there
missing data? How much? What did we do with it? Is the selection of target variable a good
one? How can it be improved?
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 494
❦
4. Make and fit the model: The modelling as such is also an art and needs to be in perfect
harmony with the data wrangling. Is the model the best that is possible. Is the choice of
model good? Does the random forest give enough added value to mitigate it is inherent lack
of transparency? What are the weaknesses of data and model. What are the implications
for the business? Can we trust that the underlying assumptions that the model makes are
valid?
5. The model maintainance cycle: Can we trust that the performance, assumptions, data qual-
ity, etc. are monitored and revisited at reasonable intervals? Finally, it is essential to have
a thorough look at the totality of the business model, question, data, model and the use
of the model. Does it all make sense? Is the model used within the field of its purpose or
are we using the model for something that it is not designed to do? What are the business
risks, and how do those business risk fit into the risk of the whole company (and eventually
society)?
This whole process should result in knowledge of the weak points of the model, and it should
provide confidence that the model is capable of it is designated task: is it fit for purpose? Finally,
this process will result in a document that describes all weaknesses and makes suggestions on
how to mitigate risks, enhance the model and make it more reliable. This will ensure that the
users of the model understand its limitations. In rare cases, the whole business model, product
line, or business logic needs to be challenged.
It is true that it is in everyone’s interest to do this and that a good modeller is to be trusted,
but it is even more true that an independent investigation will find things that the modeller did
not think of.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 495
❦
♣ 26 ♣
Labs
The quantmod package for R is directly helpful for financial modelling. It is “designed to assist
the quantitative trader in the development, testing, and deployment of statistically based trading
models” (see https://www.quantmod.com). It allows the user to build financial models, has quantmod
a simple interface to get data, and has a suite of plots that not only look professional but also
provide insight for the trader.
Its website is https:\\www.quantmod.com.
Also note that this section uses concepts of financial markets. It is best to understand at least
the difference between bonds and stocks before going through this section. If these concepts are
❦ ❦
new, we recommend Chapter 30 “Asset Valuation Basics” on page 597.
# Install quantmod:
if(!any(grepl("quantmod", installed.packages()))){
install.packages("quantmod")}
Now, we are ready to use quantmod. For example, we can start downloading some data with
the function getSymbols(): getSymbols()
quantmod also allows the user to specify lookup parameters, save them for future use, and
download more than one symbol in one line of code:
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 496
❦
496 26 Labs
If one would like to download all symbols in one data-frame, then one can use the package
BatchGetSymbols BatchGetSymbols or simply collect the data with cbind() for example in a data frame.
For other services it is best to refer to the website of the data provider, e.g. https://fred.
stlouisfed.org. Alternatively, one can use a service such as https://www.quandl.
foreign exchange
com/data/FRED-Federal-Reserve-Economic-Data to locate data from a variety of
sources. Macro economic data is available on the website of the World Bank, etc.
FX
For FX data www.oanda.com is a good starting point and the function getFX() will do most
getFX()
of the work.
getFX("EUR/PLN", from = "2019-01-01")
## [1] "EUR/PLN"
getMetals() Data about metals can be downloaded with the function getMetals(). More
information about all possibilities can be found in the documentation of quantmod:
https://cran.r-project.org/web/packages/quantmod/quantmod.pdf.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 497
❦
This created a data object with the name “HSBC” and it can be used directly in the traditional
functions and arithmetic, but quantmod also provides some specific functions that make working
with financial data easier.
For example, quantmod offer great and visually attractive plotting capabilities. Below we will
plot the standard bar chart (Figure 26.1), line plot (Figure 26.2 on page 498), and candle chart
(Figure 26.3 on page 498).
# Note: the lineChart is also the default that yields the same candleChart()
# result as plot(HSBC)
HSBC [2007−01−03/2020−01−29]
Last 36.700001 100
80
60
40
20
15 Volume (millions):
1,892,483
10
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 498
❦
498 26 Labs
HSBC [2007−01−03/2020−01−29]
Last 36.700001 100
80
60
40
20
15 Volume (millions):
1,892,483
10
❦ ❦
HSBC [2020−01−02/2020−01−29]
Last 36.700001
39.0
38.5
38.0
37.5
37.0
36.5
5 Volume (millions):
1,892,483
4
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 499
❦
So, there is no need to find a data source, download a portable format (such as CSV), load it CSV
in a data-frame, name the columns, etc. The getSymbols() function downloads the daily data
going 10 years back in the past (if available of course). The plot functions such as lineChart(),
barChart(), and candleChart() display the data in a professional and clean fashion. The looks
can be customized with a theme-based parameter that uses intuitive conventions.
There is much more and we encourage you to explore by yourself. To give you a taste: let us
display a stock chart with many indicators and Bollinger bands with three short lines of code (the
plot is in Figure 26.4):
getSymbols(c("HSBC"))
## [1] "HSBC"
❦ where n is the sampling size of the moving average (of type “maType” – SMA is simple moving ❦
average) and sd is the multiplier for the standard deviation to be used as band.
HSBC [2019−10−01/2020−01−29]
Last 36.700001
Bollinger Bands (20,2) [Upper/Lower]: 39.566/36.696
39
38
37
36
6 Volume (millions):
1,892,483
5
4
3
2
1
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 500
❦
500 26 Labs
Bollinger bands aim to introduce a relative indication of how high a stock is trading
relative to its past. Therefore, one calculates a moving average and volatility over a given
period (typically n = 20) and plots bands with plus and minus a volatility multiplied with a
multiplier (typically sd = 2).1 If the stock trades below the lower band then it is to be considered
as low.
The lines drawn correspond to ma − sd.σ , ma and ma + sd.σ with σ the standard deviation
and where ma and sd take values as define above.
Some traders will then sell it because it is considered as “a buy opportunity,” others might
consider this as a break in trend and rather short it (or sell off what they had as it is not living up
to its expectations).
The date format follows the ANSI standard (CCYY-MM-DD HH:MM:SS), the ranges are spec-
ified via the “::” operator. The main advantage is that these functions are robust regardless the
underlying data: this data can be a quote every minute, every day, or month and the functions
will still work.
In order to rerun a particular model one will typically need the data of the last few weeks,
xts months or years. Also here xts comes to the rescue with handy shortcuts:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 501
❦
periodicity(HSBC)
unclass(periodicity(HSBC))
to.weekly(HSBC)
to.monthly(HSBC)
periodicity(to.monthly(HSBC))
ndays(HSBC); nweeks(HSBC); nyears(HSBC)
❦ ndays() ❦
As these functions are dependent on the upstream xts and not specific to quantmod they can nweeks()
nyears()
be used also on non-OHLC data:
getFX("USD/EUR")
## [1] "USD/EUR"
periodicity(USDEUR)
## Daily periodicity from 2019-08-04 to 2020-01-28 periodicity()
to.monthly(USDEUR)
## USDEUR.Open USDEUR.High USDEUR.Low USDEUR.Close to.monthly()
## Aug 2019 0.900219 0.909910 0.892214 0.909910 getFX()
## Sep 2019 0.909908 0.916008 0.902656 0.916008
## Oct 2019 0.916670 0.916670 0.895232 0.896330
## Nov 2019 0.896046 0.908504 0.895620 0.907536
## Dec 2019 0.907528 0.907528 0.891614 0.891614
## Jan 2020 0.891860 0.907806 0.891860 0.907806
periodicity(to.monthly(USDEUR))
## Monthly periodicity from Aug 2019 to Jan 2020
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 502
❦
502 26 Labs
endpoints(HSBC, on = "years")
## [1] 0 251 504 756 1008 1260 1510 1762 2014 2266
## [11] 2518 2769 3020 3272 3291
market closure.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 503
❦
has.Hi()
seriesHi(HSBC)
Hi()
## HSBC.Open HSBC.High HSBC.Low HSBC.Close
Lo()
## 2007-10-31 98.92 99.52 98.05 99.52
Cl()
## HSBC.Volume HSBC.Adjusted
Vo()
## 2007-10-31 1457900 52.73684
Ad()
has.Cl(HSBC)
## [1] TRUE
has.Cl()
tail()
tail(Cl(HSBC))
seriesHi()
## HSBC.Close
## 2020-01-22 38.07
## 2020-01-23 37.72
## 2020-01-24 37.52
## 2020-01-27 36.51
## 2020-01-28 36.73
## 2020-01-29 36.70
There are even functions that will calculate differences, for example:
These functions rely on the following that are also available to use:
❦ • Lag(): gets the previous value in the series ❦
Lag(Cl(HSBC))
Lag(Cl(HSBC), c (1, 5, 10)) # One, five and ten period lags Lag()
Next(OpCl(HSBC)) Next()
There are many more wrappers and functions such as period.min(), period.sum(),
period.prod(), and period.max.
More often than not it is the return that we are interested in and there is a set of functions
that makes it easy and straightforward to do so. There is of course the master-function
periodReturn(), that takes a parameter “period’ to indicate what periods are designed. Then
there is also a suit of derived functions that carry the name of the relevant period.
The convention used is that the first observation of the period is the first trading time of periodReturn()
that period; and the last the last observation is the last trading time of the period, on the last
day of the period. xts has adopted the last observation of a given period as the date to record for
the larger period. This can be changed via the indexAt argument, we refer to the documentation
for more details.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 504
❦
504 26 Labs
dailyReturn() dailyReturn(HSBC)
weeklyReturn() weeklyReturn(HSBC)
monthlyReturn() monthlyReturn(HSBC)
quarterlyReturn() quarterlyReturn(HSBC)
yearlyReturn() yearlyReturn(HSBC)
allReturns(HSBC) # all previous returns
setSymbolLookup(SPY = 'yahoo',
VXN = list(name = '^VIX', src = 'yahoo'))
The object qmModel is now a quantmod object holding the model formula and data structure
implying the next (Next) period’s open to close of the S&P 500 ETF (OpCl(SPY)) is modelled as
a function of the current period open to close and the current close of the VIX (Cl(VIX)).4
modelData()
The call to modelData() extracts the relevant data set. A more direct function to accomplish
buildData() the same end is buildData().
lineChart(HSBC)
The line-chart shows that the behaviour of the stock is very different in the period after the
crisis. Therefore, we decide to consider only data after 2010.
HSBC.tmp <- HSBC["2010/"] #see: subsetting for xts objects
4 The VIX is the CBOE Volatility Index, known by its ticker symbol VIX, is a popular measure of the stock market’s
expectation of volatility implied by S&P 500 index options, calculated and published by the Chicago Board Options
Exchange.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 505
❦
HSBC [2007−01−03/2020−01−29]
Last 36.700001 100
80
60
40
20
15 Volume (millions):
1,892,483
10
Figure 26.5: The evolution of the HSBC share for the last ten years.
The next step is to divide our data in a training dataset and a test-dataset. The training set is the
❦ set that we will use to calibrate the model and then we will see how it performs on the test-data. ❦
This process will give us a good idea about the robustness of the model.
# use 70% of the data to train the model:
n <- floor(nrow(HSBC.tmp) * 0.7)
HSBC.train <- HSBC.tmp[1:n] # training data
HSBC.test <- HSBC[(n+1):nrow(HSBC.tmp)] # test-data
# head(HSBC.train)
Till now we used the functionality of quantmod to pull in data, but the function
specifyModel() allows us to prepare automatically the data for modelling: it will align the next specifyModel()
opening price with the explaining variables. Further, modelData() allows to make sure the data modelData()
is up-to-date.
# making sure that whenever we re-run this the latest data
# is pulled in:
m.qm.tr <- specifyModel(Next(Op(HSBC.train)) ~ Ad(HSBC.train)
+ Hi(HSBC.train) - Lo(HSBC.train) + Vo(HSBC.train))
D <- modelData(m.qm.tr)
We decide to create an additional variable that is the difference between the high and low
prices of the previous day.
# Add the additional column:
D$diff.HSBC <- D$Hi.HSBC.train - D$Lo.HSBC.train
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 506
❦
506 26 Labs
The column names of the data inherit the full name of the dataset. This is not practical since
the names will be different in the training set and in the test-set. So we rename them before mak-
ing the model.
The volume of trading in the stock does not seem to play a significant role, so we leave it out.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 507
❦
From the output of the command summary(m2) we learn that all the variables are significant
now. The R2 is slightly down, but in return, one has a much more stable model that is not over-
fitted (or at least less over-fitted).
Some more tests can be done. We should also make a Q-Q plot to make sure the residuals are Q-Q plot
normally distributed. This is done with the function qqnorm(). qqnorm)()
qqnorm(m2$residuals)
qqline(m2$residuals, col = 'blue', lwd = 2)
❦ ❦
Sample Quantiles
−5
−10
−15
−20
−3 −2 −1 0 1 2 3
Theoretical Quantiles
Figure 26.6: The Q-Q plot of our naive model to forecast the next opening price of the HSBC stock.
The results seems to be reasonable.
Figure 26.6 shows that the model does capture well the tail-behaviour of the forecasted vari-
able. However, the predicting power is not great.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 508
❦
508 26 Labs
First, we prepare the test data in the same way as the training data:
We could of course use the function predict() to find the predictions of the model, but here
we illustrate how coefficients can be extracted from the model object and used to obtain these
predictions. For the ease of reference we will name the coefficients.
a <- coef(m2)['(Intercept)']
bAd <- coef(m2)['D$Ad']
bD <- coef(m2)['D$Diff']
est <- a + bAd * D.tst$Ad + bD * D.tst$Diff
These values give us an estimate on what error can be expected by using this simple model.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 509
❦
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.85764 0.82818 -4.658 3.43e-06 ***
## D.tst$Ad 1.45544 0.02341 62.179 < 2e-16 ***
## D.tst$Diff 6.54810 0.32930 19.885 < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.013 on 1769 degrees of freedom
## Multiple R-squared: 0.6872,Adjusted R-squared: 0.6868
## F-statistic: 1943 on 2 and 1769 DF, p-value: < 2.2e-16
One will notice that the estimates for the coefficients are close to the values found in model
m2. Since the last model, m3, includes the most recent data it is probably best to use that one and
even update it regularly with new data.
Finally, one could compare the models fitted on the training data and on the test-data and
consider if on what time horizon the model should be calibrated before use. One can consider the
whole dataset, the last five years, the training dataset, etc. The choice will depend on the reality
of the environment rather than on naive mathematics. Although one machine-learning approach
would consist of using all possible data-horizons and finding the optimal one.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 511
❦
♣ 27 ♣
All models discussed previously use one common assumption: there is just one variable to model.
While it might be complicated to find an optimal value for that variable or very complicated to
predict that one variable, it does not prepare us well for the situation where there are more than
one functions to optimize.
Think for example about the situation where you want to buy something – anything actually:
then you certainly want to purchase it cheap but still a good quality. Unfortunately, goods of
❦ ❦
high quality will have high prices, so there is no such thing as an option that is optimal for both
variables. Fortunately our brain is quite well suited to solve such problems: we are able to buy
something, get married, decide what to eat, etc. The issue is that there is no clear bad choice but
also no clear good choice. People tend hence to decide with gut-feeling. This is great because it
helps people to do something: fight or flight in a dangerous situation, start a business, explore
a new path, get married, vote, etc.1 The world would be very different if our brain was unable
to solve such questions really efficient: such decisions come natural to us, but they are hard to
explain to someone else.
However, in a corporate setting this becomes quite problematic. We will need to be able to
explain why we recommend a certain solution. That is where Multi Criteria Decision Analysis
MCDA
(MCDA) comes in: it is merely a structured method to think about multi-criteria problems, come multi criteria
to some ranking of alternatives but above all it provides insight in why one action or alternative decision analysis
is preferable over another.
Actually, most most business problems are MCDA problems. Consider for example the levels
of decision making proposed by Monsen and Downs (1965). Roughly translated to today’s big
corporates, this boils down to:
1. Super-strategic: Mission statement (typically the founders, supervisory board and/or own-
ers have decided this) — this should not be up to discussion, so nothing to decide here (but
1 Imagine that people would be bad at naturally solve multi criteria problems. In that case, simply buying some-
thing in a shop would be a nightmare, as we would get stuck in a failing optimization over price and quality. Indeed,
the cheapest option is seldom the one that has the highest quality, so some trade-off is needed.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 512
❦
note that the company most probably started by a biased vision and a bold move on what
was actually a multi-criteria problem).
3. Operational Control / tactical: Typical middle management — some multi criteria prob-
lems, but most probably other methods – described in Part V “Modelling” on page 373 – are
more fit.2
So, the multi criteria problems that we will study in this chapter typically are more relevant
for the higher management.
❦ ❦
2 Anyhow, if the problem can be reduced to a uni-criterion problem, then the other methods are always preferable.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 513
❦
When more than one function need to be optimized simultaneously there is usually no unique
best solution and things start to get messy. As with all things in life, when the situation is messy,
a structured approach is helpful.
Actually, all types of modelling profit of a structured approach. While for other models – such
as the ones discussed in Part V “Modelling” on page 373 – the question to be answered is usually
clear, but this is not really the case in multi-criteria problems. More often people mix things that
are of very different nature and rather need a strategic choice. For example, if business is good but
the company has liquidity issues because of delay between paying for raw materials and selling
of finished products, then one might consider being bought by a larger company, get an overdraft
facility, increase shares, produce lower quality, save costs, etc.
In the following, we present a – personal – view on how such structured approach should look
like.
Make sure that the problem is well understood, that all ideas are on the table, that the environ-
ment is taken into account, and that we view the issue at hand through different angles and from
different points of view. Use for example exploratory techniques such as:
• SWOT analysis,
• 7Ps of Marketing,
❦ ❦
• Business Model Canvas,
• make sure that the problem is within one level of decision (strategic / managerial / opera-
tional) — see p. 233.
Make sure that the question is well formulated and is that it is the right questions to ask at this
moment in these circumstances.
• Brainstorming techniques or focus groups to
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 514
❦
This step makes the problem quantifiable. At the end of this step, we will have numbers for all
criteria for all alternative solutions.
If we miss data, we can sometimes mitigate this by adding a best estimate for that variable,
and then using “risk” as an extra parameter.
The work-flow can be summarised as follows:
1. Define how to measure all solutions for all criteria — make sure we have an ordinal scale
for all criteria – see Chapter B “Levels of Measurement” on page 829.
2. Collect all data so that you can calculate all criteria for all solutions.
3. Put these number is a “decision matrix” – see Section 27.4 “Step 3: the Decision Matrix” on
page 518.
4. Make sure that the decision matrix is as small as possible: can some criteria be combined
❦ into one? For example, it might be useful to fit criteria such as the presence of tram, bus, ❦
parking, etc. into one “commuting convenience” criterion.
2. The lowest alternative for each criterion has a value 0 and the highest equals 1.
1. Leave out all alternative that do not satisfy the minimal criteria – eventually rethink the
minimal criteria.
2. Drop the non-optimal solutions (the “dominated ones”) – see Chapter 27.5.2 “Dominance
– Inefficient Alternatives” on page 521.
If the problem cannot be reduced to a mono criterion problem then we will necessarily have to
make some trade-off when selecting a solution. A – very subjective – top-list of multi criteria
decision methods (MCDMs) is the following.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 515
❦
1. Weighted sum method — Chapter 27.7.2 “The Weighted Sum Method (WSM)” on page 527
4. PCA analysis (aka “Gaia” in this context) — Chapter 27.7.6 “PCA (Gaia)” on page 553
In practice, we never make a model or analysis just out of interest, there is always a goal when we
do something. Doing something with our work is the reason why we do it in the first place. The
data scientist needs to help the management to make good decisions. Therefore it is necessary to
write good reports, make presentations, discuss the results and make sure that the decision maker
has understood your assumptions and has a fair idea of what the model does.
This step could also be called “do something with the results.” This step is not different from
any other model report and hence will be superseded by Part VII “Reporting” on page 685. How-
ever make sure to take the following into account:
• Connect back to the company, its purpose and strategic goals (steps 1 and 2)
• Conclude
❦ ❦
• Make an initial plan (assuming an Agile approach).
• The set of all alternatives is A (in what follows we assume all alternatives to be
discrete, and A is finite (and hence countable – we assume A possible alternatives
that are worth to consider) — as opposed to continuous.a
• An alternative that cannot be rejected (is not dominated nor preferred under
another alternative) is a solution.
a So, we consider in this chapter problems of choice and not problems of design.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 516
❦
The first step is to get the big picture and explore possible alternatives. There is no conclusive
method, no model, no statistical learning possible without a good understanding of the prob-
lem, the business, goals and strategy. Multiple methods have been proposed and used in most
corporations.
SWOT Methods such as SWOT analysis for example can be used to investigate ideas, identify new
ideas, focus ideas, etc. These methods explore ideas but are no decision models in itself.
• Opportunities: Elements that the business or project could exploit to its advantage.
• Threats: Elements in the environment that could cause trouble for the business or
project.
While the words “strengths” and “opportunities” are very close to each other (just as
❦ ❦
“weakness” and “threat”) the SWOT method uses the convention that the first two are
internal to the team/company and the last two are external. So while a talented and moti-
vated workforce could be seen as an opportunity – in plain English – we will put it in the
box of “strengths” and the fact that the competition is unable to attract talent should be
considered as an “opportunity.”
It really depends on the problem what method to use. Assume for now that we are working in
the analytics department of a large multinational, that has a lot of data to take care of and analyse,
for example a large multinational bank. Let us call the bank “R-bank”, to honour the statistical
software that brings this book together. The company has large subsidiaries in many countries
and already started to group services in large “service centres” in Asia.
R-bank is UK based and till now it has 10 000 people working in five large service centres
in Asia and South America. These centres are in Bangalore, Delhi, Manilla, Hyderabad and São
Paulo. These cities also happen to be top destinations for Shared Service Centres (SSC) and Busi-
ness Process Outsourcing (BPO) – as presented by the Tholons index (see http://www.tholons.
com).3
The bank wants to create a central analytics function to supports its modelling and in one
go it will start building one central data warehouse with data scientists to make sense of it for
commercial and internal reasons (e.g. risk management).
3 BPO – Business Process Outsourcing – refers to services delivered by an external (not fully owned) specialized
company for third party companies. SSC – Shared Service Center – refers to a company that delivers services by
a company that belongs to the same capital group. In our example the SSC in Delhi can provide services to the
multiple banks in multiple countries and since both the customers are part of the R-bank group as well as the SSC
of R-bank in Delhi are fully owned by R-bank Holding in UK we speak of an “SSC”
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 517
❦
As the bank already has SSC centres in low-cost countries it is not a big step to imagine that
the centre of gravity of the new risk and analytics centre should be also in a shared service centre.
The question that arises is rather: in which SSC? Do we use an existing location or go to a new
city? The management of R-bank realizes that this is different from their other services where the
processes are well structures, stable, mature and do not require massive communications between
the SSC and the headquarters.
This narrative defines the problem at hand and provides understanding of the business ques-
tion, limitations, strategy and goals. The next step is to identify the criteria to make our selection.
For possible destinations we retain the top ten of Tholons: Bangalore, Delhi, Manilla, Hyder-
abad and São Paulo, Dublin, Kraków, Chennai, and Buenos Aires.
We use brainstorm sessions, focus groups to evaluate the existing SSCs and have come up with
the following list of relevant criteria:
1. Talent: Availability of talent and skills (good universities and enough students)
5. Travel: Cost and convenience of travelling to the centre (important since we expect lots of
interaction between the headquarters and the SSC Risk and Analytics)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 518
❦
1. Talent: Use Tholons’ “talent, skill and quality” 2017 index – see http://www.tholons.
com
2. Stability: the 2017 political stability index of the World Bank – see http://info.
worldbank.org/governance/WGI
4. Cost inflation “Annualized average growth rate in per capita real survey mean consumption
or income, total population (%)” from https://data.worldbank.org
5. Travel: Cost and convenience of travelling to the centre (important since we expect lots of
interaction between the headquarters and the SSC Risk and Analytics) – our assessment of
airline ticket price between R-bank’s headquarters, the travel time, etc.
6. Time-zone: Whether there is a big time-zone differnce – this is roughly one point if in the
same time-zone as R-bank’s headquarters, zero if more than 6 hours difference.
❦ ❦
7. Infrastructure: Use Tholons’ “infrastructure” 2017 index – see http://www.tholons.com
8. Life quality: Use Tholons’ “risk and quality of life” 2017 index – see http://www.tholons.
com
9. International airport in close proximity: Not withheld as a criterion, because all cities in
the Tholons top-10 have international airports.
These criteria, research, and data result in the raw decision matrix. This matrix is presented
in Table 27.1 on page 519
All criteria will at least be evaluated on an ordinal scale4 . While some are provided by reliable
sources, the science behind it is not as exact as measuring a distance. These indexes are usually to
be understood as a qualitative KPI. Even when measurable, these quantities will be defined with
lots of assumptions underpinning their value.
At this stage we notice that the correlation between travel cost and time-zone is too high to
keep the variables separate. So, we need to merge them in to one indicator (we will refer to it as
“travel”).
The following code characterises this decision matrix.
M0 <- matrix(c(
1.6 , -0.83 , 1.4 , 4.7 , 1 , 0.9 , 1.1 ,
1.8 , -0.83 , 1.0 , 4.7 , 1 , 0.9 , 0.8 ,
1.8 , -0.83 , 1.2 , 4.7 , 1 , 0.9 , 0.6 ,
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 519
❦
Table 27.1: The decision matrix summarises the information that we have gathered. In this stage
the matrix will mix variables in different units, and even qualitative appreciations (e.g. high and
low).
M0
## tlnt stab cost infl trvl infr life
## BLR 1.6 -0.83 1.4 4.7 1 0.9 1.1
## BOM 1.8 -0.83 1.0 4.7 1 0.9 0.8
## DEL 1.8 -0.83 1.2 4.7 1 0.9 0.6
## MNL 1.6 -1.24 1.4 2.8 1 0.9 0.8
## HYD 0.9 -0.83 1.4 4.7 1 0.7 0.8
## GRU 0.9 -0.83 0.8 4.7 1 0.7 0.6
## DUB 0.7 1.02 0.2 2.0 3 1.1 1.3
## KRK 1.1 0.52 1.0 1.3 3 0.6 0.9
## MAA 1.2 -0.83 1.3 4.7 1 0.8 0.5
## EZE 0.9 0.18 0.9 7.3 1 0.8 0.6
One will notice that all data is already numeric. That was not so from the start. In some cases.
we need to transform ordinal labels into a numeric scale. As explained in Chapter B “Levels of
Measurement” on page 829, there is no correct or incorrect way of doing this. The best way is
to imagine that there is some quantity as “utility” or “preference” and scale according to this
quantity.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 520
❦
# mcda_rescale_dm
# Rescales a decision matrix M
# Arguments:
# M -- decision matrix
# criteria in columns and higher numbers are better.
# Returns
# M -- normalised decision matrix
mcda_rescale_dm <- function (M) {
❦ colMaxs <- function(M) apply(M, 2, max, na.rm = TRUE)
❦
colMins <- function(M) apply(M, 2, min, na.rm = TRUE)
M <- sweep(M, 2, colMins(M), FUN="-")
M <- sweep(M, 2, colMaxs(M) - colMins(M), FUN="/")
M
}
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 521
❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 522
❦
}
}
}
colnames(Dom) <- rownames(Dom) <- rownames(M)
class(Dom) <- "prefM"
Dom
}
The output of this function is a A × A matrix (with A the number of alternatives). It has zeros
and ones. A one in position (i, j) means ai ≺ aj (or “alternative i is dominated by alternative j ).
For the rest of this chapter we will look at preference relationships and it might make more sense
to flip this around by transposing the matrix.
# mcda_get_dominants
# Finds the alternatives that dominate others
# Arguments:
# M -- normalized decision matrix with alternatives in rows,
# criteria in columns and higher numbers are better.
# Returns
# Dom -- prefM -- a preference matrix with 1 in position ij
# if alternative i dominates alternative j.
mcda_get_dominants <- function (M) t(mcda_get_dominated(M))
We see that
• Hyderabad (HYD) is dominated by Bangalore: it has a worse talent pool and lower quality
of life, while it scores the same for all other criteria.
This should not come as a massive surprise since our list of alternatives is the Tholons top-10
of 2017 (in that order). This helps to leave out some alternatives. In general, it does not make
sense to keep alternatives that are worse or equal to another. No decision method would prefer
Hyderabad, São Paulo or Chennai over Bangalore.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 523
❦
An easy way to leave out the dominated alternatives could be with the following function.
# mcda_del_dominated
# Removes the dominated alternatives from a decision matrix
# Arguments:
# M -- normalized decision matrix with alternatives in rows,
# criteria in columns and higher numbers are better.
# Returns
# A decision matrix without the dominated alternatives
mcda_del_dominated <- function(M) {
Dom <- mcda_get_dominated(M)
M[rowSums(Dom) == 0,]
}
This function allows us to reduce the decision matrix M to M 1 that only contains alternatives
that are not dominated.
M1 <- mcda_del_dominated(M)
round(M1,2)
## tlnt stab cost infl trvl infr life
## BLR 0.82 0.18 1.00 0.12 0 0.6 0.75
## BOM 1.00 0.18 0.67 0.12 0 0.6 0.38
## DEL 1.00 0.18 0.83 0.12 0 0.6 0.12
## MNL 0.82 0.00 1.00 0.35 0 0.6 0.38
## DUB 0.00 1.00 0.00 0.57 1 1.0 1.00
## KRK 0.36 0.78 0.67 1.00 1 0.0 0.50
## EZE 0.18 0.63 0.58 0.00 0 0.4 0.12
Although it makes also clear that dominance is an approach that will leave a lot of alternatives
❦ unranked. At this point, we will have to revert to strategies that are on the borderline between ❦
science and art. However, using one of the MCDA methods that follow allows to make very clear
how you came to the decision and it creates a way to talk about it.
Should we rescale the decision matrix now that we have left out the dominated solutions?
In our example the place with the lowest score for “quality of life” has been left out. This
means that the range of these criteria is not between 0 and 1 any more. If we do not rescale
the results will be hard to interpret. So rescaling is a good idea.
At this point, it makes sense to rescale the decision matrix since the lowest element of some
criteria might have been dropped out. We can reuse the function rescale_dm() that we have
defined previously.
M1 <- mcda_rescale_dm(M1)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 524
❦
Before we explore some MCDA methods it is worth to figure out how to visualize preference
relationships. We have a simple example in the matrix Dom that shows which alternatives are
dominated (and hence have others that are preferred over them).
It appears that R has a package called diagram that provides a function that is actually
designed to visualize a transition matrix, but it is also an ideal fit for our purpose here.
plotmat() In the following code segment, we create a function plot.prefM() that will plot an object of
the class prefM – as defined in Chapter 27.5.2 “Dominance – Inefficient Alternatives” on page 521.
diagram This function is based on the plotmat() function of the library diagram.
# plot.prefM
# Specific function to handle objects of class prefM for the
# generic function plot()
# Arguments:
# PM -- prefM -- preference matrix
# ... -- additional arguments passed to plotmat()
# of the package diagram.
plot.prefM <- function(PM, ...)
{
❦ X <- t(PM) # We want arrows to mean '... is better than ...' ❦
# plotmat uses the opposite convention because it expects flows.
plotmat(X,
box.size = 0.1,
cex.txt = 0,
lwd = 5 * X, # lwd proportional to preference
self.lwd = 3,
lcol = 'blue',
self.shiftx = c(0.06, -0.06, -0.06, 0.06),
box.lcol = 'blue',
box.col = 'khaki3',
box.lwd = 2,
relsize = 0.9,
box.prop = 0.5,
endhead = FALSE,
main = "",
...)
}
Note – Dots
The three points allow the function plot.prefM() to get more arguments passing them
on to the function plotmat(). This allows us later to pass more arguments to the function
plotmat() through the function plot.prefM().
Now, it becomes clear why the function get_dominated() changes the class argument of
the S3 matrix object before returning it. By naming our function plot.prefM() we tell R that
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 525
❦
whenever the dispatcher function plot() is used and the class of the object is “prefM” that we
expect R to plot it as below and not as a scatter-plot (which is default for a matrix).5 Hence, plotting
the dominance matrix is just one line of code (the output is in Figure 27.1).
EZE
MAA BLR
KRK BOM
DUB DEL
❦ ❦
GRU MNL
HYD
5 More information about how the object model is implemented in R is in Chapter 6 “The Implementation of OO”
on page 87 and more in particular Chapter 6.2 “S3 Objects” on page 91.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 526
❦
At this point, we have left out all dominated solutions and hence are by definition left with a set
of “efficient solutions.”
Definition: Efficient Solutions
The matrix M 1 is the decision matrix for all our efficient alternatives. Actually, “dominance”
can be considered as the first non-compensatory MCDA method. We recognize two classes of
MCDA methods: compensatory methods that allow one strong point to compensate for another
weak point, and non-compensatory methods that to not allow for any compensation.
As we have seen, dominance only eliminated one alternative of our list, and most of them
could not really be ordered with dominance. That is typical for non-compensatory methods.
Compensatory methods allow weaknesses to be full or partially compensated by strong points.
They will typically lead to a richer ranking. In the remainder of this chapter, we will study some
of those methods.
2. select the solution that has the highest weak attribute (0 in a normalized decision matrix)
• when the “a chain is as weak as the weakest link reasoning” makes sense.
• when one knows that the best of the best in one attribute is most important.
In our example, the dimensions (units) are nothing but comparable. Also, if they were all to
be expressed in dollar terms, then why not aggregate them and reduce our multi criteria problem
to a mono-criterion optimization?
While these methods are of limited practical use, they are important food for thought.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 527
❦
max{N (a)}
x∈A
where M is the decision matrix where each element is transformed according to a certain func-
tion.
The key thing – especially for the mathematician – is to understand that weights are assigned
to “differences in preference” and not to the criteria as such. Since then everything is expressed
in the unit “preference,” we are indeed allowed to add the scores.
❦ The main advantage of this method is that it is as clear as a logistic regression. Anyone who ❦
can count will understand what is happening. So, the WSM method is an ideal tool to stimulate
discussion.
In R this can be obtained as follows.
# mcda_wsm
# Finds the alternatives that are dominated by others
# Arguments:
# M -- normalized decision matrix with alternatives in rows,
# criteria in columns and higher numbers are better.
# w -- numeric vector of weights for the criteria
# Returns
# a vector with a score for each alternative
mcda_wsm <- function(M, w) {
X <- M %*% w
colnames(X) <- 'pref'
X
}
At this point, we need to assign a “weight” to each criterion. This needs to be done in close
collaboration with the decision makers. For example, we might argue that the infrastructure is
less important since our SSC will be small compared to the existing ones, etc.
Taking into account that the SSC will not be very large, that we cannot expect employees
just to be ready (so we will do a lot of training ourselves and work with universities to fine-tune
curricula, etc.), we need a long time to set up such centre of expertise and hence need stability,
etc. we came up with the following weights.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 528
❦
The WSM produces almost always a complete ranking, but is of course sensitive to the weights
that – actually – are arbitrary. So, with this method it is not so difficult to make sure that the
winning solution is the one that we preferred from the start anyhow.
The complete ranking can be represented with plotmat but it might make more sense to use
ggplot2. To do so neatly, we take a step back and re-write the function mcda_wsm() and make it
return a “matrix of scores” (“scoreM” for short).
# mcda_wsm_score
# Returns the scores for each of the alternative for each of
# the criteria weighted by their weights.
# Arguments:
# M -- normalized decision matrix with alternatives in rows,
# criteria in columns and higher numbers are better.
# w -- numeric vector of weights for the criteria
❦ # Returns ❦
# a score-matrix of class scoreM
mcda_wsm_score <- function(M, w) {
sweep() X <- sweep(M1, MARGIN = 2, w, `*`)
class(X) <- 'scoreM'
X
}
# plot.scoreM
# Specific function for an object of class scoreM for the
# generic function plot().
# Arguments:
# M -- scoreM -- score matrix
# Returns:
# plot
plot.scoreM <- function (M) {
# 1. order the rows according to rowSums
M <- M[order(rowSums(M), decreasing = T),]
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 529
❦
This function is now ready to be called automatically via the dispatcher function plot(). The
code is below and the result is in Figure 27.2.
sM <- mcda_wsm_score(M1, w)
plot(sM)
0.7
life
infr
trvl
0.6
infl
cost
stab
tlnt
0.5
0.4
0.3
❦ ❦
0.2
0.1
0.0
Score
According to our weighting which attached reasonably similar weights to all criteria, Krakow
seems to be a good solution. However, note that this city scores lowest on “infrastructure.” Before
presenting the solution to the board, we need of course to understand what this really means.
Looking deeper in what Tholons is doing, this is due to the fact that it does not have a subway
system, public transport relies on trams and buses and both need investment, the ring-road is not
finished yet (but there are plans to add three levels of ring-roads, etc.), it misses large office spaces,
etc. For our intend, which is building up an SSC with highly skilled workers this is not a major
concern. By all standards R-bank’s investment in Krakow would be modest and even medium
office spaces should suffice.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 530
❦
The next concern is that it scores low at “talent.” Again we need to investigate what this really
means. Krakow – with its 765 000 inhabitants – is a lot smaller than for example Bangalore that
hosts 23.3 million people. At this point, it only employs about 65 000 employees in service centres,
but it has great universities and 44 000 students graduate every year. While this might be an issue
for a large operation of 10 000 people, it should not be a major concern for our purpose.
While these considerations are rather particular and not of general interest, it is important
to realize that MCDA more than any other method needs a lot of common sense and
interpretation.
P (ai ) = Πn
j=1 (mij )
wj
❦ ❦
This form of the WPM is often called dimensionless analysis because its mathematical structure
eliminates any units of measure. Note however, that it requires a ratio scale.
It is less easy to understand what the coefficients of WPM do compared to the WSM. It is also
possible to combine both. For example, build a score “human factors” as the product of safety,
housing quality, air quality, etc. This “human factor” can then be fed into an additive model such
as WSM. This approach can also be useful for scorecards.
27.7.4 ELECTRE
If impossible to find, express, and calculate a meaningful common variable (such as “utility”) then
we try to find at least a preference structure that can be applied to all criteria.7 If such a preference
structure π exists, and is additive, then we can calculate an inflow and outflow or preference for
each alternative.
If the decision matrix M has elements mik , then we prefer the alternative ai over the alterna-
tive aj for criterion k if mik > mjk . In other words, we prefer alternative i over alternative j for
7 If it is possible to define “Utility,” then the multi-criteria problem reduces to a mono-criterion problem that is
easily solved. In reality however, it is doubtful if “utility” exists and even if it would, it will be different for each
decision maker and it might not even be possible to find its expression (if that exists).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 531
❦
criterion k if its score is higher for that criterion. The amount of preference can be captured by a
function Π().
In ELECTRE the preference function is supposed to be a step-function.
Definition: Preference of one solution over another
K
π + (ai , aj ) :=
πk (mik − mjk ) wk
k=1
K
π − (ai , aj ) :=
πk (mjk − mik ) wk
k=1
We note that:
K
π + (ai , aj ) =
πk (mik − mjk ) wk
k=1
K
❦ =− πk (mjk − mik ) wk ❦
k=1
+
= −π (aj , ai )
= −π − (ai , aj )
= π − (aj , ai )
Even with a preference function π() that is a strictly increasing function of the difference
in score, it might be that some solutions have the same score for some criteria and hence are
incomparable for these criteria. So, it makes sense to define a degree of “indifference.”
j=a
= 1 − π + (ai , aj ) − π − (ai , aj )
These weights can be the same as in the Weighted Sum Method. However, in general there
is no reason why they would be the same.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 532
❦
This way of working makes a lot of sense. Referring to our previous example, we see
that Dublin outperforms Krakow for political stability, infrastructure and life quality. We also see
that Krakow does better for talent pool, cost, and wage inflation and hence we prefer Krakow for
these criteria. They are in the same time-zone and travel costs for almost all relevant stakeholders
will be the same. For travel cost, one is indifferent between Krakow and Dublin.
27.7.4.1 ELECTRE I
We have now three preference matrices: Π+ , Π− , and Π0 . Based on these matrices we can devise
multiple preference structures. The first method is “ELECTRE I” and requires to calculate a com-
parability index.
There are two particularly useful possibilities for this index of comparability. We will call them
C1 and C2 .
Π+ (a, b) + Π0 (a, b)
C1 (a, b) =
Π+ (a, b) + Π0 (a, b) + Π− (a, b)
Note that C1 (a, b) = 1 ⇔ aDb. This, however, should not be the case in our example as we already
left out all dominated solutions.
Definition: Index of comparability of Type 2
Π+ (a, b)
C2 (a, b) =
❦ Π− (a, b) ❦
• for the comparability index a cut-off level and consider the alternatives as equally interest-
ing if Ci < Λi :
• for each criterion a maximal discrepancy in the “wrong” direction if a preference would be
stated: rk , k ∈ {1 . . . K}. This will avoid that a solution a is preferred over b while it is too
much worse than b for at least one criterion.
With all those definitions we can define the preference structure as follows:
⎫
Π+ (a, b) > Π− (a, b) ⎪
⎬
• for C1 : C1 (a, b) ≥ Λ1 ⇒a≻b
⎪
∀j : dj (a, b) ≤ rj ⎭
⎫
Π+ (a, b) > Π− (a, b) ⎪
⎬
• for C2 : C2 (a, b) ≥ Λ2 ⇒a≻b
⎪
∀j : dj (a, b) ≤ rj ⎭
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 533
❦
In a last step one can present the results graphically and present the kernel (the best solutions) to
the decision makers. The kernel consists of all alternatives that are “efficient” (there is no other
alternative that is preferred over the latter).
K = {a ∈ A | ∄b ∈ A : b ≻ a}
ELECTRE I in R
Below is one way to program the ELECTRE I algorithm in R. One of the major choices that we
made was create a function with a side effect. This is not the best solution if we want others to
use our code (e.g. if we would like to wrap the functions in a package). The alternative would be
to create a list of matrices, that then could be returned by the function.
Since we are only calling the following function within another function this is not toxic, and
suits our purpose well.
# mcda_electre Type 2
# Push the preference matrixes PI.plus, PI.min and
# PI.indif in the environment that calls this function.
# Arguments:
# M -- normalized decision matrix with alternatives in rows,
# criteria in columns and higher numbers are better.
❦ # w -- numeric vector of weights for the criteria ❦
# Returns nothing but leaves as side effect:
# PI.plus -- the matrix of preference
# PI.min -- the matrix of non-preference
# PI.indif -- the indifference matrix
mcda_electre <- function(M, w) {
# initializations
PI.plus <<- matrix(data=0, nrow=nrow(M), ncol=nrow(M))
PI.min <<- matrix(data=0, nrow=nrow(M), ncol=nrow(M))
PI.indif <<- matrix(data=0, nrow=nrow(M), ncol=nrow(M))
This function can now be called in an encapsulating function which calcualtes the ELECTRE
preference matrix.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 534
❦
# mcda_electre1
# Calculates the preference matrix for the ELECTRE method
# Arguments:
# M -- decision matrix (colnames are criteria, rownames are alternatives)
# w -- vector of weights
# Lambda -- the cutoff for the levels of preference
# r -- the vector of maximum inverse preferences allowed
# index -- one of ['C1', 'C2']
# Returns:
# object of class prefM (preference matrix)
mcda_electre1 <- function(M, w, Lambda, r, index='C1') {
# get PI.plus, PI.min and PI.indif
mcda_electre(M,w)
# initializations
CM <- matrix(data=0, nrow=nrow(M), ncol=nrow(M))
PM <- matrix(data=0, nrow=nrow(M), ncol=nrow(M))
colnames(PM) <- rownames(PM) <- rownames(M)
Then in the next function, mcda_electre1(), we should address the PI.* not directly
like Pi.x[i, j] , but rather as follows:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 535
❦
The function mcda_electre1() is now ready for use. We need to provide the decision matrix,
weights and the cut-off value and a vector for maximum inverse preferences. The code below
does this, prints the preference relations a matrix and finally plots them with our custom method
plot.prefM() in Figure 27.3 on page 536.
plot(eM)
Each arrow in Figure 27.3 on page 536 means “. . . is preferred over . . . .” We also notice that the
preference relationship in ELECTRE I is transitive, so we might leave out arrows that span over
another alternative. Some fiddling would allow us to get rid of those in the preference matrix and
we could use argument pos for the function plotmat in order to fine tune how the cities have to
be plotted.
Our plotting functions shows each preference relation. However, in ELECTRE these prefer-
ence relations are transitive. That means if KRK is preferred over DUB, and DUB is preferred
over EZE, than necessarily, KRK is also preferred over EZE. That means that we could reduce the
number of arrows. That idea is presented in Figure 27.4 on page 536
This presentation makes very clear that the ELECTRE method will not always be able to rank
alternatives.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 536
❦
EZE
KRK BLR
DUB BOM
MNL DEL
Figure 27.3: The preference structure as found by the ELECTRE I method given all parameters in
the code.
❦ Besides the plot that we obtain automatically via our functon plot.prefM(), it is also possible ❦
to create a plot that uses the transitivity to make the image lighter and easier to read. This is
presented in Figure 27.4.
Figure 27.4: Another representation of Figure 27.3. It is clear that Krakow and Bangalore are quite
different places for a SSC. Therefore they are not ranked between each other and choosing between
them means making compromises.
All the findings from a method such as ELECTRE I (or II) should be interpreted with the
utmost care! All conclusions are made based on many parameters, sometimes moving
them a little will yield a slightly different picture. So it is worth to spend a while tweaking
so that (a) most are ranked but (b) too dissimilar alternatives are not ranked relative to
each other. This situation allows us to better understand what is going on in this particular
multi criteria problem.
We can try the same with the C2 comparability index. This is done in the following code, and
the results are visualised in the last line – output in Figure 27.5 on page 537
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 537
❦
EZE
KRK BLR
DUB BOM
❦ MNL DEL ❦
Figure 27.5: The results of ELECTRE I with comparability index C2 and parameters as in the afore-
mentioned code.
Besides the plot that we obtain automatically via our functon plot.prefM(), it is also possible
to create a plot that uses the transitivity to make the image lighter and easier to read. This is
presented in Figure 27.6.
Figure 27.6: The results for ELECTRE I with comparability index C2. The A → B means “A is better
than B .” In this approach and with this visualisation, we find three top-locations: Krakow, Bangalore
and Manilla.
The figures and calculations above seem to indicate that given this set of alternatives and our
choice of criteria and weights thereof that
1. Krakow and Bangalore are good choices, but for very different reasons;
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 538
❦
3. The Asian cities and the European cities form two distinct group so that within the group
a ranking is possible but it is harder to rank them over groups.
27.7.4.2 ELECTRE II
The strength of ELECTRE I is that it does not force a ranking onto alternatives that are too dif-
ferent. But that is of course its weakness too. In a boardroom it is easier to cope with a simple
ranking than with a more complex structure.
Hence, the idea of ELECTRE II was born to force a complete ranking by
• gradually lower the cut-off level Λ1 and
In our example r needs to be equal to the unit vector and Λ can be zero in order to obtain a
full ranking. The code below uses these values and plots the preference relations in Figure 27.7
on page 539.
plot(eM)
Note – Alternative
Note also that in a production environment we should build in further tests (e.g. the num-
ber of rows of M 1 should correspond to the length of w).
The plot shows now an arrow between each of the alternative solutions, so all alternatives are
comparable. Since the preference relationship is transitive, we can summarise this complex plot
as in Figure 27.8.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 539
❦
EZE
KRK BLR
DUB BOM
MNL DEL
Figure 27.7: The preference structure as found by the ELECTRE II method given all parameters in
the code.
❦ ❦
Figure 27.8: The results for ELECTRE I with comparability index C2.
Note – ELECTRE II
It is not necessary to put r equal to the unit vector and Λ at zero. We could program a
goal-seek algorithm using the fact that the associated preference matrix of a full ranking
is an upper triangle matrix of all ones (though the rows might not be in order). So, the
one will find that in this matrix satisfies the following equation.
sum(rowSums(prefM)) == A * A - A
# with prefM the preference matrix and
# A the number of alternatives.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 540
❦
Disadvantages
• There is still an “abstract” concept “preference,” which has little meaning and no pure inter-
pretation
• So to some extend it is still so that concepts that are expressed in different units are com-
pared in a naive way.
27.7.5 PROMethEE
PROMethEE PROMethEE is short for “Preference Ranking Organization METHod for Enrichment of Evalua-
tions”.8 In some way, it is a generalisation of ELECTRE. In ELECTRE the preference relation was
binary (0 or 1). So, the obvious generalisation is allowing a preference as a number between 0 and
1. This number can then be calculated in a variety of different ways: that is what PROMethEE
does.
• This 0-or-1-relation (black or white) can be replaced by a more gradual solution with dif-
ferent shades of grey.
• This preference function will be called πk (a, b) and it can be different for each criterion.
The idea is that the preference for alternative ai and aj can be expressed in function of the
weighted sum of differences of their scores mik in the decision matrix.
K
π(ai , aj ) = Pk (mik − mjk )wk (27.1)
k=1
K
= Pk dk (ai , aj ) wk (27.2)
k=1
8 PROMethEE is accompanied by a complementary method for “geometrical analysis for interactive aid” – better
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 541
❦
The key assumption is that it is überhaupt possible to define a reasonable preference function
for each criterion k and that these preferences are additional.
We notice that dk (a, b) ∈ [1, +1] and hence for each criterion we need to find a function that
over the domain [1, +1] yields a preference that we can also coerce in [0, +1] without loss of gener-
ality. Note that these preferences should indeed not become negative, so they cannot compensate.
The preference functions π(a, b) is typically considered to be zero between on [−1, 0] (so on
the part where b has a better score than a). On the other part it will increase to a maximum of 1
in function of mak − mbk .
Some possible smooth scaling functions are in Figure 27.9. For the use of PROMethEE, we
should avoid the preference function to become negative (otherwise we would allow full compen-
sation of weaknesses). So, in practice and for PROMethEE we will use only preference functions
that are positive on one side. Some example of that type of functions are in Figure 27.10 on
page 542.
1.0
0.5
Preference −−− π
0.0
❦ ❦
−0.5
−1.0
−2 0 2
d
Figure 27.9: Examples of smooth transition schemes for preference functions π(d). The “d” is to be
understood as the difference in score for a given criterion.
The code for the last plots is not included here, but you can find it in the appendix: Chap-
ter D “Code Not Shown in the Body of the Book” on page 841. It includes some useful aspects
such as fiddling with legends in ggplot2, using LATEX markup in plots, and vectorizing
functions.
Figure 27.10 on page 542 shows a few possibilities for the preference functions, but we
can of course invent much more. Further, we note – similar to WSM – that not all criteria are
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 542
❦
ifelse(x < 0, 0, −3/2 * x^5 + 5/2 * x^3) ifelse(x < 0, 0, sin(pi * x/2))
0.8
0.8
0.4
0.4
0.0
0.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
0.8
0.4
❦ 0.4 ❦
0.0
0.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
1.5
1.0
0.4
0.5
0.0
0.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Figure 27.10: Examples of practically applicable preferences functions P (d). The “x” is to be under-
stood as the difference in score for a given criterion. Note that all scaling factors are optimized for a
normalized decision matrix. The function gaus() refers to exp(−(x − 0)2 /0.5).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 543
❦
equally important in the decision matrix and hence need to be weighted. So for each criterion
(k ∈ (1 . . . K) we could multiply the preference function π() with a weight wk
Note further that the preference function for each criterion j can be different, so in general
we can define the degree of preference as follows.
It is important to understand that there is no exact science for choosing a preference function,
nor is there any limit on what one could consider. Here are some more ideas.
Examples:
• π(d) = tanh(d)
√
(π)
• π(d) = erf 2 d
• π(d) = √ d
1+x2
0 ford < 0
• Gaussian: π(d) = (d−d0 )2
❦ 1 − exp − 2s2
for d ≥ 0 ❦
• ...
27.7.5.1 PROMethEE I
Note that in PROMethEE I the preference π(ai , aj ) does not become negative. Otherwise
– when added – we would loose information (i.e. differences in opposite directions are
compensated).
This preference function allows us to define a flow of how much each alternative is preferred,
−
Φ+
i , as well as a measure of how much other alternatives are preferred over this one: Φi . The
process is as follows.
2. They should only depend on the difference between the scores of each alternative as sum-
marized in the decision matrix mik :
K
3. Define a preference index: Π(ai , aj ) = k=1 wk πk (ai , aj )
4. Then sum all those flows for each solution – alternative – to obtain
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 544
❦
K A
(a) a positive flow: Φ+ (ai ) = 1 1
K−1 aj ∈A Π(ai , aj ) = K−1 k=1 j=1 πk (ai , aj )
K A
(b) a negative flow: Φ− (ai ) = 1 1
K−1 aj ∈A Π(aj , ai ) = K−1 k=1 j=1 πk (aj , ai )
Based on these flows, we can define the preference relations for PROMethEE I as follows:
•
Φ+ (a) ≥ Φ+ (b) ∧ Φ− (a) < Φ− (b) or
a≻b⇔
Φ+ (a) > Φ+ (b) ∧ Φ− (a) ≤ Φ− (b)
PROMethEE I in R
We will first define a base function that calculates the flows Φ and pushes the results a in the
environment a level higher (similar to the approach for the ELECTRE method).
❦ # mcda_promethee ❦
# delivers the preference flow matrices for the Promethee method
# Arguments:
# M -- decision matrix
# w -- weights
# piFUNs -- a list of preference functions,
# if not provided min(1,max(0,d)) is assumed.
# Returns (as side effect)
# phi_plus <<- rowSums(PI.plus)
# phi_min <<- rowSums(PI.min)
# phi_ <<- phi_plus - phi_min
#
mcda_promethee <- function(M, w, piFUNs='x')
{
if (piFUNs == 'x') {
# create a factory function:
makeFUN <- function(x) {x; function(x) max(0,x) }
P <- list()
for (k in 1:ncol(M)) P[[k]] <- makeFUN(k)
} # in all other cases we assume a vector of functions
# initializations
PI.plus <<- matrix(data=0, nrow=nrow(M), ncol=nrow(M))
PI.min <<- matrix(data=0, nrow=nrow(M), ncol=nrow(M))
# calculate the preference matrix
for (i in 1:nrow(M)){
for (j in 1:nrow(M)) {
for (k in 1:ncol(M)) {
if (M[i,k] > M[j,k]) {
PI.plus[i,j] = PI.plus[i,j] + w[k] * P[[k]](M[i,k] - M[j,k])
}
if (M[j,k] > M[i,k]) {
PI.min[i,j] = PI.min[i,j] + w[k] * P[[k]](M[j,k] - M[i,k])
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 545
❦
}
}
}
}
# note the <<- which pushes the results to the upwards environment
phi_plus <<- rowSums(PI.plus)
phi_min <<- rowSums(PI.min)
phi_ <<- phi_plus - phi_min
}
If you want to avoid using side effects to pass on three matrices, how would you re-write
this cde so that the three matrices are returned to the environment that call the function?
Before you look up the answer, you might want to get Some inpsiration from previous
section about the ELECTRE method: read the “digression box” in that section.
In some literature, the preference functions are defined as being symmetrical for rotation
along the y -axis. This can be quite confusing and is best avoided, however, the aforemen-
tioned code will work with such functions too.
❦ Now, we can define a function mcda_promethee1() that calls the function mcda_ ❦
promethee() to define the preference flows.
# mcda_promethee1
# Calculates the preference matrix for the Promethee1 method
# Arguments:
# M -- decision matrix
# w -- weights
# piFUNs -- a list of preference functions,
# if not provided min(1,max(0,d)) is assumed.
# Returns:
# prefM object -- the preference matrix
#
mcda_promethee1 <- function(M, w, piFUNs='x') {
# mcda_promethee adds phi_min, phi_plus & phi_ to this environment:
mcda_promethee(M, w, piFUNs='x')
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 546
❦
pref[i,j] = NA
}
}
}
rownames(pref) <- colnames(pref) <- rownames(M)
class(pref) <- 'prefM'
pref
}
All that is left, now is to execute the function that we have created in previous code segment.
The object m is now the preference matrix of class prefM, and we can plot it as usual – result
in Figure 27.11
EZE
KRK BLR
❦ ❦
DUB BOM
MNL DEL
Again, it is possible to simplify the scheme of Figure 27.11 by leaving out the spurious arrows.
This scheme is in Figure 27.12 on page 547.
The function that we have created can also take a list of preference functions via its piFUNs
argument. Below, we illustrate how this can work and we plot the results in Figure 27.13 on
page 547.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 547
❦
Figure 27.12: The preference relations resulting from PROMethEE I. For example, this shows that the
least suitable city would be Buenos Aires (EZE). It also shows that both Krakow (KRK) and Bangalore
(BLR) would be good options, but PROMethEE I is unable to tell us which of both is best, they cannot
be ranked based on this method.
EZE
KRK BLR
❦ ❦
DUB BOM
MNL DEL
Figure 27.13: The result for PROMethEE I with different preference functions provided.
Note that besides the plot that we obtain automatically via our function plot.prefM(), it is
also possible to create a plot that uses the transitivity to make the image lighter and easier to read.
This is presented in Figure 27.14.
Figure 27.14: The results for PROMethEE I method with the custom preference functions.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 548
❦
Interestingly, the functions that we have provided, do change the preference structure as
found by PROMethEE I, even the main conclusions differ. The main changes are that KRK
became comparable to BLR and MNL to DEL.
PROMethEE I does not yield a complete ranking (it does not define a total order relation), but
❦ ❦
along the way we gathered good insight in the problem. Below we list some of the most important
advantages and disadvantages of this method.
Advantages:
• It is easier and makes more sense to define a preference function than the parameters Λj
and r in ELECTRE.
• It seems to be stable for addition and deletion of alternatives (the ELECTRE and WPM have
been proven inconsistent here).
Disadvantages:
• Does not readily give too much insight in why a solution is preferred.
• There are a lot of arbitrary choices to be made, and those choices can influence the result.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 549
❦
27.7.5.2 PROMethEE II
The idea of PROMethEE II is to allow “full compensation” between positive and negative flows
(similar to ELECTRE II) and hence reduces the preference flow calculations to just one flow:
k
Φ(a, b) = πj (fj (a), fj (b))
j=1
where here the πj (a, b) can be negative and are symmetrical for mirroring around the axis (y =
−x). They can also be considered as the concatenation of Φ+ and Φ− as follows
Φ = max(Φ+ , Φ− ).
The result of this adjusted mechanism will be that we loose some of the insight9 , but gain a total
ordering of the alternatives.
We can condense this information further for each alternative:
k
Φ(a) = πj (fj (a), fj (x))
x∈A j=1
= π(a, x)
x∈A
This results in a preference relation that will almost in all cases show a difference (in a small
number of cases there is indifference, but all are comparable – there is no “no preference”)
A key element for PROMethEE is that the preference function can be defined by the user.
In Figure 27.15 on page 551 we present some functions that can be used to derive a preference
from a distance between values of a criterion. While it might be hard to prove the need for other
functions, your imagination is the real limit.
We can reuse the function mcda_promethee() – that provides the preference flows in the
environment that calls it – and build upon this the function mcda_promethee2().
# mcda_promethee2
# Calculates the Promethee2 preference matrix
# Arguments:
# M -- decision matrix
# w -- weights
# piFUNs -- a list of preference functions,
# if not provided min(1,max(0,d)) is assumed.
# Returns:
# prefM object -- the preference matrix
#
mcda_promethee2 <- function(M, w, piFUNs='x')
9 If for example city A is much better on three criteria than city B, but for three other criteria this relation is
opposite, we might want to stop and study deeper. PROMethEE II, however will not stop and the small difference
on the seventh criterion could determine the ranking.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 550
❦
{ # promethee II
mcda_promethee(M, w, piFUNs='x')
pref <- matrix(data=0, nrow=nrow(M), ncol=nrow(M))
for (i in 1:nrow(M)){
for (j in 1:nrow(M)) {
pref[i,j] <- max(phi_[i] - phi_[j],0)
}
}
rownames(pref) <- colnames(pref) <- rownames(M)
class(pref) <- 'prefM'
pref
}
We can reuse the decision matrix M 1 with weights w, we can define our own preference
functions or use the standard one that is build in max(0, d).
m <- mcda_promethee2(M1, w)
plot(m)
The output of this code is the plot in Figure 27.16 on page 552. The preference matrix is now
not longer binary and the numbers are the relative flows for each alternative and the preference
map will have a preference relation between each data point.
Because of this total relation it is also possible to represent a Promethee II plot as a barchart
that represents “preference.” The following code does this and the output of this code is in Fig-
ure 27.17 on page 552.
Advantages
Disadvantages
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 551
❦
1.0
0.5
0.5
0.0
0.0
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
1.0
0.5
0.5
0.0
0.0
❦ ❦
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
0.5
0.0
0.0
−0.5
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Figure 27.15: Promethee II can also be seen as using a richer preference structure that can become
negative. Here are some examples of practically applicable preferences functions P (d). The function
gaus() refers to exp(−(x − 0)2 /0.5).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 552
❦
EZE
KRK BLR
DUB BOM
MNL DEL
Figure 27.16: The hierarchy between alternatives as found by PROMethEE II. The thickness of the
lines corresponds to the strength of the preference.
❦ ❦
EZE
KRK
12
DUB
MNL
DEL
10
BOM
BLR
8
6
4
2
0
Score
Figure 27.17: PROMethEE II provides a full ranking. Here we show how much each alternative is
preferable over its competitors. The size of the blocks is relative to the amount of preference over the
other alternative.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 553
❦
pca1
0.6
❦ ❦
0.5
0.4
Variances
0.3
0.2
0.1
0.0
1 2 3 4 5 6 7
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 554
❦
# plot for the prcomp object shows the variance explained by each PC
plot(pca1, type = 'l')
0.6
1.0
DUB
0.4
infr
0.5
0.2
life
BLR
BOM EZE
stab
0.0
0.0
PC2
DEL
tlnt MNL
trvl
−0.2
cost
−0.5
infl
−0.4
−1.0
−0.6
❦ ❦
KRK
−1.5
−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6
PC1
Figure 27.19: A projection of the space of alternatives in the 2D-plane formed by the two most dom-
inating principal components.
As mentioned in earlier, also with ggplot2 and ggfortify it is easy to obtain professional
results with little effort. The code below does this and shows two versions: first, with the labels
coloured according to cost (in Figure 27.20 on page 555), second with the visualisation of two
clusters in Figure 27.21 on page 555
library(ggplot2)
library(ggfortify)
library(cluster)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 555
❦
infr
DUB
0.4
life
BLR
BOM EZE cost
stab 1.00
PC2 (16.71%)
0.0
DEL 0.75
MNL 0.50
tlnt
0.25
trvl 0.00
cost
−0.4
infl
KRK
−0.8
−0.25 0.00 0.25 0.50
PC1 (66.26%)
❦ ❦
infr
U
DUB
0.4
life
BLR
BOM
O
OM EZE
EZ
stab
PC2 (16.71%)
0.0 cluster
DEL
EL
MNN
MNL a 1
tlnt a 2
v
trvl
cost
−0.4
infl
KRK
R
−0.8
−0.25 0.00 0.25 0.50
PC1 (66.26%)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 556
❦
These visualization show already a lot of information, but we can still add the “decision vec-
tor” (the vector of weights projected in the (P C1 , P C2 ) plane). This shows us where the main
decision weight its located, and it shows us the direction of an ideal soluton in the projection.
annotate() This can be done by adding an arrow to the plot with the function annotate().
# Calculate coordinates
dv1 <- sum( w * pca1$rotation[,1]) # decision vector PC1 component
dv2 <- sum( w * pca1$rotation[,2]) # decision vector PC2 component
On plot of Figure 27.22 on page 557 is an orthogonal projection in the (P C1 , P C2 ) plane – the
plane of the two most important principal components – we find the following information:
1. The name of the alternatives appears centred around the place where they are mapped. The
projection coincides with the alternatives being spread out as much as possible.
2. Two clusters are obtained by the function pam(): the first cluster has a red ellipsoid around
it and the second one generates the error message “Too few points to calculate an ellipse”
since there are only two observations in the cluster (KRK and DUB).
3. Each criterion is projected in the same plane. This shows that for example DUB offers great
life quality, KRK optimal location and low wage inflation, whereas the group around DEL
and MNL have low costs and a big talent pool, etc.
4. A “decision vector,” which is the projection of the vector formed by using the weights as
coefficients in the base of criteria. This shows the direction of an ideal solution.
When we experiment with the number of clusters and try three clusters, then we see that KRK
breaks apart from DUB. Thus we learn that – while both in Europe – Krakow and Dublin are very
different places.
This plot shows us how the alternatives are different and what the selection of weights implies.
In our example we notice the following.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 557
❦
0.6
infr
DUB
0.4
life
li
0.2
BLR
L
B M
BOM EZE
E
PC2 (16.71%)
trvl
cost
−0.4
infl
−0.6
KRK
−0.8
−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6
PC1 (66.26%)
Figure 27.22: Clustering with elliptoid borders, labels of alternative, projections of the criteria and
a “decision vector” – the projection of the weights – constitute a “Gaia-plot.”
❦ • The cities in Asia are clustered together. These cities offer a deep talent pool with hundreds ❦
of thousands of already specialized people and are – still – cheap locations: these locations
are ideal for large operations where cost is multiplied.
• Dublin offers best life quality and a stable environment. The fact that it has great infras-
tructure is not so clear in this plot and also note that we left out factors such as “digital
enabled” for which again Dublin scores great. Ireland has also as stable low-tax regime.
However, we notice that it is opposite to the dimensions “tlnt” and “cost”: it is a location
with high costs and a really small talent pool. This means that it would be the ideal location
for a head-quarter.
• Krakow is – just as Dublin – a class apart. Poland has a stable political environment thanks
to the European Union, is close to R-bank’s headquarters and further offers reasonable
costs and best-in-class wage inflation. However, we note that it sits (almost) opposite to
the dimension infrastructure. Krakow is indeed the ideal location for a medium sized oper-
ation, where specialization is more important than a talent pool of millions of people. It
is also the ideal place for long-term plans (it has low wage inflation and a stable political
situation), but still has to invest in its infrastructure. A reality check learns us that this is
happening, and hence it would be a safe solution to recommend.
• Direct Ranking: A solution a is preferred over b if a does better on more criteria than b
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 558
❦
• Inverse Ranking: A solution a is preferred over b if there are more alternatives that do better
than b than there are alternatives that do better than a
Actually, outranking methods can be seen as a special case of ELECTRE II – with a step-function
as preference function. So, all the code can be reused, though it is also simple to program it directly
as the following code fragment shows.
### Outrank
# M is the decision matrix (formulated for a maximum problem)
# w the weights to be used for each rank
outrank <- function (M, w)
{
order <- matrix(data=0, nrow=nrow(M), ncol=nrow(M))
order.inv <- matrix(data=0, nrow=nrow(M), ncol=nrow(M))
order.pref <- matrix(data=0, nrow=nrow(M), ncol=nrow(M))
for (i in 1:nrow(M)){
for (j in 1:nrow(M)) {
for (k in 1:ncol(M)) {
if (M[i,k] > M[j,k]) { order[i,j] = order[i,j] + w[k] }
if (M[j,k] > M[i,k]) { order.inv[i,j] = order.inv[i,j] + w[k] }
}
}
}
for (i in 1:nrow(M)){
for (j in 1:nrow(M)) {
❦ if (order[i,j] > order[j,i]){
❦
order.pref[i,j] = 1
order.pref[j,i] = 0
}
else if (order[i,j] < order[j,i]) {
order.pref[i,j] = 0
order.pref[j,i] = 1
}
else {
order.pref[i,j] = 0
order.pref[j,i] = 0
}
}
}
class(order.pref) <- 'prefM'
order.pref
}
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 559
❦
Goal programming can – besides identifying the best compromise solution (MCDA) – also
• determine to what degree the goals goals can be attained with the resources that we have.
f1 (x) +y1 = M1
⎧
⎪
⎪
f2 (x) +y2 = M2
⎪
⎪
⎪
⎪
⎪
⎨ ... = ...
with
⎪ fj (x)
⎪ +yj = Mj
⎪
... = ...
⎪
⎪
⎪
⎪
⎩
fk (x) +yk = Mk
• This forces us to convert them first to the same unit: e.g. introduce factors rj that eliminate
the dimensions, and then minimize kj=1 rj yj
It should be clear that the rj play the same role as the fj (x) in the Weighted Sum Method.
❦ This means that the main argument against the Weighted Sum Method (adding things that are
❦
expressed in different units) remains valid here.10 .
The target unit that is used will typically be “a unit-less number between zero and one” or
“points” (marks) . . . as it indeed looses all possible interpretation. To challenge the management,
it is worth to try in the first place to present “Euro” or “Dollar” as common unit. This forces a
strict reference frame.
Another way of formulating this, is using a target point.
• define a “distance” to the target point: ||F − x||, with F = (f1 (x), f2 (x), . . . , fk (x))′
(defined as in the Weighted Sum Method, so reducing all variables to the same units). For
the distance measure, be inspired by:
k
– the Manhattan Norm: L1 (x, y) = j=1 |xj− yj |
1
k 2 2
– the Euler Norm: L2 (x, y) = j=1 (xj − yj )
p1
k
– the general p-Norm: Lp (x, y) = j=1 (xj − yj )p
– the Rawls Norm: L∞ (x, y) = maxj=1...k |xj − yj |
The problem was introduced in Page 559 as the Manhattan norm, but we can consider other
norms too.
10 Indeed, we decided to convert differences to preference in order to make this mathematicallly correct, but it
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 560
❦
Advantages
• Reasonably intuitive.
Disadvantages
• One has to add variables in different units, or at least reduce all different variables to unit-
less variables via an arbitrary preference function.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 561
❦
It is important to remember that MCDA is not a way to calculate which alternative is best, it is
rather a way to understand how alternatives differ and what type of compromise might work best.
It is also an ideal way to structure a discussion in a board meeting.
Using MCDA to decide on a multi criteria problem is more an art than a science. All those
methods have similar shortcomings:
• MCDA methods used for solving multi-dimensional problems (for which different units
of measurement are used to describe the alternatives), are not always accurate in single-
dimensional problems (e.g. everything is in Dollar value).
• When one alternative is replaced by a worse one, the ranking of the others can change. This
is proven for both ELECTRE and WPM. However, WSM and PROMethEE (most probably)
are not subjected to this paradox – see e.g.: Triantaphyllou (2000).
• The methods are also sensitive to all parameters and functions needed to calculate results
(e.g. weights and preference functions), and as we have shown, they can also impact the
result.
All this means that we should treat them with utmost care, and rather use them to gain insight.
May we suggest that the method that gains most insight is the one that does not make a ranking
at all: PCA (or Gaia) – see Section 20 “Factoring Analysis and Principle Components” on page 363
and Section 27.7.6 “PCA (Gaia)” on page 553.
❦ ❦
• the “Multiple Criteria Decision Aid Bibliography” pages of the “Université Paris
Dauphine”: http://www.lamsade.dauphine.fr/mcda/biblio
Digression – Step 6
You might have noticed that in the chapter about MCDA, we discussed all steps in details,
expect the last one: “step 6 – recommend a solution”. Recommending a solution is part of
reporting and communication, which is the subject of Part VII “Reporting” on page 685.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:12pm Page 563
❦
PART VI
Introduction to Companies
♥
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:12pm Page 565
❦
A company or a business is an entity that will employ capital with the aim of producing profit.
This commercial activity necessarily produces a variety of financial flows. Probably the process
that is the most intimately related to all those financial flows. Accounting includes the systematic
and comprehensive recording of transactions. The concept “accounting” also refers to the process
of summarizing, analysing and reporting these transactions to the management, regulators, and
tax offices.
Accounting is one of the quintessential functions in any business of any size. The accounting
department will also vary in function of the complexity of the business and fiscal environment.
States want to ensure a comfortable inflow of money and the accounting process is used to make
sure that private enterprise pays for the state organization. While taxation is still a matter that
is very much linked to the relevant country, the European Union lead to harmonisation in legis-
lation. In most countries and companies, accountants use a set of rules that is called “generally
accepted accounting principles” (GAAP). GAAP is a set of standards that describe how the assets
can be valued and summarized in the balance sheet, how problems like outstanding debt have to
be treated, etc. It is based on the double-entry accounting, a method which enters each expense
or incoming revenue always mirrored in two places on the balance sheet.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 567
❦
♣ 28 ♣
While financial accounting is not a subject that is part of most programs for mathematicians or
data scientists, it does help us to understand how companies work. Accounting is a vast field
and is the details tend be dependent on the country of residence and operation. In this chapter,
we present only a very brief and general introduction to the subject that will be valid in most
countries, and that will help us to understand asset classes and their pricing easier.
❦ ❦
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 568
❦
The process of accounting is essentially registering what happens in the company so that we are
able to understand how good the company is doing and to some extent how future looks like.
This necessarily implies dealing with both flow variables (cash entering or leaving our books for
example) and stock variables (e.g. the reserve of cash, the pile of raw materials, etc.). In order to
make sense of this continuously changing reality, one usually employs a standardized approach
that consists of daily updating the books and taking regular snapshots. These snapshots are called
“statements of accounts” and consist of
• income statement,
• balance sheet.
The Income Statement is all cash income minus all cash expenses.
The income statement informs about cash flow and is therefore, backwards looking only and
focuses only on a narrow part of the reality: the cash. Its importance lies of course in the cash
❦ ❦
management.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 569
❦
The P&L includes all income or loss over a certain period. This defines our tax obligations and
determines how much dividend we can pay. However, this might include extra ordinary income
or loss. For example, if the business is buying and selling cars, then including the one-off sale of
the building will not give a true image on how the core business is doing.
The P&L is already a little more forward looking, but still focuses mainly on the flow variables.
It clarifies changes that occur over a certain period.
In order to get that image on how profitable the core-business is, we can use the following.
NOPAT
Definition: NOPAT
Net Operating Income After
NOPAT = Net Operating Income After Taxes (this is EAT minus extra ordinary income) Taxes
This NOPAT is essential when we want to understand how good a company is doing on its
core business, and that is important to understand how good it will be doing in the future, because
extra ordinary profit or loss is unlikely to be repeated. Instead of defining NOPAT by what it does
not contain, it can also be written in terms of its constituents.
❦ ❦
28.1.3 Balance Sheet
The P&L shows the profitability of the company, while this is a good indication of how good things
are going on short term, it does not say much of how this short term movement will impact the
totality of the company. That is where the balance sheet comes in: the balance sheet highlights
the stock variables. That stock of all the company owes and owns is called “balance sheet.”
The balance sheet is a summary of all what the company owns and all what it owes someone
else. This is split in two logical parts: assets and liabilities: The assets are all the company owns
and consist of:
Detailed breakdown of assets:
• Current Assets
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 570
❦
The liabilities are all the positions that reflect an obligation of the company towards a third
parrty.
Detailed Breakdown of Liabilities
• Accounts payable
• Provisions for warranties or court decisions (contingent liabilities that are both probable
and measurable)
• Financial liabilities (excluding provisions and accounts payables), such as promissory notes
and corporate bonds
❦ • Tax liabilities ❦
• Deferred tax liabilities and deferred tax assets
• Unearned revenue for services paid for by customers but not yet provided
• Shares
The balance sheet of a company is the most comprehensive reflection of the company: it shows
all the company’s possessions and obligations. The P&L shows how the company is using these
assets, and the Income Statement shows how this use of assets reflects in cash-flows.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 571
❦
The ultimate goal of a company is to create value and pay back that value to its shareholders. In
order to create this value the company can use its assets to generate income, and after paying for
costs, the profit will accumulate in the value of the company.
Value Creation
Companies are created to generate profit. The investor will put initial assets in the company
(in return the investor gets shares), those assets are used to generate income. Along the way the
company incurs also costs and hence we need to deduce those before we can see what the profit
is. The value of the company becomes then the discounted sum of all potential profits. This is
expressed in Figure 28.1.
Profit
Figure 28.1: The elements of wealth creation in a company. The company acquires an assets, uses
this to generate income, we take into account the costs that were incurred and have the profit. The
potential to generate and accumulate profit constitutes the value of a company.
Note that a company does not need to show profit in order to be valuable. A fast growing
company will have to buy more raw materials than it has sales income an hence can show losses–
❦ even if each sale in itself is profitable.1 The value of the company then comes from the “prospect ❦
of growing profit” (once the situation stabilizes).
In order to generate profit, a manager is hired in the company. That person will need to mon-
itor a more detailed value chain in order to create value. The profit will not just happen, there is a
conscious effort needed to produce, sell, keep the company safe, etc. This idea is summarised in
Figure 28.2.
Figure 28.2: KPIs of the Value Chain that can be used by a manager who wants to increase the value
of a company. TA stands for “total assets”, “EBIT for “earnings before interest and taxes”, and ROI is
“return on investments” – we clarify these concepts further in this chapter.
What the shareholder really should try to obtain is long-term sustainable growth of the share
price (and hence the market cap2 ). The idea is that by linking incentives of senior management
to sales (annual), EBITDA, growth in share value, etc. that this chain is activated and supported
in as many places as possible.
1 These
companies are known as “growth companies”.
2 Market
cap is slang for “market capitalisation” and refers to the total amount of stock floated on the stock
exchange. We use this concept here as synonym with the total value of the company.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 572
❦
There are a few important points of view. The first is that if one will buy a company then it
is essential to have a good idea of it is value. However, once bought the second point of view is
managing the value. The owner should make sure that the interest of the management is aligned
with the interest of the owners of the company: maximize growth of company value (given the
dividend policy).
In the following chapter, we will present management accounting as the tool for the owner to
align the interest of the management with that of the owners and as a tool for the management
to execute this strategy.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 573
❦
When people interact a lot and have to talk about specific subjects they tend to use specific words
more than others, develop specific words or live by acronyms. It is not our aim to list all acronyms
or wording used, but we would like to mention a few that will be helpful later on.
Definition: Loans
loans := debt = a sum of money borrowed with the obligation to pay back at pre-agreed
terms and conditions.
For the purpose of this section, we will refer to loans (or debt) as the “outstanding amount
of debt.” So, if for example the initial loan was $10 000 000, but the company already paid
back $9 000 000, then we will have only one million of debt.
Definition: Equity
share capital := equity = the value of the shares issued by the company = asset minus cost
of liabilities
❦ ❦
Example: Equity
A company that has only one asset of $100 000 and a loan against that asset (with out-
standing amount of $40 000) has $60 000 equity.
Definition: CapEx
CapEx
Capital Expenditure (CapEx) is an expense made by a company for which the benefit
to the company continues over a long period (multiple accounting cycles), rather than Capital Expenditure
being used and exhausted in a short period (shorter than one accounting cycle). Such
expenditure is assumed to be a non-recurring nature and results in acquisition of durable
assets.
CapEx is also referred to as “Capital Expense”, both are synonyms. Capital Expense
In accounting, one will not book CapEx as a cost but rather add them to capital and then
depreciate. This allows a regular and durable reduction of taxes. For example, a rail-transport
company would depreciate a train over 10 years, because it can typically be used longer.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 574
❦
Operational An Operational Expenditure (OpEx) is an ongoing and/or recurring cost to run a busi-
Expenditure ness/
system/product/asset.
In accounting, OpEx is booked as “costs” and will reduce the taxable income of that year
(except when local rules force it to be re-added for tax purposes).
For example, the diesel to run a train, its maintenance, salary costs of the driver, oil for the
motor, etc. are all OpEx for the train (which would be booked as CapEx).
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 575
❦
In many cases, it makes sense to compare the performance on one company to its peers. For
example, the potential buyer might be interested to find the most valuable company or the man-
agement wants to compare its performance to peers. When comparing companies, it makes little
sense to compare just the numbers (such as EAT). Indeed, the bigger company will always have
larger numbers. In order to be able to compare the performance of companies we should con-
vert numbers from dollars to percentages. This is done by dividing two quantities that have the
same dimensions, and we call the results a “ratio”. Note that this procedure makes the ratio a
dimensionless quantity. ratio
Below we list a few common ratios, that can be helpful to understand the strengths and weak-
nesses of a company relative to another. The management should be interested in monitoring all
these margins and make sure they evolve favourably.
Profit Margin
EBIT
PM =
Sales
The profit margin (PM) indicates how much profit the company is able to generate compared PM
❦ to its sales. This is a measure of the power of the sales and can be indicative of strength or weak- ❦
profit margin
nesses in the relationship with the customers. If two companies are similar, but company A has
a PM much higher, then it indicates that this company is able to gain more with a similar effort.
Gross Margin
Gross Profit
GM =
Sales
The gross margin (GM) is related to the profit margin, however, one will notice that while the GM
denominator is the same, the nominator is different. Here, we compare the gross profit with sales. gross margin
For example if the PM is too low, we should first check if the GM is high enough. If that is so, we
can drill down and find out what is reducing the profit. For example, we might find that the cost
of sales is too high.
If on the other hand, we find that even the GM is too low, then the problem is – predominantly
– elsewhere.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 576
❦
Asset Utilisation
AU Sales
AU =
asset utilisation net Total Assets
Sales
=
TCstE TCstE
total cost of equity
The asset utilisation rate (AU) is hence the quotient of total assets divided by the total cost of
equity (TCstE). This indicates of how much net assets the company is able to maintain relative to
the cost to raise capital.
LR Liquid Assets
LR =
liquid ratio Liquid Liabilities
Current Assets - Stock
=
Current Liabilities
❦ LA ❦
=
LL
where LA is short for liquid assets and LL stands for liquid liabilities.
The liquidity ratio shows us how much liquid assets we have available to cover our liquid
liabilities. If that ratio would be small, this means that payments will be due before cash can be
raised. It this ratio is high, then the company will soon accumulate excess cash.
This ratio can be used to forecast liquidity squeezes. If it is indeed so that we will soon need
more cash to pay for liabilities than we will have income from the realisation of assets, then we
should think about an overdraft facility for example.
Sometimes, the LT is also referred to as the current ratio (CR):
Definition: Current Ratio (CR)
Current Assets
CR CR =
current ratio Current Liabilities
CA
=
CL
= LR
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 577
❦
Note – CR or LR
The words “current” and “liquid” are both used, to as synonyms with “short term” in this
context so CR and LR are different words for the same concept.
The problem with these ratios is that the current liabilities are certain (the invoice is in our
accounting department and we know when we have to pay), but the current assets are composed
of cash, cash equivalents, securities, and the stock of finished goods. These goods might or might
not be sold. This means that it is uncertain if we will be able to use those assets to cover those
liabilities. To alleviate this issue, we can use the quick ratio.
Quick Ratio
The quick ratio compares the total cash and cash equivalents (marketable securities, and
❦ accounts receivable) to the current liabilities. The AR excludes some current assets, such as the ❦
inventory of goods. This is hence something like an “acid test”: it only takes into account the assets
that can quickly be converted to cash.
Note, however that the QR does include accounts receivables. That means we are expecting
payment, since we have issued an invoice. In reality, this is not a certainty that we will get the
money (in time).
Only the laws of physics are “certain”, the laws of men and the customs of business do not
only change over time, but cannot be taken for granted.
Operating Assets
OA
OA = total assets − financial assets operating assets
= TA − FA
= TA − cash (usually)
The OA are those assets that the company can use for its core business. For most companies
(so, excluding banks and investment funds) the operating assets are all the assets minus the finan-
cial assets such as shares of other companies.3 For example the factory, the machines and tools
3 We exclude financial service companies because their financial assets are their operational assets, in regular
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 578
❦
would qualify as operating assets for a car manufacturer, but not the strategic stake that is held in
the competing brand.
Operating Liabilities
The operating liabilities are those liabilities that the company should settle in order to pursue
its regular business model.
❦ ❦
Digression – non-operating assets
A non-operating asset is an assets that is not essential to the ongoing/usual operations of a
business but may still generate income (and hence contributes to the return on investment
[ROI]).
Typically, these assets are not listed separately in the balance sheet, one will have to gather
the information from the management or on-site analysis in order to separate them.
A suitable acronym for non-operating assets could be NOA, not to be confused with the
net operating assets (for which we reserved the acronym NOA in this book).
Working Capital
Hence, WC represents the operating liquidity that is available to the business. A positive WC
means that there is enough cash to cover the short term liabilities. If WC is negative this means
that the company will need to use more cash to cover short term liabilities than it is getting in via
its current assets.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 579
❦
Digression – WC and LR
Compare the WC to the definition of Liquid Ratio on page 576. While both definitions use
CL and CA, they do something different and have different unit (WC has monetary unit
and CR is dimensionless).
To determine the current assets and current liabilities one needs to use the following three
accounts:
• accounts payable (→ current liabilities) — e.g. short term debts such as bank-loans and
lines of credit.
If we observe that the working capital increases, then this means that the current assets grew
faster than the current liabilities (for example it is possible that the company has increased its
receivables, or other current assets or has decreased current liabilities — or has paid off some
short-term creditors, or a combination of both).
The total capital employed (TCE) or Total Capital (TC) inform about how much capital the com-
❦ pany uses to run the business and invests in future projects. Capital can be raised via equity, bonds ❦
or loans, and by accumulating reserves of previous years.
TC
Note that we use TCE and TC interchangeably: TCE := TC. total capital
The company can use both equity and debt to finance its business growth. The management
should aim to minimize the cost of capital. To some extend the cost of capital is also a measure
for the perceived risk of the company.
WACC is used to evaluate new projects of a company: it is the minimum rate of return that a
new project should bring. WACC is also the minimum return that investors expect in return for
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 580
❦
capital they allocate to the company (hereby setting a benchmark that a new project has to meet
– so both are equivalent).
Vi N
the market value of asset i i=1 Ri Vi
WACC = N
i=1 Vi
Ri
the return of asset i
D E
= K + Ke (if only funded by equity and debt)
D D+E d D+E
total debt expressed in
currency
With Vi the market value of asset i, Ri the return of asset i, E the total equity, D the total debt,
E Ke the cost of equity, and Kd the cost of debt.
total equity expressed in
currency For our purposes we will have to include tax effects. Assuming a tax rate of τ we get:
Kd D E
Ke WACC = K (1 − τ ) + Ke
cost of equity
D+E d D+E
cost of debt There are factors that make it difficult to calculate the formula for determining WACC (e.g.
determining the market value of debt (here one can usually use the book value – in case of a
healthy company – and equity (this can be circular), finding a good average tax rate, etc.). There-
fore, different stakeholders will make different assumptions and end up with different numbers.
This is the genesis of a healthy market where no one has perfect information and the different
players have different price calculations. This will then lead to turnover (buying and selling), and
is essential for a healthy economy.
g
RIR = Reinvestment Rate =
ROIC
where g is the growth rate and ROIC is the return on capital.
The RIR measures how much of the profits the company keeps to grow its business. Indeed,
when the company pays suppliers, lenders, and authorities it can decide what to do with the
profit left. The ordinary shareholders meeting will convene and decide how much of the earnings
can remain in the company (this money is called a “reserve”, later this can be used for growth
purposes g ), or how much dividend will be paid (D).
A company that has a lot of confidence from its shareholders, will be allowed to keep more in
the company and hence will have a higher growth rate (given the same EAT).
Coverage Ratio
Operating Income
CoverageR =
Financial Expenses
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:21pm Page 581
❦
with Depr. meaning “depreciations”, Int. “interest”, and Oth. meaning “other”.
Gearing
Loans
GR =
TCE
Loans
=
Shareholders Equity + Reserves + Loans
❦ ❦
The gearing ratio expresses how much loans are used to gather capital. A company that is cred-
itworthy will have easier access to the bond market or lending via banks and hence can work with
a GR that is higher. This implies that its shareholders will see their stake less diluted. Bond hold-
ers or lenders not only earn a fixed percentage, but also get their money first. Only if something
is left after paying suppliers, taxes, and debt equity holders earn a dividend.
Debt-to-equity ratio
Loans
DE =
Equity
The DE is related to the previous ratio and it compares the amount of loans used with the
amount of equity.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 583
❦
♣ 29 ♣
Management Accounting
29.1 Introduction
The owner(s) of a company should make sure that the interest of the management is aligned with
theirs.1 In the company form where the owners are most remote from the management (the share
company), the owners will have at least once a year an Ordinary General Shareholders Meeting.
It is in that meeting that the supervisory board is chosen by a majority of votes – that are allocated
in function of the number of shares owned.
In that step, the owners will choose supervisory board members that they can trust to align the
executing management’s priorities with those of the owners. This supervisory board will typically
set goals for the executive management in the form of KPIs (Key Performance Indicators), and
❦ ❦
the variable pay of the executive manager depends on the results of these KPIs.
For example, the executive management might be pushed to increase share value, market
share, and profit. The executive management in it is turn will then be able to set more concrete
goals for team leaders who in their turn will put goals for the executing workers. This cascade of
goals and their management is “management accounting.”
Management Accounting (MA) is the section of the company that supports the management
to make better informed decisions and planning support. To do so, it will use data to monitor
1 Of course, if they are themselves the management, then there is no potential conflict of interest and this step
becomes trivial.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 584
❦
finances, processes, and people to prepare a decision and after the decision it will help to follow
up the impact.
Definition: Management Information — MI
The concept may include transaction processing system, decision support systems, expert sys-
tems, and executive information systems. The term MIS is often used in the business schools.
Some of MIS contents are overlapping with other areas such as information system, information
technology, informatics, e-commerce and computer science. Therefore, the MIS term sometimes
can be inter-changeable used in above areas.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 585
❦
Management accounting needs to support the management in their decision process. In essence,
it has to be forward looking, but also it needs to give an accurate image of what is going on in the
company. One of the most important additions to the financial accounting is usually a view on
what consumes money and what produces money in the company.
Imagine for example that there are two business lines: we produce chairs and tables. Some-
times we sell a set, but most often customers buy only chairs or only the table from us. This means
that the price of chair and table should be determined with care. For example, it is possible that
the profit margin on the chairs is negative, but on the tables positive. Since we used to sell sets,
we did not notice, but now a big contract for chairs comes in.
If we do not have a good mechanism in place to determine what the real cost of the chair
production is, then we will make losses on this contract. Such contracts might be decisive for the
company.
Cost accounting is an accounting process that measures and analyses the costs associated
with products, production, and projects so that correct amounts are reported on financial
statements.
SCA
• Standard cost accounting (SCA): Standard cost accounting (SCA) uses ratios called efficien-
Standard Cost
cies that compare the labour and materials actually used to produce a good with those that Accounting
the same goods would have required under "standard" conditions – works well if only labour
is the main cost driver (as was the case in the 1920s when it was introduced).
ABC
• Activity-based costing (ABC): Activity-based costing (ABC) is a costing methodology that activity based
identifies activities in an organization and assigns the cost of each activity with resources to costing
all products and services according to the actual consumption by each. This model assigns
more indirect costs (overhead) into direct costs compared to conventional costing and also
allows for the use of activity-based drivers – see for example van der Merwe and Clinton
(2006).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 586
❦
• Lean accounting: Lean accounting is introduced to support the lean enterprise as a business
strategy (the company that strives to follow the principles of Lean Production — see Liker
and Convis (2011)). The idea is to promote a system that measures and motivates best busi-
ness practices in the lean enterprise by measuring those things that matter for the customer
and the company.
RCA • Resource consumption accounting (RCA): Resource consumption accounting (RCA) is a
Resource Consumption
Accounting
management theory describing a dynamic, fully integrated, principle-based, and compre-
hensive management accounting approach that provides managers with decision support
information for enterprise optimization. RCA is a relatively new, flexible, comprehensive
management accounting approach based largely on the German management accounting
approach Grenzplankostenrechnung (GPK)
TA
• Throughput accounting (TA): Throughput accounting3 is a principle-based and simplified
throughput accounting management accounting approach that aims to maximize throughput4 (sales reduced with
total variable costs). It is not a cost accounting approach as it does not try to allocate all costs
(only variable costs) and is only cash focused. Hence, TA tries to maximize Throughput (T ):
T T = S − T V C . This throughput is typically expressed as a Throughput Accounting Ratio
throughput
(TAR), defined as follows: T AR = return per factory hour
cost per factory hour .
❦ ❦
S • Life cycle costing (LCCA): Life-cycle cost analysis (LCCA) is a tool to determine the most
sales
cost-effective option among different competing alternatives to purchase, own, operate,
TVC maintain, and finally, dispose of an object or process, when each is equally appropriate to
total variable costs
be implemented on technical grounds. Hence, LCCA is ideal to decide what to use and how
to do it. For example, it can be used to decide which types of rails to use, which machine to
LCC
use to put the rails, how to finance the machine, etc.5
life cycle costing
• Environmental accounting: Environmental accounting incorporates both economic and
environmental information. It can be conducted at the corporate level, national level
or international level (through the System of Integrated Environmental and Economic
Accounting, a satellite system to the National Accounts of Countries (those that produce
the estimates of Gross Domestic Product (GDP))). Environmental accounting is a field
2 The GPK methodology has become the standard for cost accounting in Germany as a “result of the modern,
If the scope becomes too large the tool may become impractical to use and of limited ability to help in decision-
making and consideration of alternatives; if the scope is too small then the results may be skewed by the choice
of factors considered such that the output becomes unreliable or partisan. Usually, the LCCA term implies that
environmental costs are not included, whereas the similar Whole-Life Costing, or just Life Cycle Analysis (LCA),
generally has a broader scope, including environmental costs.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 587
❦
The field of cost accounting is vast and complicated. It is beyond the scope of this book to
detail out the subject. Fortunately there are many good books on the subject. For example,
a good overview can be found in: Drury (2013) and Killough and Leininger (1977).
Direct Costs
A Direct Cost is a cost used to produce a good or service and that can be identified as
directly used for the good or service.
For example: Direct Cost can be (raw) materials, labour, expenses, marketing and distribution
costs if they can be traced to a product, department or project.
Marginal Cost
The Marginal Cost is the expense to produce one more unit of product. This can also be
∂P
defined as CM = ∂Q (with P the price, A the quantity produced and CM the Marginal
Cost.
6 In the traditional cost-plus pricing method, materials, labour, and overhead costs are measured. Further, a
desired profit is added to determine the selling price. Target costing works the opposite way around. One starts
from the price that can be obtained on the market and then works back what the costs can be and tries to reduce
costs as necessary.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 588
❦
Indirect Cost
An Indirect Cost is an expense that is not directly related to producing a good or service,
and/or cannot be easily traced to a product, department, activity or project that would be
directly related to the good or service considered.
An assembly facility will easily allocate all components and workers to the end-product
(e.g. a specific mobile phone or tablet). However, the cost to rent the facility, the electricity
and the management are not easily allocated to one type of product, so they can be treated
as Indirect Costs.
Fixed Cost
A fixed cost is an expense that does not vary with the number of goods or services pro-
duced (at least in medium or short term).
❦ Variable Cost
❦
A Variable Cost is an expense that changes directly with the level of production output.
The lease of the facility will be a fixed cost: it will not vary in function of the number
of phone and tablets produced. However, the electricity and salaries might be Variable
Costs.
Overhead Cost
Overhead expenses can be defined as all costs on the income statement except for direct
labour, direct materials, and direct expenses. Overhead expenses may include accounting fees,
advertising, insurance, interest, legal fees, labour burden, rent, repairs, supplies, taxes, telephone
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 589
❦
bills, travel expenditures, and utilities. Note that Overhead Cost can be Variable Overhead (e.g.
office supplies, electricity) or Fixed Overhead (lease of a building).
There are of course many other types of costs. Some were already mentioned in the section
of financial accounting (e.g. operating costs, cost of goods sold), and others are quite specific (for
example “sunk costs” are those costs that have been made on a given project and that will not be
recuperated even when the project is stopped).
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 590
❦
While the details of cost accounting are a little beyond the scope of this book, their use is right in
the attention scope. Indeed, doing something with the numbers produced and transforming that
data in to meaningful insight is the work of the data scientist. Companies gradually discover the
importance and value of all the data they harvest and invest more and more in data and analytics.
Data scientists will need to make sense of the vast amounts of data available. Besides making
models – see Part V “Modelling” on page 373 – presenting data is an essential part of the role of
the data scientist. This presenting data has a technical part (e.g. choosing the right visualization
– see Chapter 9 “Visualisation Methods” on page 159, Chapter 31 “A Grammar of Graphics with
ggplot2” on page 687 and Chapter 36.3 “Dashboards” on page 725 – but also has an important
content part. That content part is the subject of this section.
BSC The Balanced Scorecard (BSC) is a structured report that helps managers to keep track of
balanced scorecard the execution of activities, issues, and relevant measures. The critical characteristics that
define a balanced scorecard are:
The third-generation version was developed in the late 1990s to address design problems inher-
ent to earlier generations – see Lawrie and Cobbold (2004). Rather than just a card to measure
performance, it tries to link into the strategic long-term goals; therefore, it should be composed
of the following parts:
• A destination statement. This is a one or two page description of the organisation at a defined
point in the future, typically three to five years away, assuming the current strategy has
been successfully implemented. The descriptions of the successful future are segmented
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 591
❦
into perspectives for example financial & stakeholder expectations, customer & external
relationships, processes & activities, organisation & culture.
• A strategic linkage model. This is a version of the traditional “strategy map” that typically
contains 12-24 strategic objectives segmented into two perspectives, activities and
outcomes, analogous to the logical framework. Linkages indicate hypothesised causal
relations between strategic objectives.
• A set of definitions for each of the measures selected to monitor each of the strategic objectives,
including targets.
A good overview of the third generation balanced scorecard can be found in: Kaplan and
Norton (2001a,b), and Norreklit (2000).
Also essential is that the design process is driven by the management team that will use the bal-
anced scorecard. The managers themselves, not external experts, make all decisions about the
balanced scorecard content. The process starts – logically – with the development of the “desti-
nation statement” to build management consensus on longer term strategic goals. This result is
then used to create the “strategic linkage model” that presents the shorter term management pri-
orities and how they will help to achieve the longer term goals. Then all “strategic objectives” are
❦ assigned at least one “owner” in the management team. This owner defines the objective itself, ❦
plus the measures and targets associated with the objective. The main difference with the previous
generations of BSCs is that the third generation really tries to link in with the strategic objectives,
hence improving relevance, buy-in, and comfort that more areas are covered.
Definition: KPI
KPI
A Key Performance Indicator (KPI) is a measure used to bring about behavioural change key performance
and improve performance. indicator
Our management thinks that customer engagement is key and identifies NPSa as a KPI
(or even as the KPI). Doing so it engages all employees to provide customers with a bet-
ter experience, better product, sharp price, after sales service, etc. Almost everything the
company does will somehow contribute to this KPI.
a NPS is formally defined in Section 29.3.2.3 on page 592. It is an indicator that measures customer
satisfaction.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 592
❦
A Lagging Indicator is an “output” indicator, it is the result of something but and by the
time it is measured it is too late for management to intervene. It explains why we have
today a given profit.
Leading KPIs
7 Of course it is possible to argue that the impact on this period will be rather adverse if we pay fuel for the cars
that are used to visit prospective customers or if we have to pay per hit on the Internet. The point, is however, that
these indicators will lead towards income in next period.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 593
❦
The textbook example would be that when your strategic goal is to live longer and be
physically fitter, then most probably your level one KPI is weight loss. But that is a Lagging
Indicator: when you are on the scale and you read out the weight, it is too late to do
something about it. Leading indicators that feed into this lagging indicator are for example
food intake, hours of workout, etc.
Typically, Leading Indicators feed into Lagging Indicators. The higher up the organization
chart, the more KPIs become lagging. One of the finer arts of management is to turn these
lagging KPIs into actionable strategies, leading KPIs, leading actions, etc.
• Expected: also referred to as Customer Lifetime Value or Lifetime Customer Value this is
CLV
the present value of the total value that be expected to be derived from this customer. customer lifetime value
LCV
• Potential: the maximal obtainable customer value.
lifetime customer value
The answer is: “it depends”. If past income is predictable for the future income (if no cross
selling or up-selling is possible), then it is important. In most cases, however, past income
on a customer is not the most essential: the future income is the real goal. Banks will
for example provide free accounts to youngsters and loss making loans to students in the
hope that most of those customers will stay many years and become profitable.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 594
❦
It might seem obvious that a customer value metric is a good idea, but it is also deceivingly
difficult to get an estimate that is good enough in the sense that matters. Usually, there are the
following hurdles and issues to be considered in CVM calculations.
• Use Income or Gross Profit in stead of Net Income.
• CVM is an output model (not an input model). If model inputs change then the CVM will
change (e.g. better customer service will reduce churn).
• Correlation between the CVM of different segments can increase risk.
Subjective concepts such as customer satisfaction are best measured on some simple scale – such
as “bad,” “good”, and “excellent.” To our experience the best is to use a scale of five options: seven
is too much to keep in mind, three is too simple in order to present fine gradation. A “scale from
one to five” is almost naturally understood by everyone.8
Usually, the scale is more or less as follows:
❦ ❦
1. really bad/dissatisfied,
2. acceptable but not good/satisfied,
3. neutral,
4. good/satisfied,
5. really good/extremely satisfied.
It is now possible to define the NPS as follows:
NPS
#Promoters − #Detractors
NPS :=
Net Promoter Score Total #Customers
Where Promoters are the people that score highest possible (e.g. 5 out of 5) and Detractors
are people that score lowest (1 out of 5 — though it might make sensea to make this range
wider such as scores 1 and 2). The middle class that is not used in the nominator is called
the “Passives”: these are the people that do not promote us, nor
a In the original work one argues that “customers that give you a 6 or below (on a scale from 1 to 10)
8 Note that this “net promoter score” in R would best be described as a factor-object – see Chapter 4.3.7 “Factors”
on page 45 – and that it is and ordinal scale only – see Chapter B.2 “Ordinal Scale” on page 830.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 595
❦
Promoters are believed to support the brand and detractors are believed to discourage peers
to use the brand, while one can expect that the group in between will not actively promote or
discourage others. Therefore, it makes logically sense to compare only the two active groups. This
does make the ratio “harsh” (in that sense that it can be expected to be low), but it does make the
ratio more predictable for what will happen to our customer base.
Digression NPS
The NPS was introduced by Reichheld (2003) and Baine & Co (and still is a registered
trademark of Fred Reichheld, Bain & Company and Satmetrix).
If customer satisfaction is rated on a scale from 1 to 10, then the promoter score (PS) is the
percentage of users that score 9 or 10, and the brand Detractors are the percentage of clients that
score 1 or 2.
An NPS can be between −1 (everybody is a detractor) or as high as +1 (everybody is a pro-
moter). So, what is a good and what is a bad NPS? There is of course no rigorous answer, but
usually one considers a positive NPS as “good,” and an NPS of +0.5 or more as “excellent.”
It is possible to think of another definition – such as the sum of people that score 1/5 and 2/5
compared to the people that score 4/5 plus those that score 5/5. However, taking less enthusiast
people in that promoter or detractor score is not a good idea if we want to measure “who is going
to recommend our service or product.” This last suggestions seems to be a suitable to measure
who is likely to use our product or service again.
We argue that for other uses, it makes sense to use a “Net Satisfaction Score” that can be
defined as follows.
❦ ❦
Definition: Net Satisfaction Score (NSS)
For processes that have no immediate competition (for example a self evaluation of a board,
the quality of projects run by the project team, etc.) this approach makes sense. However, when
customers have a free choice, the NPS is probably a better forward looking measure.
Whatever scale or exact definition of NPS we use, the importance of NPS is that it reduces a
complex situation of five or more different values to one number. One number is easier to follow
up, present as a KPI and eventually visualize its history over time.
Why is it never a good idea to follow up the average of the satisfaction score (as subjective
rating on a scale from 1 to 5)?
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 597
❦
♣ 30 ♣
In previous chapters – Chapter 28 “Financial Accounting (FA)” on page 567 and Chapter 29
“Management Accounting” on page 583 – we discussed how a company creates snapshots of the
reality with financial accounting and how management accounting is used to gain actionable
insight for driving the value of the company. But what is the value of a company? It is easy to
reply that the value of a company equals its market capitalization. At least for quoted companies
this is a practical definition, but still then someone needs to be able to calculate a fundamental
value in order to assess if the market is right or if there is an arbitrage opportunity.
We will build up this chapter towards some ways to calculate the value of a company and
along the way introduce some other financial instruments that might be of interest such as cash,
bonds, equities, and some derivatives.
❦ ❦
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 598
❦
Financial instruments typically deliver future cash-flows for the owner, hence before we can start
we need some idea how to calculate the value of a future cash flow today.
Question #24
If the interest rate over one year is ry , then how much interest is due over one
month?(While in general that will depend on how many days the month has, for this
1 th
exercise work with 12 ).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 599
❦
Consider now the following example. A loan-shark asks a 5% interest rate for a loan of one
month. What is the APR? Well the APR is the annual equivalent interest rate, so we have to see
what a 5% interest per week would cost us over a full year.
APR = ry = (1 + rw )52 − 1
If a loan shark decides to add an “administration fee” (to be paid immediately after getting
the loan) of $5 on a loan of $100 for one month. What is the APR?
The nominal interest rate (in ) is the rate of interest (as shown or calculated) with no
❦ adjustment for inflation. ❦
The real interest rate (ir ) is the growth in real value (purchase power) plus interest cor- inflation rate
rected for inflation (p).
Assume that the inflation p is 10% and you borrow $100 for one year and the lender asks
you to pay back $110 after one year. In that case, you pay back the same amount in real
terms as the amount that you have borrowed, so the real interest rate is 0%, while the
nominal interest rate is 10%.
(1 + in ) = (1 + ir ) (1 + p)
and hence
1 + in
ir = −1
1+p
So, the relation in = ir + p is only and approximation of the first order (the proof is left as
exercise).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 600
❦
30.1.3 Discounting
NPV The Net Present Value (NPV) is the Future value discounted to today:
Net Present Value
FV
PV = (30.1)
(1 + r)N
Cash Flow Hence, the Net Present Value of a series of cash flows (CF) equals:
N
CFt
NPV = (30.2)
(1 + r)t
t=0
Example
interest rate The interest rate is “flat” (this means that it has the same value regardless the time hori-
zon) and equals 10%. We have a project that has no risk and today we need to invest £100,
and then it pays £100 in year 5 and 7. Is this a good project?
r <- 0.1
CFs <- c(-100, 100, 100)
t <- c( 0, 5, 7)
NPV <- sum(CFs / (1 + r)^t)
print(round(NPV, 2))
## [1] 13.41
The net present value of the project is £13.41. Since the value is higher than the risk free
❦ interest rate (10%), the project is a worthwhile investment and the rational investor should ❦
be willing to pay £100 today in order to receive two deferred cash flows of £100 in year 5
and 7.
The fact that R will treat all operations element per element does make our code really
short and neat.a
a However, be sure to understand vector recycling and read see Chapter 4.3.2 “Vectors” on page 29 if
necessary.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 601
❦
30.2 Cash
The most simple asset class is cash. This refers to money that is readily available (paper money,
current accounts, and asset managers might even use the term for short term government bonds).
Definition: Cash
The strict definition of Cash is money in the physical form of a currency, such as ban-
knotes and coins.
In bookkeeping and finance, cash refers to current assets comprising currency or currency
equivalents that can be converted to cash (almost) immediately. Cash is seen either as a
reserve for payments, in case of a structural or incidental negative cash flow, or as a way
to avoid a downturn on financial markets.
Example: Cash
For example, typically one considers current accounts, savings accounts, short term Trea-
sury notes, etc. also as “cash.”
For an asset manager, “cash” is a practical term, used to describe all assets that share similar
returns, safety, and liquidity with cash as defined previously. In this wide definition cash can refer
to cash held on current accounts, treasury notes of very short duration, etc.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 602
❦
30.3 Bonds
Definition: Bond
Thus, a bond is a form of loan: the holder of the bond is the lender (creditor), the issuer of the
bond is the borrower (debtor), and the coupon is the interest. Bonds provide the borrower with
external funds to finance long-term investments, or, in the case of government bonds, to finance
current expenditure. Certificates of deposit (CDs) or short term commercial paper are considered
to be money market instruments and not bonds for investment purposes: the main difference is
in the length of the term of the instrument.
Bonds and stocks are both securities, but the major difference between the two is that (cap-
ital) stockholders have an equity stake in the company (i.e. they are investors, they own part of
the company), whereas bondholders have a creditor stake in the company (i.e. they are lenders).
Being a creditor, bondholders have priority over stockholders. This means they will be repaid
in advance of stockholders, but will rank behind secured creditors in the event of bankruptcy.
❦ Another difference is that bonds usually have a defined term, or maturity, after which the bond is ❦
redeemed, whereas stocks are typically outstanding indefinitely. An exception is an irredeemable
bond (perpetual bond), ie. a bond with no maturity.
Nominal, principal, par, or face amount is the amount on which the issuer pays interest,
and which – usually – has to be repaid at the end of the term. Some structured bonds can
have a redemption amount which is different from the face amount and can be linked to
performance of particular assets.
Definition: Maturity
The issuer has to repay the nominal amount on the maturity date. As long as all due
payments have been made, the issuer has no further obligations to the bond holders after
the maturity date. The length of time until the maturity date is often referred to as the term
or tenor or maturity of a bond. The maturity can be any length of time. Most bonds have
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 603
❦
In the market for United States Treasury securities, there are three categories of bond
maturities:
• Short term (bills): Maturities between one to five year; (instruments with maturities less
than one year are called Money Market Instruments).
Definition: Coupon
The coupon is the interest rate that the issuer pays to the holder. Usually, this rate is fixed
throughout the life of the bond. It can also vary with a money market index, such as
LIBOR.
The name “coupon” arose because in the past, paper bond certificates were issued which had
coupons attached to them, one for each interest payment. On the due dates the bondholder would
hand in the coupon to a bank in exchange for the interest payment. Interest can be paid at different
frequencies: generally semi-annual, i.e. every six months or annual.
Definition: Yield
❦ ❦
The yield is the rate of return received from investing in the bond. It usually refers either to
• the current yield, or running yield, which is simply the annual interest payment
divided by the current market price of the bond (often the clean price), or to
• the yield to maturity or redemption yield, which is a more useful measure of the
return of the bond, taking into account the current market price, and the amount
and timing of all remaining coupon payments and of the repayment due on matu-
rity. It is equivalent to the internal rate of return of a bond.
The quality of the issue refers to the probability that the bondholders will receive the
amounts promised at the due dates. This will depend on a wide range of factors. High-
yield bonds are bonds that are rated below investment grade by the credit rating agencies.
As these bonds are more risky than investment grade bonds, investors expect to earn a
higher yield. These bonds are also called junk bonds.
The market price of a trade-able bond will be influenced amongst other things by the
amounts, currency and timing of the interest payments and capital repayment due, the
quality of the bond, and the available redemption yield of other comparable bonds which
can be traded in the markets.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 604
❦
On bond markets, the price can be quoted as “clean” or “dirty”. (a “dirty” price includes the
present value of all future cash flows including accrued interest for the actual period. In Europe,
the dirty price is most commonly used. The “clean” price does not include accrued interest, this
price is most often used in the U.S.A.)
The issue price – at which investors buy the bonds when they are first issued – will in many
cases be approximately equal to the nominal amount. This is because the interest rate should be
chosen so that it matches the risk profile of the debtor. This means that the issuer will receive the
issue price, minus all costs related to issuing the bonds. From that moment, the market price of
the bond will change: it may trade “at a premium” (this is a price higher than the issue price, also
referred to as “above par”, usually because market interest rates have fallen since issue), or at a
discount (lower than the issue price, also referred to as “below par”, usually due to higher interest
rates or a deterioration of the credit risk of the issuer).
for a bond that pays annual coupon. where Pbond is the price of the bond (this can also be
❦ understood as the actual price, or in other words, the net present value – see Chapter 30.1.3 ❦
“Discounting” on page 600, and more in particular Equation 30.2 on page 600). CFt is the cash
flow at moment t, and ri the relevant interest rate for an investment of time t. Note that in the
second line, we assume that each moment t correspond exactly with one year. This allows to
simplify the notations.
Digression – Required interest rate
In this formula, we use an interest rate, r. For now, we will assume that this is the risk
free interest rate for the relevant investment horizon for the relevant currency. This has
the – mathematical – advantage that it is a known number (we can find it as the interest
rate on the government bonds). In fact, we should include a risk premium:
r = RRF + RP
where rRF is the risk free interest rate and RP is the risk premium. This is further clarified
in Chapter 30.4 “The Capital Asset Pricing Model (CAPM)” on page 610.
In the aforementioned formulae, we make the simplifying assumption that we are at the
beginning of the year. In reality the price of a bond needs to be adapted as interest is
accrued and when interest rates are changed. This means that the price of a bond will
differ every minute of the day.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 605
❦
The nominal is also called face value, it is the amount that will be paid back at the end.
Consider the following example. A bond with nominal value of $100 pays annually a
coupon of $5 (with the first payment in exactly one year from now), during 4 years and
in the fifth year the debtor will pay the last $5 and the nominal value of $100. What is its
fair price given that the risk free interest rate is 3%?
# bond_value
# Calculates the fair value of a bond
# Arguments:
# time_to_mat -- time to maturity in years
# coupon -- annual coupon in $
# disc_rate -- discount rate (risk free + risk premium)
# nominal -- face value of the bond in $
# Returns:
# the value of the bond in $
bond_value <- function(time_to_mat, coupon, disc_rate, nominal){
value <- 0
# 1/ all coupons
for (t in 1:time_to_mat) {
value <- value + coupon * (1 + disc_rate)^(-t)
}
# 2/ end payment of face value
value <- value + nominal * (1 + disc_rate)^(-time_to_mat)
value
}
❦ # We assume that the required interest rate is the
❦
# risk free interest rate of 3%.
Assume now that the interest rates increase to 3.5%. What is the value now?
As one could expect, the value of the bond is decreased as the interest rates went up. Most
bonds are reasonably safe investments. The emitter has the legal obligation to pay back
the bonds. In case of adverse economic climate that company will not pay dividend but
still will have to pay its bonds.
Of course, companies and governments can default on their debt. When the income is not
sufficient to pay back the dividends and/or nominal values, then the bond issuer defaults on its
bonds. In this section we assume that there is no default risk.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 606
❦
Assume a bond that has pays for the next five years each year one coupon of 5% while the
interest rate is 5% (and the first coupon is due in exactly one year). What is the value of
a bond emission of PLN 1 000? This means that buyer of the bond will see the following
cash flows:
You have just bought the bond and the interest rate drops to 3%. How much do you loose
or win that day?
❦ ❦
You have just bought the bond and the interest rate goes up to 7% in stead of going down.
How much do you loose or win that day?
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 607
❦
Using the same example as aforementioned: a bond with nominal value of $100, an annual
coupon of $5 (with the first payment in exactly one year from now), with a maturity of five years
so that in the fifth year the debtor will pay the last $5 and the nominal value of $100. What is its
Macaulay Duration given that the risk free interest rate is 3%?
In many calculations one will rather use yield to maturity (y) to calculate the PVi , with PVi =
N
i=1 CFi exp {−y ti },and hence:
1
N
MacD = ti CFi exp {−y ti }
V
i=1
The yield to maturity is the internal rate of return (IRR) of the bond, earned by an investor
❦ who buys the bond now at market price and keeps the bond till maturity. Yield to maturity ❦
is the same as the IRR; hence it is the discount rate at which the sum of all future cash
flows from the bond (coupons and principal) is equal to the current price of the bond.
The y is often given in terms of Annual Percentage Rate (APR), but more often market
convention is followed. In a number of major markets (such as gilts) the convention is to
quote annualized yields with semi-annual compounding (see compound interest)
T nominal
y := −1
PV
∂V N N
= CFi exp(−y ti ) = − ti .CFi exp(−y ti ) = −MacD V
∂y
i=1 i=1
Inserting this in the definition of the modified duration shows that M acD = M odD.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 608
❦
However, in most financial markets interest rates are not presented as continuously com-
pounded interest rates but rather as a periodically compounded interest rate (usually annually
compounded). If we write k as the compounding frequency (1 for annual, 2 for semi-annual, 12
for monthly, etc.) and yk the yield to maturity expressed as periodically compounded, then it can
be shown that y
MacD = 1 + k ModD
k
To see this express the value of a bond in function of the periodically compounded interest
rates.
N N
CFi
V (yk ) = PVi = k.t
yk i
i=1 i=1 1+ k
Its Macaulay duration becomes then
N
ti CFi
V (yk ) 1 + ykk
i=1
1 MacD V (yk )
=− (30.7)
V (yk ) 1 + ykk
❦ MacD ❦
= yk (30.8)
1+ k
This calculation shows that the Macaulay duration and the modified duration numerically
will be reasonably close to each other and you will understand why there is sometimes confusion
between the two concepts.
Since the modified duration is a derivative it provides us with a first order estimate of the
price change when the yield changes a small amount. Hence
∆V
≈ −ModD V
∆y
So for a bond with a modified duration of 4% and given a 0.5% interest rate increase, we
can estimate that the bond price will decrease with 2%.
Digression – DV01
In professional markets it is common to use the concept “dollar duration” (DV01). Some-
times, it is also referred to as the “Bloomberg Risk”. It is defined as negative of the deriva-
tive of the value with respect to yield:
∂V
D$ = DV01 =
∂y
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 609
❦
ModD
D$ = DV01 = V
100
which is makes clear that it is expressed in Dollar per percentage point change in yield.
Alternatively, it is not divided by 100, but rather by 10 000 and to express the change
per “base point” (sometimes called “bips”, the base point is simply one hundredth of a
percentage point).
The DV01 is analogous to the delta in derivative pricing (see Chapter 30.7.8 “The Greeks”
on page 664). The DV01 is the ratio of a price change in output (dollars) to unit change in
input (a basis point of yield).
Note that it is the change in price in Dollars, not in percentage. Usually, it is measured
per 1 basis pointa Sometimes, the DV01 is also referred to as the BPV (basis point value)
or “Bloomberg Risk”
BPV = DV01
a That explains its name. The DV01 is short for “dollar value of a 01 percentage point change”.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 610
❦
CAPM
30.4 The Capital Asset Pricing Model (CAPM)
Capital Asset Pricing Model
In our naive presentation of the valuation of a bond, we mentioned that the lender will ask an
interest rate that compensates for the risk that the borrower will fail to pay back. At that point, we
did not elaborate on how to calculate that “risk premium”. In fact the interest rate that the lender
will ask is the interest rate of the risk free investment (government bonds) plus a risk premium
adequate for that particular borrower.
A simple and useful framework to estimate a risk premium – as defined in Equation 30.14 on
page 616 – is the Capital Asset Pricing Model (CAPM). It uses the “the market” as reference and
relies on some simplifications that create a workable framework.
• RRF is the risk free rate of interest, such as interest arising from government bonds.
• βk (the beta coefficient) is the sensitivity of the asset returns to market returns, or also
Cov(Rk ,RM )
βk = VAR(R )M
.
• E[RM ] is the expected return of the market.
VAR
• E[RM ] − RRF the market premium or risk premium.
variance
• VAR(RM ) is the variance of the market return.
The CAPM is a model for pricing an individual security or portfolio. For individual securities,
SML we make use of the security market line (SML) and its relation to expected return and systemic
security market line
risk (β ), in order to show how the market must price individual securities in relation to their
security risk class. The SML enables us to calculate the reward-to-risk ratio for any security in
1 All these authors were building on the earlier work of Harry Markowitz on diversification and his Mean
Variance Theory – see Markowitz (1952). Sharpe received the Nobel Memorial Prize in Economics (jointly with
Markowitz and Merton Miller) for this contribution to the field of financial economics.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 611
❦
relation to the reward-to-risk ration of the overall market. Therefore, when the expected rate of
return for any security is deflated by its beta coefficient, the reward-to-risk ratio for any individual
security in the market is equal to the market reward-to-risk ratio. For any security k:
Restated in terms of risk premium:
which states that the individual risk premium equals the market premium times beta.
The CAPM provides a framework that can be used to calculate the value of a company. The
caveat is that the framework is self recursive. The riskiness of the company determines the beta,
that in its turn influences the required rate of return, that is then used to calculate the value of
the company . . . using the beta.
Below we provide some examples, to illustrate how the CAPM can be used to calculate the
value of a company.
Example: Company A
The company “A Plc.” has a β of 1.25, the market return is 10% and the risk free return is
2%.
What is the expected return for that company?
Since beta and RRF are given, we can use the CAPM to calculate the required rate of return
❦ RA as follows: ❦
E[RA ] = RRF + βA (E[RM ] − RRF )
Example: Company B
The company “B” has a β of 0.75 and all other parameters are the same (the market return
is 10% and the risk free return is 2%).
What is the expected return for that company?
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 612
❦
2. Unsystematic risk, idiosyncratic risk or diversifiable risk: the risk of individual assets. Unsys-
tematic risk can be reduced by diversifying the portfolio (specific risks “average out”).
• A rational investor should not take on any diversifiable risks: — therefore the required return
on an asset (i.e. the return that compensates for risk taken), must be linked to its riskiness
in a portfolio context – i.e. its contribution to the portfolio’s overall riskiness — as opposed
to its “stand-alone riskiness.”
• In CAPM, portfolio risk is represented by variance. — therefore the beta of the portfolio is
the defining factor in rewarding the systematic exposure taken by an investor.
❦ ❦
• The CAPM assumes that the volatility-return profile of a portfolio can be optimized as in
Mean Variance Theory.
• Because the unsystematic risk is diversifiable, the total risk of a portfolio can be viewed as
beta.
2. have a stable utility function (does not depend on the level of wealth),
5. do not care about other live goals apart from money (investments are a life goal in their
own right and do not serve to cover other liabilities or goals),
7. are able to lend and borrow under the risk free rate of interest with no limitations,
2 The original paper of Artzner et al. (1997) is quite good, but we can also refer to De Brouwer (2012) for a complete
and slower paced introduction to risk measures and coherent risk measures in particular.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 613
❦
10. deal with securities that are all highly divisible into small units, and
11. assume all information is at the same time available to all investors.
The CAPM methods also holds important conclusions for the construction of investment
portfolios and there is a lot more to be said about the limitations. For a more complete
treatment we refer to De Brouwer (2012).
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 614
❦
30.5 Equities
Equity (or share) refers to a title of ownership in a company. The equity holder (owner) typically
gets a fair share in the dividend and voting rights in the shareholders meeting. Shares represent
ownership in a company.
30.5.1 Definition
Definition: Stock, shares and equity
There are some different classes of shares that are quite different.
1. Common stock usually entitles the owner to vote at shareholders’ meetings and to receive
dividends.
2. Preferred stock generally does not have voting rights, but has a higher claim on assets
and earnings than the common shares. For example, owners of preferred stock receive div-
idends before common shareholders and have priority in the event that a company goes
bankrupt and is liquidated.
❦ ❦
Digression – Local use of definitions
In some jurisdictions such as the United Kingdom, Republic of Ireland, South Africa, and
Australia, stock can also refer to other financial instruments such as government bonds.
• Roman Republic, the state outsourced many of its services to private companies. These
government contractors were called publicani, or societas publicanorum (as individual com-
pany). These companies issued shares called partes (for large cooperatives) and particulae
for the smaller ones.3
• ca. 1250: 96 shares of the Société des Moulins du Bazacle were traded (with varying price)
in Toulouse
3 Sources: Polybius (ca. 200—118 BC) mentions that “almost every citizen” participated in the government leases.
Marcus Tullius Cicero (03/01/-106 — 07/12/-43) mentions “partes illo tempore carissimae” (this translates to “share
that had a very high price at that time,” and these words provide early evidence for price fluctuations)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 615
❦
• 31/12/1600: the East India Company was granted the Royal Charter by Elizabeth I and
became the earliest recognized joint-stock company in modern times.4
• 1602: saw the birth of the stock exchange: the “Vereenigde Oostindische Compagnie”
issued shares that were traded on the Amsterdam Stock Exchange. The invention of the
joint stock company and the stock exchanges allowed to pool risk and resources much
more efficiently and capital could be gathered faster for more expensive enterprises. The
trade with the Indies could really start and bring wealth to the most successful countries.
Soon England and Holland would become superpowers as the sea.
– stock futures,
– stock options,
– short selling,
– credit to purchase stock (margin trading or “trading on a margin”),
– . . . and famously gave rise to the first market crash “the Tulipomania” in 1637 –
Mackay (1841)
1. Absolute value models try to predict future cash flows and then discount them back to the
present value. The Dividend Discount Model – see Chapter 30.5.4.1 “Dividend Discount
Model (DDM)” on page 616 – and the Free Cash Flow Method – see Chapter 30.5.4.2 “Free
Cash Flow (FCF)” on page 620 are examples of absolute value models.
2. Relative value models rely on the collective wisdom of the financial markets and determine
the value based on the observation of market prices of similar assets. Relative value models
are discussed in Chapter 30.5.5 “Relative Value Models” on page 625
• market value: The value that “the market” is willing to pay (the price on the stock exchange).
• fair value: If there is no market price for this asset, but we can determine its value based on
other market prices then we call this the fair value.
all trade in the East Indies. This allowed it to acquire auxiliary governmental and military functions and virtually
rule the East Indies.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 616
❦
For instance, when an analyst believes a stock’s intrinsic value is greater (less) than its market
price, an analyst makes a “buy” (“sell”) recommendation. Moreover, an asset’s intrinsic value
may be subject to personal opinion and vary among analysts.
Since a share will only yield dividend, it stands to reason that the value of a share today is the
discounted value of all those dividends. Hence, a simple model for the price of a share can be just
this: the present value of all cash-flows.
N
CFt
Pequity = (30.12)
(1 + r)t
t=0
∞
Dt
= (30.13)
(1 + r)t
t=0
with D = dividend.
This model is the Dividend Discount Model, and we will come back to it later, because
although the model is concise and logical, it has two variables that will be hard to quantify.
RP 1. The discount rate should also include the risk. It is easy to write that the discount rate in
risk premium
Equation 30.13 should be the risk free interest rate plus a risk premium
r = RRF + RP (30.14)
RRF with RRF the risk free interest rate and RP the risk premium. The real problem is only
risk free interest rate deferred: finding the risk premium that is appropriate for the assets being valued.
❦ 2. The dividends themselves are unknown. In fact, we should estimate an infinite series of ❦
future dividends.
In order to make sense of this simple model, we need first to find methods to find good estimates
for the risk premium and the dividends.
DDM Theorem 30.5.1 (DDM). The value of a stock is given by the discounted stream of dividends:
dividend discount model
∞
Dt
V0 =
(1 + r)t
t=1
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 617
❦
Capital gains appear as expected sales value and are derived from expected dividend income.
r is the capitalization rate and is the same as E[Rk ] in the CAPM, see Equation 30.10 on page 610
In this section, we will develop this idea further. We start from the most simple case, where
we assume that the dividend will increase at a constant and given rate g . This model is called the
constant growth DDM.
CGDDM
Constant Growth DDM (CGDDM) constant growth dividend
discount model
A particularly simple yet powerful idea is to assume that the dividends grow at a constant rate. It
might not be the most sophisticated method, but is a good zero-hypothesis and it leads to elegant
results that can be used as a rule of thumb when a quick price estimate is needed or when one
tries to assess if the result of a more complex model makes sense.
If every year the dividend increases with the same percentage, 100g% (with g the growth rate),
then each dividend can be written in function of the previous one.
❦ D1 = D0 (1 + g) ❦
D2 = D1 (1 + g) = D0 (1 + g)2
...
Dn = Dn−1 (1 + g) = D0 (1 + g)n
Theorem 30.5.2 (constant-growth DDM). Assume that ∀t : Dt = D0 (1 + g)t , then the DDM
collapses to
D0 (1 + g) D1
V0 = =
r−g r−g
This model is very simple and the following examples illustrate how to calculate a reasonable
approximation of a the value of a private company. In those examples, we use a fictional company
that has the name “ABCD.”
Note, that in all those examples, we need to calculate first the required rate of return
for that given company (denoted RABCD ). To approximate this, we use the CAPM (see
Chapter 30.4 “The Capital Asset Pricing Model (CAPM)” on page 610). This, requires of
course the assumptions that the beta is endogenous (which is not entirely correct, but a
good approximation in many cases).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 618
❦
Consider the company, ABCD. It pays now a dividend of = C10 and we believe that the
dividend will grow at 0% per year. The risk free rate (on any horizon) is 1%, and the market
risk premium is 5% and the β is 1. What is the intrinsic value of the company?
D0 (1+g)
Using the CGDDM, V0 = RABCD −g , and the CAPM RABCD = RRF + β.RPM we get:
The company value that assumes a zero growth rate is called the “no-growth-value.” Unless
when the outlook is really bleak, this is seldom a valid assumption. The concept of investing in a
company and hence accepting its increased risk, depends on the fact that it should – on average
– be a better investment than investing in bonds or safer interest bearing products.
The difference in value compared to the previous example is called the PVGO (present value
of growth opportunities). So,
Since the company is more risky and all other things remained the same, the price must
be lower. For the price to be the same, investors would expect higher returns.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 619
❦
Simply adding that growth rate in the formula leads to impossible results:
A companies equity can never become negative, because the equity holder is only liable
up to the invested amount. Therefore this result is not possible: it indicates a situation
where the DDM fails (the growth rate cannot be larger than the required rate of return).
This example illustrates that the DDM is only valid for dividend growth rates smaller than the
required rate of return. A company that would grow faster, would be deemed to be more risky and
hence would have a higher beta, leading to a higher discount rate. The model states that anything
above that is unsustainable and will lead to a correction.
When buying a company, it is not realistic to expect it to grow eternaly at the same rate. The
growth rate can be assumed to follow certain patterns to match economic cycles or the investor
might even assume zero growth after 20 years – simply as a rule of thumb to avoid overpaying.
❦ To better understand how the growth rate related to certain accounting values, we introduce the ❦
following concepts.
Definition: earnings
E
E := net income
E
ROE
ROE := P
Note
Note that all definitions work as well per share as for the company as a whole! We will
use all concepts per share unless otherwise stated.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 620
❦
The growth rate of the dividend is the amount of ROE that is not paid out as dividend, hence
This is because if the company retains x% earnings, then the next dividend will be x% higher.
More generally:
reinvested earnings reinvested earnings TE
g= =
BV TE BV
BV where BV stands for “book value”, and TE is the “total earnings” (note that this is similar to the
book value concept of total earnings in the context of personal income).
Hence, – as mentioned earlier –
• it only makes assumptions about the outcome (dividend) and not the thousands of variables
that influence this variable.
• to find a good discount rate (which is complex and actually circular), and hence
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 621
❦
• add again depreciations and amortizations because these are no cash outflows,5
• reduce by changes in working capital (if the working capital increased, this means that the
company needed more cash to operate and this will reduce the owner earnings), and
• reduce by changes in capital expenses, because these costs really reduce liquidity (these are
of course linked to the amortizations and depreciations).
There are multiple ways to calculate the FCF, depending on the data available one could choose
for example the following format.
where NOPAT stands for “net operational profit after tax”, PAT for “profit after tax”, WC for NOPAT
“working capital”, CapEx for “capital expenses” (expenses that get booked as capital), and τ is the WC
tax rate. CapEx
For dividend estimations, it is useful to use the concept “Net free cash Flow”. This is the
FCF available for the company to maintain operations without making more debt, its definition
also allows for cash available to pay off the company’s short term debt and should also take into
account any dividends.
Net Free Cash Flow = Operation Cash flow Net Free Cash Flow
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 622
❦
Here, the Capex definition should not include additional investment on new equipment.
However, maintenance cost can be added. Further, we should consider the following.
• Dividends: This will be base dividend that the company intends to distribute to its share
holders.
• Current portion of LTD: This will be minimum debt that the company needs to pay in order
to not default.
• Depreciation: This should be taken out since this will account for future investment for
replacing the current PPE.
Net Free Cash Flow is a useful measure for the management of a company but we will not need
it for company valuation.
Discounted cash flow (DCF) is a method of valuing a company, project, or any other asset
by discounting future cash flow to today’s value (in other words: using the time value of
money) and then summing them.
In practice, future cash flows are first estimated and then discounted by using the relevant cost
of capital to give their present values (PVs). The sum of all future cash flows, is the then called
NPV, which is taken as the value or price of the cash flows in question. We remind the definition
of NPV
NPV
Definition: NPV
The Net Present Value (NPV) is then the sum of all present values and represents today’s
value of the asset.
The DCF model in company valuation is simply calculating the NPV of the companies FCF.
Essentially, the DCF model is the sum of all future cash flows discounted for each moment t,
and can hence be written as follows:
CFt
PV := t
(1 + rt )
In that model we will substitute both cash flow CFt and required rate of return rt as follows.
• CF by FCF, because that is the relevant cash flow for the potential buyer of the company.
• r by WACC, because the company should at least make good for compensating its capital.
needs
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 623
❦
The DCF method for company valuation has the following distinct advantages.
• Easy to understand.
• DCF allows for the most detailed view on the company’s business model.
❦ ❦
• It can be used to model synergies and/or influence on the company’s strategy.
• One needs to forecast an infinite amount of FCFs (and therefore, one needs to model the
whole balance sheet).
• Therefore, one needs many assumptions (costs, inflation, labour costs, sales, etc.)
• One needs to find a good discount rate (which is complex and actually circular).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 624
❦
This method only considers the assets and liabilities of the business. At a minimum, a solvent
company could shut down operations, sell off the assets, and pay the creditors. The money that is
left can then be distributed to the shareholders and hence can be considered as the value of the
company.
Of course, companies are supposed to grow and create value, hence this method is rather
a good floor value for the company. In general, the discounted cash flows of a well-performing
company exceeds this floor value.
Zombie companies that needs subsidiaries to survive for example that own many tangible
assets might be worth more when liquidated than when operations are continued.
This method is probably a good alternative for valuing non-profit organisations, because gen-
erating profit (cash flow) is not the main purpose of these companies.
❦ Further, it is essential to consider the purpose of this type of valuation. If it is really the idea to ❦
liquidation cost stop trading and liquidate the company, in that case one will have to add the liquidation cost. For
example, selling an asset might involve costs to market it, have it valued, maintain it till it sold,
store it (e.g. keep a boat in a harbour), etc.
The time scale becomes also relevant here. If one needs urgently the cash then the expected
price will be lower, but also the cost of storing and maintaining the asset might be lower.
Depending on the purpose of the valuation one will also have to choose what value to consider:
book value or market value? For example, a used car might be worth 0 in the books but still can
be sold for good money.
Investment Funds
Investment funds are a very specific type of companies that are created with the sole purpose to
invest in other financial assets.6 The most common type of investment funds will invest in liquid
assets and not try to influence the management of the company.
Investment funds can invest in all other financial assets that are explained in this chapter, and
can do much more. For example, an investment fund can invest in real estate, labour ground, or
eventually actively play a role in infrastructure works.
UCITS The investment funds that are the most relevant for the investor that wants to save money for
Undertaking for Collective
retirement – for example – are the liquid investment funds that are investing in the assets that
Investments in Transferable
Securities are explained in this chapter. In Europe they are known as UCITS (Undertaking for Collective
Investments in Transferable Securities) and regulated by the UCITS IV regulations.
6 In that sense investment funds are comparable to holding companies, which have as sole purpose to invest in
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 625
❦
UCITS will never invest in their own shares. They have a variable capital and buying shares is
considered as “redeeming” shares. This means that in stead of buying and holding its own shares
these shares stop to exist.
In the same way the fund can create new shares when more people want to buy the fund.
Typically, there will be a market maker to facilitate this process. Market Maker
While for investment funds and bankrupted companies the NAV method is the only method nec-
essary, this method will not give an appropriate picture of normal operating companies.
Again, there are some advantages that stand out.
• Easy to understand.
• The NAV is only the lower limit of the real value (and hence merely a reality check) and
for normal companies it misses the point of a valuation.
The intrinsic value is the true fair value of a company. However, this is not always equal to what
someone else is willing to pay for it. The amount that others are willing to pay is the market value.
Definition: Price or market value
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 626
❦
Definition: Value
A short-cut: the price is the consensus of the market about the value.
Definition: Market capitalization
The market capitalisation (often shortened to “market cap”) is the total value of all out-
standing stocks at the market price. It is the value of the company as fixed by the market.
If you have ever bought or sold a house or an apartment, you will know how this process
works. Typically, when one plans to buy a property, one will scan the market for suitable prop-
erties. Suitable properties are those that are in a certain location, have the required number of
rooms, etc.
Comparing the price per square meter makes a lot of sense, but it is not the whole story. The
neighbourhood, quality of the property, age, willingness to sell, etc. will all play an important role.
Similarly, we can compare companies via the price earnings ratio, but many other elements
will need to be taken into account.
Using this definition and rearranging Equation 30.19 on page 620 shows that:
V0 1 PVGO
= + (30.22)
E1 r E
DPR
= (30.23)
r − ROE × PBR
DPR
= (30.24)
r−g
1 − PBR
= (30.25)
r−g
where we remind that DPR is the dividend payout ratio, PBR the plow-back ratio, PVGO is the
present value of growth opportunities, r is the required discount rate and g is the growth.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 627
❦
These formulae make clear that PE is lower for more risky firms; and that when ROE increases
that then PE decreases.
If PVGO = 0, then Equation 30.22 shows that V0 = Er1 : the stock is then valuated as a
perpetual bond with coupon E1 ; and the PE ratio is then 1r . Further, one will remark that
if g = 0, that then E1 = E0 and hence
PE = V0 /E0 .
• Accounting details: The earnings are sourced from the accounting system and it is worth to
investigate how this particular company is using the various accounting rules and which
rules apply.
• Earnings management: the management has some freedom within the guidelines of the
accounting rules to show more or less profit on short term.
• Economic cycles: Economic cycles might influence the earnings of companies in various
ways and patterns can change when the economic environment changes (e.g. company A
❦ ❦
can do relatively better than B in economic downturns, while otherwise B does better in
the same market).
• Estimation of the first future earnings: The formula tells us to use E1 , but that value is not
known yet. In practice, one uses the earnings of previous accounting year, E−1 .
• Circular reasoning in value and riskiness: PE ratio includes the future growth potential and
the riskiness in one measure, hence it is extremely important to compare only with com-
panies that share similar potential and risk (e.g. companies of the same sector in the same
country).
• Short term fluctuations for a long term estimate: For the same reason – that PE ratios include
the future growth potential and the riskiness in one measure – they will jump up when the
economic cycle is on its low in short term.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 628
❦
The book value (BV), is the value of the company as per accounting standards – in other
words the size of the balance sheet. The advantage of using the book value is of course its stability,
the downside is that accounting rules are designed to collect taxes. That makes accounting rules
inherently backwards looking, while company valuation is essentially forward looking.
The list of possible ratios to use is only limited by your imagination and can differ for different
companies. For example a steel producer can build up a stock of completed products, in a bank
that works differently: they need assets to compensate risk.
PTCF Definition: Price-to-cash-flow ratio (PTCF)
Sales is important for any company and easy to isolate in the balance sheet.
PTS Definition: Price-to-sales ratio (PTS)
These ratios are necessarily a small sample of what is used in practice. It is not really possible to
provide an exhaustive list. For each sector, country or special situation some other measures might
be useful. Actually all measures that we have presented in Chapter 28 “Financial Accounting (FA)”
on page 567 can also be used to calculate the value of a company. It is even possible to define your
own ratio, that makes most sense to you and the special situation that you’re dealing with.
Already in Section 28.4 “Selected Financial Ratios” on page 575, we discussed some ratios that
❦ help the managers of the company to manage value. They are different from the ratios discussed ❦
in Section 30.5.5.2 “The Price Earnings Ratio (PE)” on page 626, because they cannot directly be
used to calculate the value of a company. However, they are indirectly linked to company value
and can be used to amend the value of a company or to gain more insight in how the company is
doing compared to its competitors.
As data scientist in a commercial company, you are very likely to encounter these ratios or
you might be in a position to include them in a report.
we remind that CA stands for current assets and CL for current liabilities.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 629
❦
The ROIC only includes the equity capital employed. So, while for the investors that is the most
important, the management can also raise capital via debt. Therefore, it is customary to use
also ROCE. ROCE uses the total capital employed by the company (this is the sum of Debt and
equity). Another difference is that ROCE is a pre-tax measure, whereas ROIC is an after-tax
measure. So ROCE measures the effectiveness of a company as the profit exceeding the cost of
capital.
Net Incomet
ROE =
❦ equity ❦
Net Income S TA
= × × (DuPont Formula)
S TA equity
NOPAT
=
equity
ROE shows how profitable a business is for the investor/shareholder/owner, because the
denominator is simply shareholders’ equity. ROIC and ROCE show the overall profitability of
the business (and for the business) because the denominator includes debt in addition to equity
(which is also capital employed, but not necessarily provided by the owner).
ROE and ROCE will differ widely in businesses that employ a lot of leverage. Banks for exam-
ple, earn a very low return on assets because they earn a small spread (e.g.: borrow at 0.5%, lend
at 3.5%). Regular saving banks have the majority of their capital structure in depositors’ money
(i.e. low-interest bearing debt) and this leverage magnifies their returns compared to equity. It is
typical for banks to have low ROCE but a high ROE.
An important footnote is that in the denominator of ROE one will find the book value of the
equity (of course one might make the calculation with the market value). However, that is not
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 630
❦
necessarily the most important reference for the investor. The investor might have his/her own
book value, purchase price or other price as a reference.
Economic Value Added (EVA) is an estimate of the company’s economic profit (the value
created in excess of the required return of the company’s shareholders). In other words,
EVA is the net profit less the opportunity cost for the firm’s capital.
Market value added (MVA) is the difference between the company’s current market value
Market Value (of a company)
and the capital contributed by investors.
❦ with Vmarket the market value and K the capital paid by investors. ❦
If a company has a positive MVA this means that it has created value (in case of a negative
MVA it has destroyed value). However, to determine if the company has been a good investment
one has to compare the return on the invested capital with the return of the market (rM ), adjusted
for the relative risk of that company (its β ).
The MVA is the present value of the series EVA values. MVA is economically equivalent to
the traditional NPV measure of worth for evaluating an after-tax cash flow profile of a company
if the cost of capital is used for discounting.
∞
EVAt
MVA =
(1 + WACC)t
t=0
There is no best measure, and all measures have to be used with care. It is important to con-
sider what one wants to obtain before making a choice. It is very different to make a comparative
analysis within one sector or compare different sectors for example.
1. Is it my purpose to buy the company and stop its activities or did it already stop trading or
is it an investment fund? – If yes, use NAV method (in all other cases this should be the
lower limit). If not, then continue to next question.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 631
❦
2. Will you be an important share holder and can you make a business plan? – If yes, try to
use DCF, if not continue.
3. Do you have the option not to invest? – If yes, use DDM otherwise continue.
4. If you ended up here, this means that you have to invest anyhow in similar stocks (e.g. you
are an equity fund manager and need to follow your benchmark). – In this case, you might
want to use a relative value method.
• short history: it might be easier to forecast mature companies with long, stable history. Buy-
ing a company with only a few years history is a leap of confidence.
• management differences: will you attribute cash differently? Is the salary that the owner
(not) took relevant for your case? etc.
There are many other factors to consider. For example: when identifying the non-operating
assets in the accountancy of a company one needs to be careful to avoid double counting in valu-
ation.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 632
❦
In order to gain some insight in how robust a certain result of our valuation is or what bad cases
can be expected a simple stress test can answer that question.
A simple example could be: allow the price of certain raw materials to fluctuate (simply test a
few possibilities), then do the same with labour prices, allow the effect of a strike, an earthquake,
fluctuations in exchange rates, one of the lenders that gets into problems, we have to halt digging
because we stumbled upon a site of historic importance, etc.
Soon, one of the problems with stress testing becomes obvious: it becomes bewildering how
much possibilities there are, it is impossible to say which is more probable that the other, etc.
The answer to that shortcoming is simply to restrict stress testing to what it does best: explore
extreme risks – without knowing how likely it is. So, for example assume that we are building
an airport and an earthquake destroys a lot of the half-build site, killed a few people, causes a
strike of our crew, and creates a negative climate that makes the currency plunge, this in return
pushes and the domestic bank in the syndicate into problems, etc. Then we have just one sce-
nario, something that we can calculate with your spreadsheet and that gives us a “worst case
scenario.”
The relevance for each investor is that he should ask the question “can I afford to loose that
much.” If the answer is “no,” then the investor should seek another partner in the syndicate in
order to diversify risks. Failing to do so, is planning for disaster.
In order to do that in practice, a spreadsheet might still be sufficient, however, it might be
advisable to follow a few simple rules to keep it organized. For example:
P&L • Use different tabs (sheets) for (i) assumptions, (ii) costs, (iii) income, (iv) P&L, and (v) ratios.
Profit and Loss
❦ • Make sure that each sheet has the same columns (they are your time axis). ❦
• Use different colours to make the different function of each cell clear: for example pale
yellow for an input cell, no background for the result of a calculation, etc.
• Avoid – where possible – obscure formulae that are difficult to read for humans.
• Do use as much as possible underlying programming language (Visual Basic for example)
VB
Visual Basic
and never ever use macros (macros are very difficult to read by other humans, not reuseable,
slow, and confusing).
• Keep different versions, have frequent team-meetings when working on one file, and agree
who will modify what.
Following these simple rules will help you to make rather complex models in the simple
spreadsheet that a modern computer offers. If you find that the spreadsheet becomes difficult
to read or slow we suggest to have a look at the alternatives presented on page 634.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 633
❦
With “something sensible” we mean that we know something about the likelihood of some-
thing to happen. We might not know the exact distribution, but at least some probability. For
example, we might expect an earthquake of force 4 to happen once in 1 000 years. This simple
number is far less than knowing the probability density function, but it can already work.
In that case, we would have a 0.000 083 probability each month that such earthquake would
occur. However, if it occurs, then the knock-on effects will be significant for the project: dam-
age, delays, other problems in the region needing attention, etc. It is here that the limitations of
a spreadsheet become all too clear. It becomes impossible to model correctly the effect of such
events, not only because of the interdependence with other parameters, but also in time. If such
event occurred, then is it more or less likely to happen again? Some effects will be immediate (such
as if the currency drops 20% with respect to the currency that we use to pay a certain material
or service, then that service or material is immediately more expensive). This can still be mod-
elled in a spreadsheet, but in the realistic case with the earthquake one must take into account
a whole different scenario for the rest of the project and that becomes almost impossible and at
least very convoluted.
The alternative is to use a programming language that allows us to model anything. Best suited
for large projects are languages that allow for some object oriented code. We can use the features
of an object oriented programming language to represent actors and input in our project. For
example, the engineering company can be one “object” and it will decide to hedge currency risk
if the exchange rate hits a certain barrier, etc.
This allows us to model dependencies such as in our example with the earthquake. If the
earthquake happened, then other objects can “see” that and react accordingly, the exchange rate
(also an object) will switch regime (ie. draw its result from a different distribution), the workers
can see the impact of the safety conditions and consider a strike with a given probability, etc. This
❦ way of working is not so far removed from the way modern computer games work. ❦
Good examples of programming languages that allow vast amounts of complex calculations
are C++ and R. The high level of abstraction offered by object oriented programming languages
allows the programmer to create objects that can interact with each other and their environ-
ment. For example, the Engineering Company can be such object. That object can be instructed
to employ more workers when a delay threatens to happen but up to the limit that the extra costs
are offset by the potential penalties. As the simulation then runs, market parameters change and
events happen according to their probability of occurrence and each object will then interact in a
pre-programmed or stochastic way.
This allows very complex behaviour and dependencies to be modelled, yet everything will be
in a logical place and any other programmer can read it as a book. On top of that there are good
free solutions to create a professional documentation set with little effort. For example, Doxy-
gen (see http://www.doxygen.org) is free and able to create both an interactive website as
well as a LATEX7 book for the documentation, that details each class, function, handle, property,
etc. Code written in such way and documented properly is not only easy to maintain, but also
straightforward to audit.
7 LaTeX is a high-quality typesetting system; it includes a large set of features designed for the production of
technical and scientific documentation. LATEX is the de facto standard for the communication and publication of
scientific documents. LATEX is available as free software in the repositories of your distribution and at http://www.
latex-project.org. Information on how to link it with R, is in Chapter 33 “knitr and LATEX” on page 703.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 634
❦
Now, that we have a good idea how the distribution of the results will look like, we can use this
distribution to calculate the relevant risk parameters. In many cases, the “historic” distribution
that we got by our Monte Carlo simulation will be usable, however, for large and complex projects
the distribution might not be very smooth. If we believe that this is a sign of the limited number
of simulations, then we can try to apply a kernel estimation in order to obtain a smoother results
that yield more robust risk parameters.
KDE The technique of kernel density estimation (KDE) could be helpful for all distributions that are
kernel density estimation estimated from a histogram. As an alternative to parametric estimation where one infers a certain
distribution it avoids the strong assumption that the data indeed follows that given distribution.
Note a KDE can be used also for any input parameter where the distribution used is based on
observations.
Of course, one can choose a standard distribution if we have reasons to assume that this would
be a good approximation. However, choosing a non-parametric KDE, has the advantage of avoid-
ing any assumptions about the distribution, and on top of that:
• It is well documented in the case of expected shortfall – see e.g. Scaillet (2004), Chen (2008),
Scaillet (2005), and Bertsimas et al. (2004).
• There is research on its sensitivity with respect to the portfolio composition, w – see e.g.
Scaillet (2004), Fermanian and Scaillet (2005).
Using a non-parametric KDE, however, requires one arbitrary parameter: “the bandwidth.”
The bandwidth is a parameter that is related to the level to which the data sample is representative
❦ ❦
of the real underlying distribution. If one makes a choice of this parameter that is too small, one
forces the estimated distribution function, fest , to stick too much to the data, and there is too little
of a smoothing effect. If, on the other hand, the parameter is insufficiently restrictive, then fest
will be smeared out over an area that is too large.8 More information on bandwidth selection can
be found in Jones et al. (1996b).
Of course, one can ask if it is necessary at all to use a kernel estimation instead of working
with the histogram obtained from the data. Using the histogram as pdf has a few disadvantages:
• It is not smooth (this observation tells us that the use of histograms is similar to noticing
that The dataset is imperfect and not doing anything about it).
• it depends on the end points of the bins that are used (changing the end points can dramat-
ically change the shape of the histograms).
• It depends on the width of the bins (this parameter can also change the shape of the
histogram).
• It introduces two arbitrary parameters: the start point of the first bin, and the width of the
bins.
An answer to the first two points (and half of the last point) is to use a kernel density estima-
tion (KDE). In that procedure, a certain function is centred around each data point (for example,
an indicator function, a Gaussian distribution, the top of a cosine, etc.), these functions then are
8 Note that we do not use the usual notation for the estimated distribution density function, fˆ, because we have
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 635
❦
summed to form the estimator of the density function. The KDE is currently the most popu-
lar method for non-parametric density estimation – see e.g. in the following books: Scott (2015),
Wand and Jones (1994), and Simonoff (2012)
This method consists in estimating the real (but unknown) density function f (x) with
1 1 x − xn
N N
fest (x; h) = Kh (x − xn ) = K (30.29)
N Nh h
n=1 n=1
Definition: Kernel
If K is a kernel, then also K ∗ (u) := h1 K uh (with h > 0) is a kernel. This introduces an ele-
gant way to use h as a smoothing parameter, often called “the bandwidth.”
This method was hinted by Rosenblatt et al. (1956) and further developed in its actual form
by Parzen (1962). The method is thus also known as the “Parzen-Rozenblatt window method.”
The Epachenikov kernel (see Epanechnikov (1969)) is optimal in a minimum variance sense.
❦ However, it has been shown by Wand and Jones (1994) that the loss of efficiency is minimal for ❦
the Gaussian, triangular, biweight, triweight, and uniform kernels.
Two of those kernels are illustrated in Figure 30.1 on page 636.
If an underlying pdf exists, kernel density estimations have a some distinct advantages over
histograms: they can offer a smooth density function for an appropriate kernel and bandwidth,
and the end points of the bins are no longer an arbitrary parameter (and hence we have one
arbitrary parameter less, but still the bandwidth remains an arbitrary parameter).
We also note that Scott (1979) proves the statistical inferiority of histograms compared to a
Gaussian kernel with the aid of Monte Carlo simulations. This inferiority of histograms is mea-
sured in the L2 norm, usually referred to as the “mean integrated squared error” (MISE), which
is defined as follows.
⎡ +∞ ⎤
M ISE(h) = E ⎣ {fest (x; h) − f (x)}2 dx⎦ (30.30)
−∞
A variant of this, the AMISE (asymptotic version), can also be defined, and this allows us to
write an explicit form of the optimal bandwidth, h. Both measures have their relevance in testing
a specific bandwidth selection method. However, for our purpose these formulae cannot be used
since they contain the unknown density function f (x). Many alternatives have been proposed
and many comparative studies have been carried out. A first heuristic was called “cross validation
selectors” – see Rudemo (1982), Bowman (1984), and Hall et al. (1992). Sheather and Jones (1991)
developed “plug-in selectors” and showed their theoretical and practical advantages over existing
methods, as well as their reliable performance. A good overview is in Jones et al. (1996a).
In Figure 30.2 on page 636, we show how the histogram and Epachinov KDE are different.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 636
❦
u 2
Figure 30.1: The Epachenikov kernel (left), KhE (x) = 3
4h 1− h 1{|u/h|≤1} for h = 1; and the
2
−u
Gaussian kernel (right), KhG (u) = √1 e h2 for h = 0.5.
2Πh
❦ ❦
Figure 30.2: As illustration on how the Epachenikov Kernel Estimation works, we present in the
upper graph the histogram of the annual inflation corrected returns of standard asset classes. The
lower graph offers a view on what a non-parametric kernel density estimation on those data can do.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 637
❦
Conclusion
Kernel estimation is a widely accepted and used method that has many advantages. However,
it introduces the arbitrary choice of bandwidth and of type of kernel. However, we note a novel
method that automates this selection without the use of arbitrary normal reference rules Botev
et al. (2010) see. We also note that it is blind to specific aspects, such as the boundedness of the
domain of values (e.g. prices cannot become negative). Therefore, the method has to be used with
care, and preferably on non-bounded data (e.g. log-returns).
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 638
❦
Forwards and futures give exposure to the underlying asset without actually buying it right now.
Both forwards and futures are agreements to sell or buy at a future date at market price at that
moment. They are defined as follows.
Definition: Future
A future is an agreements to sell or buy at a future date at market price at that moment,
when the agreement is quoted on a regulated stock exchange.
Definition: Forward
A forward an agreements to sell or buy at a future date at market price at that moment,
when the agreement is an OTC agreement.
The difference between a forward and future is only about the form of the agreement.
The value of a forward or future at maturity is the difference between the delivery price (K )
and the spot price of the underlying as at maturity (ST )
❦ Reasoning that there should be no rational difference between buying an asset today and keep- ❦
ing it compared to buying it later and investing the money at the risk free interest rate till we do
so, this relationship becomes – for assets that yield no income:
F0 = S0 erT
with T the time to maturity and r the continuously compounded risk free interest rate.
This formula can be modified to include income. If an asset pays income then this advantage
is for the holder of the assets and hence we subtract it for the cash portfolio where one does not
have the asset. For example, if the income is known to be a a discrete series of income It then the
formula becomes
N
F0 = S0 − PV(It ) exp(r T )
t=1
Note – Commodities
For commodities such as for example gold or silver there is no income from the asset, but
rather a storage cost. This cost can be modelled in the same way as the income, but of
course with the opposite sign.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 639
❦
This means that the value of a future will evolve quite similar to the value of the underlying
asset. For example, to gain exposure to the S&P500 one would have to buy 500 shares in exact
proportions, which is complicated, costly, and only possible for large portfolios, but it is possible
to gain that exposure with just one future.
It is possible to manage a portfolio by only using futures, so that it looks like we have the under-
lying asset in portfolio. When the future is close to maturity, it is sufficient to “roll the contract”:
sell the old one and buy a new contract.
Forwards and futures play an important role on the stock exchange, they are efficient for
exposure or hedging purposes and are usually very liquid.
One particular forward is the Forward Rate Agreement (FRA) that fixes an interest rate for FRA
a future transaction such as borrowing or lending. They are used to hedge against interest rate forward rate agreement
changes.
More specifically, the FRA is a cash settlement that compensates the counterpart for promis-
ing a transaction at a given interest rate in the future.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 640
❦
30.7 Options
30.7.1 Definitions
Options as financial derivative instruments are very much what one would expect from the
English word “option.” It is an agreement that allows the owner of the contract to do or not do
something at his/her own discretion. Usually, it is something along the lines of “the bearer of
this agreement has the right to purchase N shares of company ABC for the price of X at a given
date T from the writer of this option, DEF”.
There are two base types of options:
A Call Option is the right to buy the underlying asset at a given price (the Strike) at some
point in the future (the maturity date).
A Put Option is the right to sell the underlying asset at a given price (the Strike) at some
point in the future (the maturity date).
The option market is a very specialised market and has developed an elaborate vocabulary to
❦ communicate about the options. For example, ❦
• The strike or execution price is the price at which an option can be executed (e.g. for a
call the price at which the underlying can be sold when executing the option)
The strike price is denoted as X .
So, the owner of the option can at maturity date execute the right that a particular option provides.
This act is called “exercising an option.”
For example, you have bought a call option on HSBC Holdings plc with strike price of
£600, the price today is £625. This means that you have the right to buy the share from
the writer of the contract for £600, but you can sell it right away on the market for £625.
This transaction leaves you with £25 profit.
Actually, option traders will seldom use the words “option buyer,” they rather say that you
have long position. The option buyer has the long position, he or she has right to sell or buy at
the pre-agreed price.
The world is balanced and for each person or group that has a right, there is an other person or
group that has an obligation to match that right. For each long position, there is a short position.
The option writer has the obligation to sell or buy at the pre-agreed price. He/she has a short
position.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 641
❦
The words long and short are also used with relation to futures and shares. For example,
having a short position in Citybank means that you have sold shares of Citybank without
having them in portfolio. This means that at a later and pre-defined moment in time you
will need to buy them at market price do deliver. This means that you earn money on this
transaction when the price would go down.
The long position in shares – or any other symmetric asset – means that you have the
contract in portfolio. As in “I have two shares of HSBC Holdings Plc.”
The example shows that having a long position in a call option for example gives you the
right to buy a certain underlying asset. This is all nice, but it means that you should not forget to
use the right to buy the option, then sell it, etc. Maybe there is a better option and ask for “cash
settlement.” So in general, options provide two ways of closing the contract:
• Delivering of the underlying: Providing or accepting the underlying from the option buyer
who exercises his/her option.
• Cash settlement: The option writer will pay out the profit of the option to the buyer in cash
in stead of delivering the asset.
Still, we are not finished listing the slang of an option trader. Today’s price of the underlying
asset is usually referred to as the Spot Price. This language use is in line with its use in the
forwards and futures market. The spot price is traditionally denoted as S .
❦ ❦
So, at maturity the value of the options equates the profit that the long position will provide.
For a call, that profit will equal the difference between the spot price at that moment and the
strike price. Even before maturity one can observe if the spot price is higher or lower than the
strike price. This difference is called the Intrinsic Value. The Intrinsic Value is the payoff that
the option would yield if the spot price would remain unchanged till maturity (not discounted,
just nominal value). For example,
• IVcall = max(S − X, 0)
• ITM: an option is in-the-money if its Intrinsic Value is positive. So if the price of the under-
lying would be the same at maturity date, the option buyer would get some payoff.
• OTM: An option is out-of-the-money if the spot price is not equal to the strike and the
intrinsic value of the option is zero. For a call, this means that S < X . This would mean
that if at maturity the spot price would be the same as now, then the buyer would get no
payoff.
While the concept “marked-to-market” is not specific for the option market, it is worth to
notice that MTM or Marked-to-Market is the value of a financial instrument that market par-
ticipants would pay for it (regardless if you believe this price to be fair or correct).
These concepts are visualised in Figure 30.3 on page 642.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 642
❦
Figure 30.3: Some concepts illustrated on the example of a call option with on the z -axis the sport
price S and on the y -axis the payoff of the structure. The length of the blue arrows illustrates the
concept, while the grey arrows indicate the position of the concept on the z -axis.
❦ ❦
Finally, there are two types of these basic options that are worth distinguishing.
• European Option: A European Option is an option that can be executed by the buyer at the
maturity date and only at the maturity date.
• American Option: An American Option is an option that can be executed by the buyer from
the moment it is bought and till the at maturity date.
Even the price of an option has it specific name: the option premium. The premium
includes obviously the present value of the intrinsic value, but usually will be higher. There is
indeed an extra value hidden in the optionality of the contract: the right of being allowed do do
something – but not having the obligation – as such has a positive value. We come back to this
idea in Chapter 30.7.5 “The Black and Scholes Model” on page 649.
Definition: OTC
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 643
❦
The main differences between OTC and Exchange Traded options are:
1. The counter-party risk: On the exchange those risks are covered by the clearing house;
2. Settlement and clearing have to be specified in the OTC agreement, while on the stock
exchange the clearing house will do this; and
Options are available on most stock exchanges. Already during the Tulipomania in 1637 there
LSE
were options on tulip onions available both on the London Stock Exchange (LSE) and the Ams-
terdam exchange.
While the amount of options traded on regulated exchanges is huge, the amounts traded OTC
are much higher and typically amount in hundreds of trillions of dollars per year. This is because
after those OTC trades are usually professional counterparts such as investment funds or pen-
sion funds that collect savings of thousands of investors and then build a structure with capital
protection based on options.
1. Supposedly, the first option buyer in the world was the ancient Greek mathematician and
philosopher Thales of Miletus (ca. 624 – ca. 546 BCE). On a certain occasion, it was pre-
dicted that the season’s olive harvest would be larger than usual, and during the off-season
he acquired the right to use a number of olive presses the following spring. When spring
came and the olive harvest was larger than expected he exercised his options and then
rented the presses out at much higher price than he paid for his “option.” – see Kraut (2002)
2. Tulipomania (March 1637): On February 24, 1637, the self-regulating guild of Dutch
florists, in a decision that was later ratified by the Dutch Parliament, announced that all
futures contracts written after November 30, 1636 and before the re-opening of the cash
market in the early Spring, were to be interpreted as option contracts. See for example:
Mackay (1841)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 644
❦
3. In London, puts and “refusals” (calls) first became well-known trading instruments in the
1690s during the reign of William and Mary. See: Smith (2004)
4. Privileges were options sold OTC in nineteenth century America, with both puts and calls
on shares offered by specialized dealers. Their exercise price was fixed at a rounded-off mar-
ket price on the day or week that the option was bought, and the expiry date was generally
three months after purchase. They were not traded in secondary markets.
5. In real estate market, call options have long been used to assemble large parcels of land
from separate owners; e.g., a developer pays for the right to buy several adjacent plots, but
is not obligated to buy these plots and might not unless he can buy all the plots in the entire
parcel.
6. Film or theatrical producers often buy the right – but not the obligation – to dramatize
a specific book or script.
7. Lines of credit give the potential borrower the right – but not the obligation – to borrow
within a specified time period and up to a certain amount.
8. Many choices, or embedded options, have traditionally been included in bond contracts.
For example, many bonds are convertible into common stock at the buyer’s discretion, or
may be called (bought back) at specified prices at the issuer’s option.
9. Mortgage borrowers have long had the option to repay the loan early, which corresponds
to a callable bond option.
❦ ❦
As you will have noticed from this list, options are not limited to financial markets. For exam-
ple a lease contract for a car has usually an option to buy the car at the end of the contract.
C = max(0, X)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 645
❦
Payoff
Profit
$
5
0
−5
−10
Figure 30.4: The intrinsic value of a long call illustrated with its payoff and profit. The profit is lower,
since it takes into account that the option buyer has paid a fixed premium for the option.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 646
❦
10
Profit
5
Payoff
0
−5
$
−10
−15
−20
Figure 30.5: The intrinsic value of a short call illustrated with its payoff and profit. The profit is
higher, since this the position of the option-wirter, so this party has got the premium at the start of the
contract. Note that the loss is unlimited.
❦ ❦
T <- 3 # time to maturity
r <- 0.03 # discount rate
payoff <- - mapply(max, FS-X, 0)
profit <- P * (1 + r)^T + payoff
While the option buyer (the long position) can loose maximum the premium paid, the option
writer can loose an unlimited amount while his profit will maximum be equal to the option
premium.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 647
❦
buyer needs to own the underlying asset. This implies that this party needs to buy this asset first
and hold it till the option expires.9 The cost to hold the asset is known as the “cost of carry”.
The code below, calculates the intrinsic values of both a long and a short position and then
plots the results in Figure 30.6.
20
10
10
Profit
Payoff Payoff
$
0
Profit
−10
−10
−20
−20
par(mfrow=c(1,2))
plot(FS, payoff,
col='red', lwd=3, type='l',
main='LONG PUT at maturity',
xlab='Future spot price',
ylab='$',
ylim=c(-20,20)
)
lines(FS, profit,
col='blue', lwd=2)
text(110,1, 'Payoff', col='red')
text(110,-4, 'Profit', col='blue')
9 If the option ends favourably, but the option buyer does not have the underlying asset, then he needs to buy it
at that moment at the market price in order to sell it to the option buyer. This does not make sense, as it cancels out
the potential profit on the option.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 648
❦
plot(FS, payoff,
col='red', lwd=3, type='l',
main='SHORT PUT at maturity',
xlab='Future spot price',
ylab='',
ylim=c(-20,20)
)
lines(FS, profit,
col='blue', lwd=2)
text(110,1, 'Payoff', col='red')
text(110,6, 'Profit', col='blue')
First, consider a long call and a short put position. This will provide us – at maturity – with
profit if the future spot price ST is higher than the strike price X , but it will also expose us to any
price decreases because of the short put position. The result is that the payoff of the structure is
S(T ) − X . This means that at maturity the following holds:
C(T ) − P (T ) = S(T ) − X
If those two portfolios are equal at maturity, then also today their market value must be
the same. The present value of the left-hand side is the long call and short put position today:
C(0) − P (0) and the second portfolio is the underlying as it is today S(0) minus the present value
of the strike price. If we denote the discounting operation as D(.) then this becomes.
C − P = D(F − X)
with
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 649
❦
the right hand side is the same as buying a forward contract on the underlying with the strike as
delivery price. So, a portfolio that is long a call and short a put is the same as being long a forward.
The Put-Call Parity can be rewritten as
C +D×X =P +S
In this case, the left-hand side is a fiduciary call, which is long a call and enough cash (or bonds)
to pay the strike price if the call is exercised, while the right-hand side is a protective put, which
is long a put and the asset, so the asset can be sold for the strike price if the spot is below strike at
expiry. Both sides have payoff max(S(T ), K) at expiry (i.e., at least the strike price, or the value of
the asset if more), which gives another way of proving or interpreting put-call parity.
2. The returns of one period are statistically independent of the return in other periods
• interest rates are continuous: so that ert = (1 + i)t , implying that the compounded
continuous rate can be calculated from the annual compound interest rate:
r = log(1 + i)
• also returns can be split infinitesimally and be expressed as a continuous rate.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 650
❦
These assumptions allow to calculate the fair value for a call option as in Black and Scholes
(1973) and via the call-put parity derive the value of the put option — see Chapter 30.7.4.4 “The
Put-Call Parity” on page 648. The result is the following:
• with:
• S = 100
• σ = 20%
• r = 2%
• τ = 1 year
We will convert the Black and Scholes formula into functions, so we can easily reuse the code
later. First, we also make functions for the intrinsic value. In these functions we leave out the time
value of the premium, because want to use them later to compare portfolios at the moment of the
purchase of the option. The code below defines these functions.
# call_intrinsicVal
# Calculates the intrinsic value for a call option
# Arguments:
# Spot -- numeric -- spot price
# Strike -- numeric -- the strike price of the option
# Returns
# numeric -- intrinsic value of the call option.
call_intrinsicVal <- function(Spot, Strike) {max(Spot - Strike, 0)}
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 651
❦
# put_intrinsicVal
# Calculates the intrinsic value for a put option
# Arguments:
# Spot -- numeric -- spot price
# Strike -- numeric -- the strike price of the option
# Returns
# numeric -- intrinsic value of the put option.
put_intrinsicVal <- function(Spot, Strike) {max(-Spot + Strike, 0)}
# call_price
# The B&S price of a call option before maturity
# Arguments:
# Spot -- numeric -- spot price in $ or %
# Strike -- numeric -- the strike price of the option in $ or %
# T -- numeric -- time to maturity in years
# r -- numeric -- interest rates (e.g. 0.02 = 2%)
# vol -- numeric -- standard deviation of underlying in $ or %
# Returns
# numeric -- value of the call option in $ or %
#
call_price <- function (Spot, Strike, T, r, vol)
{
d1 <- (log(Spot / Strike) + (r + vol ^ 2/2) * T) / (vol * sqrt(T))
d2 <- (log(Spot / Strike) + (r - vol ^ 2/2) * T) / (vol * sqrt(T))
pnorm(d1) * Spot - pnorm(d2) * Strike * exp(-r * T)
}
# put_price
# The B&S price of a put option before maturity
❦ # Arguments: ❦
# Spot -- numeric -- spot price in $ or %
# Strike -- numeric -- the strike price of the option in $ or %
# T -- numeric -- time to maturity in years
# r -- numeric -- interest rates (e.g. 0.02 = 2%)
# vol -- numeric -- standard deviation of underlying in $ or %
# Returns
# numeric -- value of the put option in $ or %
#
put_price <- function(Spot, Strike, T, r, vol)
{
Strike * exp(-r * T) - Spot + call_price(Spot, Strike, T, r, vol)
}
# Examples:
call_price (Spot = 100, Strike = 100, T = 1, r = 0.02, vol = 0.2)
## [1] 8.916037
It is even possible to use our functions to plot the market value of different options. First we
plot an example of the market value of a long call and compare this to its intrinsic value. The
results are in Figure 30.7 on page 652.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 652
❦
60
50
40
Market value
Option value
30
20
Intrinsic value
10
0
Spot price
Figure 30.7: The price of a long call compared to its intrinsic value. The market value is always
positive.
❦ ❦
# Long call
spot <- seq(50,150, length.out=150)
intrinsic_value_call <- apply(as.data.frame(spot),
MARGIN=1,
FUN=call_intrinsicVal,
Strike=100)
market_value_call <- call_price(Spot = spot, Strike = 100,
T = 3, r = 0.03, vol = 0.2)
plot(spot, market_value_call,
type = 'l', col= 'red', lwd = 4,
main = 'European Call option',
xlab = 'Spot price',
ylab = 'Option value')
text(115, 40, 'Market value', col='red')
lines(spot, intrinsic_value_call,
col= 'forestgreen', lwd = 4)
text(130,15, 'Intrinsic value', col='forestgreen')
Let us try the same for a long put position. The code below does this and plots the results in
Figure 30.8.
# Long put
spot <- seq(50,150, length.out=150)
intrinsic_value_put <- apply(as.data.frame(spot),
MARGIN=1,
FUN=put_intrinsicVal,
Strike=100)
market_value_put <- put_price(Spot = spot, Strike = 100,
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 653
❦
20
10
intrinsic value
market value
❦ ❦
0
Spot price
Figure 30.8: The price of a long put compared to its intrinsic value. Note that the market price of a
put can be lower than its intrinsic value. This is because of the cost of carry.
It is noteworthy that the time value of a put can be negative (ie. the market value minus the
intrinsic value). This is because it is somehow costly to exercise a put. In order to exercise a put
we first need to have the asset. So, the equivalent portfolio is lending money in order to buy the
asset so that we would be able to sell the assets at maturity. Lending costs money and hence when
the probability that the put will end in the money, this costs will push down the value of the put.
We call this the cost of carry.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 654
❦
The BS model needs a few strong assumptions that are not really reflecting the reality on
financial markets. For example, in order to derive the model, we need to assume that:
• markets are efficient (follow a Wiener process, we can always buy and sell, etc.),
• the underlying is not paying any dividend (and if it is, this is a continuous flow),
Step One
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 655
❦
Calculate the price of a long ATM European call option; using one step in the binomial
model and the following assumptions: S0 = $100, p = 0.70, u = 1.03 (≡ 3% increase), d =
0.95 (≡ 5% decrease), r = 0.1. Assume zero interest rates.
❦ ❦
In the next step, we will split each node again, as illustrated in Figure 30.10 on page 656
Each ending node of the second step will be the split in step 3 and so on. The total number of
end-nodes is hence 2K−1 with K the number of steps.
This approach is called the binomial model, since in each node we consider two scenarios.10
The key issue is to find good parameters so that the probabilities and up and down moves all
together create a realistic picture. The first approach is the risk neutral approach.
1. Choose (u, d, p) consistent with some other theory of observation. For example, the Cox–
Ross–Rubinstein model:
√
• u = eσ δt
√
δt
• d = e−σ
eRRF δt −d
• p= u−d
10 The reason why we only consider two branches for each node is mainly that it is the most simple system that
works. Adding more branches makes everything a lot more complex and does not really add any precision. More
precision rather can be obtained by taking smaller steps and hence a larger tree.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 656
❦
To illustrate how this can work, we provide here an implementation of the Cox–
Rubinstein model (CRR)
# CRR_price
# Calculates the CRR binomial model for an option
#
# Arguments:
# S0 -- numeric -- spot price today (start value)
# SX -- numeric -- strike, e.g. 100
# sigma -- numeric -- the volatility over the maturity period,
# e.g. ca. 0.2 for shares on 1 yr
Rrf -- numeric -- the risk free interest rate (log-return)
# optionType -- character -- 'lookback' for lookback option,
# otherwise vanilla call is assumed
# maxIter -- numeric -- number of iterations
# Returns:
# numeric -- the value of the option given parameters above
CRR_price <- function(S0, SX, sigma, Rrf, optionType, maxIter)
{
Svals <- mat.or.vec(2^(maxIter), maxIter+1)
probs <- mat.or.vec(2^(maxIter), maxIter+1)
Smax <- mat.or.vec(2^(maxIter), maxIter+1)
Svals[1,1] <- S0
probs[1,1] <- 1
Smax[1,1] <- S0
dt <- 1 / maxIter
u <- exp(sigma * sqrt(dt))
d <- exp(-sigma * sqrt(dt))
p = (exp(Rrf * dt) - d) / (u - d)
for (n in 1:(maxIter))
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 657
❦
{
for (m in 1:2^(n-1))
{
Svals[2*m-1,n+1] <- Svals[m,n] * u
Svals[2*m,n+1] <- Svals[m,n] * d
probs[2*m-1,n+1] <- probs[m,n] * p
probs[2*m,n+1] <- probs[m,n] * (1 - p)
Smax[2*m-1,n+1] <- max(Smax[m,n], Svals[2*m-1,n+1])
Smax[2*m,n+1] <- max(Smax[m,n], Svals[2*m,n+1])
}
}
if (optionType == 'lookback')
{
exp.payoff <- (Smax - SX)[,maxIter + 1] * probs[,maxIter + 1]
} # lookback call option
else
{
optVal <- sapply(Svals[,maxIter + 1] - SX,max,0)
exp.payoff <- optVal * probs[,maxIter + 1]
} # vanilla call option
sum(exp.payoff) / (1 + Rrf)
}
Now, we still add another function that can be used as a wrapper function for the previous
one in order to visualize the results:
# plot_CRR
# This function will call the CRR function iteratively for
# number of iterations increasing from 1 to maxIter and
# plot the results on screen (or if desired uncomment the
❦ # relevant lines to save to disk). ❦
# Arguments:
# optionType -- character -- 'lookback' for lookback option,
# otherwise vanilla call is assumed
# maxIter -- numeric -- maximal number of iterations
# saveFile -- boolean -- TRUE to save the plot as pdf
# Returns:
# numeric -- the value of the option given parameters above
if(saveFile) {
p
ggsave(paste('img/binomial_CRR_',optionType,'.pdf',sep=''))
}
# Return the plot:
p
}
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 658
❦
This code allows us to plot the CRR model and show how it converges. We will plot it for two
options: a in Figure 30.11 on page 658.
library(ggplot2)
# Plot the convergence of the CRR algorithm for a call option.
plot_CRR("Call", maxIter = 20)
❦ ❦
Figure 30.11: The Cox–Ross–Rubinstein model for the binomial model applied to a call option. Note
how the process converges smooth and quick.
To illustrate this further, we will do the same for an unlimited lookback option (“Russian”
option). This option allows the buyer to set the strike price at the lowest quoted price during
the lifetime of the option. The code of that example is below and the plot that results from it in
Figure 30.12 on page 659.
Note that the binomial model has an exponentially growing number of nodes in every
step. This can be seen in the line: for (m in 1:2^(n-1)) . This might become a limit-
ing factor to see the result. Best is to experiment with small numbers and once the code
is de bugged check what your computer can do. Changing the number of steps can be
achieved by fine-tuning the maxIter in the aforementioned code.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 659
❦
Figure 30.12: The Cox–Ross–Rubinstein model for the binomial model applied to an unlimited look-
back option (aka Russian Option). For the Russian option, the convergence process is not obvious. In
fact the more steps we allow, the more expensive the option becomes, because the more there are
❦ moments that can have a lower price. ❦
C = δS − B
If both portfolios have the same price now, then
C = δS − B
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 660
❦
We have now two equations with two unknown (δ and B ), and hence easily find that:
⎧
⎨ δ = Cu −Cd
Su −Sd
⎩ B = δSu −Cu
= δSd −Cd
1+r 1+r
Cu −Cd
1. δ = Su −Sd
3. C = δS − B
What is the value of a call option, assuming that: the strike price, X is $100; the spot price,
S is $100, Sd = $98, Su = $105, and the interest is 2%.
5$−0$
δ = 105$−98$ = 0.714 ⇒ B = 0.714×105$−5$
1+0.02 = 68.60$ ⇒ C = 0.714 × 100$ − 68.60$ =
$2.88
– Is easier to calculate
– Does not use the economic probability of the stock going up or down
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 661
❦
changes in function of the different parameters such as time to maturity, volatility, interest rates,
etc. So, we will use these as a starting point to create the illustrations.
To visualise the results, we will use the library ggplot2, and hence we need to load it first:
Then we create a generic function that we will use to plot the price dependencies. This func-
tion will take an argument varName, that is the variable that we will study. The function takes
even the option name as an argument.
# plot_price_evol
# Plots the evolution of Call price in function of a given variable
# Arguments:
# var -- numeric -- vector of values of the variable
# varName -- character -- name of the variable to be studied
# price -- numeric -- vector of prices of the option
# priceName -- character -- the name of the option
# reverseX -- boolean -- TRUE to plot x-axis from high to low
# Returns
# ggplot2 plot
#
plot_price_evol <- function(var, varName, price, priceName,
reverseX = FALSE)
{
d <- data.frame(var, price)
colnames(d) <- c('x', 'y')
p <- qplot(x, y, data = d, geom = "line", size = I(2) )
p <- p + geom_line()
if (reverseX) {p <- p + xlim(max(var), min(var))} # reverse axis
❦ ❦
p <- p + xlab(varName ) + ylab(priceName)
p # return the plot
}
## ... time
T <- seq(5, 0.0001, -0.01)
Call <- c(call_price (Spot, Strike, T, r, vol))
p1 <- plot_price_evol(T, "Time to maturity (years)", Call, "Call",
TRUE)
## ... interest
R <- seq(0.001, 0.3, 0.001)
Call <- c(call_price (Spot, Strike, t, R, vol))
p2 <- plot_price_evol(R, "Interest rate", Call, "Call")
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 662
❦
## ... volatility
vol <- seq(0.00, 0.2, 0.001)
Call <- c(call_price (Spot, Strike, t, r, vol))
p3 <- plot_price_evol(vol, "Volatility", Call, "Call")
## ... strike
X <- seq(0, 200, 1)
Call <- c(call_price (Spot, X, t, r, vol))
p4 <- plot_price_evol(X, "Strike", Call, "Call")
## ... Spot
spot <- seq(0, 200, 1)
Call <- c(call_price (spot, Strike, t, r, vol))
p5 <- plot_price_evol(spot, "Spot price", Call, "Call")
25
25
20
20
15
Call
Call
10 15
5
10
0
5 4 3 2 1 0 0.0 0.1 0.2 0.3
Time to maturity (years) Interest rate
❦ 100
❦
8 75
Call
Call
6 50
25
4
0
0.00 0.05 0.10 0.15 0.20 0 50 100 150 200
Volatility Strike
100
75
Call
50
25
0
0 50 100 150 200
Spot price
Figure 30.13: The value of a call option depends on many variables. Some are illustrated in these
plots.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 663
❦
8
10.0
6
7.5
Call
Call
5.0 4
2.5 2
6
75
4
Call
Call
50
2
25
0 0
0.00 0.05 0.10 0.15 0.20 0 50 100 150 200
Volatility Strike
100
75
Call
50
25
0
0 50 100 150 200
Spot price
Figure 30.14: The value of a put option depends on many variables. Some are illustrated in these
plots.
❦ ❦
# Define the default values:
t <- 1
Spot <- 100
Strike <- 100
r <- log(1 + 0.03)
vol <- 0.2
## ... time
T <- seq(5, 0.0001, -0.01)
Call <- c(put_price (Spot, Strike, T, r, vol))
p1 <- plot_price_evol(T, "Time to maturity (years)",
Call, "Call", TRUE)
## ... interest
R <- seq(0.001, 0.3, 0.001)
Call <- c(put_price (Spot, Strike, t, R, vol))
p2 <- plot_price_evol(R, "Interest rate", Call, "Call")
## ... volatility
vol <- seq(0.00, 0.2, 0.001)
Call <- c(put_price (Spot, Strike, t, r, vol))
p3 <- plot_price_evol(vol, "Volatility", Call, "Call")
## ... strike
X <- seq(0, 200, 1)
Call <- c(put_price (Spot, X, t, r, vol))
p4 <- plot_price_evol(X, "Strike", Call, "Call")
## ... Spot
spot <- seq(0, 200, 1)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 664
❦
Table 30.1: An overview of the price dependency for call and put options. A plus sign indicates that
if the variable goes up, then the option premium goes up. The minus sign indicates that in the same
case the option premium goes down.
❦ ❦
∂C √
vega ∂σ SN ′ (d1 ) τ
SN ′ (d1 )σ SN ′ (d1 )σ
theta ∂C
∂t − √
2 τ
− rKe−r(τ ) N (d2 ) − √
2 τ
+ rKe−r(τ ) N (−d2 )
rho ∂C
∂r K(τ )e−r(τ ) N (d2 ) −K(τ )e−r(τ ) N (−d2 )
Table 30.2: An overview of “the Greeks:” the most relevant derivatives of the option price.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 665
❦
complicated by the fact that transactions come at a cost. This means that one will have to make
a trade-off between constantly trading but keeping risks low and incurring too much transaction
costs.
So, the Greeks are very important to the option trader. Therefore, we will illustrate how they
in their turn depend on the spot price of the underlying asset.
We can now visualize the value of the delta of a call and a put as follows – plot in Figure 30.15
on page 666:
# call_delta
# Calculates the delta of a call option
# Arguments:
# S -- numeric -- spot price
# Strike -- numeric -- strike price
# T -- numeric -- time to maturity
# r -- numeric -- interest rate
# vol -- numeric -- standard deviation of underlying
call_delta <- function (S, Strike, T, r, vol)
{
d1 <- (log (S / Strike)+(r + vol ^2 / 2) * T) / (vol * sqrt(T))
pnorm(d1)
}
# put_delta
# Calculates the delta of a put option
# Arguments:
# S -- numeric -- spot price
❦ # Strike -- numeric -- strike price ❦
# T -- numeric -- time to maturity
# r -- numeric -- interest rate
# vol -- numeric -- standard deviation of underlying
put_delta <- function (S, Strike, T, r, vol)
{
d1 <- (log (S / Strike)+(r + vol ^2 / 2) * T) / (vol * sqrt(T))
pnorm(d1) - 1
}
## DELTA CALL
spot <- seq(0,200, 1)
delta <- c(call_delta(spot, Strike, t, r, vol))
p1 <- plot_price_evol(spot, "Spot price", delta, "Call delta")
## DELTA PUT
spot <- seq(0,200, 1)
delta <- c(put_delta(spot, Strike, t, r, vol))
p2 <- plot_price_evol(spot, "Spot price", delta, "Put delta")
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 666
❦
1.00
0.75
Call delta
0.50
0.25
0.00
0 50 100 150 200
Spot price
0.00
−0.25
Put delta
−0.50
−0.75
−1.00
0 50 100 150 200
Spot price
Figure 30.15: An illustration of how the delta of a call and put compare in function of the spot price.
Note the difference in scale on the y -axis.
❦ So, if we have a short position in a call for example and the price of the underlying increases, then ❦
we might have more losses because it becomes more probable that the option will end in the
money. So, if that happens we need to buy more of the underlying asset.
The first derivative of the option price is a linear approximation of how the option price will
move with the underlying asset. It appears that this thinking works, always adjusting our position
of the underlying to the delta of the option would result in a risk-less strategy. Although it is of
course not cost-less as it will require many transactions.
Of course, we need to adjust our strategy to the type of option that is in portfolio. This is the
delta hedging essence of “delta hedging.” Summarized delta hedging is:
Note that the delta of a call is positive and that of a put is negative.
Assume an asset with strike = $100, time to maturity = 5 years , σ = 0.2, r = 0.02 (as continu-
ous, so as percent: 1.98%).
In previous example – where we adapt our position only once a year – the option writer has to
buy and sell during the hedge. Note that:
• the payoff of the option is $15, that is what the option writer pays the option buyer;
• our option writer has spent $107 (non-discounted) and can sell this portfolio for $115 (dif-
ference is [non-discounted] $8), this is $7 short for paying his customer, but he did get the
premium;
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 667
❦
Table 30.3: Delta hedging of a hypothetical example where we only hedge our position once per year.
• The difference at the end is $37.78 (additional shares to buy), this is a big risk and results
from leaving the position open for one year.
To safely hedge a portfolio of options, it is not sufficient that the position remains delta
❦ ❦
neutral. We would for example not be covered if the volatility would suddenly increase.
Hence, we need to manager all greeks (all derivatives and even some second derivatives)
and keep them all zero.
The code for the function portfolio_plot() is below. This function produces the plots of
intrinsic value and market value. It is quite long because if will have to work for different option
types.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 668
❦
# portfolio_plot
# Produces a plot of a portfolio of the value in function of the
# spot price of the underlying asset.
# Arguments:
# portf - data.frame - composition of the portfolio
# with one row per option, structured as follows:
# - ['long', 'short'] - position
# - ['call', 'put'] - option type
# - numeric - strike
# - numeric - gearing (1 = 100%)
# structureName="" - character - label of the portfolio
# T = 1 - numeric - time to maturity (in years)
# r = log(1 + 0.02) - numeric - interest rate (per year)
# as log-return
# vol = 0.2 - numeric - annual volatility of underlying
# spot.min = NULL - NULL for automatic scaling x-axis, value for min
# spot.max = NULL - NULL for automatic scaling x-axis, value for max
# legendPos=c(.25,0.6) - numeric vector - set to 'none' to turn off
# yLims = NULL - numeric vector - limits y-axis, e.g. c(80, 120)
# fileName = NULL - character - filename, NULL for no saving
# xlab = "default" - character - x axis label, NULL to turn off
# ylab = "default" - character - y axis label, NULL to turn off
# Returns (as side effect)
# ggplot plot
# pdf file of this plot (in subdirectory ./img/)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 669
❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 670
❦
}
# return the plot:
p
}
The function above caters for various visual effects. While not too complicated, this tends to
be quite verbose.
# long call
portfolio <- rbind(c('long','call',100,1))
p1 <- portfolio_plot(portfolio, 'Long call',
❦ legendPos="none", xlab = NULL) ❦
# short call
portfolio <- rbind(c('short','call',100,1))
p2 <- portfolio_plot(portfolio, 'Short call', legendPos="none",
xlab = NULL, ylab = NULL)
# long put
portfolio <- rbind(c('long','put',100,1))
p3 <- portfolio_plot(portfolio, 'Long put', legendPos="none",
xlab=NULL)
# short put
portfolio <- rbind(c('short','put',100,1))
p4 <- portfolio_plot(portfolio, 'Short put', legendPos="none",
xlab = NULL, , ylab = NULL)
# -- call
portfolio <- rbind(c('long','call',100,1))
p6 <- portfolio_plot(portfolio, 'Call', legendPos="none",
xlab = NULL, ylab = NULL)
# -- put
portfolio <- rbind(c('long','put',100,1))
p7 <- portfolio_plot(portfolio, 'Put', legendPos="none")
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 671
❦
# -- callput
portfolio <- rbind(c('short','put',100,1))
portfolio <- rbind(portfolio, c('long','call',100,1))
p8 <- portfolio_plot(portfolio, 'Call + Put', legendPos="none",
ylab = NULL)
10 5
Value
5 0
0 −5
−5 −10
−10 −15
80 90 100 110 120 80 90 100 110 120
10 5
0
Value
0 −5
−5 −10
10 10
Value
5
0
0
−10
−5
−20
−10
80 90 100 110 120 80 90 100 110 120
5
0
0 −10
−5 −20
80 90 100 110 120 80 90 100 110 120
spot price spot price
Figure 30.16: Linear option strategies illustrated. The red line is the intrinsic value and the green
line is the value of today if the spot price would move away from 100. Part 1 (basic strategies).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 672
❦
# -- callspread
portfolio <- rbind(c('short','call',120,1))
portfolio <- rbind(portfolio, c('long','call',100,1))
p1 <- portfolio_plot(portfolio, 'CallSpread',
legendPos="none", xlab = NULL)
# -- short callspread
portfolio <- rbind(c('long','call',120,1))
portfolio <- rbind(portfolio, c('short','call',100,1))
p2 <- portfolio_plot(portfolio, 'Short allSpread',
legendPos="none", xlab = NULL, ylab = NULL)
# -- callspread differently
portfolio <- rbind(c('short','put',120,1))
portfolio <- rbind(portfolio, c('long','put',100,1))
p3 <- portfolio_plot(portfolio, 'Short putSpread',
legendPos="none", xlab = NULL)
# -- putspread
portfolio <- rbind(c('short','put',80,1))
portfolio <- rbind(portfolio, c('long','put',100,1))
p4 <- portfolio_plot(portfolio, 'PutSpread',
legendPos="none", xlab = NULL, ylab = NULL)
# -- straddle
portfolio <- rbind(c('long','call',100,1))
portfolio <- rbind(portfolio, c('long','put',100,1))
p5 <- portfolio_plot(portfolio, 'Straddle', spot.min = 50,
spot.max = 150,legendPos="none", xlab = NULL)
# Note that our default choices for x-axis range are not suitable
# for this structure. Hence, we add spot.min and spot.max
❦ ❦
# -- short straddle
portfolio <- rbind(c('short','call',100,1))
portfolio <- rbind(portfolio, c('short','put',100,1))
p6 <- portfolio_plot(portfolio, 'Short straddle',spot.min = 50,
spot.max = 150, legendPos="none",
xlab = NULL, ylab = NULL)
# -- strangle
portfolio <- rbind(c('long','call',110,1))
portfolio <- rbind(portfolio, c('long','put',90,1))
p7 <- portfolio_plot(portfolio, 'Strangle',
spot.min = 50, spot.max = 150,
legendPos="none", xlab = NULL)
# -- butterfly
portfolio <- rbind(c('long','call',120,1))
portfolio <- rbind(portfolio, c('short','call',100,1))
portfolio <- rbind(portfolio, c('long','put',80,1))
portfolio <- rbind(portfolio, c('short','put',100,1))
p8 <- portfolio_plot(portfolio, 'Butterfly',
spot.min = 50, spot.max = 150,
legendPos="none", xlab = NULL, ylab = NULL)
The output of this second batch of linear option strategies is in Figure 30.17 on page 673.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 673
❦
5
−5
0
−10
−5
80 100 120 140 80 100 120 140
5 5
0 0
−5 −5
80 100 120 140 60 80 100 120
10 −10
0 −20
−10 −30
Strangle Butterfly
30 10
20 5
Value
10 0
0 −5
−10
50 75 100 125 150 50 75 100 125 150
Figure 30.17: Linear option strategies illustrated. Part 2 (basic composite structures).
What stands out in Figure 30.17 is that a callspread can be equivalent to a putspread. The
payoffs will be the same in every market scenario and hence the market price must also
be the same.
Finally, we complete our list of popular structures and add one that is maybe not very practical
but illustrates the fact that options can be used as building blocks to build any strategy imaginable
– this structure is called “a complex structure” in the code and it is the last portfolio defined in
the following code block. The plot is in Figure 30.18 on page 675
# -- condor
portfolio <- rbind(c('long','call',140,1))
portfolio <- rbind(portfolio, c('short','call',120,1))
portfolio <- rbind(portfolio, c('long','put',60,1))
portfolio <- rbind(portfolio, c('short','put',80,1))
p1 <- portfolio_plot(portfolio, 'Condor',spot.min = 40,
spot.max = 160, legendPos="none",
xlab = NULL)
# -- short condor
portfolio <- rbind(c('short','call',140,1))
portfolio <- rbind(portfolio, c('long','call',120,1))
portfolio <- rbind(portfolio, c('short','put',60,1))
portfolio <- rbind(portfolio, c('long','put',80,1))
p2 <- portfolio_plot(portfolio, 'Short Condor',spot.min = 40,
spot.max = 160, legendPos="none",
xlab = NULL, ylab = NULL)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 674
❦
# -- geared call
portfolio <- rbind(c('long','call',100.0,2))
p3 <- portfolio_plot(portfolio,
structureName="Call with a gearing of 2",
legendPos="none", xlab = NULL)
# -- a complex structure:
portfolio <- rbind(c('long','call',110,1))
portfolio <- rbind(portfolio, c('short','call',105,1))
portfolio <- rbind(portfolio, c('short','put',95,1))
portfolio <- rbind(portfolio, c('long','put',90,1))
portfolio <- rbind(portfolio, c('long','put',80,1))
portfolio <- rbind(portfolio, c('long','call',120,1))
portfolio <- rbind(portfolio, c('short','call',125,1))
portfolio <- rbind(portfolio, c('short','put',70,10))
portfolio <- rbind(portfolio, c('short','put',75,1))
portfolio <- rbind(portfolio, c('short','call',130,10))
portfolio <- rbind(portfolio, c('long','call',99,10))
portfolio <- rbind(portfolio, c('short','call',100,10))
portfolio <- rbind(portfolio, c('short','put',100,10))
portfolio <- rbind(portfolio, c('long','put',101,10))
p5 <- portfolio_plot(portfolio, 'Fun',legendPos='none',
spot.min=60, spot.max=140,
yLims=c(-0,25))
By now it should be clear that options can be used as building block to create payoff structures
as desired for portfolio management.
Note also that each structure has its counterpart as a short position. To visualise this “short”
structure, it is sufficient to switch “long” with “short” in each definition.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 675
❦
15
0
10
−5
Value
5
−10
0
−15
20
0.25
Value
10
0.00
0
−0.25
−10
−20 −0.50
80 90 100 110 120 80 90 100 110 120
Fun
25
20
Value
15
10
0
60 80 100 120 140
spot price
Figure 30.18: Linear option strategies illustrated. Part 3 (some more complex structures and extra
one structure that is entirely made up).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 676
❦
20
10
Profit at maturity
−10
Legend
Underlying
Short call
portfolio
−20
❦ ❦
80 90 100 110 120
Value of the underlying
Figure 30.19: A covered call is a short call where the losses are protected by having the underlying
asset in the same portfolio.
LegendPos = c(.8,0.2)
Spot <- seq(Spot.min,Spot.max,len=nbrObs)
val.end.put <- sapply(Spot, put_intrinsicVal, Strike = 100)
put.value <- - put_price(the.S, the.Strike, the.T, the.r, the.vol)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 677
❦
'portfolio',
1.1)
colnames(d.underlying) <- c('Spot', 'value', 'Legend','size')
colnames(d.shortput) <- c('Spot', 'value', 'Legend','size')
colnames(d.portfolio) <- c('Spot', 'value', 'Legend','size')
dd <- rbind(d.underlying,d.shortput,d.portfolio)
p <- qplot(Spot, value, data = dd, color = Legend, geom = "line",
size = size )
p <- p + xlab('Value of the underlying' ) + ylab('Profit at maturity')
p <- p + theme(legend.position = LegendPos)
p <- p + scale_size(guide = 'none')
print(p)
20
10
Profit at maturity
❦ −10
Legend
❦
Underlying
Long put
portfolio
−20
Figure 30.20: A married put is a put option combined with the underlying asset.
# Using the same default values as for the previous code block:
LegendPos = c(.6,0.25)
Spot <- seq(Spot.min,Spot.max,len=nbrObs)
val.end.call <- - sapply(Spot, call_intrinsicVal, Strike = 110)
val.end.put <- + sapply(Spot, put_intrinsicVal, Strike = 95)
call.value <- call_price(the.S, the.Strike, the.T, the.r, the.vol)
put.value <- put_price(the.S, the.Strike, the.T, the.r, the.vol)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 678
❦
d.underlying <-
data.frame(Spot, Spot - 100, 'Underlying', 1)
d.shortcall <-
data.frame(Spot, val.end.call, 'Short call', 1)
d.longput <-
data.frame(Spot, val.end.put, 'Long call', 1)
d.portfolio <-
data.frame(Spot, Spot + val.end.call + call.value +
val.end.put - put.value - 100, 'portfolio',1.1)
colnames(d.underlying) <- c('Spot', 'value', 'Legend','size')
colnames(d.shortcall) <- c('Spot', 'value', 'Legend','size')
colnames(d.longput) <- c('Spot', 'value', 'Legend','size')
colnames(d.portfolio) <- c('Spot', 'value', 'Legend','size')
dd <- rbind(d.underlying,d.shortcall,d.longput,d.portfolio)
p <- qplot(Spot, value, data=dd, color = Legend, geom = "line",
size=size )
p <- p + xlab('Value of the underlying' ) + ylab('Profit at maturity')
p <- p + theme(legend.position = LegendPos)
p <- p + scale_size(guide = 'none')
print(p)
20
10
Profit at maturity
❦ ❦
Legend
Underlying
−10
Short call
Long call
portfolio
−20
Figure 30.21: A collar is a structure that protects us from strong downwards movements at the cost
of a limited upside potential.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 679
❦
the package as explained in the following section: Chapter 30.7.13 “Capital Protected Structures”
on page 680.
Before listing the more complex types it is probably worth to understand some of the basic
building blocks for Exotic Options.
• Knock-in: This option only becomes active when a certain level (up or down) is reached.
• Knock-out: This option will have a fixed (zero or more) return when a certain level (up or
down is reached); if the level is not reached it remains an option.
In the context of exotic options, the basic options such as call and put are referred to as “(plain)
vanilla options.” The vanilla options and the aforementioned advanced options are used as build-
❦ ❦
ing blocks to construct even more customised option types. It almost looks as if it is done on
purpose, but those options come in different groups that are roughly named after counties or
mountain ranges.
• Asian options: The strike or spot is determined by the average price of the underlying taken
at different moments.
• Full Asianing: The average of the Asian feature is taken over some moments over the whole
lifetime of the option.
• Look-back options: The spot is determined as the best price of different moments
• Russian Look-back options: The look-back feature is unlimited (the best price over the whole
lifetime of the option.
• Callable or Israeli options: The writer has the opportunity to cancel the option, but must
pay the payoff at that point plus a penalty fee.
• Himalayan: Payoff based on the performance of the best asset in the portfolio.
• Atlas: In which the best and worst-performing securities are removed from the basket prior Atlas
to execution of the option.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 680
❦
• Altiplano: A vanilla option is combined with a compensatory coupon payment if the under-
lying security never reaches its strike price during a given period.
• Bermuda Option: An option where the buyer has the right to exercise at a set (always dis-
cretely spaced) number of times, so this structure is between an American and a European
option.
• Canary Option: Can be exercised at quarterly dates, but not before a set time period has
elapsed.
• Verde Option: Can be exercised at incremental dates(typically annually), but not before a
set time period has elapsed.
Using our standard example parameters (see page 650), and assuming that
then what what gearing can we give to a capital protected structure that only uses a fixed
term deposit (or custom zero bond) and a vanilla call option?
1
We need to invest = C 1000 (1+0.02) 5 = =
C905.73 in the fixed term deposit in order to make
it increase to 1 000 in five year.
This leaves us = C94.27 for buying an option. A call option on 5 years costs us =
C95.35 on a
nominal of = C1 000, so we can make a structure with a gearing of ca. 99%.
11 The aspect of being loss averse as opposed to being risk averse is for example described in De Brouwer (2012),
Thaler (2016), or Kahneman (2011). In traditional economic theory actors on markets are “rational” and always
“risk-averse,” however, people tend to be rather loss-averse. This implies that for profits people are risk-averse, but
for losses they are risk-seeking.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 681
❦
The previous example refers to a structure that has no costs. That is of course not realistic. We
will need to procure the services of an investment manager, transfer agent, deposit bank, and we
need to pay costs related to the prospectus of the fund. Assume now that we have a cost of (on
average) 1% per year.
In R, this example can be programmed as follows (note that we use our function
call_price() – which is defined in Chapter 30.7.5.2 “Apply the Black and Scholes Formula” on
page 650):
So, when adding 1% of annual costs to this structure (and not investing the provisions for
these costs), our gearing decreases to 46.43%. That is roughly half of the original gearing.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page 683
❦
PART VII
Reporting
♥
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page 685
❦
In this book we already studied data and building mathematical models on that data. The next
logical step is to communicate clearly and concisely, bring our opinion across and point the man-
agement to potential opportunities and risks. The first step is generally building a good presenta-
tion.
Again, R can be our workhorse and be the cornerstone of a free work-flow that produces
professional results. While certain applications might have an edge in environments where data
is not measured in Gigabytes but in Terrabytes, or where the inflow of information is extremely
fast and one has to make sense of the flow of data rather than a given snapshot. The reality is
that in almost all applications in almost all organisations from the sole trader to the giant that
employs three hundred thousand people and processes hundred thousand payments per second
one will find that a free set of tools based around R, C++, MySQL and presented in a pdf or html5
web-page is more than enough.
The difference is in the mindset and in the price tag. The large corporate can save millions
of dollars switching to a free workflow. However, that entails taking responsibility as there is
no company to blame in the unexpected case that something could go wrong. Should it also be
mentioned that the relevant manager’s bonus is linked to tangible micro-results – which are easier
to obtain with the support of a commercial company – rather than to the money that is not spend
(or overall company profit).
In this section we provide the elements that will allow the reader to build very professional
reports, dashboard, and other documentation, using only free and open source software.
For example, this book is written in LATEX markup language and the R code is compiled with
the knitr library.
First, we will have a look at the library ggplot2, that we already encountered a few times in
this book.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 687
❦
♣ 31 ♣
We already used ggplot2 in Chapter 9.6 “Violin Plots” on page 173, Chapter 22.2.3 “The AUC” on ggplot2
page 396 and in Chapter 30.7.7 “Dependencies of the Option Price” on page 660 (and following).
We were confident in doing so, because ggplot2 is rather intuitive and it is not so difficult to
understand what a given code segment does. In this section, we will use little words, but allow the
code to speak for itself and provide tactical examples that learn how to use ggplot2 in practice.
With ggplot2 comes the notion of a “grammar of graphics.” Just as the grammar for the
English language, it is a set of rules that allow different words to interact and produce something
that makes sense. That something is in the case of ggplot2 is a professional and clear chart.
❦ ❦
Further information – Extensions
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 688
❦
To explore ggplot2, we will use the dataset mtcars from the library datasets. It is usually
loaded as the start of R, and it is already known from other sections in this book.
# install once: install.packages('ggplot2')
library(ggplot2)
Most of the functionality is obtained by producing a plot object. This is done by the function
ggplot() ggplot. It takes first a data-frame.1 Then, it will still need an aesthetics parameter and a geometry
one.
This plot object with only dots to be printed is created in the following code. The plot is visu-
alised in Figure 31.1.
p <- ggplot(mtcars, aes(x=wt, y=mpg))
# So far printing p would result in an empty plot.
# We need to add a geom to tell ggplot how to plot.
p <- p + geom_point()
❦ ❦
ggplot()
This is a basic plot. From here, there are an endless series of possibilities. For example, we
notice that there is indeed some relationship between the fuel consumption mpg and the weight
(wt) of a car. It seems as we are on to something.
With the following code segment, we investigate this assumption with a Loess estimate and
improve the layout of the plot (add axis names, change font face and size, and add a clear title).
The result is shown in Figure 31.2 on page 689.
1 This is in line with the tidyverse philosophy, however, ggplot2 does not allow to work with the pipe operator.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 689
❦
Figure 31.2: The same plot as in previous figure, but now enhanced with Loess estimates complete
with their confidence interval, custom title, axis labels, and improved font properties.
The shaded area is by default the 95% confidence interval. This confidence interval
level can be changed by passing the argument level to the function geom_smooth. The
type of estimation can also be changed. For example, it is possible to change the loess
estimation by a linear model by passing “lm” to the method argument. For example:
geom_smooth(method = "lm", level = 0.90) .
Hint – Themes
ggplot2 allows the user to specify a “theme,” which is a pre-defined list of options that
give the plot a certain look and feel. Here are some options:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 690
❦
Would you like even more eye-candy? Or you want to make a book according to the
famous Tufte style? Or rather have a layout as in the Economist? then you might want
to have a look at the package ggthemes.
It is now easy to add layers and information to the plot. For example, we can investigate the
effect of the gearbox (in the parameter am: this is zero for an automatic gearbox and one for a
manual transmission). Further, we will change the size of the dot in function of the acceleration
of the car (in the variable qsec), and show the new plot in Figure 31.3.
❦ ❦
Figure 31.3: The same plot as in previous Figure, but now enhanced with different colours that
depend on the gearbox (the parameter am, and the size of the dot corresponds to the time the car
needs to get from standstill to the distance of 0.2 Miles (ca. 322 m).
Note that it is not necessary to factorize the variable am. However, the result will be different.
If the variable is not factorised, ggplot will scale the colour between two colours in a linear way
(in other words, it will genearate a smooth transition, which does not make sense for a binary
variable).
Another powerful feature of ggplot2 is its ability to split plots and produce a “facet grid.” A
facet grid is executed in function of one or more variables. This can be passed on to ggplot2 in
the familiar way of passing on an expression (note the ~ in the following code). The output is in
Figure 31.4 on page 691 and we also show how to use the pipe operator.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 691
❦
Figure 31.4: A facet plot will create sub-plots per discrete value of one or more variables. In this case
we used the number of cylinders, so each sub-plot is for the number of cylinders that is in its title.
The package ggplot2 comes with a dataset mpg that has more rows and different infor-
mation. Try to find out more about the cars in the dataset by using ggplot2. Better still
go to https://www.fueleconomy.gov and download a recent dataset.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 692
❦
31.2 Over-plotting
The dataset mtcars has only 32 rows, and the aforementioned plots are not cluttered and clear.
Generally, we can expect more than 32 rows in a dataset and soon plots will be so busy that it
is not possible to spot any trend or pattern. For example, a loan portfolio can easily have a few
million customers, high frequency market data can have multiple thousands of observations per
second, etc.
For the larger dataset, a simple scatterplot will soon become hard to read. To illustrate this,
we will generate our own dataset with the following code and plot the result in Figure 31.5. In
this example it might be useful to think of the variables to be defined as: “LTI” meaning “loans to
installments” (the amount of loan repayments due very month divided by the income over that
period), and “DPD” as “days past due” (the number of days that the payments were due but not
paid).
❦ ❦
Figure 31.5: The standard functionality for scatterplots is not optimal for large datasets. In this case,
it is not clear what the relation is between LTI and DPD. ggplot2 will provide us some more tools to
handle this situation.
# Plot the newly generated data and try to make it not cluttered:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 693
❦
plot(LTI, DPD,
pch=19, # small dot
ylim =c(0, 100)) # not show outliers, hence zooming in
We can think of some tools to solve this problem in base R: use a plot character that is small
(like pch = 1, 3, or 5) or use boxplots. However, ggplot2 has a larger set of tools to tackle this
situation.
ggplot2 has all the tools of the traditional plot() function, but it has a few more possibilities.
We can choose one or more of the following:
• use violin plots: see Chapter 9.6 “Violin Plots” on page 173;
• it is possible to summarise the density of points at each location and display that in some
way, using geom_count(), geom_hex(), geom_bin2d() or geom_density2d(). The lat-
ter will first make a a 2D kernel density estimation using MASS::kde2d() and display the
results with contours; geom_count()
# Note that ggplot will warn us about the data that is not shown
# due to the cut-off on the y-axis -- via the function ylim(0, 100).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 694
❦
100
75
density
0.04
DPD 50
0.03
0.02
0.01
25
Figure 31.6: The contour plot is able to show where the density of points is highest by a visually
attractive gradient in colour.
❦ Another approach could be based on a scatter plot to which we add transparency for the dots ❦
and (if the dataset is not too large2 ) a Loess fit via geom_smooth() – the results is shown in
Figure 31.7 on page 695:
Any method passed to geom_smooth() such as lm, gam, and loess will only take into
account the observations that are not excluded from being plotted. In our case we have
a cut-off on the y -axiswith ylim(0, 100) , so any outliers that have a DPD higher than
100 do not contribute in the model being fitted.
2 For a Loess estimation the time to calculate and memory use are quadratic in the number of observations, and
hence with actual computers a dataset of about thousand observations is maximum. For larger datasets, one can
then resort to taking a random sample of the data or using another model fit, for example, a linear model can be
used with the option method=’lm’.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 695
❦
100
75
DPD
50
25
Figure 31.7: Adding a Loess estimate is a good idea to visualize the general trend in the data. The
algorithm, however needs a time that increases as the square of the number of observations and will
be too slow for larger datasets.
❦ ❦
The geom “smooth” takes an argument method. This refers to the model that is fitted.
For example we can use lm, glm, gam, loess, or rlm. The default is method = "auto".
In that case, R will select the smoothing method is chosen based on the size of the largest
group (across all panels) that is being plotted. It will select “loess” for sample sizes up to
1 000 observations and in all other cases it will use “gam” (Generalized Additive Model)
with the formula formula = y ~ s(x, bs = "cs") . The reason for that process is
that usually “loess” is preferable because it makes less assumptions and hence can reveal
other trends than only linear. However, Loess estimation will need O(n2 ) memory, which
can become fast insurmountable. This is the reason why we in our example have reduced
the number of data-points.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 696
❦
To summarize the section about ggplot2, we will show one more example on another dataset.
We choose the dataset “diamonds.” This dataset is provided by the package ggplot2 and is ideally
suited to show the excellent capabilities of ggplot2.
15000
density
0.00020
10000
price
0.00015
0.00010
0.00005
5000
❦ ❦
0
4 6 8 4 6 8 4 6 8 4 6 8 4 6 8
x
Figure 31.8: This plot shows a facet plot of a contour plot with customised colour scheme.
The code following code segment will generate the plot in Figure 31.8.
library(ggplot2)
library(viridisLite)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:22pm Page 697
❦
While ggplot is not really compatible with the piping operator ( %>% ),a it is possible to
use different notation styles. In this section we changed from creating an object p and
adding to this, over moving the + to the end of the line so that R expects further input, to
omitting the object and allowing the function ggplot() to send the output directly.
a While we can pass data to ggplot2 with the pipe opeartor, later on we cannot use it to add a geom for
example. The workaround is to use the package wrapr and its %.>% operator.
Given the data in the example above. Assume that LTI stands for “loans to income” (or
rather loan instalments to income ratio) and DPD stands for “days past due” (the arrears
on the payments of the loans). Further, assume that you would be requested to build a
model that predicts the DPD in function of LTI. How would you proceed and what model
would you build? Do think this data is realistic in such case? What does it mean?
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:23pm Page 699
❦
♣ 32 ♣
R Markdown
Once an analysis is finished and conclusions can be drawn, it is important to write up a report.
Whether that report will look like a slideshow, scientific paper, or a book, some tasks will be quite
laborious. For example, saving all plots, then adding them to the document, and – even worse –
do the whole cycle again when something changes (new data, a great idea, etc.).
R Markdown is a great solution. R Markdown is a variant to markdown (that is – despite its
name – a “markup language”) that is designed to mix text, presentation, and R-code.
Hint – RStudio
It is entirely possible to do everything from the command prompt, but working with
RStudio will be most efficient.a
❦ a It
❦
is worth to note that rmarkdown is produced and maintained by the RStudio team. RStudio has
produced and is maintaining rmarkdown, yet made it freely available for everyone. The website of RStudio
is http://www.rstudio.com.
RStudio shortens the learning curve by facilitating the creation of a new document
(see Figure 32.1 on page 700) and inserting some essential example text in a new document.
Most of the content is self-explanatory and hence one can soon enjoy the great results instead of
spending time to learn how R Markdown works.
For example, when creating a new file in RStudio we can select that we want PDF slides – and
a popup message tells us to, make sure we have LATEX on our system installed. The new file that
we create this way will not be empty, but contains a basic structure for PDF slides.
An example content of such R-markdown file is shown below.
---
title: "R Markdown"
author: "Philippe De Brouwer"
date: "January 1, 2020"
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:23pm Page 700
❦
700 32 R Markdown
Figure 32.1: Selecting File → New File → R Markdown... in RStudio will open this window
that allows us to select which R Markdown file type we would like to start with.
❦ output: beamer_presentation ❦
---
## R Markdown
- Bullet 1
- Bullet 2
- Bullet 3
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:23pm Page 701
❦
R Markdown 701
```{r pressure}
plot(pressure)
```
Some aspects that might seem cosmetic are actually very important. For example, the
backticks, pound-symbols, and hyphens have to be at the beginning of the line. Put them
somewhere else and R will not recognize them any more.
The first section – between the --- signs – are the YAML headers. They tell what type of file
we want to create. Do not alter these --- signs, nor move them, change the content between
quotes as desired. The last entry in the YAML headers ( output: beamer_presentation ) is YAML
the one that will be responsible for creating a PDF slideshow (based on the package “beamer” for
LATEX) – hence the need for a working version of LATEX. LATEX
RStudio will now show automatically a Knit button. This button will run all the R code, add
the output to the R-Markdown file, convert the R-Markdown to LATEX, and finally compile this to
PDF.
Then following the R-Markdown content, a new header slide will be created for each level
one title.
Executable code is placed between the three backtics marker ( ``` ), we tell that it is R-code by
putting the letter “r” between curly brackets. That is also the place where we can name the code
knitr
chunk and eventually override the default options knitr::opts_chunk$set(echo = FALSE) .
For example, the following code will create a histogram on the slide with title “50 random
numbers”.
## 50 random numbers
```{r showPlot}
hist(runif(50))
```
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:23pm Page 702
❦
702 32 R Markdown
Digression – R Bookdown
There is a flavour of R Markdown that is especially designed to facilitate the creation of
books. It is called “R Bookdown” and its homepage is here: https://bookdown.org. As
a bonus, you will find some free books on that page.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:23pm Page 703
❦
♣ 33 ♣
It is hard to beat R-Markdown when the document needs limited customization and when
R-output is omnipresent. When the balance is more toward the text, it might be a good option
to do the opposite: instead of working from R-Markdown and R and calling LATEX to create the
finished product, it might make more sense to work in LATEX and call R from there.
This gives us the unrivalled power of LATEX to typeset articles, presentations, and books and
neatly include R-code and/or output generated from R. This book is compiled this way.
LATEX is a markup language (just as R-Markdown itself for that matter) that is extremely
versatile – and Turing complete. It is often said that LATEX produces neat and professional looking
output without any effort but that producing a confusing document requires effort in LATEX. This
is quite the opposite of a regular WYSIWYG text editor.
This is only one of the many advantages of LATEX. There are many more advantages, but proba-
❦ bly the most compelling advantages are that it is the de facto standard for scientific writing, allows ❦
for high automation and – since it is Turing complete – will always be more capable than any text
editor.1 For example, in this book all references to title, figures, and tables are done automati-
cally; and we can really customize how the reference looks like. That experience is more close to
a programming language than to a document editor.
Explaining the finer details of working with LATEX is a book in itself and beyond the scope of
this book. In the remainder of this chapter we want to give a flavour to the reader of how easy
LATEX can be and how straightforward it is to make professional documents that include R-code
and the results of that code. We will build a minimal viable example. To get started you will need
a working version of R and a LATEX compiler.
Already in section Chapter 4 “The Basics of R” on page 21 we explained how to install R. LATEX
is installed on the CLI as follows.2
sudo apt-get install texlive-full
On Windows you want to download and install MitTex – see https://miktex.org and follow
the instructions there.
LATEX uses normal text files that follow a specific syntax and grammar. So, it is not necessary to
use a specific text editor. However, a text editor that understands the mark-up language is easier
to work with as it will help by auto-completing commands and highlighting keywords. Similar to
what RStudio is to R, there are also IDEs available for LATEX. For example,you might want to use
texmaker or kile?
1 While these objective arguments in favour of LAT X are very convincing, for the author the most important
E
one is that it is intellectually satisfying, while working with a WYSIWIG text editor is intellectually frustrating and
condescending.
2 We assume here that you know to updated all your packages before installing anything on your system.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:23pm Page 704
❦
Now, create a text-file (by using one of the IDE or your favourite text-file editor) with the
following content and name it latex_article.Rnw. The name is of course not important, but if
you use another name you will have to substitute this in the following code.
\documentclass[a4paper,12pt]{article}
\usepackage[utf8]{inputenc}
\begin{document}
<<echo=FALSE,include=FALSE>>=
library(knitr) # load knitr
opts_chunk$set(echo=TRUE,
warning=TRUE,
message=TRUE,
out.width='0.5\\textwidth',
fig.align='center',
fig.pos='h'
)
@
❦ \maketitle ❦
\begin{abstract}
A story about FOSS that resulted in software that rivals any --even very
expensive-- commercial software \ldots and has the best user support
possible: a community of people that help each other.
\end{abstract}
If not already done, save this in a text-file and give it the name latex_article.Rnw. Now,
we are ready to compile this article. It is sufficient to execute two lines in the Linux CLI:
R -e 'library(knitr);knit("latex\_article.Rnw")'
pdflatex latex\_article.tex
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:23pm Page 705
❦
❦ ❦
Figure 33.1: The LATEX article looks like this. Note that this is a cropped image so it would be readable
in this book. The result will be centred on an A4 page.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:23pm Page 706
❦
The first line calls the R software and executes two commands: it first loads the knitr library
and then “knits” the document. This means that it takes the Rnw file, and extracts all expressions
that are between <<>>= (eventually with optional arguments between the double brackets) and
the @ as well as all the code wrapped in the macro \Sexpr{} and executes that in R, then “knits”
all this code, its output and plots together with the remainder of the LATEX code. The result of this
knitr process is then stored in the file latex_article.tex.
The second line uses the command pdflatex to compile the tex file into a pdf file.
Note that knitr provides two methods to include R-code in the LATEX file.
2. short code that appears in the normal text can be wrapped in the function
\Sexpr{}
❦ ❦
The result will be a pdf file that is called latex_article.pdf. It will look as in Figure 33.1
on page 705.
Digression – RStudio
While it is easy enough to work in the CLI and often preferable to use our most loved IDE
or text editor, it is also possible to use RStudio. With RStudio it is not necessary to use any
command line and a click of the button will compile the R-markdown to LATEX code, then
compile that and present the result in a separate screen. Maybe you prefer this workflow?
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:23pm Page 707
❦
♣ 34 ♣
An Automated Development
Cycle
R allows for a completely automated development cycle, from data extraction over
data-wrangling, data exploration, modelling, and reporting. There are so many possible
ways to automate a model development cycle, and there is no clear best solution. When building
an automated modelling cycle it is, however, worth to consider the following hints.
• R has a function source() that allows to read in a file of commands – for example the file source()
that loads in the new dataset or provides some frequently used functions.
• R can be invoked from the command line and from the command line we can ask R to
execute some commands. We demonstrated this in previous section (Chapter 33 “knitr and
❦ ❦
LATEX” on page 703). This implies that together with the source() function it is possible to
automate about everything.
• Further adding the governing commands to the crontab file of your computer will cause
them to be executed at any regular or irregular time interval that you might choose. For
example, we can run every week a file that extracts data from the servers and updates our
data-mart or we can run a weekly dashboard that checks the performance of our models.
• While database servers have their own variations of the source() function and can usually
also be controlled from the command line, it is also possible to access those database servers
from R – as demonstrated in Chapter 15 “Connecting R to an SQL Database” on page 253. crontab
For example
• The object oriented structure in R allows to build objects that hold everything together.
For example we can build an object that knows where to find data, how to wrangle that
data, what to model and how to present it, we can even add to this nice package the doc-
umentation of the model and contact details. The objects can for example have methods
that update data, run analysis, check model performance, and feed that information to a
model-dashboard – see Chapter 36.3 “Dashboards” on page 725. More information about
the OO model is in Chapter 6 “The Implementation of OO” on page 87 – we recommend
especially to have a look at the RC objects for this purpose.
• It is also not too difficult to write one’s own packages for R. For example, we could build a
package for our company that facilitates connection to our servers, provides functions that
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:23pm Page 708
❦
can be used in shiny to use the look and feel of our company, etc. This package can hold
whatever is relevant to our model and provides a neat way to store the logic, the data, and
the documentation on one place. More about building packages is in Chapter A “Create
your own R Package” on page 821.
• It is possible that the model will be implemented in a specific production system, but in
many cases the goal is to present the analysis. R has great tools to produce professional
documents with a gentle learning curve – see for example Chapter 32 “R Markdown” on
page 699, Chapter 36.3 “Dashboards” on page 725, etc. This allows us even to publish results
on a corporate server, which eliminates the need to send spreadsheets around and will be
more useful to the recipient as he/she can interact with the data, while the report owner
can follow up who is using the report.
This toolbox is very powerful and allows for almost any size corporate to increase efficiency,
reduce risks, and produce more sustainable results . . . while spending nothing on expensive soft-
ware. Not only can the aforementioned software be used free of charge – even for commercial use
– but also the stack on which it is build (such as operating system, windows manager, etc.) can
be completely free.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:23pm Page 709
❦
♣ 35 ♣
Writing up conclusions is often an essential part of the work of the data scientist. Indeed, the
best analysis is of no use if it does not convince decision makers. In order to be effective, the
data-scientist needs also to be a skilled communicator.
The exact style and the attention to things such as abstract, summary, conclusions, executive
summary, and many more aspects will depend on the tradition in your company. However, some
things are bound to be universal. For example, we live in an age where information is abundant
and it is not abnormal to communicate on different social platforms and still receive hundreds
of mails every day while the normal pattern is meetings back-to-back. The average manager will
hence have little time to read your document. While you want of course the document to be read-
❦ able (avoid slang and acronyms for example), you also will want to keep the message as short and ❦
clear as possible. Indeed, in this time and age of information overload, concise communication is
of paramount importance.
A first thing to do is re-read your text and make sure it is logical, clear, and that every word
really has to be there. If something can be written shorter then that is usually the way to go.
However, even before we start writing it is worth to check some basic rules. When asked to
look into a problem, make sure to ask yourself the following ideas.
1. Identify the main question or task at hand. Usually, there is a situation and a complication
in some form that leads directly to one main question. Focus that single question.
4. Then for each of the small problems follow the logic of this book: get the relevant data,
wrangle and explore, model, gather evidence, and draw conclusions.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:23pm Page 710
❦
Consider the situation where you have been tasked to investigate customer profitability.
You have set up a team that for one year has been building a model to calculate a customer
value metric (CVM), you have not only made good proxies of actual costs and incomes
streams but also used machine learning to forecast the expected customer lifetime value
CLTV (CLTV) for each customer. During that year your team has grown from yourself only to 15
people, it was hard work but you managed to keep the staff motivated and are very proud
of the hard work.
While the initial idea was to “do something extra for very profitable customers,” your
conclusions are that
1. there are also a lot of non-profitable customers that only produce losses for us;
2. the actual customer segmentation – based on wealth (for a bank) – does not make
sense.
Therefore, you believe it is best to review actual customer segmentation, and find
solutions for the loss-producing customers as well as focussing more on profitable
customers.
The analysis has revealed that there are more problems that need addressing. What is now
the best way forward?
Suppose that we are faced with a situation as in the aforementioned example. How would we
❦ ❦
best communicate and to whom? Given the request, there is certainly an interest in profitability
and most probably people are open to discuss what drives customer profitability. Challenging
the customer segmentation is an entirely different task. This would involve restructuring teams,
re-allocating customers and most probably will shift P&L from one senior manager to the other.
This is not something that can be handled overnight and probably will need board sign-off.
Therefore, we might want to focus – for each existing customer segment – on making the
customer portfolio more profitable. We might want to propose to offboard some customers, cross-
sell to others and provide a better service to a third category that is already profitable.
When compiling the presentation it is a good idea to follow certain simple guidelines. The
whole presentation can best be compared with a pyramid where at each level we start with the
conclusions and provides strong evidence. At each level we find similar things (arguments,
actions, etc.). For example, we could structure our presentation as follows.
1. We need a simple matrix of three actions for each customer segment. This is about
segment A, and we need to get rid of some customers, cross-sell to others, and make sure
we keep the third category.
2. Introduction: Remind the reader what he/she already knows and hence what the key ques-
tion is to be answered. Then for each segment elaborate the following.
(a) There is a group of customers that are unprofitable, and never will be profitable. It is
unfair towards other customers to keep them.
• demonstrate how unprofitable these customers are
• show how even cross selling will not help
• show how much we can save by focussing on profitable customers
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:23pm Page 711
❦
(b) The second group of customers can be made more profitable by cross-selling (and we
have suggestions on how to do that)
• show the sub-optimal nature of these customer accounts
• show how suitable cross-selling helps
• estimate of the gains
(c) The last group of customers are the ones that are very profitable, we can help you
identify them and we suggest to do all you can to keep them as customers.
• show how profitable these customers are
• show why they are profitable
• forecast their contribution to future profits
3. conclusions: repeat the three actions per client segment and propose first steps
4. appendix
The point that stands out most in the presentation structure above is probably that what mat-
ters most to us (how smart the team was, how hard it worked, what concepts it developed, etc.) is
banned to the appendix and is only used in case someone asks about it.
❦ ❦
Hint – Think from the audience’s point of view
Making a presentation, always start thinking from the point of view of the audience: start
from their needs and goals, ask yourself what makes them tick, what is acceptable and
what not for them . . . and what is important to convince them (seldom the overtime that
you have made, the genius breakthrough, or cute mathematical formula will fit that cat-
egory).
There are a few golden rules to keep in mind about the structure of a presentation.
1. The ideas – at any level in the structure – are summaries of the ideas below – in other words,
start with the conclusion on the title slide, the title of the next slide is the summary of the
things on the slide, etc. On a smaller scale, an enumeration groups things of the same type
(e.g. does not mix actions and findings) and the title of that slide will be a summary of that
list.
2. Ideas in each grouping should be of the same kind – never mix things like arguments and
conclusions, observations and deductions, etc.
3. Ideas in each grouping should be logically ordered – ask someone else to challenge your
presentation before sending it to the boss.1
1 If you fail to do this step, your boss will play the role as challenger, but that is not best use of his/her time, nor
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:23pm Page 712
❦
Making slides is an art, and it really depends on what you want to achieve, for what audi-
ence, what size of audience, etc. Are you making a presentation for thousand people or
are you having a one-to-one discussion and will you leave the slides with the decision
maker? It is obvious that while the structure can be largely the same that the layout will
vary dramatically. In the first case, we might focus on a few visuals and eliminate text at
all; in the second case text is of paramount importance and titles can even be multiple
lines long. For each level in the presentation, the mantra “the ideas at any level in the
structure should be summaries of the ideas below” holds, this means that you should put
summaries of the ideas on the slide in the titles.
❦ Finally, one will want to pay attention to the layout of the slides. When working with R- ❦
Markdown (see Chapter 32 “R Markdown” on page 699) or with the Beamer class in LATEX, good
layout is the standard and it will be hard to make an unprofessional layout. Working with WYSI-
WYG slide producing software it is easy to have a different font, font-size and colour on each page.
That will not look professional, nor will it be conducive for the reader to focus on the content. Most
companies will have a style-guide and templates that will always look professional.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 713
❦
♣ 36 ♣
Interactive Apps
A static, well written report, a good book, or a scientific paper have their use and value, but some
things look better on screen. More importantly it also allows the user to interact with the data.
For example, putting the data on a website, can with the right tools allow the user to visualize
different cuts of the data, zoom in or drill down to issues of interest or move around 3D plots on
the screen.
In the larger company, too many people spend their time on manually processing the same
data in an electronic spreadsheet over and over again, then mail it to too many people. Typically,
no one reads those weekly reports, but when asked if the report is still needed the answer is usually
a confirming “yes.” Be warned, though, this asking is dangerous. That usually triggers someone to
open the file and spawn out a mail with requests: add a plot here, a summary there and of course
some other breakdown of data.
❦ Putting the dashboard online, not only allows people to “play” with the data and get more ❦
insight from the same data, but also it allows the publisher to follow up the use of the dashboard.
It is also much faster: it can be updated as the data flows in and the user is not bound to weekly
or monthly updates. It will take a little more time to produce it the first time, but then it will not
take any human time at all to make every second an update.
Unfortunately, companies tend to use old technology (such as spreadsheet and slides) and
then make the jump to super-powerful engines that are able to make sense of ultra-fast changing
data flows that flows into Hadoop infrastructure. Usually, that it overkill, not needed and way too
expensive.
Knowledge is power and that is even more true in this age of digitization: “data is the new
oil.” Many companies have data that can be monetized (for example by enhancing cross-sales).
Business intelligence tools hold the promise to unlock the value of that data.
BI stands for “business intelligence”, and it is a powerful buzzword. So, powerful that all the
large IT companies have their solution: SAP, IBM Cognos, Oracle BI, Dundas BI, and Microsoft
Power BI. Also some specialized players offer masterpieces of software: QlikView, Tableau,
PowerBI and Looker, just to name a few. These things can do wonders, they all offer great
data visualization libraries, OLAP1 capacity, analytical tools to make simple or more advanced OLAP
models, offer support for different document formats as well as online interactive display, allow
thresholds to be defined, and can send alarm emails when a threshold is breached, they offer
great integration with almost all database systems and big data platforms.
1 OLAP is the buzzword for “online analytical processing,” it means that data can come in raw format and that
at the moment the user requests the plot, that only then a snapshot is taken to calculate the plot.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 714
❦
These systems are truly great solutions, but did we not just describe R? Indeed, R, with some
libraries R can do all of that.2 Maybe it looks a little less flashy for an executive demo, and there
SLA is a friendly online community to help you out instead of a helpline with and SLA3 based on a
service level agreement contract.
Using R it is also of crucial importance to manage the updates and the dependencies of pack-
ages carefully. It can happen that you use a package X and that after updating R this package
will not load anymore. In this case, you might want to wait to update R before the update of the
package X is available or alternatively update the package yourself. Because R is FOSS, there are
so many packages and it is not realistic that they are all tested and updated as a new release of
R comes available. Further, it might be that package X depends on package Y and it might be
that Y is not yet updated. Commercial software usually has better backwards compatibility and
the management of dependencies is less crucial. However, most R-code will usually continue to
work fine for many years.
However, the difference that stands out is the price tag. The commercial tools typically will set
you back millions of dollars and R is free. Not only free, but also open source so you know exactly
what you get, and if you need to change something, then you do that and recompile instead of
adding your request to a feature list.
The reader might also appreciate the fact that free and open source is the best warranty against
back-doors in the software or even data-leakages. If you are in doubt, it is possible to check the
source code, compile that code and only this version.
❦ ❦
2 R is not the only free and open source solution for data manipulation, model building and visualization. There
response time or metrics such as not more than ten percent of the cases can take more than one business day to
solve.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 715
❦
36.1 Shiny
The father (or mother?) of R’s fantastic online capacity is the library shiny. It is made available by
the company around RStudio. Shiny provides a straightforward interface that facilitates building
interactive web apps directly from R. Apps can be hosted as standalone apps on a web-page or
embedded in R Markdown documents. Because it is built for online processing and visualization
it is ideal for dashboards. It will even generate HTML 4 and use Bootstrap style CSS styling. So,
the website or app can be extended and modified with CSS themes,4 htmlwidgets, and JavaScript
interaction.
In order to display the app online, it has to be hosted on a webserver. It will work on the
usual LAMP stack (Linux, Apache, MySQL, and PHP/Python/Perl), however, it is designed for
the newer gdebi. In theory, it can co-exist with Apache, though it is bound to raise issues. So,
if you already have other webservers it might be wise to use a dedicated server for Shiny apps.
There are free and commercial versions available. More information can be found on https:
❦ //www.rstudio.com ❦
In order to show Shiny apps on your webserver, it is also possible to publish them on
http://www.shinyapps.io and then include them in the html code of your page via
an <iframe> tag and the app will seamlessly integrate with the page that is calling it.
Using Shiny is easy enough thanks to the excellent help that is available online. For example,
RStudio provides an excellent tutorial – available at http://shiny.rstudio.com/tutorial –
that will get you started in no time.
The first thing to do is – as usual – installing the package then load it. Then we can immedi-
ately look at an example.
library(shiny)
runExample("01_hello")
Pressing enter after the line runExample("01_hello") will produce the output
Listening on http://127.0.0.1:7352 and open a web-page with the content in Fig-
ure 36.1 on page 716. In the meanwhile, the R-terminal is not accepting any further commands.
[CTRL]+C (or [ESC] in RStudio) will allow you to stop the webserver and make the command
prompt responsive again.
4 The standard layout that Shiny provides looks very professional and nicely integrates with Twitter’s Bootstrap.
In order to access more of the Bootstrap widgets directly, use the package shinyBS.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 716
❦
Figure 36.1: The output of one of the examples supplied by the Shiny package. Note the slider bar
on the left and the code on the right.
❦ The examples provided show the code and are so well documented that it is hardly necessary ❦
to explain here too many details. Rather we will explain the main work-flow, concepts and focus
on a few caveats.
The command runExample() will list all examples provided. Look at their code and start
modifying the one that is closest to your needs.
There are a few things to be noted, though. First, is that there are two main ways to code an
app. First, – the older method – is to provide two files: server.R and ui.R and the – newer –
method to fit all code into one file app.R
Here is a simple example of the second method that has all code in one file:
# The name of the server function must match the argument in the
# shinyApp function also it must take the arguments input and output.
server <- function(input, output) {
output$distPlot <- renderPlot({
# any plot must be passed through renderPlot()
hist(rnorm(input$nbr_obs), col = 'khaki3', border = 'white',
breaks = input$breaks,
main = input$title,
xlab = "random observations")
})
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 717
❦
# The name of the ui object must match the argument in the shinyApp
# function and we must provide an object ui (that holds the html
# code for our page).
ui <- fluidPage(
titlePanel("Our random simulator"),
sidebarLayout(
sidebarPanel(
sliderInput("nbr_obs", "Number of observations:",
min = 10, max = 500, value = 100),
sliderInput("breaks", "Number of bins:",
min = 4, max = 50, value = 10),
textInput("title", "Title of the plot",
value = "title goes here")
),
mainPanel(plotOutput("distPlot"),
h4("Conclusion"),
p("Small sample sizes combined with a high number of
bins might provide a visual image of the
distribution that does not resemble the underlying
dynamics."),
"Note that we can provide text, but not
<b>html code</b> directly.",
textOutput("nbr_txt") # object name in quotes
)
)
)
❦ ❦
# finally we call the shinyApp function
shinyApp(ui = ui, server = server)
This code can be executed in an R terminal, and the function shinyApp() will start a web-
server on the IP loopback address of your computer (usually 127.0.0.1) and broadcast to port 7352
the webpage. Then, it opens a browser and directs it to that ip address and port. The browser will
then show an image as in Figure 36.2 on page 718.
Feel free to test the app in the browser. The app is interactive: changing an input will modify
the plot and text below it immediately. What you’re looking at now, is a simple shiny app, but
you probably get an idea about the possibilities of this amazing platform.
It is of course possible to include the app on a web page, so that it it is accessible from any-
where on the Internet. RStudio provides a free version of “ShinyServer” that can be put on your
(dedicated) webserver of choice. There are also non-free versions available.
They also provide a website ShinyApps.io on which you can put your apps for free and then
include them on your web pages via an iframe. The process is simple, and after signing in on
that website the user is guided through the process.
It tells us to install a package rsconnect, then authorize the account with a function provided
by the package. Finally, the app can be deployed as follows:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 718
❦
The great thing is that when people visit your website, they will only see the app that sits on
one of your web pages. This means that your visitors will not even notice that the app itself and
R behind it are actually hosted on another server.
These sections should get you started, and your path might be very personal, but we
believe anyone can benefit of the shiny tutorial that is here: https://shiny.rstudio.
com/tutorial
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 719
❦
36.2.1 HTML-widgets
Adding more interactivity is made easy with many free libraries that are called “HTML-widgets.”
The concept “HTML-widgets” is a little misleading in that sense that HTML is a passive markup
language to present content on a s screen. The interactivity usually comes from Javascript, but it
is enough to have some idea how HTML works in order to use those widgets. javascript
Here is a selection of some personal favourites of the author:
• ggvis: Interactive plots – see Chapter 36.2.3 “Interactive Data Visualisation with ggvis” on
page 721;
Need even more eye-candy? An ever growing list of widgets can be found online at http:
//www.htmlwidgets.org.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 720
❦
library(leaflet)
content <- paste(sep = "<br/>",
"<b><a href='http://www.de-brouwer.com/honcon/'> Honorary Consulate of Belgium</a></b>",
"ul. Marii Grzegorzewskiej 33A",
"30-394 Krakow"
)
map <- leaflet() %>%
addProviderTiles(providers$OpenStreetMap) %>%
addMarkers(lng = 19.870188, lat = 50.009159) %>%
addPopups(lng = 19.870188, lat = 50.009159, content,
options = popupOptions(closeButton = TRUE)) %>%
setView(lat = 50.009159,lng = 19.870188, zoom = 12)
map
When the aforementioned code is run, it will display an interactive map in a browser that
looks like the figure in Figure 36.3.
❦ ❦
Figure 36.3: A map created by leaflet based on the famous OpenStreetMap maps. The ability
to zoom the map is standard, the marker and popup (with clicking-enabled link) are added in the
code.
Note that leaflet can also be integrated with shiny, in order to create interactive maps
that use your own data and display it for the user according to his or her preferences and
interests.
5 To experience the interactivity it must be included in an interactive application, for example based on the shiny
framework.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 721
❦
This simple example should help you to get started with leaflet. We refer to its website
for deeper information, more customization and other examples: https://rstudio.
github.io/leaflet
“Leaflet is one of the most popular open-source JavaScript libraries for inter-
active maps. It’s used by websites ranging from The New York Times and
The Washington Post to GitHub and Flickr, as well as GIS specialists like
OpenStreetMap, Mapbox, and CartoDB.”
a The homepage of leaflet is https://rstudio.github.io/leaflet.
Interactive apps are not only a good way for high ranking managers to consume their
custom MI and allow them to zoom in on the relevant information for them. It is also a
great tool that allows the data scientist to get used to new data before making a model or
analysis.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 722
❦
titanic_train$Age %>%
as_tibble %>%
na.omit %>%
ggvis(x = ~value) %>%
layer_densities(
adjust = input_slider(.1, 2, value = 1, step = .1,
label = "Bandwidth"),
kernel = input_select(
c("Gaussian" = "gaussian",
"Epanechnikov" = "epanechnikov",
"Rectangular" = "rectangular",
"Triangular" = "triangular",
"Biweight" = "biweight",
"Cosine" = "cosine",
"Optcosine" = "optcosine"),
label = "Kernel")
)
Executing this code fragment will open a browser with the content shown in Figure 36.4.
❦ ❦
Figure 36.4: A useful tool to explore new data and/or get an intuitive understanding of what the
different kernels and bandwidth actually do for a kernel density estimation.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 723
❦
Next, we need the ui.R-file that provides the code regulating how the items are placed on the
screen:
library(ggvis)
fluidPage(sidebarLayout(
sidebarPanel(
❦ # Explicit code for a slider-bar:
❦
sliderInput("n", "Number of points", min = 1, max = nrow(mtcars),
value = 10, step = 1),
# No code needed for the smoothing span, ggvis does this:
uiOutput("plot_ui_div") # produces a <div> with id corresponding
# to argument in bind_shiny
),
mainPanel(
# Place the plot "plot1" here:
ggvisOutput("plot1"), # matches argument to bind_shiny()
# Under this the table of selected card models:
tableOutput("carsData") # parses the result of renderTable()
)
))
The application can now be placed on a server – such as ShinyApps.io for example – or
opened in a web-browser directly via the following command.
shiny::runApp("/path/to/my/app")
Executing this line will open a browser with the content shown in Figure 36.5 on page 724.
This allows us to inspect how the app works before publishing it.
36.2.4 googleVis
If we are thinking about a professional looking dashboard or a graphical library that can make
fancy plots but also gauges, maps, organization charts, sankey charts on top of all the normal plots
such as scatter-plots, line charts, and bar charts that is designed to be interactive then GooglerVis
imposes itself the natural choice.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 724
❦
Figure 36.5: An interactive app with ggvis. This particular example uses the data from the dataset
mtcars, to experiment with two parameters: the size of the sample and the span of the Loess smooth-
ing. This is most useful to get a better intuitive understanding of the impact of sample size and span.
❦ ❦
The googleVis package provides an interface between R and the Google Charts API. As all
other packages, it is not only free but also open source. Its code is maintained on github, just as
googleVis all others packages. Its github home is: https://github.com/mages/googleVis.
Note that the interactivity and animations for googleVis will generally require Flash. This
means that in order to see the special effects, your browser will have to support Flash and have it
allowed in its settings. Especially plots with film-like animation will require Flash.
In order to get an idea of the possibilities, please execute the following code.
library(googleVis)
demo(package='googleVis')
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 725
❦
36.3 Dashboards
Dashboards merit a special place in this book and in the heart of every data scientist. They are the
ideal way to visualize data and call for appropriate action and are in many aspects similar to the
function of a periscope in a submarine. Most companies spend massive amounts of working days
on producing “reports”: static visualizations and tables produced, that are usually produced by
Microsoft Excel, but that always require a lot of manual work to maintain. Usually, there is good
interaction with the management for whom the MI (“management information”) is intended MI
Management Information
when the report is created. Then, the report is produced every week, eating up time that could be
used wiser, and unfortunately the report will typically see an ever decreasing amount of people
interested in it.
The answer is – partly – to have dynamic dashboards, that allow the user to change selections,
drill down in the data, and find answers to questions that were never asked before. Unfortunately,
companies are spending every increasing amounts to visualize data on expensive systems, while
R can do this for free . . . and with quite great results.
It is possible to use Shiny and build a dashboard with Shiny alone. However, good dash-
boards will do some the same things over and over again. For example, most dashboards will use
colours to visualize importance of the numbers (for example a simple gauge or a RAG6 status;
the famous red-amber-green can already be very helpful). So, it does make sense to build that
flexdashboard
common infrastructure once and do it thorough. This is what the packages flexdashboard and shinydashboard
shinydashboard do for you.
If it is your task to make a dashboard and you want to use R, then we recommend to have a
❦ thorough look at those two options. The two packages compare as in Table 36.1. ❦
Both solutions will provide you with a flexible and powerful environment that is easily cus-
tomized to the style guide of your company, that allows certain central management of things that
are always the same (such the company logo). Both solutions have an active online community
to provide advice and examples.
Maybe you want to choose shinydashboard because your company website anyhow
uses Bootstrap and your team has also experience with Shiny? Maybe you want to choose
flexdashboard because making a dashboard is just an ancillary activity and you do not have
large teams specializing only in dashboards? Either way you will end up with a flexible and
6 RAG is corporate slang for “red–amber–green”, and it usually means that an item market green does not require
attention, and item market amber should be monitored as it could go wrong soon, and an item market red is requires
immediate attention.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 726
❦
professional solution that can rival the most expensive systems available, while incurring no
direct software cost.
For any business, for any company, for any manager, for any team, and for any individual
employee having a “scorecard” or “a set of KPIs” this is a powerful idea.7 In the first place, it
helps to create clarity about what is to be done, but most importantly the author’s adagio “what
you measure is what you get” always seem to hold. Measuring KPIs and discussing them aligns
minds and creates a common focus. If the manager measure sales per quarter, then your staff
will pursue sales at any cost and converge towards a short time perspective. If the manager has a
place on the scorecard for “feedback from the customer,” then employees will converge towards
entirely different values and value sustainable, long-term results.
A scorecard should be simple (say each employee should focus on maximum seven personal
KPIs), but also offer enough detail to find out when things go wrong and ideally offer enough
detail to get us started when investigating mitigating measures.
A scorecard that has SMART goals (goals that are specific, measurable, attainable but real-
istic and limited in time) also creates a moment to celebrate success. This is another powerful
motivator for any team!
The underlying mechanics of dashboards also apply to any other aspect of any business:
from showing what happens in a production line, over what customers are profitable to what
competitors are doing. Using a scorecard will create clarity about what is done and people will
strive for it.
In the remainder of this chapter we will show how to build a simple dashboard. To focus the
ideas we will create a dashboard that is about diversity of employees.
of this citation in his works. The earliest reference appears to be from Leon C. Megginson: June 1963, Southwestern
Social Science Quarterly, Volume 44, Number 1, “Lessons from Europe for American Business by Leon C. Meg-
ginson”, (Presidential address delivered at the Southwestern Social Science Association convention in San Antonio,
Texas, 12 April 1963), Published jointly by The Southwestern Social Science Association and the University of Texas
Press.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 727
❦
financial performance, and they find that “companies with more diverse top teams were also
top financial performers” — see Barta et al. (2012). However, these kind of studies fail to prove
any causal relationship. The paper phrases it as follows: “We acknowledge that these findings,
though consistent, aren’t proof of a direct relationship between diversity and financial success.”
Indeed, most of those studies only study correlation. We argue that it would be possible to
study causation by for example taking a few years lag between board composition and relative
financial results and carefully filter industry effect and use more control variables. A rare example
is found in Badal and Harter (2014) where one controlling variable “employee engagement” is
used. They find that both gender diversity and this employee engagement independently are able
to “explain” the financial success of a company.
In summary those studies generally prove that financial performance and diversity can be
found together, but fail to answer what is the cause. Therefore, we prefer – at this point – our log-
ical argument based on biological evolution as previously presented. More importantly, striving
to equal chances and more inclusiveness is also the right thing to do. It seems also to go well with
the evolutionary argument of capturing the diversity of ideas.
Now, that we have convinced ourselves that a more diverse and inclusive company is the way
to go, we still need a simple quantification of “diversity” that is universal enough so it can be
applied on multiple dimensions.
What could be more natural measure for diversity than Boltzmann’s definition of entropy? In
1877, he defined entropy to be proportional to the natural logarithm of the number of micro-states
a system could potentially occupy. While this definition was proposed to describe a probabilistic
system such as a gas and aims to measure the entropy of an ensemble of ideal gas particles, it is
remarkably universal and perfectly suited to quantify diversity.
Under the assumption that each microstate is improbable in itself but possible, the entropy
❦ ❦
S is the natural logarithm of the number of microstates Ω, multiplied by the Boltzmann
constant kB :
S = kB log Ω
When those states are not equally probable, the definition becomes:
N
S = −kB pi log pi
i
Where there are N possible and mutually exclusive states i. This definition shows that the entropy
is a logarithmic measure of the number of states and their probability of being occupied.9
If we choose the constant kB so that equal probabilities yield a maximal entropy of 1 (or in
1
other words kB := log(N )
) then we can program in R a simple function.
# diversity
# Calculates the entropy of a system with equiprobable states.
# Arguments:
# x -- numeric vector -- observed probabilities of classes
# Returns:
# numeric -- the entropy / diversity measure
diversity <- function(x) {
f <- function(x) x * log(x)
x1 <- mapply(FUN = f, x)
- sum(x1) / log(length(x))
}
9 Note also how similar entropy is to “information” as defined in Chapter 23.1.1.5 “Binary Classification Trees”
on page 411.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 728
❦
In the context of diversity, entropy works fine as a measure for diversity. Many authors use a
similar definition.10 In this section we will take a practical approach.
If there is a relevant prior probability (e.g. we know that the working population in our area
consist of 20% Hispanic people and 80% Caucasian people) then we might want to show the max-
imum diversity for that prior probability (e.g. 20% Hispanic people and not 50%).
In such case, it makes sense to rescale the diversity function so that a maximum is attained at
the natural levels (the expected proportions in a random draw). This can be done by scaling s(x)
1
so that the scaled prior probability of each sub-group becomes N . So, we want for each group i to
find a scaling so that ⎧
⎨ s(0)
⎪ = 0
1
s(Pi ) = N
⎪
= 1
⎩
s(1)
with Pi the prior probability of sub-group i. For example, we could fit a quadratic function through
these three data-points. A broken line would also work, but the quadratic function will be smooth
in Pi and has a continuous derivative.
Solving the simple set of aforementioned equations, we find that s(x) can be written as:
s(x) = ax2 + bx + c,
where
1− N1P
⎧
⎨a
⎪
⎪ = 1−Pi
i
1−N Pi2
⎪ b = N Pi (1−Pi )
⎪
⎩
c = 0
❦ with N the number of sub-groups and Pi the prior probability of the group i. ❦
To add this as a possibility but not make it obligatory to suppy these prior probabilities, we
re-write the function diversity() so that it takes an optional argument of prior probabilities;
and if that argument is not given, it will use the probabilities as they are.
# diversity
# Calculates the entropy of a system with discrete states.
# Arguments:
# x -- numeric vector -- observed probabilities of classes
# prior -- numeric vector -- prior probabilities of the classes
# Returns:
# numeric -- the entropy / diversity measure
diversity <- function(x, prior = NULL) {
if (min(x) <= 0) {return(0);} # the log will fail for 0
# If the numbers are higher than 1, then not probabilities but
# populations are given, so we rescale to probabilities:
if (sum(x) != 1) {x <- x / sum(x)}
N <- length(x)
if(!is.null(prior)) {
for (i in (1:N)) {
a <- (1 - 1 / (N * prior[i])) / (1 - prior[i])
b <- (1 - N * prior[i]^2) / (N * prior[i] * (1 - prior[i]))
x[i] <- a * x[i]^2 + b * x[i]
}
}
f <- function(x) x * log(x)
x1 <- mapply(FUN = f, x)
- sum(x1) / log(N)
}
10 See for example Jost (2006), Keylock (2005), Botta-Dukát (2005), or Kumar Nayak (1985).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 729
❦
For example, if we have prior probabilities of three subgroups of 10%, 50% and 40% then we
consider our population as optimally diverse when these probabilities are obtained:
# Consider the following prior probabilities:
pri <- c(0.1,0.5,0.4)
We can also visualize what this function does. For example, assume prior population of men
and women equal and consider gender as binary, then we can visualize the evolution of our index
as follows – the plot is in Figure 36.6 on page 730:
❦ females <- seq(from = 0,to = 1, length.out = 100) ❦
div <- numeric(0)
for (i in (1:length(females))) {
div[i] <- diversity (c(females[i], 1 - females[i]))
}
Note the back-ticks around the variable names. That is necessary because the variable
names contain spaces. Alternatively, it is possible to refer to them using the function
get(’percentage females’). In either case, ggplot2 will insist in adding the func-
tion get() or back-ticks to the labels, so it is best to re-define them with the functions
xlab() and ylab().
Now, that we have some concept of what we can use as a diversity index, we still need some
data. This should of course come from your HR system, but for the purpose of this book, we
will produce data. The advantage of this approach is that you can do this on your own computer
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 730
❦
Diversity Index
1.00
0.75
diversity index
0.50
0.25
0.00
Figure 36.6: The evolution of the gender-diversity-index in function of one of the representation of
one of the genders in our population.
❦ and play with the sample size and learn about the impact of sample size on diversity. Another ❦
advantage is that we can build in a bias into the data, and that known bias will make the concepts
more clear.
library(tidyverse)
N <- 200
set.seed(1866)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 731
❦
# Now we clean up age and fill in grade, salary and lastPromoted without
# any bias for gender, origin -- but with a bias for age.
d1 <- d0 %>%
mutate(age = ifelse((age < 18), age + 10, age)) %>%
mutate(grade = ifelse(runif(N) * age < 20, 0,
ifelse(runif(N) * age < 25, 1,
ifelse(runif(N) * age < 30, 2, 3)))) %>%
mutate(salary = round(exp(0.75*grade)*4000 +
rnorm(N,0,1500))) %>%
mutate(lastPromoted = round(exp(0.05*(3-grade))*1 +
abs(rnorm(N,0,5))) -1)
While we are usually very comfortably using R in the CLI, it is – at this point – a good idea to
use RStudio. RStudio will make working with dynamic content a lot easier – just as working with
Chapter 36.1 “Shiny” on page 715.
Then in RStudio, chose “new file ” in the file menu and select then “R Markdown,” then “from
template” and finally select “Flex Dashboard” from the list. The screen will look like Figure 36.7
on page 732, and the document will now look similar the following code. Although, note that this
is only the framework (sections without the actual code).
---
title: "Untitled"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
---
```{r setup, include=FALSE}
library(flexdashboard)
```
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 732
❦
Column {data-width=650}
----------------------------------------------------
### Chart A
```{r}
```
Column {data-width=350}
-----------------------------------------------------
### Chart B
```{r}
```
### Chart C
```{r}
```
This prepares the framework for a dashboard that contains a wide column on the left, and
a narrower one at the right (the {data-width=xx} arguments make the columns). Similar to
R Markdown, sections are created by the third-level titles (the words following three pound signs
###).
❦ ❦
Figure 36.7: Creating a flexdashboard from the template provides a useful base-structure for three
plots. It can readily be extended to add one’s own content.
The dashboard that we intend to make will have one front-page that provides an overview of
the diversity in our population based on the indices as defined above.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 733
❦
Row
-------------------------------------
Gender
======================================
❦ Row {.tabset} ❦
-------------------------------------
Age
======================================
Row {.tabset}
-------------------------------------
### first tab
### second tab
Roots
======================================
Row {.tabset}
-------------------------------------
Dependants
======================================
Row {.tabset}
-------------------------------------
Here we present only the structure of the dashboard. The complete working demo-
dashboard can be seen at the website of this book, and the code can be downloaded from
that website (click on the </> Source Code button in the right upper corner), and of
course the code is also in the code-document that goes with this book in the code-box
named “flexdash.”
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 734
❦
The code listing is too long to print here, but it is available in the code that goes with this
book: www.de-brouwer.com/publications/r-book.
❦ ❦
Figure 36.8: The welcome page of the dashboard provides the overview (menu) at the top and in the
body an overview of some diversity indices.
The structure of the dashboard starts with the header of the rmarkdown document. In the
header we tell R what the title is, what theme to use, whether to include buttons to share on
social media, whether to show the source code but also how the layout will look like. We have
chosen “orientation: rows.” This means that sections – characterized by three pound signs: ###
– will be will be shown as rows.
The rmarkdown document then consists of the first level titles – underlined by equal signs –
and second level titles – underlined by at least three dash signs. The first level titles get a separate
page (they appear in the blue header of all pages and eventually collapse on smaller screens).
That brings us to the level of a page. Each page can still contain more than one plot or more
than one row. For example, the first page – the overview – contains four columns in one row and
then two columns in the next row.
The code for the gauges works as follows:
Overview
========
Row
-------------------------------------
```{r}
# here goes the R-code to get data and calculate diversity indices.
```
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 735
❦
### Gender
```{r genderGauge}
# ranges:
rGreen <- c(0.900001, 1)
rAmber <- c(0.800001, 0.9)
rRed <- c(0, 0.8)
iGender <- round(diversity(table(d1$gender)),3)
gauge(iGender, min = 0, max = 1, gaugeSectors(
success = rGreen, warning = rAmber, danger = rRed
))
kable(table(d1$gender))
```
### This is only the first gauge, the next one can be described below.
The widget “gauge” is provided by flexdashboard and takes the obvious parameters such as
minimum, maximum, etc. After the gauge, we simply output a table that sums up the data. This knitr
table is then formatted with the function kable() from knitr. kable()
Also gvis provides a long list of widgets such as gauges for example:
Of course, the package plot_ly has also out-of-the box the possiblity to produce amazing
gauges.
It is of course possible to use a similar layout on every page (main titles) that simply relies
on columns and rows. However, it is also possible to make more interesting structures such as
“tabsets” and “storyboards.” A tabset will layout the page in different sections that are hidden
one after the other and that can be revealed by clicking on the relevant tab.
The storyboard is a great tool to present conclusions of a particular bespoke analysis. Each
“tab” is now a little text-field that is big enough to contain an observation, finding, or conclusion.
Clicking on this information reveals then the data-visualization that leads to that conclusion or
observation.
For example, the page “Gender” has a structure with tabs. That is made clear to R by adding
{.tabset}
Gender
======================================
Row {.tabset}
-------------------------------------
### Composition
```{r}
p2 <- ggplot(data = d1, aes(x=gender, fill=gender)) +
geom_bar(stat="count", width=0.7) +
facet_grid(rows=d1$grade) +
ggtitle('workforce composition i.f.o. salary grade')
ggplotly(p2)
```
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 736
❦
### Tab2
```{r RCodeForTab2}
# etc.
```
❦ ❦
Figure 36.9: The page “Gender” of the dashboard provides multiple views of the data that are each
made visible by clicking on the relevant tab. The effect of ggplotly() is visible by the toolbar that
appears just right above the plot and the mouse-over effect on the boxplot for the salary grade 2/male
population.
Note that we use the function ggplotly(), from the package plotly instead of printing
the plot as we are used to in ggplot2. This function will print the plot but on top of that
add interactive functionality to show underlying data, zoom, download, etc. Some of that
interaction and the interface that it creates is visible in Figure 36.9.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 737
❦
1. Add runtime: shiny to the header of the file (in the YAML front matter).
2. Add the input controls. For example, add the {.sidebar} attribute to the first column of
and it will create a box for the Shiny input fields.
3. Add Shiny outputs on the dashboard and when including plots that need to react to the
input controls, make sure to to wrap them renderPlot().
# user interface ui
ui <- dashboardPage(
dashboardHeader(title = "ShinyDashBoard"),
dashboardSidebar(
title = "Choose ...",
numericInput('N', 'Number of data points:', N),
numericInput('my_seed', 'seed:', my_seed),
sliderInput("bins", "Number of bins:",
min = 1, max = 50, value = 30)
),
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 738
❦
dashboardBody(
❦ fluidRow( ❦
valueBoxOutput("box1"),
valueBoxOutput("box2")
),
plotOutput("p1", height = 250)
)
)
# server function
server <- function(input, output) {
d <- reactive({
set.seed(input$my_seed)
rnorm(input$N)
})
output$p1 <- renderPlot({
x <- d()
bins <- seq(min(x), max(x), length.out = input$bins + 1)
hist(x, breaks = bins, col = 'deepskyblue3', border = 'gray')
})
output$box1 <- renderValueBox({
valueBox(
value = formatC(mean(d()), digits = 2, format = "f"),
subtitle = "mean", icon = icon("globe"),
color = "light-blue")
})
output$box2 <- renderValueBox({
valueBox(
value = formatC(sd(d()), digits = 2, format = "f"),
subtitle = "standard deviation", icon = icon("table"),
color = "light-blue")
})
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 739
❦
Note how in the server function, the data is dynamically determined and stored in a vari-
able d . This is useful because the variable is used in more than one function later on.
However, to make clear that this is dynamic and will use methods of the input object,
it has to be wrapped in the function reactive(). This will create a method to find the reactive()
value of d rather than a dataset. Therefore, any future references can only be done in a
reactive function and by referring to it as d() .
Also the diversity-dashboard will look good with shinydashboard and – if one can work
with Shiny in general – will be fast and elegant. It can look for example as in Figure 36.11.11
❦ ❦
Figure 36.11: Another take on the diversity dashboard, rendered with the help of shinydashboard
and sporting dynamic content with Shiny.
Also here we can choose from an abundance of free html-widgets to spice up our content,
and make the presentation more interactive, useful, attractive, and insightful. See Chapter 36.2.1
“HTML-widgets” on page 719 for more information and suggestions.
11 The code for this dashboard will be shared completely on the website of this book: http://www.de-brouwer.
com/r-book.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page 741
❦
PART VIII
♥
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 743
❦
♣ 37 ♣
Parallel Computing
It may happen that running your code takes an impractical1 amount of time. If the calculations
do not depend on each other’s outcome then the calculations can be run in parallel.
In its standard mode, R is single threaded and all operations go through one processor. The
package parallel allows us to split the runtime of a loop over different cores available on our
system. Note that the library parallel is part of core-R and hence does not need to be installed,
so we can load it immediately. parallel
So, the computer that was used to compile this book has 12 cores at its disposal. Larger
computers have vast arrays of CPUs to which calculations can be dispatched and the library
parallel provides exactly the tools to use that multitude of cores to our advantage. The function
mclapply() that is an alternative to lapply() that can use multiple cores.
The following code illustrates how simple we can use multiple cores with the function mclapply()
mcapply(), and we will also time the gains with the function system.time(). lapply()
1 An impractical amount of time is somehow a personal concept. For the author it is fine if code runs less than
a few minutes or longer than an hour up to a few days. In the first case, one can wait for the result to appear and
in the second case we can do something else while the code is running. Anything in between is annoying enough
to do an effort to speed it up. Anything that runs longer than a good night rest is probably also a good candidate to
speed up.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 744
❦
# 2. With parallel::mclapply
system.time(results <- mclapply(starts, f, mc.cores = numCores))
## user system elapsed
## 1.124 0.536 0.287
The total elapsed time is about five times shorter when we use all cores.2 This comes at some
overhead cost (the System time is longer) and the gain is not exactly the number of cores. The
gain of using more cores is significant. Problems that can be programmed as independent blocks
on different cores should use this way of parallel programming.
Our system had 12 cores and the gain was roughly fivefold. That is not bad, and can make a
serious difference. Remember, however, that this book is compiled on a laptop that is designed
to read email and play an occasional game. Large computers that are equipped with an array of
CPUs are called “Supercomputers.” On such computers, the gain can be much, much more. Note
also that modern supercomputers not only have arrays of CPUs, but that each of those CPUs can
supercomputer be equipped with a GPU.3
Computers that are designed for massive parallel computing are still quite big, and usually
are the sole tenants of a large building. A list of supercomputers can be found at https://www.
❦ ❦
top500.org. The rankings are updated each six months and in June 2019 the fastest computer
is IBM’s “Summit.” Summit has 2,414,592 cores available for calculations. This is a lot more than
our laptop and the performance gain will be accordingly.
Indeed, it is that simple: we can just install R on a supercomputer and use its massive amount
of cores. Almost all supercomputers run Linux,4 and hence, can run R very much in the same way
as your computer can do that. The command line interface can be accessed via remote login (ssh)
or eventually we can install RStudio server on the supercomputer to provide a more user-friendly
experience for the end-user.
2 The Elapsed Time is the time charged to the CPU(s) for the expression, the User Time is the time that you as
user experience, and System Time is the time spent by the kernel on behalf of the process. While usually elapsed
time and system time are quite close. If the elapsed time is longer than the user time this means that the CPU
was spending tiime on other operations (maybe external to our process). Especially in the case of mcapplyl() the
elapsed time is a lot less than the user time. This means means that our computer has multiple cores and is using
them.
3 The CPU is the central processing unit, and the GPU is the graphical processing unit. These additions became
popular on the back of the popularity of computer games, but the modern GPU has thousands of processors on
boards and has its own RAM. This makes them an ideal platform for speeding up calculations. More about GPU
programming is in Chapter 37.3 “Using the GPU” on page 752.
4 At the time of the top-list at https://www.top500.org, 100% of the 500 fastest supercomputers run Linux.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 745
❦
If you rather work with for-loops then the library foreach will help you to extend the for-loop
to the foreach loop. This construct combines the power of the for-loop with the power of the
lapply() function. The difference with lapply is, that it evaluates an expression and not a func-
tion and hence returns a value (and avoids side effects). foreach()
Here is a short illustration on how it works:
# 1. The regular for loop:
for (n in 1:4) print(gamma(n))
## [1] 1
## [1] 1
## [1] 2
## [1] 6
Note – foreach
class(foreach (n = 1:4))
## [1] "foreach"
The operation is best understood by noticing that the structure of the foreach loop is based
on the definition of the operator %do%. This operator, %do%, acts on a foreach object (created by
the function foreach()). This operator controls how the expression that follows is evaluated.
While the operator %do% will sequentially run each iteration over the same core, %dopar% will
utilise the power of parallel processing by using more cores. The operator can be initialised with
the library doParallel. Note that this package most probably needs to be installed first via the
command
install.packages('doParallel') . doParallel
# Register doParallel:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 746
❦
registerDoParallel(numCores)
Hint – Expressions
The expression – that is passed to foreach – can span multiple lines provided that it is
encapsulated by curly brackets. This is similar to how other functions such as if, for,
etc. work in R.
Note also that the function foreach allows to simplify the result (i.e. collapse the result to a
simpler data type). To achieve this, we need to provide the creator function to such simple data
c() type. For example, we can supply c(), to simplify the result to a vector.
In the following code we provide a few examples to that collapse the results to a more elemen-
❦ tary data type with the .combine argument. ❦
# Collapse to vector:
foreach (n = 1:4, .combine = c) %dopar% print(gamma(n))
## [1] 1 1 2 6
We can rewrite our example of calculating 50 times k-means with 100 starting points with the
foreach syntax as follows.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 747
❦
As can be seen, the time is similar to mclapply. The difference is in the formalism that can be
more intuitive for some people. This might be up to personal taste, but as many other open-source
solutions, R always offers choice.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 748
❦
In order use more cores it is not always necessary to buy time on a supercomputer. It is also
possible to use the workstations that are already connected to the network to dispatch calcula-
snow tions so that they can be executed in parallel. The library snow (an acronym for Simple Network
Of Workstations) provides such framework that allows computers to collaborate over a network
Master while using R. The model is that of Master/Slave communication in which one node that has the
slave Master role controls and steers the calculations on the slave nodes.
To achieve this, snow implements an application programming interface (API) to different
mechanisms that allow the Slave and Master to connect and exchange information and com-
API mands. The following protocols can be used:
• Socket
In bootstrapping and many other applications, it is important that the random numbers
that are used in all nodes are independent. This can be assured via the rlecuyer or the
rsprng library.
❦ ❦
First, we install the library snow and when that is done we can load it.
library(snow)
##
## Attaching package: ’snow’
## The following objects are masked from ’package:parallel’:
##
## clusterApply, clusterApplyLB, clusterCall,
## clusterEvalQ, clusterExport, clusterMap,
## clusterSplit, makeCluster, parApply, parCapply,
## parLapply, parRapply, parSapply, splitIndices,
## stopCluster
We had loaded parallel in previous section, and the output shows how many of the
functions overlap in both packages. This means two things. First, they should not be used
together, they do the same but in different ways. Second, the similarity between both will
shorten the total learning time.
The better approach is to unload parallel first:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 749
❦
Now, the slave process needs to be started on each slave node and finally the cluster can be
started on the Master node. The socket method allows to connect a node via the ssh protocol. ssh
This protocol is always available – on *nix systems – and easy to set up. *nix
It allows also to create a simulated slave on the same machine by referring to “localhost.”5 localhost
This will create a cluster on the same machine.
In this particular case, it is possible to use the function’s defaults and makeCluster(2) will
have the same effect. However, remember that this functionality only makes sense when you use
other machines than “localhost.”
Now, that the cluster is defined, we can run operations over the cluster. We will use the same
example as we used in previous section.
In summary, one defines a cluster object, and this allows the Master node to dispatch part of parLapply()
parSapply()
parallel calculations to the slave nodes by using functions such as parLapply() as an alternative
to lapply() or parSapply() as an alternative to sapply(), etc. Those functions will then take
the cluster as the first argument and work for the rest similar to their base R counterparts and are
shown in Table 37.1 on page 750.
5 Creating a cluster on your own computer will of course not be faster than the solutions described earlier in this
chapter (e.g. the library parallel). The reason why we do this here, is that the whole book is using life connections
and calculations and therefore this is the most efficient way to show how it works.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 750
❦
Base R snow
lapply parLapply
sapply parSapply
vapply NA
apply (row-wise) parRapply or parApply(,1)
apply (column-wise) parCapply or parApply(,2)
Snow is the ideal solution to use existing hardware before buying an expensive super-
cluster or renting time on an existing supercomputer. It allows all computers to collabo-
rate and use computing power that sits idle.
Some applications will not suit the apply family of functions. The function clusterCall()
clusterCall() will call a given function with identical arguments on each slave in the cluster.
f <- function (x, y) x + y + rnorm(1)
clusterCall(cl, function (x, y) {x + y + rnorm(1)}, 0, pi)
## [[1]]
## [1] 3.046134
##
❦ ## [[2]] ❦
## [1] 4.187535
clusterCall(cl, f, 0, pi)
## [[1]]
## [1] 1.608131
##
## [[2]]
## [1] 5.024353
# However, note that in the last example the random numbers are exactly the same
# on both clusters.
The cluster version of the function evalq() is clusterCallEvalQ() and it is called as fol-
clusterCallEvalQ() lows.
# Note that ...
clusterEvalQ(cl, rnorm(1))
## [[1]]
## [1] -0.1861496
##
## [[2]]
## [1] 1.014375
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 751
❦
# ... but that the random numbers on both slaves nodes are the same
Note that the evalq function in base R is equivalent to eval(quote(expr), ...). This eval()
evalq()
function, eval(), evaluates its first argument in the current scope before passing it to the evalua-
tor, whereas evalq avoids evaluating the argument first and hence passes it on before evaluating.
When not all arguments of the functions should be the same, the function clusterApply()
allows to run a function with a list of provided arguments, so that each cluster will run the
same function but with its unique given argument. The function will therefore, take a cluster,
a sequence of arguments (in the form of a vector or a list), and a function, and calls the function
with the first element of the list on the first node, with the second element of the list on the second
node, etc. Obviously, the list of arguments must have at most as many elements as there are nodes
in the cluster. If it is shorter, then it will be recycled.
The package snow allows for automatic load-balancing with the function clusterApplyLB()
clusterApplyLB(). The function definition is:
It takes the same arguments and behaves the same as clusterApply(). It will dispatch a
balanced work load to slave nodes when the length of seq is greater than the number of
cluster nodes. Note that it does not work on a cluster that is defined with Type=Socket
as we did.
Note also that there is a function clusterSplit(cl, seq) that allows the sequence seq clusterSplit()
to be split over the different clusters so that each cluster gets an equal amount of data. There are
also specific functions to dispatch working with matrices over the cluster: parMM(cl, A, B) will
multiply the matrices A and B over the cluster cl.
Finally, after all our code if finished running, we terminate the cluster and clean up all con-
nections between master and slave nodes with the function stopCluster(): stopCluster()
stopCluster(cl)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 752
❦
GPU In all previous sections of this book we used the CPU (the “central processing unit”) to handle all
our requests. Modern computers often have a dedicated a graphics card on board. That card usu-
ally has many cores that can work simultaneously to compute and render complex landscapes,
shadows and artefacts in a computer game. The processors on such graphics card (GPU) are usu-
ally not as fast as the CPU, but there are many of them on one card, and even have their own
memory. While the GPU is designed to improve experience for playing computer games, it is also
a superb source for parallel computing. Indeed, a high-end graphics card has multiple thousands
processors on board. Compare this to the CPU that nowadays has around ten cores.
Unfortunately, programming a GPU is a rather advanced topic and the interfaces depend on
C the brand and make of the card itself. Most graphic cards producers provide libraries that can be
C++ used. While the interfaces provided are typically geared towards C, C++, or Fortran, the R-user
Fortran can rely, on R-packages that simplify the interface to using functions in R.
Example: Nvidia
One of the major players on the graphics cards market is Nvidia. This company provides a
CUDA
programming platform for accessing the multiple processors on their cards: CUDA (Com-
pute Unified Device Architecture). The interface that accesses the processors is a C or
rpud C++ interface, but the library rpud – for example – takes much of the complexity away
for the R-user. This means that in R we can simply access the cores of the GPU.
❦ To give you an idea: today, a decent graphics card has thousandsa cores that run at a ❦
clock-speed that is typically one third to half of a good CPU. This means that when the
code benefits of parallelism, that speed gains of many hundreds of times are very realistic.
a For example, one of the later models of Nvidia is the Titan. It offers 3840 CUDA cores that have a
To take this concept a step further, it is also possible to use the GPUs in a cluster of computers.
Some supercomputers have GPU cards for each of the CPU cards that constitute the cluster. If the
PU array has 1 000 CPUs, and each CPU has 1 000 GPU cores on board, then the total number of
processing unit processing units (PUs) exceeds a million.
Using the GPU depends critically on the card in your computer. So, a first thing to do, is to
determine what graphics card you have on board. If you have invested in a good GPU, you will
know what model you have, otherwise, you can use the command below (not in the R-console,
but in the CLI of the OS).
*-display
description: VGA compatible controller
product: Crystal Well Integrated Graphics Controller [8086:D26]
vendor: Intel Corporation [8086]
physical id: 2
bus info: pci@0000:00:02.0
version: 08
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 753
❦
width: 64 bits
clock: 33MHz
capabilities: msi pm vga_controller bus_master cap_list rom
configuration: driver=i915 latency=0
resources: irq:30 memory:f7800000-f7bfffff memory:e0000000-efffffff
ioport:f000(size=64) memory:c0000-dffff
This result means that there is actually no GPU other than the one that Intel has added to the
CPU. This means that for common rendering tasks the CPU will be able offload some tasks to
this card, however, we cannot expect much gain. No dedicated GPU means that on this computer
it is not an alternative to program the GPU.
If you have a dedicated GPU, the output will provide information about the GPU:
*-display
description: VGA compatible controller
product: NVIDIA Corporation [10DE:2191]
vendor: NVIDIA Corporation [10DE]
physical id: 0
bus info: pci@0000:01:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:154 memory:b3000000-b3ffffff memory:a0000000-afffffff memory:
b0000000-b1ffffff ioport:4000(size=128) memory:c0000-dffff
❦ If you have a dedicated GPU, then chances are that it is an AMD or an Nvidia device, these ❦
are the two market leaders. The AMD GPU will understand the open source OpenCL framework,
and a modern Nvidia GPU will use the proprietary CUDA framework (but will also understand
OpenCL)
Both systems have strong and weak points. Usually, Nvidia will give a better performance, the
CUDA platform is readily integrated in C, C++, and Fortran; and there are a handful of libraries
for R available that take care of the complexity of programming the GPU.6
Hint – Nvidia
If you have an Nvidia GPU that understands the CUDA framework, then you might
want to have a look at the packages gputools, cudaBayesreg, HiPLARM, HiPLARb, and
gmatrix that are specifically designed with the CUDA framework in mind.
6 CUDA also supports programming frameworks such as OpenACC and OpenCL, which are a little more user
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 754
❦
Be sure to read the documentation of the package of your choice. Most packages will not
use the memory of the GPU, and this increases transfer times. This means that copying
variables will not be so time and memory friendly as in base R.
Once installed, the packages are usually very user friendly. For example, the following code
is enough to use the GPU to calculate a matrix product.
# Install.packages("gpuR") # do only once
library(gpuR) # load gpuR
## Number of platforms: 1
## - platform: NVIDIA Corporation: OpenCL 1.2 CUDA 10.1.0
## - context device index: 0
❦ ## - GeForce GTX 1660 Ti ❦
## checked all devices
## completed initialization
## gpuR 2.0.3
## Attaching package: 'gpuR'
## The following objects are masked from 'package:base':
## colnames, pmax, pmin, svd
# Prepare an example:
N <- 516
A <- matrix(rnorm(N^2), nrow = N, ncol = N)
gpuA <- gpuMatrix(A) # prepare the matrix to be used on GPU
7 See: http://viennacl.sourceforge.net.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 755
❦
##
## Slot ".platform":
## [1] "NVIDIA CUDA"
##
## Slot ".device_index":
## [1] 1
##
## Slot ".device":
## [1] "GeForce GTX 1660 Ti"
Unlike other sections in this book, do not expect output to be exactly the same when you
execute the code on your computer. This chapter tests the hardware and not numeric
logic, and hence the results will vary depending on the hardware that you’re using.
We refer to the Chapter 40.1 “Benchmarking” on page 794 to compare the speed difference.
Here we will simply use system.time() to compare the performance.
# base R:
system.time(B <- A %*% A)
## user system elapsed
## 0.052 0.000 0.051
The time gain, when using the GPU, depends of course on the size of the matrix. This is illus-
trated with the following code. We run an experiment,8 where we test different matrix sizes and
then plot the times. The results are in Figure 37.1 on page 756.
set.seed(1890)
NN <- seq(from = 500, to = 4500,by = 1000)
t <- data.frame(N = numeric(), CPU = numeric(), GPU = numeric())
i <- 1
for(k in NN) {
A <- matrix(rnorm(k^2), nrow = k, ncol = k)
gpuA <- gpuMatrix(A)
t[i,1] <- k
t[i,2] <- system.time(B <- A %*% A)[3]
t[i,3] <- system.time(gpuB <- gpuA %*% gpuA)[3]
i <- i + 1
}
# Print the results
t
## N CPU GPU
## 1 500 0.047 0.006
## 2 1500 1.597 0.057
## 3 2500 8.634 0.285
## 4 3500 22.995 0.686
## 5 4500 48.545 1.094
8 Please refer to the section Chapter 40.1 “Benchmarking” on page 794 for more details on how to run more
reliable tests. For now, it is sufficient to realize that the same code can run at different times. This depends for
example on other processes running. Therefore, our experiment can show minor deviations.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 756
❦
❦ ❦
Figure 37.1: The runtimes for matrix multiplication compared on the CPU versus the GPU.
Hint – Cleaning up
After such experiment, we have not only large matrices in the memory, but also clut-
tered memory in the RAM of the GPU (at least this is the case with gpuR version 2.0.3).
Therefore, it is a good idea to clean up:
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 757
❦
The usability of a GPU programming library depends on the calculations that one can spread
over the cores of the graphics card. The library gpuR also provides methods for basic arithmetic
and calculus such as: %*% , + , - , * , / , t , crossprod , tcrossprod , colMeans , colSums ,
rowMeans , rowSums , sin , asin , sinh , cos , acos , cosh , tan , atan , tanh , exp , log , exp ,
abs , max , min , cov , eigen . There are many other operations available such as Euclidian dis-
tance, etc.
With all those functions gpuR is one of the most complete, but it is also one of the most uni-
versal GPU programming options available for the R programmer.
Most of the packages and interfaces focus on matrix algebra or at least will include
this. That makes sense, since it is close to the primordial task of GPUs. The power of a
GPU can be extended to much more functions such as deep learning, adiabatic quan-
tum annealing, clusters, distances, optimisation problems, cross validations, and many
matrix
more. Nvidia, for example, is now also offering frameworks for pre-fitted artificial intelli-
gent solutions (such as for interpolating pixels). To learn more, we refer to their website:
https://www.nvidia.com/en-us/gpu-cloud.
These simple experiments show that working with a GPU is not too difficult in R, but more
importantly that significant performance gains can be obtained in some cases.
require(gpuR)
require(tidyr)
require(ggplot2)
set.seed(1890)
NN <- seq(from = 500, to = 5500,by = 1000)
t <- data.frame(N = numeric(), `CPU/CPU` = numeric(),
`CPU/GPU` = numeric(), `GPU/GPU` = numeric())
i <- 1
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 758
❦
64.00
4.00
Time in seconds
RAM/PU
CPU.CPU
CPU.GPU
0.25 GPU.GPU
0.02
Figure 37.2: The runtimes for matrix multiplication for storage/calculation pairs: CPU/CPU,
CPU/GPU, GPU/GPU. Note that the scale of the y -axisis logarithmic.
❦ ❦
i <- i + 1
}
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 759
❦
# unload gpuR:
detach('package:gpuR', unload = TRUE)
The performance gain in this simple example is massive. For a multiplication of two matrices
of 5 500 by 5 500, the vclMatrix is about 50 times faster than the gpuMatrix. If we compare
the saem vclMatrix with the CPU alternative, then it is about 30 000 times faster. Taking into
account the low effort that is needed to learn the package gpuR, we must conclude that it really
pays of to use the GPU.
In this section, we have only presented the option to use an R package in order to pro-
gram the GPU. Knowledge of R is sufficient. The package gpuR has a lot of many func-
tions available, but if you need something that is not there, you might want to use the
library OpenCL and create your own functions. A good introduction is here: https:
//cnugteren.github.io/tutorial. However note that knowledge of C or C++ is
required.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 760
❦
A good GPU offers thousands of PUs, but did you know that it is possible to have more
than one GPU in your desktop computer? For example two identical video cards can be
bridged together. Nvidia calls their solution SLI, while the AMD version is called Cross-
fire.
If you are using an Nvidia GPU, then you should read https://devblogs.nvidia.com/
accelerate-r-applications-cuda. This tutorial shows how to accelerating R with
the CUDA libraries, call bespoke parallel algorithms written in CUDA C/C++ or CUDA
Fortran from R, and adds information on profiling GPU-accelerated R applications via
the CUDA Profiler.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 761
❦
♣ 38 ♣
In the rest of this book we implicitly assume that the data can be loaded in memory. However,
imagine that we have a dataset of all the data that you have transmitted over your mobile phone,
including messages, images, complete with timestamps and location that results from being con-
nected. Imagine now that we have this dataset for 50 million customers. This dataset will be so
large that this becomes impractical, or even impossible to store on one computer. In that case, it
is fair to speak of “big data.”
Usually, the academic definition of big data implies that the data has to be big in terms of big data
Commercial institutions will add “value” as a fourth word that starts with the letter “v.” While
this definition of “big data” has its merits, we rather opt for a very practical approach. We consider
our data to be “big” if it is no longer practically possible to store all data on one machine and/or
use all processing units (PUs)1 of that one machine to do all calculations (such as calculating a
mean or fitting a neural network).
memory limit
Note – Memory limits in R
R is in the first place limited to the amount of memory that the OS allocates to R. A 32-bit
system has a limit of something like 3GB, a 64-bit system can allocate a lot more. However,
– by default – it is not really possible to manage data that has the number of rows multi-
plied with the number of columns more than 231 − 1 = 2 147 483 647.
When our data is too big to be handled elegantly in the RAM of our computer, we can follow
the following steps to get the analysis done.
1 With “all PUs”, we mean “all the cores of the CPU(s) and all the cores on the GPU(s) connected to that CPU
array”. In this sentence it is equivalent to say that you have tried to make your computer as powerful as possible and
pushed the limits of what you can do on one machine. The thing is that if the data is too large, no array of processors
will be sufficient, that’s what we intend to call “big data.”
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 762
❦
1. Can the data be reduced at read-in. Maybe you do not need to copy the whole table, but
only some columns? To obtain this, do as much as possible data wrangling on the RDBMS
(or NoSQL server). If this does not solve the problem, go to next point — see Part III “Data
Import” on page 213.
2. Is it possible to take a sub-set of the data, do the modelling and then check if the model can
be generalized to the whole database? Take a sample of the data that can be handled, make
the model and then calculate the model performance on the whole table. If the results are
not acceptable, move to the next step — see e.g. Chapter 21.3 “Performance of Regression
Models” on page 384 and Chapter 25 “Model Validation” on page 475.
3. Would the data fit on a more powerful computer? This can be the computer of a friend,
a server at the workplace of a supercomputer. Chapter 38.1 “Use a Powerful Server” on
page 763 is a reference about this solution.
4. Force R to work a little different, and do not try to load all data in memory, but rather work
with a moving window on the data and leave long tables on the hard-drive. This will –
usually – slow down the calculations, but it might just be possible to get the calculations
done. This can be done by packages like ff, Biglm, RODBC, snow, etc. We will discuss this
in Chapter 38.2 “Using more Memory than we have RAM” on page 765. This is typically a
solution for a few Gigabytes of data. It your data is rather petabytes, then move on to the
next step.
5. Use parallelism in data-storage and calculations, and then distribute data and calculations
over different nodes with an appropriate ecosystem to support this massive parallel process-
❦ ing and distributed, resilient file-system. This will be explained in Chapter 39 “Parallelism ❦
for Big Data” on page 767.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 763
❦
Hint – RStudio
If you do not like the command line too much, then you might want to check “RStudio RStudio Server
Server.” It provides a browser based interface to a version of R running on a remote Linux
server. So, it will look as if you use the RStudio IDE locally, but actually you are using the
❦ power and storage capacity of the server. ❦
2. Using the library dplyr to generate the SQL code for you and execute the query only when
needed. This is the subject of Chapter 17 “Data Wrangling in the tidyverse” on page 265.
Since the “how to” part is explained earlier in this book, we will not elaborate on the subject
further in this chapter.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 764
❦
Some databases have their own drivers for R, others not. It might be a good idea to have
RODBC a look at the package RODBC. It is a package that allows to connect to many databases via
ODBC the ODBC protocol.a
a ODBC stands for Open Database Connectivity and is the standard API for many DBMS.
dplyr Did you know that dplyr is made available by RStudio and that they have further great
documentation on the subject? More information is on their website: https://www.
rstudio.com/resources/webinars/working-with-big-data-in-r.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 765
❦
If the problem is that the data is too large to keep in memory, but calculation times are (or would
be) still acceptable, then we can choose one of the libraries to change the default behaviour or R
so that data is not – by default – all loaded into RAM.2
To the library ff provides data-structures that are stored on the disk rather than in RAM. ff
When the data is needed, it will be loaded chunk after chunk.
There are also a few other packages that do the same, while targeting specific use. For example:
For the purpose of this book, we will not discuss further those packages, but look into the next
step: what if your hard-disk is no longer sufficient to store the date in the first place?
❦ ❦
2 Competing software such as the commercial SAS engine has by default the opposite behaviour. Hence, SAS is
out of the box capable of handling really large data that would not fit in the RAM.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 767
❦
♣ 39 ♣
Since the 1990s providers such as Terradata specialize in solutions that store data and then allow to
operate parallel on massive amounts of data. CERN1 on the other hand is a textbook example of an
institution that handles big datasets, but they rather rely on super computers (high performance
computers) instead of resilient distributed computing. In general – and for most applications –
there are two solutions: one is resilient distributed computing where data and processing units
(PUs) are commodity hardware and redundant, and the second is high performance computing
with racks of high grade CPUs.2
One particularly successful solution is breaking up the data in parts and storing each part
(with some redundancy built in) on a computer its own CPU and data storage. This allows to
bring the calculations to the multiple parts of data instead of bringing all the data to the single
CPU.
❦ In 2004 Google published the seminal paper in which they described the process “MapRe- MapReduce ❦
duce” that allows for a parallel processing model, to process huge amounts of data in parallel
nodes. MapReduce will split queries and distributes them over the parallel nodes, where they
are processed simultaneously (in parallel). This phase is called the “Map step.” Once a node has
obtained its results, it will pass them on to the layer higher. This is the “Reduce step,” where the
results are stitched together for the user.
This approach has many advantages:
• it is able to handle massive amounts of data in incredible speeds,
• therefore, it is also possible to use cheaper hardware in the nodes (if one node is busy or not
online, then the task will be dispatched to the alternate CPU that also has the same part of
the data), and it becomes also a cost efficient solution.
The downside is that parallelism comes at an overhead cost of complexity. So, as long as the
other methods work, big data solutions based on massive redundant parallel systems might not
be the best choice. However, when data reaches sizes that are best expressed in peta-bytes, then
this becomes the only viable solution that is fault tolerant, resilient, fast enough and affordable.
At that point you will need to get specialist knowledge in the company to build and maintain the
big data solution.
1 CERN is the European Organization for Nuclear Research, it’s website is here https://home.cern and the
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 768
❦
However, it is worth noticing that R offers an intermediate solution with snow.3 However,
at some point that protocol will not be sufficient to provide the performance boost and storage
capacity needed to finish the calculations in a practical time. Then it is time to think about massive
resilient dataset and why not try to free and open source ecosystem of “Hadoop”?
❦ ❦
3 For a quick introduction to snow and references see Chapter 37.2 on page 748.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 769
❦
The world of parallelism and big data becomes fast very complex. Fortunately there is the Apache
Software Foundation. In some way, Apache is for the big-data community similar to what the
RStudio community is for R-users. The structure is different, but both provide a lot of tools and Apache
software that can be used for free and in some way both entities produce tools that are so useful
and efficient that they become unavoidable in their ecosystem.
The Apache Software Foundation is probably best known and loved for their http web-
server “Apache.” Today, the Apache Software Foundation is – with more than 350 active
software projects – arguably the largest open software foundation. More information
about the Apache Software Foundation is here: https://www.apache.org
Implementing MapReduce in a high volume, high speed environment requires to re-invent MapReduce
how a computer works and all components (that in a traditional operating system (OS) – such
as Windows or Linux – are all on one computer) should now be distributed over many systems.
Therefore, parallelism becomes fast very complex and the need for some standard imposes itself.
Under the umbrella “Hadoop,” the Apache Software Foundation has a host of software solutions
that allows to build and operate a solution for manipulating and storing massive amounts of data.
The core of Hadoop is known as “Hadoop Common” and consists of: Hadoop Common
❦ ❦
• HDFS is the Hadoop Distributed File System is the lowest level of the ecosystem: it manages
the physical storage of the distributed files in a redundant way.
The family of software that populate the Hadoop ecosystem is much larger than Hadoop Com-
mon. Here are some of the solutions that usually are found together.
• Apache Storm: is the computation framework for the distributed stream processing.
• Oozie: is the workflow scheduling system that manages different Hadoop jobs for the
client.
• Query and data intelligence interfaces that are designed to allow for massively parallel pro-
cessing.
– Apache Pig: allows to create programs that run on Hadoop via the high-level lan-
guage “Pig Latin.”
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 770
❦
In some way this ecosystem is the equivalent of a computer that allows to store data, query
it, and run calculations on it. However, the data is not stored on one disk but distributed over a
massive cluster of commodity hardware, the data can come in at high speeds, the amounts of data
gathered are massive. Hadoop does all of that and is surprisingly fast, scalable, and reliable.
While R allows us to interface with different components of the Hadoop ecosystem, we will
focus on Apache Spark in the next section because it offers – for R-users – probably the ideal
combination of similar concepts and the right level of API.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 771
❦
Apache Spark runs also a lot faster than Hadoop MapReduce.4 Spark can be run stand alone
or on a variety of platforms including Hadoop, Cassandra, and HBase. It has APIs for R, but also
for other languages such as Python, SQL, and Scala.
In the remainder of this section we will illustrate how one can harvest the power of Spark via
R. To test this, one will first need to install Spark. Of course, you will argue that it is of limited use
to install Spark just on your own laptop computer. You’re right, but we will do this here so that
❦ we have a self-contained solution that can be tested by everyone, and as a bonus when you need ❦
to connect to a real Hadoop cluster, then you will find that this works in a very similar way.
java -version
If this produces some output, then Java is installed. If you do not have Java installed yet, install it
for example on a Debian system as follows:
sudo apt install default-jdk
It appears that the installation of Spark is still changing, so it might make sense to refer
to the website of Spark before going ahead.
For now, it is not part of most packaging systems of most distributions. So we need to down-
load it first. Best is to refer to https://www.apache.org/dyn/closer.lua/spark/spark-2.
4.3/spark-2.4.3-bin-hadoop2.7.tgz and choose a mirror that will work best for you. For
example we can use wget to download the files:
4 The Apache website reports a speed-up of about 100 times.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 772
❦
mkdir ~/tmp
cd ~/tmp
wget https://www-eu.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2
.7.tgz
This will result in the file park-2.4.2-bin-hadoop2.7.tgz in the directory ∼/tmp. This is a
compressed file, also known as “tarball.” It can be decompressed as follows:
tar -xvf spark-2.4.3-bin-hadoop2.7.tgz
CLI If you prefer not using the command line interface (CLI), then you might want to refer
to the website of Spark and download it from there: https://spark.apache.org/
downloads.html is your place to start then. Once the file is on your computer, you can
access it via a file manager (such as Dolphin or Thunar), double click on it and you will
be presented with a window that is a graphical user interface to decompress the file.
Now, you should have a subdirectory with the code that will be able to run Spark – in our
case the directory is called spark-2.4.3-bin-hadoop2.7, and it is a sub-directory to the place
tarball where we have extracted the tarball (our ∼/tmp. This directory needs to be moved to the /opt
directory. Note that in order to do that you will need administrator rights. Chances are that you
do not have the directory /opt on your system yet and in this case you might want to opt to create
it or put the binaries elsewhere (maybe in /bin)5 .
❦ ❦
# Create the /opt directory:
sudo mkdir /opt
Now, you should have the binaries of Spark in the /opt/spark directory. Test this as follows.
$ ls /opt/spark
bin data jars LICENSE NOTICE R
RELEASE yarn conf examples kubernetes licenses
python README.md sbin
Take some time to go through the READMDE.md file. As the extension suggests, this is a MarkDown
file. While it is readable as a text-file, it is better viewed via a markdown reader such as Grip,
ReText, etc. For example:
retext /opt/spark/README.md
will open the ReText6 , but it still will not look good. Activate the live-preview by pressing the keys
ReText CTRL + L.
Spark will still not work: some manual configuration is still necessary. We need to add the
directory of Spark to the system path (the directories where the system will look for executable
5 Keeping Spark in a separate directory has the advantage that it will be easy to identify, remove, and update it if
necessary. The documentation on the Apache website recommends to use a directory called /opt.
6 If necessary, install it first. On a Debian OS, this is don by the usual sudo apt install retex in the CLI.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 773
❦
files). This can be done by editing the ∼/.bashrc file. Any text-editor will do, we like vi, but you
vi can use any other text editor.
vi ~/.bashrc
Working in vi requires some practice (it pre-dates GUI interfaces), so we describe here how to
get things done. At the very end of the file we need to add the two following lines. To do this: copy
the text below, then press the following keys [CTRL]+[END], A, [ENTER], CTR+V, [ESC], :wq and
finally [ENTER].
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Save and quit (in vi this would be ESC, :wq, ENTER). Activate the changes by opening a new
terminal or executing
source ~/.bashrc
The system will reply with the location of the log-file and return to the command prompt. It might
seem that not too much happened, but Spark is really active on the system now and ready to reply
to instructions. We can use the CLI to interrogate the system status or use a web-browser and point
it to http://localhost:8080. The browser-interface will look similar to the one in Figure 39.1
on page 774.
7 A daemon a software that runs on a computer but is rather invisible to the normal user. Daemons provide
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 774
❦
Figure 39.1: The status of Spark can be controlled via a regular web-browser, that is directed to
http://localhost:8080.
❦ ❦
There is – of course – no need to use a web-browser nor even have a windows manager active:
the status of Spark can also be checked via the CLI. This can be done via the command ss.
$ ss -tunelp | grep 8080
tcp LISTEN 0 1 :::8080 :::*
users:(("java",pid=26295,fd=351)) uid:1000 ino:5982022 sk:1d v6only:0
<->
Now, we are sure that the Spark-master is up and running and can be used on our system. So,
we can start a Spark-slave via the shell command start-slave.sh or via the Spark-shell. This
shell can be invoked via the command spark-shell and will look as follows.
$ spark-shell
19/07/14 21:16:43 WARN Utils: Your hostname, philippe-W740SU resolves to a
loopback address: 127.0.1.1; using 192.168.100.120 instead (on interface
enp0s25)
19/07/14 21:16:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another
address
19/07/14 21:16:44 WARN NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
Spark context Web UI available at http://192.168.100.120:4040
Spark context available as 'sc' (master = local[*], app id = local
-1563131811133).
Spark session available as 'spark'.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 775
❦
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.3
/_/
scala>
Read this welcome message, because it is packed with useful information. For example, it
refers you to a web user interface (in our case at http://192.168.100.120:4040). Note that
when you work on the computer that has the Spark-master installed and running, that this is
always equivalent to http://localhost:4040. The screen of this page is shown in Figure 39.2
❦ ❦
Figure 39.2: Each user can check information about his own Spark connection via a web-browser
directed to http://localhost:4040. Here we show the “jobs” page after running some short jobs
in the Spark environment.
Digression – Scala
Note that the spark command prompt is scala> . Scala is a high level, general purpose
language. It saw first light in 2014 as an alternative to Java, and it addressed some of
the criticism on Java. Scala code can be compiled to Java bitmode and hence can run in
the Java Virtual Machine, and hence the same code can be executed on most modern
operating systems.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 776
❦
In order to test this shell environment, we can execute the “Hello world” program as follows.
scala> println("Hello World!")
Hello World!
scala>
To stop respectively the Spark master and slave, use the following commands:
stop-slave.sh
stop-master.sh
Finally, note that we can also start Spark from the R-prompt. The following command will
start the daemon for Spark and then return the R-command prompt to normal status ready for
further input.
system('start-master.sh')
❦ 39.2.3 SparkR ❦
As usual in R, there is more than one way to get something done. To address Apache Spark, there
are two main options: SparkR from the Apache Software Foundation and sparklyr from RStdio.
For a comparison between those two solutions, we refer to Chapter 39.2.5 “SparkR or sparklyr”
SparkR on page 791. In this first section, we will describe the basics of SparkR.
SparkR provides a new data type: the DataFrame. As the data.frame in base R, it holds
rectangular data, but it is designed so that the data frame resides on a resilient and redundant
cluster of computers and not be read in memory at once.
The Spark data frame is, however, also designed to be user friendly and most functions of
dplyr have an equivalent in SparkR. This means that what we have learned in Chapter 17 “Data
Wrangling in the tidyverse” on page 265 can be used with minor modifications and all the addi-
tional complexity will be handled in the background.
To start, load all libraries that we will use, connect to the Spark-Master and define our first
SparkDataFrame SparkDataFrame, with the function as.DataFrame().
library(tidyverse)
library(dplyr)
library(SparkR)
# Note that loading SparkR will generate many warning messages,
# because it overrrides many functions such as summary, first,
# last, corr, ceil, rbind, expr, cov, sd and many more.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 777
❦
DF <- as.DataFrame(mtcars)
# The DataFrame is for big data, so the attempt to print all data,
# might surprise us a little:
DF
## SparkDataFrame[mpg:double, cyl:double, disp:double, hp:double, drat:double, wt:double,
## qsec:double, vs:double, am:double, gear:double, carb:double]
# R assumes that the data-frame is big data and does not even
# start printing all data.
The central data-structure in Spark is called RDD (“Resilient Distributed Dataset”). The
❦ resilience is obtained by the fact that the dataset can at any point be restored to a previous ❦
state via the “RDD lineage information.” The lineage information refers to the additional lineage
information that RDD stores about the origin of each dataset and each number. RDD
This lineage of RDD can grow fast and can become problematic if we use the same data
multiple time (e.g. in a loop). In that case, it is useful to add “checkpoints.” checkpoint
The function checkpoint() (from SparkR) will clear all the lineage information and just
keep the data.
To develop an example, we will use the well known database titanic. This data was intro-
duced in Chapter 22.2 “Performance of Binary Classification Models” on page 390. In the following
code, we create the DataFrame and show some of its properties.
library(titanic)
library(tidyverse)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 778
❦
str(T)
## 'SparkDataFrame': 12 variables:
## $ PassengerId: int 1 2 3 4 5 6
## $ Survived : int 0 1 1 1 0 0
## $ Pclass : int 3 1 3 1 3 3
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley
## (Florence Briggs Thayer)" "Heikkinen, Miss. La
## $ Sex : chr "male" "female" "female" "female" "male" "male"
## $ Age : num 22 38 26 35 35 NA
## $ SibSp : int 1 1 0 1 0 0
## $ Parch : int 0 0 0 0 0 0
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803"
## "373450" "330877"
## $ Fare : num 7.25 71.2833 7.925 53.1 8.05 8.4583
## $ Cabin : chr "" "C85" "" "C123" "" ""
## $ Embarked : chr "S" "C" "S" "S" "S" "Q"
summary(T)
## SparkDataFrame[summary:string, PassengerId:string, Survived:string,
❦ ## Pclass:string, Name:string, Sex:string, Age:string, SibSp:string, Parch:string, ❦
## Ticket:string, Fare:string, Cabin:string, Embarked:string]
class(T)
## [1] "SparkDataFrame"
## attr(,"package")
## [1] "SparkR"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 779
❦
The data manipulation capacities of SparkR are modelled on dplyr and SparkR over-rides
most of dplyr’s functions so that they will work on a SparkDataFrame. This means that for the
R-user, all functions from dplyr work as expected.8 dplyr
For example, selecting a row can be done in a very similar way as in base-R. The following
lines of code do all the same: select one column of the Titanic data and then show the first ones.
X <- T %>% SparkR::select(T$Age) %>% head
Y <- T %>% SparkR::select(column('Age')) %>% head
Z <- T %>% SparkR::select(expr('Age')) %>% head
cbind(X, Y, Z)
## Age Age Age
## 1 22 22 22
## 2 38 38 38
## 3 26 26 26
## 4 35 35 35
## 5 35 35 35
## 6 NA NA NA
The package dplyr offers SQL-like functions to manipulate and select data, and this func-
❦ ❦
tionality works also on a DataFrame using SparkR. In many cases, this will allow us to reduce
the big data problem to a small data problem that can be used for analysis.
For example, we can select all young males that survived the Titanic disaster as follows:
T %>%
SparkR::filter("Age < 20 AND Sex == 'male' AND Survived == 1") %>%
SparkR::select(expr('PassengerId'), expr('Pclass'), expr('Age'),
expr('Survived'), expr('Embarked')) %>%
head
## PassengerId Pclass Age Survived Embarked
## 1 79 2 0.83 1 S
## 2 126 3 12.00 1 C
## 3 166 3 9.00 1 S
## 4 184 2 1.00 1 S
## 5 194 2 3.00 1 S
## 6 205 3 18.00 1 S
8 The package dplyr is introduced in Chapter 17 “Data Wrangling in the tidyverse” on page 265.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 780
❦
SparkR has its functions modelled to dplyr and most functions will work as expected. For
example, grouping, summarizing, changing and arranging data works as in dplyr.
# Extract the survival percentage per class for each gender:
TMP <- T %>%
SparkR::group_by(expr('Pclass'), expr('Sex')) %>%
summarize(countS = sum(expr('Survived')), count = n(expr('PassengerId')))
N <- nrow(T)
TMP %>%
mutate(PctAll = expr('count') / N * 100) %>%
mutate(PctGroup = expr('countS') / expr('count') * 100) %>%
arrange('Pclass', 'Sex') %>%
SparkR::collect()
## Pclass Sex countS count PctAll PctGroup
## 1 1 female 91 94 10.549944 96.80851
## 2 1 male 45 122 13.692480 36.88525
## 3 2 female 70 76 8.529742 92.10526
## 4 2 male 17 108 12.121212 15.74074
## 5 3 female 72 144 16.161616 50.00000
## 6 3 male 47 347 38.945006 13.54467
❦ ❦
Note – Statements that are not repeated further
In the later code fragments in this section we assume that the following statements are
run and we will not repeat them.
library(tidyverse)
library(SparkR)
library(titanic)
sc <- sparkR.session(master = "local", appName = 'first test',
sparkConfig = list(spark.driver.memory = '2g'))
T <- as.DataFrame(titanic_train)
dapply
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 781
❦
# The data:
T <- as.DataFrame(titanic_train)
## age ageGroup
## 1 22 youth
## 2 38 mature
## 3 26 youth
## 4 35 mature
## 5 35 mature
## 6 NA <NA>
The schema can be extracted via the function schema(). Below a trivial example.
The function apply has a sister function dapplyCollect that does basically the same plus
“collecting” the data back to a data.frame on the client computer (whereas dapply will leave the
results as a resilient distributed dataset (SparkDataFrame) on the Hadoop cluster.
dapplyCollect
The “collect” version, dapplyCollect(), will apply the provided user defined function to each dapplyCollect()
partition of the SparkDataFrame and collect the results back into a familiar R data.frame. Note
also that the schema does not have to be provided to the function dapplyCollect(), because it
returns a data.frame and not a DataFrame.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 782
❦
# The data:
T <- as.DataFrame(titanic_train)
❦ In many cases, we need something different than just working with the partitions of the data as ❦
they are on the distributed file system. Maybe – referring to previous example – we want to know if
in the different classes of passengers on the Titanic the age had a different dynamic. For example,
we need to understand if richer people (in class one) were also on average older than those of class
three. This will not work with the dapply, we need to be able to specify ourselves the grouping
parameter, so that we can calculate the age average per group (class).
The function gapply() and gaplyCollect() will connect to spark and instruct it to run a
gapply()
user defined function on a given dataset. They both work similar, with the major difference that
gapplyCollect will “collect” the data (ie. coerce the result into a non-distributed data.frame.
The function that is applied to the Spark data-frame need to comply with the following rules.
It should only have two parameters: a grouping key and a standard data.frame as we are used to
in R. The function will also return just a R data.frame.
gapply takes as arguments the SparkDataFrame, the grouping key (as a string), the user
defined function and the schema of the output. The schema specifies the format of each row
of the resulting a SparkDataFrame..
# Define the function to be used:
f <- function (key, x) {
data.frame(key, min(x$Age, na.rm = TRUE),
mean(x$Age, na.rm = TRUE),
max(x$Age, na.rm = TRUE))
}
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 783
❦
gapplyCollect
The function gapply has a variant that “collects” the result to an R data.frame. The main differ-
ence in usage is that the schema does not need to be supplied.
❦ ❦
spark.lapply
The equivalent of lapply from R is in SparkR the function spark.lapply. This function allows spark.lapply()
to run a function over a list of elements and execute the work over the cluster and its distributed
dataset.9 Generally it is necessary that the results of the calculations fit on one machine, however,
workarounds via lists of data.frames are possible.
The power of spark.lapply resides of course not in calculating a list of squares, but rather
in executing more complex model fitting over large distributed datasets. Below we show how it
can fit a model.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 784
❦
# df is a data.frame (R-data.frame)
# DF is a DataFrame (distributed Spark data-frame)
❦ # From R to Spark: ❦
DF <- createDataFrame(df)
Data can reside in SQL or NoSQL databases, but sometimes it is necessary to input plain textfiles
CSV such as CSV-files. This can be done as follows.
loadDF(fileName,
source = "csv",
header = "true",
sep = ",")
Apart from the functions mutate() and select, there are much more functions from dplyr re-
engineered to work on a SparkDataFrame. For example, we can change and add columns via the
withColumn() function withColumn.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 785
❦
## 3 3 26 2.6
## 4 4 35 3.5
## 5 5 35 3.5
## 6 6 NA NA
Note that in the code fragment above, the function lit() returns the literal value of 10. lit()
Aggregation
In addition to the group_by() function used before, SparkR also provides the aggregation func- agg()
cube()
tion agg() as well as the OLAP cube operator: cube().
1. Classification
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 786
❦
2. Regression
3. Tree
4. Clustering
• spark.kmeans: K-Means
• spark.bisectingKmeans: Bisecting k-means
• spark.gaussianMixture: Gaussian Mixture Model (GMM)
• spark.lda: Latent Dirichlet Allocation (LDA)
5. Collaborative Filtering
• spark.fpGrowth : FP-growth
❦ 7. Statistics ❦
Behind the scenes SparkR will dispatch the fitting of the model to MLlib. That does not
mean that we need to change old R-habits too much. SparkR supports a subset of the avail-
able R formula operators for model fitting (for example, we can use the familar operators
~ , ., :, + and -).
write.ml() SparkR makes it possible to store results of models and retrieve them via the functions
read.ml() write.ml() and read.ml().
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 787
❦
# Do something with M2
summary(M2)
##
## Saved-loaded model does not support output 'Deviance Residuals'.
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.30997 0.33860 9.7755 0.0000e+00
## Pclass -0.98022 0.11989 -8.1758 4.4409e-16
## Sex_male -2.62236 0.20723 -12.6542 0.0000e+00
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 932.42 on 704 degrees of freedom
## Residual deviance: 654.45 on 702 degrees of freedom
## AIC: 660.5
##
## Number of Fisher Scoring iterations: 5
Spark and SparkR are cutting edge technologies and are still changing. While the library
of functions is already impressive, good documentation is a little sparse. Maybe you find
Ott Toomet’s introduction useful? It is here: https://otoomet.github.io/sparkr_
notes.html. Also do not forget that the Apache Foundation offers a website with the offi-
cial documentation, that is well written and easy to read. See https://spark.apache.
org/docs/latest/index.html.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 788
❦
39.2.4 sparklyr
RStudio Another way to connect R to Spark is the package sparklyr. It is provided by RStudio and hence –
sparklyr as you can expect – it makes sense to use the IDE RStudio, it encourages the tidyverse phylosophy
dplyr
and has a gentle learning curve if you already know the other RStudio tools such as dplyr.
sparklyr provides a complete dplyr back-end. So, via this library, it is a breeze to filter and
use an RDD in Spark and then bring the results into R for what R does best: analysis, modelling
and visualization. Of course, it is possible to use the MLlib machine learning library that comes
with Spark all from your comfortable environment in R. The advanced user will also appre-
ciate the possibility to create extensions to use the Spark API and provide interfaces to Spark
packages.
Note that in this section, we merely demonstrate how easy it is to use Spark from the R
with sparklyr and we do not try to provide a complete overview of the package.
Before we start we will make sure that we work with a clean environment. First, we could stop
the Spark Master, but that is not really necessary.10 It makes sense, however, to unload SparkR
before loading sparklyr.11 We will also de-connect the tidyverse libraries and then reconnect
them on the desired order.
install.packages('sparklyr')
After this, the package is downloaded, compiled and kept on our hard disk, ready to be loaded
in each session where we need to use it and connect to Spark.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 789
❦
# system('start-master.sh')
The functionality that we are familiar with from dplyr can also be applied on the Spark table
via sparklyr. The following code demonstrates this by summarizing some aspects of the data.
# Alternatively:
Titanic_tbl %>% spark_dataframe() %>% invoke("count")
## [1] 891
Titanic_tbl %>%
dplyr::group_by(Sex, Embarked) %>%
summarise(count = n(), AgeMean = mean(Age)) %>%
collect
## # A tibble: 7 x 4
## Sex Embarked count AgeMean
## <chr> <chr> <dbl> <dbl>
## 1 male C 95 33.0
## 2 male S 441 30.3
## 3 female C 73 28.3
## 4 female "" 2 50
## 5 female S 203 27.8
## 6 male Q 41 30.9
## 7 female Q 36 24.3
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 790
❦
DBI It is also possible to use the power of SQL directly on the spark array via the library DBI.
library(DBI)
sSQL <- "SELECT Name, Age, Sex, Embarked FROM titanic_train
WHERE Embarked = 'Q' LIMIT 10"
dbGetQuery(sc, sSQL)
## Name Age Sex Embarked
## 1 Moran, Mr. James NaN male Q
## 2 Rice, Master. Eugene 2 male Q
## 3 McGowan, Miss. Anna "Annie" 15 female Q
## 4 O'Dwyer, Miss. Ellen "Nellie" NaN female Q
## 5 Glynn, Miss. Mary Agatha NaN female Q
Note that dbGetQuery() does not like a semicolon at the end of the SQL statement.
Spark_apply
Of course, sparklyr has also a workhorse function to apply user defined functions over the dis-
spark_apply() tributed resilient dataset. This is the function spark_apply(). This allows us to write our own
functions – that might not be available in Apache Spark – to run fast on the distributed data.
Typically, one uses spark_apply() to apply an R function on a SparkDataFrame (or more
❦ general “Spark object”). Spark objects are by default partitioned so they can be distributed over a ❦
cluster of working nodes. To some extend spark_apply is a combination of different functions
in SparkR: it can be used over the default partitions or it can be run over a chosen partitioning.
This can be obtained by simply specifying the group_by argument.
Note that the R function – that is supplied to spark_apply – takes a Spark DataFrame as
argument and must return another Spark DataFrame. As expected, spark_apply will apply the
given R function on each partition and then aggregate the result to a single Spark DataFrame.
# sdf_len creates a DataFrame of a given length (5 in the example):
x <- sdf_len(sc, 5, repartition = 1) %>%
spark_apply(function(x) pi * x^2)
print(x)
## # Source: spark<?> [?? x 1]
## id
## <dbl>
## 1 3.14
## 2 12.6
## 3 28.3
## 4 50.3
## 5 78.5
Machine Learning
The library sparklyr comes with a wide range of machine learning functions as well as with
transforming functions.
• feature transformers are the functions that create buckets, binary values, element-wise
products, binned categorical values based on quantiles, etc. The name of these functions
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 791
❦
starts with ft_. Note especially the function sql_transformer() – an exception on the
naming convention – it is the function that allows to transform data based on an SQL state-
ment.
• machine learning algorithms are the functions that perform linear regression, PCA,
logistic regression, decision trees, random forest, etc. The name of these functions starts
with ml_. machine learning
# Add predictions:
pred <- sdf_predict(t_test, M)
As one came to expect from RStudio, all their contributions to the R-community come
with great documentation on their website. The pages of sparklyr are here: https:
//spark.rstudio.com.
Both tools
• should feel reasonably familiar (knowing both base R and the tidyverse),
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:24pm Page 792
❦
• allow the results to be brought back to R for further analysis, visualization and reporting.
So both libraries achieve roughly something similar, however, we believe that it is fair to say
that
• SparkR has a Python equivalent, so if you use both languages you will appreciate the short-
Python ened learning curve;
• sparkR follows closer the scala-spark API and so also here are learning synergies.
Those are strong arguments if you are the data scientist that also works with Python or directly
on Spark. However, if you are the person that has an analyst profile and work usually with R, then
sparklyr has some noteworthy advantages.
There is no conclusive winner, it is, however, worth to spend some time considering which
interface to use and then stick with that one.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 793
❦
♣ 40 ♣
In previous section, we focused on the issue where data becomes too large to be read into memory
or causes the code to run too slowly. The first aspect – large data – is covered in the previous
section, however, the reason why our code is slow, is not always due to data being too big. It
might be related to the algorithm, the programming style, or we might have hit limits of what
R naturally can do. In this section we will have a look on how to optimize code for speed – and
assume that data-size is not the main blocking factor.
In this section we will show how we can evaluate and recognize efficient code and study
various ways to speed up code. Most of those ways to reduce runtime are related to how R is
implemented and what type of language it is. R is a high level interpreted language and provides
complex data types with loads of functionalities and hides away much complexity. Using simpler
data types, pre-allocating memory, and compiling code will therefore, be part of our basic toolbox.
❦ ❦
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 794
❦
40.1 Benchmarking
Before we start to look into details on optimizing for speed, we will need an objective way to tell if
code is really faster or not. R has built-in tools to measure how long a function or code block runs
via the function system.time() of base R, which we used before (for example in Chapter 37.3
system.time() “Using the GPU” on page 752).
We supply an expression to be timed to the function system.time() as its argument.
x <- 1:1e4
The code block above illustrates immediately some issues with timing code. The function
mean() is too fast to be timed and hence we need to repeat it a few times to get a meaningful
measure.
N <- 2500
❦ This result is meaningful. However, when we execute the function once more, the result is ❦
usually different.
We realize that the timing is a stochastic result and know how to measure it, but as long as
we have no alternative algorithm; this measurement is not so useful. So, let us come up with a
challenger model and calculate mean as the sum divided by the length of the vector. This will
only work on vectors, but since our x is a vector this is a valid alternative.
The difference in runtime is clear: our alternative to the function mean is faster: it takes about
half of the time to run. However, the difference will not always so clear and run-times are not
always exactly the same each time we run the code. Measuring the same block of code on the
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 795
❦
same computer for a few times will typically yield different results. So, in essence the run-time is
a stochastic variable. To compare stochastic variables, we need to get some insight in the shape of
the distribution, study the mean, median, minimum, quartiles, maximum, etc. This can be done
manually or by using the library microbenchmark. microbenchmark
This lightweight package provides us with all those tools that we need. It even provides auto-
mated visualization of the results. After installing the packages, we can load and use it.
N <- 1500
# Load microbenchmark:
library(microbenchmark)
The object comp is an object of class microbenchmark and – because of R’s useful implemen- microbenchmark()
tation of the S3 OO system – we can print it or ask for a summary with the known functions
print() and plot() for example.
summary(comp)
## expr min lq mean
## 1 mean(x) 12.849 13.1315 13.95813
## 2 { sum(x)/length(x) } 10.018 10.1490 10.55872
## median uq max neval
## 1 13.2295 13.374 66.145 1500
## 2 10.2060 10.267 45.464 1500
❦ ❦
{ sum(x)/length(x) }
mean(x)
10 30 50
Time [microseconds]
Figure 40.1: The package microbenchmark also defines a suible visualisation via autoplot. This
shows us the violin plots of the different competing methods. It also makes clear how heavy the right
tails of the distribution of run-times are.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 796
❦
microbenchmark allows us to confirm with more certainty that our own code is faster than
the native function mean(). This is because the function mean() is a dispatcher function that
first will find out what object x is, then dispatch it to the relevant function, where still some logic
about handling NAs, error handling, and more functionality slows down the calculation.
The microbenchmark object holds information about the different times that were
recorded when the code was run and it has a suitable method for visualizing the histograms via
autoplot() – a wrapper around ggplot2. The function autoplot() is a dispatcher function
for wrappers around ggplot2 – see Chapter 31 “A Grammar of Graphics with ggplot2” on
page 687. The plot methods that microbenchmark provides is a violin plot: see Chapter 9.6
on page 173. As demonstrated via the following code, which creates the plot in Figure 40.1 on
page 795.
# Load ggplot2:
library(ggplot2)
# Use autoplot():
autoplot(comp)
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 797
❦
The place to start looking for improvement is always the programming style. Good code is logically
structured, easy to maintain, readable and fast. However, usually some compromise is needed: we
need to make a choice and optimize for readability or speed.
Below we look at some heuristics that can be used to optimize code for speed. This list might
not be complete, but in our experience the selected subjects make a significant difference in how
efficient the code is.
x <- 1:1e+4
y <- 0
y1 <- pi
y2 <- 2.718282
y3 <- 9.869604
y4 <- 1.772454
N <- 1e5
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 798
❦
}
}
system.time(f1())
## user system elapsed
## 0.031 0.005 0.036
The for-loops in the aforementioned code and in most of the rest of the section only repeats
the same action in order to create a longer time that is measurable by system.time().
The for-loops themselves create indeed also some overhead, but they are the same in each
of the methods and hence the differences are to be allocated to the methods discussed.
The direct sums (f1()) is slower than first creating a vector and then summing the vectors
(f2()). The overhead seems to come from accessing the vector rather than using the scalars
directly (f3() and f4()). However, considering that option 2 is not only the fastest option, but
also the most readable, we strongly recommend to use this.
While our experiment shows a a clear winner (method two), it is not a bad idea to use
microbenchmark, and make sure that we make the right choice. The results of this approach
are in Figure 40.2 on page 799, which is generated by the code below.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 799
❦
f4()
f3()
f2()
f1()
10 20 30
Time [milliseconds]
Figure 40.2: This figure shows that working with vectors is fastest (f2()).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 800
❦
Using the function append() is not very efficient compared to the other methods, because
the function is designed to cater for much more than what we are doing.
The methods 2 and 3 are very close and yield good results in other programming languages.
However, in R they need – just as method 1 – quadratic time (O(N )), and are therefore bound to
fail for large enough N . Apart from the obvious inefficiency in method 2 (actually related to the
first principle discussed) these methods require that R allocates more memory for the list object
in each iteration. Only method 4, where the size of the list is defined only once, is efficient.
Whenever you have the choice between a complex data-structure (RC object, list, etc.) and
there is a more simple data-structure available – that can be used in the particular code – then
❦ choose for the most simple one. When an object provides more flexibility (e.g. a data-frame can ❦
also hold character types, where a matrix can only have elements of the same type), then this
additional flexibility comes at a cost.
For example, let us investigate the difference between using a matrix and a data frame. Below
we will compare both for simple arithmetic.
N <- 500
# simple operations on a matrix
M <- matrix(1:36, nrow = 6)
system.time({for (i in 1:N) {x1 <- t(M); x2 <- M + pi}})
## user system elapsed
## 0.009 0.000 0.009
Obviously, the aforementioned code does nothing useful: it only has unnecessary repetitions,
but it illustrate that the performance gain of a matrix over a data-frame is massive (the matrix is
more than 30 times faster in this example).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 801
❦
There is, however, a caveat. That caveat is in the words “same functionality.” Base functions
such as mean() do a lot more than just sum(x)/length(x). These base-functions are dispatcher
functions: they will check the data-type of x and then dispatch to the correct function to calculate
the mean.
We refer to the example from the beginning of this chapter:
x <- 1:1e4
N <- 1000
system.time({for (i in 1:N) mean(x)})
## user system elapsed
## 0.017 0.000 0.017
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 802
❦
This example makes clear that the library xts is much faster than the zoo library. Similar
differences can be found among many other libraries. For almost any task, R offers choice on
how to achieve a certain goal. It is up to you to make a wise choice that is intuitive, fast and
covers your needs.
Till now the differences in performance were both obvious and significant. For the
remainder of the chapter, we study some effects that can be smaller and more subtle.
Therefore, we have chosen to not to run the calculations and performance tests separate
from compiling the book. In other words, while in the rest of the book – almost every-
where – the code that you see is directly generating the output and plots, below the code
is run separately and results were manually added. The reason is that while generating
this book, the code will be wrapped in other functions such as knitr(), tryCatch(),
etc. and that this has a less predictable impact.
You can recognize the static code on the response of R: the output lines – starting with
## – appear in bold green and not in black text.
❦ ❦
# standard function:
f1 <- function(n, x = pi) for(i in 1:n) x = 1 / (1+x)
N <- 1e6
library(microbenchmark)
comp <- microbenchmark(f1(N), f2(N), f3(N), f4(N), f5(N), f6(N),
times = 150)
comp
## Unit: milliseconds
## expr min lq mean median uq max ...
## f1(N) 37.37476 37.49228 37.76950 37.57212 37.79876 39.99120 ...
## f2(N) 37.29297 37.50435 37.79612 37.63191 37.81497 41.09414 ...
## f3(N) 37.96886 38.18751 38.59619 38.28713 38.68162 47.66612 ...
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 803
❦
The plot, generated in the last line of the aforementioned code is in Figure 40.3, and shows
how the two first methods are the most efficient. It seems that while the curly brackets perform a
little more consistent, there is little to no reason to prefer either of them based on speed. However,
not overusing brackets and using the simplest operation does pay off.
f6(N)
f5(N)
f4(N)
f3(N)
❦ ❦
f2(N)
f1(N)
40 50
Time [milliseconds]
1
Figure 40.3: Different ways to calculate 1+x . The definition of the functions fi () is in the aforemen-
tioned code.
• Curly brackets are typically a little faster than the round ones, but the difference is really
small, and (in this case) much smaller than the variations between different runs.1
1 The differences are so small that answers of the statistics are not always univocal. For example, the differ-
ence between function f1 () and f2 () is really small. The median of f2 is better, but the mean of f1 shows a slight
performance advantage in the opposite direction.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 804
❦
library(compiler)
N <- as.double(1:1e7)
The plot, generated by the aforementioned code is in Figure 40.4 on page 805. The compiled
function is just a little faster, but shows a more consistent result.
It seems that in this particular case (meaning, the determining factors such as function, imple-
mentation of R, operating system, processor, etc.), we can get little to no gain of compiling the
function. In general, one can expect little gain if there are a lot of base-R functions in your code, or
conversions of to different data-types. Compiling functions typically performs better when there
are more numeric calculations, selections, etc.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 805
❦
cmp_f(N)
f(N)
600
Time [milliseconds]
Figure 40.4: Autoplot generates nice violin plots to compare the speed of the function defined in the
aforementioned code (f) versus its compiled version (cmp_f).
❦ Note that there is more than one way to compile code. cmpfun() will go a long way, but – ❦
for example – when creating packages, we can force the code to be compiled at installation by
adding: ByteCompile: true in the DESCRIPTION file of the package.2
Actually, when using install.packages(), and the programmer did not specify the
ByteCompile variable, then the package is not compiled. It is possible to force R to compile
every package that gets installed by setting the R_COMPILE_PKGS variable set to a positive integer
value by executing:
options(R_COMPILE_PKGS = 1)
A final option to use just-in-time (JIT) compilation. This means that the overhead of compiling
a function only comes when needed and this might – depending on how you are using R – deliver
a better experience. The function enableJIT() – from the package compiler – will set the level
of JIT compilation for the rest of the session. enableJIT()
It disables JIT compilation if the argument is 0, and for the arguments 1, 2, or 3 it implements
different levels of optimisation. JIT
JIT can also be enabled by setting the environment variable R_ENABLE_JIT, to one of the
values mentioned above. For example:
options(R_ENABLE_JIT = 0)
2 See the section Chapter A “Create your own R Package” on page 821 to learn more about creating R-packages
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 806
❦
library(microbenchmark)
N <- 30
comp <- microbenchmark(Fib_R(N), Fib_R_cmp(N),
Fib_Cpp(N), times = 25)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 807
❦
comp
## Unit: milliseconds
## expr min lq mean median uq
## Fib_R(N) 1449.755022 1453.320560 1474.679341 1456.202559 1472.447928
## Fib_R_cmp(N) 1444.145773 1454.127022 1489.742750 1459.170600 1554.450501
## Fib_Cpp(N) 2.678766 2.694425 2.729571 2.711567 2.749208
## max neval cld
## 1596.226483 25 b
## 1569.764246 25 b
## 2.858784 25 a
library(ggplot2)
autoplot(comp)
The last line of the aforementioned code generates a plot, that is in Figure 40.5.
Fib_Cpp(N)
Fib_R_cmp(N)
❦ ❦
Fib_R(N)
10 100 1000
Time [milliseconds]
Figure 40.5: A violin plot visualizing the advantage of using C++ code in R via Rcpp. This picture
makes very clear that the “compilation” (via the function cmpfun()) has only a minor advantage
over a native R-function. Using C++ and compiling via a regular compiler (in the background
cppFunction() uses g++), however, is really a game changer.
The performance advantage of the compilation of native C++ code with Rcpp() is massive.
The C++ function is about 500 times faster than the R function, while the compiled function in
R seems to have almost no advantage.
The reader will by now probably notice that the code can still be improved further. Better
results can be obtained by realizing that the way this is programmed we need to do twice the
same calculations. In other words, we can still tune the algorithm.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 808
❦
comp
## Unit: microseconds
## expr min lq mean median ...
## Fib_R(N) 1453850.637 1460021.5865 1.495407e+06 1471455.852...
❦ ## Fib_R2(N) 2.057 2.4185 1.508404e+02 4.792... ❦
## Fib_Cpp(N) 2677.347 2691.5255 2.757781e+03 2697.519...
## Fib_Cpp2(N) 1.067 1.4405 5.209175e+01 2.622...
## max neval cld
## 1603991.462 20 b
## 2925.070 20 a
## 3322.077 20 a
## 964.378 20 a
library(ggplot2)
autoplot(comp)
The violin plots generated by the last line of this code are in Figure 40.6 on page 809 and show
the supremacy of C++, only to be rivalled by he intelligence of the brain of the programmer.
Finally, this version of the function Fib_Cpp2() is almost a million times faster than the native
and naive approach in R (Fib_R()).
When writing code, it is really worth to
1. write code not only with readability in mind, but also write intelligently with speed in mind
2. and – if you are fluent in C++ – use it to your advantage via the package Rcpp.
Summarising, we can state Rcpp is an amazing tool, but it cannot rival smart coding.
Using C++ code in R not only requires knowledge of both C++ and R, but it also requires
you to deal with different data definitions in the two environments, but it is really worth
learning. We recommend to start with Hadley Wickham’s introduction: http://adv-r.
had.co.nz/Rcpp.html and read Eddelbuettel and Balamuta (2018).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 809
❦
Fib_Cpp2(N)
Fib_Cpp(N)
Fib_R2(N)
Fib_R(N)
Figure 40.6: A violin plot visualizing the advantage of using C++ code in R via Rcpp. Note that the
scale of the x-axis defaulted to a logarithmic one. This image makes very clear that using native C++
via Rcpp is a good choice, but that it is equally important to be smart.
❦ ❦
Digression – Efficient R
Notice how close the performance of the functions Fib_R2 and Fib_Cpp2 is. R has come
a long way and in simple routines (so e.g. no complex data structures, no overload for
calling functions, etc.) R approximates the speed of C++ in the smart implementation of
the example.
Further, the functions that needs to be exported to R should be preceded with exactly the
following comment (nothing more and nothing less in the line preceding the function – not even
an empty line):
// [[Rcpp::export]]
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 810
❦
Note also that it is possible to embed R code in special C++ comment blocks. This is really
convenient if you want to run some test code, or want to keep functions logically in the
same place (regardless the programming language used):
/*** R
# This is R code
*/
The R code is run with source(echo = TRUE) so you don’t need to explicitly print out-
put.
Once this source file is created and saved on the hard disk we can compile it in R via
sourceCpp() sourceCpp().
For further reference we will assume that the file is saved under the name /path/cppSource.
cpp.
sourceCpp("/path/cppSource.cpp")
The function sourceCpp() will create all the selected functions and make them available to
R in the current environment.
In the following example we use the example from previous section: the faster implementa-
tions of the Fibonacci series in both C++ and R. To import this function from C++ source, we
need to create a file with the following content:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int Fib_Cpp2(int n) {
int x = 1, x_prev = 1, i;
for (i = 2; i <= n; i++) {
x += x_prev;
x_prev = x;
}
return x;
}
/*** R
Fib_R2 <- function (n) {
x = 1
x_prev = 1
for (i in 2:n) {
x <- x + x_prev
x_prev = x
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 811
❦
}
x
}
N <- 30
comp <- microbenchmark(Fib_R2(N), Fib_Cpp2(N), times = 20)
comp
*/
This source file will not only make the C++ function available but also run some tests, so
that we can immediately assess if all works fine. When this file is imported via the function
sourceCpp(), the functions Fib_Cpp2() and Fib_R() will both be available to be used in R.
To understand this subject better, we recommend Hadley Wickham’s blog “High perfor-
mance functions with Rcpp”: http://adv-r.had.co.nz/Rcpp.html.
The code above will most probably work for you just like that, but in case you don’t have the
Rcpp.h header, it is possible to download it from https://github.com/RcppCore/Rcpp. It will
also be part of the packages of your linux distribution. For example on Debian and its derivatives
one can install it via:
sudo apt install r-cran-rcpp # this prepares the use of the package Rcpp
❦ sudo apt install r-base-dev # this is more general and most probably what you ❦
need for this section
More information about the subject of calling C routines in R, is for example here: http:
//adv-r.had.co.nz/C-interface.html
Digression – R in C++
It is also possible to call R functions from within a C++ program. Apart from solid knowl-
edge of C++, it also requires that R on your computer is compiled to allow linking,
and you must have a shared or static R library. We consider this beyond the scope of
this book, but wanted to make sure that you know it exists and refer you to the “RIn-
side project”: https://github.com/eddelbuettel/rinside and its documentation:
http://dirk.eddelbuettel.com/code/rinside.html.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 812
❦
We already discussed in Chapter 40.1 “Benchmarking” on page 794 how benchmarking helps to
benchmarking provide a good insight in how fast a function really is. This is very useful, however, in a typically
work-flow, one will first make a larger piece of code that “does the job.” Once we have a piece of
code that works, we can start to optimize it and make sure that it will run as fast as possible. We
could recursively work our way through the many lines and functions and via benchmarking and
keep track of which function is fastest, yet this process is not efficient nor does it provide insight
in what parts of the code are called most often. The process of finding out what part of the code
takes most time -and hence is the low-hanging fruit for optimization – is called “profiling.”
While in the rest of the book the code shown generates directly the output and plots that
appear below or near to it, the code in this section is run separately and results were
manually collated. The reason is that while generating this book, the code will be wrapped
in other functions such as knitr(), tryCatch(), etc. and that this has a less predictable
impact and – most importantly – those functions might also appear in the output that we
present here and this would be confusing for the reader.
utils The package utils that is part of base-R provides the function Rpof() that is the profiling
❦ tool in R. The function is called when the logging should start with the name of the file where the ❦
Rprof() results should be stored. Then follows the code to be profiled and finally we call Rprof() again
to stop the logging.
So, the general work-flow is as follows.
Rprof("/path/to/my/logfile")
... code goes here
Rprof(NULL)
As a first example we will consider a simple construction of functions that mainly call each
other, in such way that some parts should display a predictable performance difference (added
complexity or double amount of repetitions).
In order to do the profiling we have to mark the starting point of the profiling operation by
calling the function Rprof(). This indicates to R that it should start monitoring. The function
takes one argument: the file to keep the results. The profiling process will register information till
the process is stopped via another call to Rprof() with the argument NULL.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 813
❦
N <- 500
f4(N)
Now, the file prof_f4.txt is created and can be read and analysed. The function
SummaryRprof() from the package utils is a first port of call and usually provides very good
insight.
# show the summary:
summaryRprof("prof_f4.txt")
## $by.self
## self.time self.pct total.time total.pct
## "f0" 0.18 37.50 0.18 37.50
## "f3" 0.16 33.33 0.34 70.83
## "f1" 0.08 16.67 0.08 16.67
## "f2" 0.06 12.50 0.14 29.17
##
## $by.total
## total.time total.pct self.time self.pct
## "f4" 0.48 100.00 0.00 0.00
## "f3" 0.34 70.83 0.16 33.33
## "f0" 0.18 37.50 0.18 37.50
## "f2" 0.14 29.17 0.06 12.50
## "f1" 0.08 16.67 0.08 16.67
##
## $sample.interval
## [1] 0.02
##
❦ ## $sampling.time ❦
## [1] 0.48
• by.total: the time spent in a function, including all the functions that this function calls.
N <- 1000
require(profr)
pr <- profr({f4(N)}, 0.01)
plot(pr)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 814
❦
5
4
level
f4
3
force
2
doTryCatch
1
time
hotPaths()
via the function hotPaths() – illustrated in the following code. The function flameGraph()
❦ ❦
flameGraph() produces a colourful flame graph or a “callee tree map” for the call tree in the profile stack trace
calleeTreeMap() as produced by Rprof() – illustrated in Figure 40.8 on page 815.
This same plot can also be produced via the function calleeTreeMap() in order to get
another appealing visual effect – – illustrated in Figure 40.9 on page 815.
library(proftools)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 815
❦
Call Graph
f0 f1 f0 f1
f3 f2
f4
Figure 40.8: A flame-plot produced by the function flameGraph() from the package proftools.
f3 f2
f1
f1
f0
f0
Figure 40.9: A Callee Tee Map produced by the function calleeTreeMap() from the package
proftools. This visualization shows boxes with surfaces that are relative to the time that the func-
tion takes.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 816
❦
unlink() In order to delete the log-file via R, you can use the unlink() function.
unlink("prof_f4.txt")
This is useful when automating since Rprof() expects a not-existing file. However, note
that Rprof() also allows you to append to an existing file via the parameter append =
TRUE.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 817
❦
Look back at Figure 40.4 on page 805 and notice how all the performance distributions have heavy
tails to the right. This means that code “usually runs take a given time to run, but in some rare
cases it takes a lot more time.” This is mainly due to other – high priority code – running on the
same computer.
To compile this book, we use a general purpose Linux distribution, and such systems by
default do not allocate the highest performance to the user processes, and if they do they will make
sure that for example music in the background will not get interrupted when R starts calculating.
One way to improve this aspect is allocate a higher priority to your R process. In Linux each niceness
process gets a “niceness factor” when it is started. Niceness ranges from −20 to 19, with −20 the
“meanest process” that will not be “nice” to other processes and not share processor time, and
hence can be expected to be a little faster.
Warning – Be nice
2. in the cases where it does make a difference it will interup other processes and the
computer will be not responsive to user input (e.g. switch to anohter window).
❦ ❦
Much more speed gain is to be expected from a clean programming style, optimised, and
eventually C++ code.
To get some insight in what is going on, use the command top top
top
top - 16:18:34 up 5 days, 1:36, 4 users, load average: 1,71, 1,41, 1,13
Tasks: 261 total, 2 running, 169 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12,1 us, 1,1 sy, 0,0 ni, 86,7 id, 0,1 wa, 0,0 hi, 0,0 si, 0,0 st
KiB Mem : 8086288 total, 1551740 free, 3683512 used, 2851036 buff/cache
KiB Swap: 17406972 total, 17406972 free, 0 used. 3590084 avail Mem
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 818
❦
As you can see, the user processes are started with a niceness (“NI”) of 0 by default and R is not the
only process running and claiming processor time. To change the priority you need the command
nice.3
It is possible to change the niceness of processes while it is running with the command
renice, but probably the most straightforward is to set the niceness at launch as follows.
# Launch R with niceness of +10
nice -n 10 R
However, this will run R with a positive niceness, which is probably not what you need. Setting a
negative niceness needs administrator rights. So, you can use:
# Launch R with niceness of -15
sudo nice -n -15 R
Now, the R process will be less willing to share processor time with other processes and this will
force R to run at a higher priority:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11975 root 5 -15 549320 243072 22964 R 99,7 3,0 0:26.59 R
19962 philippe 20 0 2359076 197480 46964 S 49,5 2,4 594:28.54 thunderbird
19710 philippe 20 0 3328052 388672 119256 S 7,3 4,8 45:32.56 Web Content
19654 philippe 20 0 3973576 521712 177524 S 4,0 6,5 67:34.24 firefox
[...]
Note that setting negative niceness factors requires root privileges – hence the sudo com-
mand. There is a good reason why this is needed: “it is not nice to do so.” Other processes such
as communication, user interaction, streaming, etc. do need priority. Expect your computer to
behave less nice if you do this.
A second way to improve performance via software is trying different distributions, like “Sci-
entific Linux,” “Arch,” “MX Linux” or – with a little more patience but more control – “Gentoo”
or “LFS.” Have a look at www.distrowatch.com and explore some of the possible free options.
Also the hardware can be tuned and these days modern motherboards will make it easy to
over-clock the CPU. This is, however, not without risk. It is important to monitor the temperature
of the CPU and other elements, eventually improve cooling. In any case, this is not without danger
and can physically damage the computer.
We really add this section only for completeness and mainly to allow us to make the point that
all the other methods mentioned in earlier sections are more helpful and most likely the way to
go. Especially read Chapters 37 to 39 and the first part of this one (40).
3 You will also notice that there is a column with “PR” (for priority), for non-real-time processes niceness and
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:13pm Page 819
❦
PART IX
Appendices
♥
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 821
❦
♣A♣
While this book focuses on the work-flow of the data scientist and it is not a programming book
per se; it still makes a lot of sense to introducing you to building your own package. A package package
is an ideal solution to keep together the functions and data that you need regularly for specific
tasks. Of course, it is possible to keep these functions in a file that can be read in via the function
source(), but having it all in a package has the advantage of being more standardized, structured source()
and easier to document and so it becomes more portable so that also others can use your code.
Eventually, if you have built something great and unique then it might be worth to share it via the
“Comprehensive R Archive Network” (CRAN).1
Before we can get started, we need to install two packages first: devtools – that provides the devtools
essentials to build the package – and roxygen – that facilitates the documentation. This is done roxygen
as follows:
install.packages("devtools")
install.packages("roxygen2")
devtools is the package that is the workhorse to develop packages, it is built and designed
to make the process of building packages easier. For readers that use C++, it might suffice to
say that roxygen2 is the equivalent in R for Doxygen.2 This means that roxygen2 will be able
to create documentation for R-source code if that code is documented according to a specific
syntax.
1 The
link to submit a package is: https://cran.r-project.org/submit.html.
2 Doxygen is the de facto standard tool to generate documentation directly form C++ source code. This is
achieved by annotating the code in a specific way. doxygen also supports other languages such as C, C#, PHP,
Java, Python, IDL, Fortran, etc. Its website is here: http://www.doxygen.nl.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 822
❦
There are multiple work-flows possible to create a package. One can start from an existing .R
file or first create the empty skeleton of a package and then fill in the code. In any case, the two
packages are needed and hence, the first step is loading the packages.
library(devtools)
library(roxygen2)
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 823
❦
We could select the all the functions used in this book to create a package and name it after the
book, but it makes more sense to use a narrower scope, such as the functionality around asset
pricing (see Chapter 30 “Asset Valuation Basics” on page 597), diversity (see Chapter 36.3.1 “The
Business Case: a Diversity Dashboard” on page 726) or multi criteria decision analysis (see Chap-
ter 27 “Multi Criteria Decision Analysis (MCDA)” on page 511). The functions for multi criteria
decision analysis in this book follow a neat convention where each function name starts with
mcda_, this would be a first step of good house-holding that is necessary for a great package in R.
We choose the functions around the diversity dashboard to illustrate how a package in R is
built. The code below walks through the different steps. First, we have the function setwd(),
that sets the working directory on the file system of the computer to the place where we want to
create the package. Then, in step one, it defines the function – as it was done before, not using
the roxygen formatting commenting style for now.
Now, we are working from the correct working directory and we have the function
diversity() in the working memory of R. This is the moment where we can create the package
with the function package.skeleton() of devtools.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 824
❦
The function parameter list is a vector of the names of all functions that we want to be
package.skeleton() included in the package. The argument name supplied to the function package.skeleton() is
the name of the package and it will become a subdirectory in the current directory of our file-
system. In that directory there you will find now the following files and directories:
If you find these files, it means that you have created your first package. Those files are the
package and can be used as any other package.
• man contains the documentation for the functions and the package.
• .Rbuildignore lists files that we need to have around but that should not be
included when building the R package from this source.
• foofactors.Rproj is the file that makes this directory an RStudio Project. Even if
you do not use RStudio, this file is harmless. Or you can suppress its creation with
create(..., rstudio = FALSE).
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 825
❦
Anyone, who will use your package, needs a short description to understand what the pack-
age does and why he or she would need it. Adding this documentation is done by editing the
./div/DESCRIPTION file. For example, replace its content by:
Package: div
Type: Package
Title: Provides functionality for reporting about diversity
Version: 0.8
Date: 2019-07-10
Author: Philippe J.S. De Brouwer
Maintainer: Who to complain to <[email protected]>
Description: This package provides functions to calculate diversity of discrete
observations (e.g. males and females in a team) and functions that allow
a reporting to see if there is no discrimination going on towards some
of those categories (e.g. by showing salary distributions for the groups of
observations per salary band).
License: LGPL
RoxygenNote: 6.1.1
The last line is used at by R to document the functions, so it is essential to leave it exactly as
it was.
Now, that the package has a high-level explanation, it is time to document each function of
the package.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 826
❦
You want, of course, that the functions are neatly documented in the usual way that R recog-
nizes so that all methods such as asking for documentation should also work for your pack-
age. For example the following line of code should lead to the documentation of the function
diversity(), and display standard help-file formatted in the usual way.
?diversity
Each function will have its own file with documentation. R will look for these files in the
man directory. As you will notice, these files follow syntax that is reminiscent to LATEX. If your
functions are documented via the standards of roxygen, these files will be created automatically,
but remain editable.
Open the file R/diversity.R and insert right before the function the following comments.
❦ ❦
document() To process this documentation, and make it available, use the function document():
document() At this point, the document() function is likely to generate the following errors:
If you prefer to use roxygen, it is safe to delete the ./div/man/diversity.Rd as well as the
./div/NAMESPACE files, and then execute again the document() command. This command will
now recreate both files and from now on over-write whenever we run the document() command
again.
Alternatively, you can use the parameter roclets of the function document to list what R
exactly should do when building your package.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 827
❦
The package can be loaded directly from the source code that resides on our hard-disk.
The package is now loaded3 and the functions are available as is its documentation. The code
below illustrates the use of the package.
?diversity
This will invoke the help file of that particular function in the usual way for your system setup
(a man-like environment in the CLI, a window in RStudio or even a page in a web-browser).
❦ ❦
3 Note that the output of the install() function is quite verbose and is therefore, suppressed in the book.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 828
❦
If your package does not contain confidential information or propriety knowledge, and you have
github all rights to it, then you might want to upload it to “github” so it can be shared with the Internet
community. devtools has even a function that does this for you:
install_github('div','your_github_username')
And of course – in agreement with the philosophy of agile programming – you will continue to
add functions, improve them, improve documentation, etc. From time to time update the version
number, document, and upload again.
Also consider to read the Read-and-delete-me file and grab Hadley Wickham’s book about
R-packages. Thanks to his thought leadership packages have indeed become the easiest way to
share R-code.
The guide by MIT is very helpful to take a few more steps and this guide also includes
more information about other platforms such as Windows. It is here: http://web.mit.
edu/insong/www/pdf/rpackage_instructions.pdf.
Another great reference is Hadley Wickham’s book 2015 book “R packages: organize,
test, document, and share your code.” It is also freely available on the Internet: http:
//r-pkgs.had.co.nz.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 829
❦
♣B♣
Some things in this world can be ordered in a meaningful way. Numbers, such as sales, turnover,
etc. are such examples. Other things such as colours, people, countries, etc. might be ordered in
some way or another, but do not have an inherent dominant order.
For example, when making a model for credit applications, we have usually information about
the category of job. People that apply for a loan, choose one of the following that applies best:
law enforcement, military, other government, blue collar worker, white collar worker, teacher,
employee, self-employed, student, retired. This makes sense, because some professions are more
risky and hence this will impact the quality of the credit. However, is there an order? This is an
❦ example of a nominal scale: we have only names, but no order imposes itself on these labels. ❦
The nominal scale is the simplest form of classification. It contains labels that do not assume scale – nominal
an order. Examples include asset classes, first names, countries, days of the month, weekdays,
etc. It is not possible to use statistics such as average or median, and the only thing that can be
measured is which label occurs the most (“modus” or “mode”).
This scale can be characterised as represented in Table B.1
Note that it is possible to use numbers as labels, but that this is very misleading. When using
an nominal scale, none of the traditional metrics (such as averages) can be used.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 830
❦
A typical feedback is these days done by stars, usually a user can rate a service, driver, product
or seller between 1 to 5 stars. This is, of course, equivalent to “Rate our product with a num-
ber between 1 and 5.” The numbers are a little misleading here. Numbers as such can be added,
subtracted, multiplied, etc. However, is this still meaningful in this example?
It does not really work. Is 2 stars twice as good as one star? Are 2 customers that provide 2
stars together as happy as one that provides 4 stars? It seems that for most people there is little
difference between 3 and 4 stars, but there is a wider gap between 4 and 5 stars. So, while there
is a clear order, this scale is not equivalent to using numbers. An average, for example has little
meaning. However, calculating it and monitoring how it evolves can provide some insight in how
customers appreciate recent changes.
This scale type assumes a certain order. An example is a set of labels such as very safe, moder-
ate, risky, and very risky. Bond rating such as AAA, BB+, etc. also are ordinal scales: they indicate
a certain order, but there is no way to determine if the distance between them is the same or dif-
ferent. For example, it is not really clear if the probability of default difference between AAA and
AA is similar to the distance between BBB and BB.
For such variables, it may make sense to talk about a median, but it does not make any sense
to calculate an average (as is sometimes done in the industry and even in regulations).
These characteristics are summarised in Table B.2
Ordinal labels can be replaced by others if the strict order is conserved (by a strict increasing
or decreasing function). For example, AAA, AA-, and BBB+ can be replaced by 1, 2 and, 3 or
even by -501, -500, and 500 000. The information content is the same, the average will have no
meaningful interpretation.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 831
❦
In some cases. we are able to be sure about the order as well as the distance between the different
units. This type of scale will not only have meaningful order but also differences.
This scale can be used for many quantifiable variables: temperature (in degrees Celsius). In
this case, the difference between 1 and 2 degrees is the same as the difference between 100 and
101 degrees, and the average has a meaningful interpretation. Note that the zero point has only
an arbitrary meaning, just like using a number for an ordinal scale: it can be used as a name, but
it is only a name.
Rescaling is possible and remains meaningful. For example, a conversion from Celsius to
Fahrenheit is possible via the following formula, Tf = 59 Tc + 32, with Tc the temperature in Cel-
❦ ❦
sius and Tf the temperature in Fahrenheit.
An affine transformation is a linear transformation of the form y = A.x + b. In Euclidean
space an affine transformation will preserve collinearity (so that lines that lie on a line remain
on a line) and ratios of distances along a line (for distinct collinear points p1 , p2 , p3 , the ratio
||p2 − p1 ||/||p3 − p2 || is preserved).
In general, an affine transformation is composed of linear transformations (rotation, scaling
and/or shear) and a translation (or “shift”). An affine transformation is an internal operation and
several linear transformations can be combined into one transformation.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 832
❦
Using the Kelvin scale for temperature allows us to use a ratio scale: here not only the distances
between the degrees but also the zero point is meaningful. Among the many examples are profit,
loss, value, price, etc. Also a coherent risk measure is a ratio scale, because of the property trans-
lational invariance implies the existence of a true zero point.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 833
❦
♣C♣
Trademark Notices
We do not claim any ownership to trademarks or registered names, we want to thank all those
people and companies that spend money and time to produce great software and make it available
for free to allow others to stand on the shoulders of giants. We used also names of well known
companies and some of their commercial products. Also these words are only used as a citation
and their rightful owner is the source.
❦ ❦
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 834
❦
• Oracle and Java are registered trademarks of Oracle and/or its affiliates.
All other references to names, people, software and companies are made without any claim
of ownership. For example: IBM, FORTRAN, BASIC, CODASYL, COBOL, CALC, SQL, INGRES,
DS, DB, CPUs, noSQL, CAP, NewSQL, Nvidia, AMD, Intel, PHP, Python, Perl, etc. all are com-
panies or registered trademarks by other companies. This holds also for all other names that we
might have omited in this list.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 835
❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 836
❦
15. forecast: Hyndman et al. (2019) and Hyndman and Khandakar (2008)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 837
❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:25pm Page 838
❦
67. tm: Feinerer and Hornik (2018) and Feinerer et al. (2008)
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 839
❦
♣D♣
In rare occasions, we did not show all code. This was done out of concern that the code would get
in the way of understanding the subject discussed. However, we do not want to deprive you from
this code and will reproduce it here for all the occurrences where certain code was not omitted.
In this section we introduce a simple method to visualize some aspects of the data that are helpful
❦ for deciding on how to bin the data and have used base-R. It might be useful to know that in ❦
ggplot2 these results are equally easy to obtain and might suit better your needs. The output of
this code is in Figure D.1 on page 840.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 840
❦
1.00
200
0.75
150
0.50
spending_ratio
100
0.25
50
0.00
−0.25
30 50 70 90 30 50 70 90
age dt$age
Figure D.1: A visual aid to select binning borders is plotting a non-parametric fit and the histogram.
This code fragment produces a plot that visualizes all possible pch arguments for the plot function
as shown in Figure 9.3 on page 162.
# This sets up an empty plotting field:
plot(x = c(0, 4.5),
y = c(0, 5),
main = "Some pch arguments",
xaxt = "n",
yaxt = "n",
xlab = "",
ylab = "",
cex.main = 2.6,
col = "white"
)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 841
❦
The following code produces the plot that is presented in Figure 22.6 on page 398. It uses ggplot2
and requires a fair amount of fiddling to get the line in the correct place and the words next to the
line.
The aforementioned code includes a way to calculate KS. Since it is a little long, there are
functions in available to help us with that, as explained in Section 22.2.5 “Kolmogorov-Smirnov
(KS) for Logistic Regression” on page 398.
The following code produces the plot Figure 27.15 on page 551. Note especially the following:
• the definition of the error function and related functions, they are not readily available in
R, but are closely related to qnorm() – see for example Section 29 in Burington et al. (1973);
latex2exp
• the library latex2exp that provides the function TeX(), which allows to convert text to be
TeX()
type-setted by a LaTeX-like engine (note that it is not LaTeX it converts LATEX syntax to R’s plotmat
plotmat functions);
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 842
❦
library(ggplot2)
library(latex2exp)
d <- seq(from = -3, to = +3, length.out = 100)
## Gudermannian function
gd <- function(x) asin(tanh(x))
❦ ❦
f1 <- function(x) erf( sqrt(pi) / 2 * x)
f2 <- function(x) tanh(x)
f3 <- function(x) 2 / pi * gd(pi / 2 * x)
f4 <- function(x) x / sqrt(1 + x^2)
f5 <- function(x) 2 / pi * atan(pi / 2 * x)
f6 <- function(x) x / (1 + abs(x))
fn <- ""
fn[1] <- "erf \\left(\\frac{\\sqrt{\\pi} d}{2}\\right)"
fn[2] <- "tanh(x)"
fn[3] <- "\\frac{2}{\\pi} gd\\left( \\frac{\\pi d}{2} \\right)"
fn[4] <- "\\frac{d}{1 + d^2}"
fn[5] <- "\\frac{2}{\\pi} atan\\left(\\frac{\\pi d}{2}\\right)"
fn[6] <- "\\frac{x}{1+ |x|}"
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 843
❦
The following code, produces the plot Figure 27.9 on page 541. Pay attention to:
• the function deparse() that converts objects to vectors of strings – in our case the functions
have only one line, so we can capture the whole function in its second line – alternatively,
one could use collapse(deparse(f)) to capture the text of functions that span more
than one line.
• for most functions we would not need to vectorize, however, for example min() and max()
would fail, so we vectorize all of them. This creates a version of the function that can work
on vectors;
• note also that the functions such as the min, max, and gaus versions require additional
parameters – here some choices have been made.
❦ Vectorize(f)
❦
## function (x)
## {
## args <- lapply(as.list(match.call())[-1L], eval, parent.frame())
## names <- if (is.null(names(args)))
## character(length(args))
## else names(args)
## dovec <- names %in% vectorize.args
## do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]),
## SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))
## }
## <environment: 0x56342f0a9c80>
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 844
❦
par(mfrow=c(3,2))
f_curve(f1)
f_curve(f2)
f_curve(f3)
f_curve(f4)
f_curve(f5)
f_curve(f6)
par(mfrow=c(1,1))
Note that the part that is commented out with #-1- produces nicer expressions in all titles
except for the one using the functions min() and max(). These functions work unfortunately dif-
ferent in plotmath and R itself, causing the typesetting to be messed up. Using the function TeX
such as in previous plot would solve the issue. However, we wanted to point out the possibilities
of the functions expression() and bquote() in this plot.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 845
❦
♣E♣
Question 1 on page 14
A suggestion could be the distribution of IQ (as in intelligence quotient). This seems to follow
more or less a normal distribution with a standard deviation of 15 points.
Question 2 on page 31
All operations in R are by default done on objects element by element. Hence, the following code
❦ is sufficient to create an object nottemC that contains all temperatures in degrees Celsius. ❦
nottemC <- 5/9 * nottem - 32
While this concept makes such calculations very easy, it must be noted that this will – by
default – not produce a matrix or vector product as you might expect.
Question 3 on page 34
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 846
❦
m[k,l] <- 0
for (n in 1:ncol(m1)) {
m[k,l] <- m[k,l] + m1[k,n] * m2[n,l]
}
}
}
return(m);
}
# Compare with
M1 %*% M2
## [,1] [,2]
## [1,] 29 39
## [2,] 40 54
## [3,] 51 69
Question 4 on page 48
This question allows a lot of freedom, and there are many ways to tackle it. Here is one of the
possible solutions, by using histograms, correlations and scatter plots.
# Loading mtcars
library(MASS) # However, this is probably already loaded
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 847
❦
Histogram of mtcars$gear
2.0
1.5
Density
1.0
0.5
0.0
mtcars$gear
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 848
❦
jitter() Note the function jitter(), which adds random noise to each observation so that overlap-
ping points get just a little separated. This make clear that in some places the density of dots is
higher.
❦ ❦
Question 5 on page 48
# Create the factors, taking care that we know what factor will
# be manual and what will be automatic.
f <- factor(mtcars$am,
levels = c(0,1),
labels = c("automatic", "manual")
)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 849
❦
automatic manual
Question 6 on page 48
❦ ❦
The first part of this question is overlapping with previous, so we will just show the histogram.
This step is always important to get a good understanding of the data. Is there a car with 24 forward
gears or 1000 horse power? Then – unless it is a serious truck – this is probably wrong. Not only
we want to understand outliers but also get a feel with the data.
In this case, the reader will notice that the distribution of horsepower is skewed to the right.
If we now are asked to provide labels low, medium. and high, we have to understand why we are
doing this. Do we want an equal number of cars in each group? Or maybe it makes sense to have
a very small group “high”?
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 850
❦
Histogram of mtcars$hp
10
8
6
Frequency
4
2
0
mtcars$hp
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 851
❦
Horsepower as a factor
12
10
number of cars in that bin
8
6
4
2
0
M L H
Question 7 on page 54
❦ ❦
M <- matrix(c(1:9),nrow=3);
D <- data.frame(M);
rownames(D) <- c("2016","2017","2018");
colnames(D) <- c("Belgium", "France", "Poland");
cbind(D,rowSums(D));
## Belgium France Poland rowSums(D)
## 2016 1 4 7 12
## 2017 2 5 8 15
## 2018 3 6 9 18
D <- D[,-2]
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 852
❦
## # A tibble: 1,000 x 5
## row col expected actual file
## <int> <chr> <chr> <chr> <chr>
## 1 1001 y 1/0/T/F/TRU~ 2015-0~ '/home/philippe/R/x86~
## 2 1002 y 1/0/T/F/TRU~ 2018-0~ '/home/philippe/R/x86~
## 3 1003 y 1/0/T/F/TRU~ 2015-0~ '/home/philippe/R/x86~
## 4 1004 y 1/0/T/F/TRU~ 2012-1~ '/home/philippe/R/x86~
## 5 1005 y 1/0/T/F/TRU~ 2020-0~ '/home/philippe/R/x86~
## 6 1006 y 1/0/T/F/TRU~ 2016-0~ '/home/philippe/R/x86~
## 7 1007 y 1/0/T/F/TRU~ 2011-0~ '/home/philippe/R/x86~
## 8 1008 y 1/0/T/F/TRU~ 2020-0~ '/home/philippe/R/x86~
## 9 1009 y 1/0/T/F/TRU~ 2011-0~ '/home/philippe/R/x86~
## 10 1010 y 1/0/T/F/TRU~ 2010-0~ '/home/philippe/R/x86~
## # ... with 990 more rows
❦ ## with ❦
spec_csv(readr_example("challenge.csv"), guess_max = 1001)
# second step:
t <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
# Let us see:
head(t)
## # A tibble: 6 x 2
## x y
## <dbl> <date>
## 1 404 NA
## 2 4172 NA
## 3 3004 NA
## 4 787 NA
## 5 37 NA
## 6 2332 NA
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 853
❦
tail(t)
## # A tibble: 6 x 2
## x y
## <dbl> <date>
## 1 0.805 2019-11-21
## 2 0.164 2018-03-29
## 3 0.472 2014-08-04
## 4 0.718 2015-08-16
## 5 0.270 2020-02-04
## 6 0.608 2019-01-06
The case were the dependent variable is binary, is very important because it models any yes/no
decision. In a corporate setting many decision will have that form: for example: go/no-go, yes/no,
send/not_send, hire/not_hire, give_loan/not_give_loan, etc.
###########
# Model 1 #
###########
regr1 <- glm(formula = is_good ~ age + sexM,
family = binomial,
data = t)
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 854
❦
##
## Call:
## glm(formula = is_good ~ age + sexM, family = binomial, data = t)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.233 -1.159 -1.113 1.158 1.271
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.185179 0.262594 0.705 0.481
## age -0.002542 0.005968 -0.426 0.670
## sexM -0.169578 0.126664 -1.339 0.181
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1386.3 on 999 degrees of freedom
## Residual deviance: 1384.3 on 997 degrees of freedom
## AIC: 1390.3
##
## Number of Fisher Scoring iterations: 3
###########
# Model 2 #
###########
# make the same cut
❦ ❦
t <- as_tibble(t) %>%
mutate(is_LF = if_else((age <= 35) & (sex == "F"), 1L, 0L)) %>%
mutate(is_HF = if_else((age > 50) & (sex == "F"), 1L, 0L)) %>%
mutate(is_LM = if_else((age <= 35) & (sex == "M"), 1L, 0L)) %>%
mutate(is_HM = if_else((age > 50) & (sex == "M"), 1L, 0L))
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 855
❦
MSE2
## [1] 0.2249014
❦ 2. The model with matrix-bins performs better regardless binary or continuous parameter to ❦
be foretasted. Note in particular that
• model 1 has no stars any more (general deterioration as explained previously),
• model 2 has still three stars for each variable with only one exception where we still
have two stars.
Note also that because we have now a value of our dependent variable that is strict equal
to zero and others strictly equal to one that we can use family = binomial rather than
quasibinomial.
One way to approach such situation could be to split the population in two parts and make two
models: one for each part of the population.
The reason why we try to predict a variable that is between 0 and 1 (and linked to this why
we use a logistic regression) is related to the nature of the bins that we create. Since they com-
bine 2 (or more) units1 we cannot reasonably only use a binary variable as independent variable.
1 With units we mean things like meter, hours, minutes, dollars, number of people, produced units, etc. In our
example we mix sex and age. In such case, the only thing that can be meaningful is a binary value: belongs or does
not belong to this bin.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 856
❦
This in its turn makes logistic regressions theoretically more compelling and practically easier to
understand.
Although, note that instead of using a logistic regression that it is also possible to use a linear
regression and then cap the results. This is not elegant, and does not have the link with the under-
lying odds any more. While the model might still make good predictions, performance measures
like AUC will be more appropriate than MSE for example.
library(InformationValue)
IV(X = factor(mtcars$vs), Y = factor(mtcars$am))
## [1] 0.1178631
## attr(,"howgood")
## [1] "Highly Predictive"
We can conclude that the gearbox type (automatic or manual) is a good predictor for the cylin-
❦ der layout in the database mtcars. ❦
One possible way of obtaining this is creating one object that holds the three matrices. For exam-
ple, create an S3 object and return this.
# mcda_promethee_list
# delivers the preference flow matrices for the Promethee method
# Arguments:
# M -- decision matrix
# w -- weights
# piFUNs -- a list of preference functions,
# if not provided min(1,max(0,d)) is assumed.
# Returns (as side effect)
# phi_plus <<- rowSums(PI.plus)
# phi_min <<- rowSums(PI.min)
# phi_ <<- phi_plus - phi_min
#
mcda_promethee_list <- function(M, w, piFUNs='x')
{
if (piFUNs == 'x') {
# create a factory function:
makeFUN <- function(x) {x; function(x) max(0,x) }
P <- list()
for (k in 1:ncol(M)) P[[k]] <- makeFUN(k)
} # else, we assume a vector of functions is provided
# initializations
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 857
❦
set.seed(1492)
M <- matrix (runif(9), nrow = 3)
w <- c(runif(3))
L <- mcda_promethee_list(M, w)
❦ ❦
A response on a request to rate a service or product on a scale from 1 to 5 is an ordinal scale – see
Chapter B.2 “Ordinal Scale” on page 830 – for an average to make sense, one needs an interval
scale – see Chapter B.3 “Interval Scale” on page 831.
(1 + ry ) = (1 + rm )12
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 859
❦
Bibliography
(1793, May). Collection générale des décrets rendus par la convention nationale, date: May 8,
1793 (du 8 mai 1793). available in Google Books Full View.
Adler, D. and S. T. Kelly (2018). vioplot: violin plot. R package version 0.3.0.
Andersen, E., N. Jensen, and N. Kousgaard (1987). Statistics for economics, business administra-
tion, and the social sciences. Springer-Verlag.
Artzner, P., F. Delbaen, J.-M. Eber, and D. Heath (1997). Thinking coherently. Risk 10(11), 68–71.
Artzner, P., F. Delbaen, J.-M. Eber, and D. Heath (1999). Coherent measures of risk. Mathemat-
ical finance 9(3), 203–228.
❦ Auguie, B. (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version ❦
2.3.
Badal, S. and J. K. Harter (2014). Gender diversity, business-unit engagement, and performance.
Journal of Leadership & Organizational Studies 21(4), 354–365.
Baesens, B. (2014). Analytics in a big data world: The essential guide to data science and its appli-
cations. John Wiley & Sons.
Barta, T., M. Kleiner, and T. Neumann (2012). Is there a payoff from top-team diversity. McKinsey
Quarterly 12, 65–66.
Bengio, Y. and Y. Grandvalet (2004). No unbiased estimator of the variance of k-fold cross-
validation. Journal of machine learning research 5(Sep), 1089–1105.
Berger, S., N. Graham, and A. Zeileis (2017, July). Various versatile variances: An object-oriented
implementation of clustered covariances in R. Working Paper 2017-12, Working Papers in Eco-
nomics and Statistics, Research Platform Empirical and Experimental Economics, Universität
Innsbruck.
Bernoulli, D. (1738). Specimen theoriae novae de mensura sortis. Comentarii Academiae Scien-
tiarum Imperialis Petropolitanae Tomus V, 175–192.
Bertsimas, D., G. Lauprete, and A. Samarov (2004). Shortfall as a risk measure: properties, opti-
mization and applications. Journal of Economic Dynamics and Control 28(7), 1353–1381.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 860
❦
860 Bibliography
Black, F. and M. Scholes (1973). The pricing of options and corporate liabilities. The journal of
political economy, 637–654.
Botev, Z. I., J. F. Grotowski, D. P. Kroese, et al. (2010). Kernel density estimation via diffusion.
The Annals of Statistics 38(5), 2916–2957.
Burington, R. S. et al. (1973). Handbook of mathematical tables and formulas. McGraw-Hill New
York.
Chang, W. and B. Borges Ribeiro (2018). shinydashboard: Create Dashboards with ‘Shiny’. R
package version 0.7.1.
Chang, W., J. Cheng, J. Allaire, Y. Xie, and J. McPherson (2019). shiny: Web Application Frame-
work for R. R package version 1.3.2.
Chang, W. and H. Wickham (2018). ggvis: Interactive Grammar of Graphics. R package version
0.4.4.
❦ Chen, S. (2008). Nonparametric estimation of expected shortfall. Journal of financial economet- ❦
rics 6(1), 87.
Cohen, M. B., S. Elder, C. Musco, C. Musco, and M. Persu (2015). Dimensionality reduction for
k-means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM
symposium on Theory of computing, pp. 163–172. ACM.
Cyganowski, S., P. Kloeden, and J. Ombach (2001). From elementary probability to stochastic
®
differential equations with MAPLE . Springer Science & Business Media.
Ding, C. and X. He (2004). K-means clustering via principal component analysis. In Proceedings
of the twenty-first international conference on Machine learning, pp. 29. ACM.
Eddelbuettel, D. and J. J. Balamuta (2017, aug). Extending extitR with extitC++: A Brief Intro-
duction to extitRcpp. PeerJ Preprints 5, e3188v1.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 861
❦
Bibliography 861
Eddelbuettel, D. and J. J. Balamuta (2018). Extending R with c++: A brief introduction to rcpp.
The American Statistician 72(1), 28–36.
Feinerer, I. and K. Hornik (2018). tm: Text Mining Package. R package version 0.7-6.
Feinerer, I., K. Hornik, and D. Meyer (2008, March). Text mining infrastructure in r. Journal of
Statistical Software 25(5), 1–54.
Fermanian, J. and O. Scaillet (2005). Sensitivity analysis of var and expected shortfall for portfo-
lios under netting agreements. Journal of Banking & Finance 29(4), 927–958.
Fox, J., S. Weisberg, and B. Price (2018). carData: Companion to Applied Regression Data Sets. R
package version 3.0-2.
Fritsch, S., F. Guenther, and M. N. Wright (2019). neuralnet: Training of Neural Networks. R
package version 1.44.2.
Garnier, S. (2018). viridisLite: Default Color Maps from ‘matplotlib’ (Lite Version). R package
version 0.3.0.
❦ Gelman, A. and J. Hill (2011). Opening windows to the black box. Journal of Statistical Soft- ❦
ware 40.
Gesmann, M. and D. de Castillo (2011, December). googlevis: Interface between r and the google
visualisation api. The R Journal 3(2), 40–44.
Goldratt, E. M., J. Cox, and D. Whitford (1992). The goal: a process of ongoing improvement,
Volume 2. North River Press Great Barrington, MA.
Grolemund, G. (2014). Hands-On Programming with R: Write Your Own Functions and Simula-
tions. “ O’Reilly Media, Inc.”.
Grolemund, G. and H. Wickham (2011). Dates and times made easy with lubridate. Journal of
Statistical Software 40(3), 1–25.
Hall, P., J. Marron, and B. U. Park (1992). Smoothed cross-validation. Probability Theory and
Related Fields 92(1), 1–20.
Hara, A. and Y. Hayashi (2012). Ensemble neural network rule extraction using re-rx algorithm.
In Neural Networks (IJCNN), The 2012 International Joint Conference on, pp. 1–6. IEEE.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 862
❦
862 Bibliography
Harrell Jr, F. E., with contributions from Charles Dupont, and many others. (2019).Hmisc: Har-
rell Miscellaneous. R package version 4.2-0.
Hastie, T., R. Tibshirani, and J. Friedman (2009). The elements of statistical learning. Springer.
Hendricks, P. (2015). titanic: Titanic Passenger Survival Data Set. R package version 0.1.0.
Hintze, J. L. and R. D. Nelson (1998). Violin plots: a box plot-density trace synergism.The Amer-
ican Statistician 52(2), 181–184.
Hyndman, R. J. and Y. Khandakar (2008). Automatic time series forecasting: the forecast pack-
age for R. Journal of Statistical Software 26(3), 1–22.
Iannone, R., J. Allaire, and B. Borges (2018). flexdashboard: R Markdown Format for Flexible
Dashboards. R package version 0.5.1.1.
Izrailev, S. (2015). binr: Cut Numeric Values into Evenly Distributed Groups. R package
version 1.1.
Jackson, C. H. (2011). Multi-state models for panel data: The msm package for R. Journal of
Statistical Software 38(8), 1–29.
❦ ❦
Jacobsson, H. (2005). Rule extraction from recurrent neural networks: Ataxonomy and review.
Neural Computation 17(6), 1223–1263.
James, G., D. Witten, T. Hastie, and R. Tibshirani (2013). An introduction to statistical learning,
Volume 112. Springer.
Jones, C., J. Marron, and S. Sheather (1996a). Progress in data-based bandwidth selection for
kernel density estimation. Computational Statistics (11), 337–381.
Jones, M. C., J. S. Marron, and S. J. Sheather (1996b). A brief survey of bandwidth selection for
density estimation. Journal of the American Statistical Association 91(433), 401–407.
Kaplan, R. S. and D. P. Norton (2001a). Transforming the balanced scorecard from performance
measurement to strategic management: Part i. Accounting horizons 15(1), 87–104.
Kaplan, R. S. and D. P. Norton (2001b). Transforming the balanced scorecard from performance
measurement to strategic management: Part ii. Accounting Horizons 15(2), 147–160.
Karimi, K., N. G. Dickson, and F. Hamze (2010). A performance comparison of cuda and opencl.
arXiv preprint arXiv:1005.2581.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 863
❦
Bibliography 863
Keylock, C. (2005). Simpson diversity and the shannon–wiener index as special cases of a gen-
eralized entropy. Oikos 109(1), 203–207.
Killough, L. N. and W. E. Leininger (1977). Cost Accounting for Managerial Decision Making.
Dickenson Publishing Company.
Kondratieff, N. and W. Stolper (1935). The long waves in economic life. The Review of Economics
and Statistics 17(6), 105–115.
Kondratieff, N. D. (1979). The long waves in economic life. Review (Fernand Braudel Center),
519–562.
Kowarik, A. and M. Templ (2016). Imputation with the R package VIM. Journal of Statistical
Software 74(7), 1–16.
Lamport, L. (1994). LATEX: a document preparation system: user’s guide and reference manual.
Addison-wesley.
Lang, D. T. and the CRAN team (2019). RCurl: General Network (HTTP/FTP/...) Client Interface
for R. R package version 1.95-4.12.
❦ ❦
Lang, D. T. and the CRAN Team (2019). XML: Tools for Parsing and Generating XML Within R
and S-Plus. R package version 3.98-1.19.
Lê, S., J. Josse, and F. Husson (2008). FactoMineR: A package for multivariate analysis. Journal
of Statistical Software 25(1), 1–18.
Liaw, A. and M. Wiener (2002). Classification and regression by randomforest. R News 2(3),
18–22.
Liker, J. and G. L. Convis (2011). The Toyota way to lean leadership: Achieving and sustaining
excellence through leadership development. McGraw Hill Professional.
Little, R. J. (1988). Missing-data adjustments in large surveys. Journal of Business & Economic
Statistics 6(3), 287–296.
Luraschi, J., K. Kuo, K. Ushey, J. Allaire, and The Apache Software Foundation (2019). sparklyr:
R Interface to Apache Spark. R package version 1.0.2.
Mackay, C. (1841). Memoirs of extraordinary Popular Delusions and the Madness of Crowds (First
ed.). New Burlington Street, London, UK: Richard Bentley.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 864
❦
864 Bibliography
Maechler, M., P. Rousseeuw, A. Struyf, M. Hubert, and K. Hornik (2019). cluster: Cluster Analysis
Basics and Extensions. R package version 2.0.8 — For new features, see the ‘Changelog’ file (in
the package source).
Marr, B. (2016). Key Business Analytics: The 60+ Business Analysis Tools Every Manager Needs
To Know. Pearson UK.
Meschiari, S. (2015). latex2exp: Use LaTeX Expressions in Plots. R package version 0.4.0.
Meyer, D., E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch (2019). e1071: Misc Functions
of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package
version 1.7-0.1.
Mikosch, T. (1998). Elementary stochastic calculus with finance in view. World Scientific Pub Co
Inc.
Milborrow, S. (2019). rpart.plot: Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’. R pack-
age version 3.0.7.
Monsen, R. J. and A. Downs (1965). A theory of large managerial firms. The Journal of Political
❦ Economy, 221–236. ❦
Mossin, J. (1968). Optimal multiperiod portfolio policies. Journal of Business 41, 205–225.
Murtagh, F. and P. Legendre (2011). Ward’s hierarchical clustering method: Clustering criterion
and agglomerative algorithm. arXiv preprint arXiv:1111.6285.
Neter, J., W. Wasserman, and G. Whitmore (1988). Applied statistics. New York: Allyn and Bacon.
Norreklit, H. (2000). The balance on the balanced scorecard a critical analysis of some of its
assumptions. Management accounting research 11(1), 65–88.
Ooms, J. (2017). sodium: A Modern and Easy-to-Use Crypto Library. R package version 1.1.
Ooms, J., D. James, S. DebRoy, H. Wickham, and J. Horner (2018). RMySQL: Database Interface
and ‘MySQL’ Driver for R. R package version 0.10.15.
Parzen, E. (1962).On estimation of a probability density function and mode.The annals of math-
ematical statistics, 1065–1076.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 865
❦
Bibliography 865
Provost, F. and T. Fawcett (2013). Data Science for Business: What you need to know about data
mining and data-analytic thinking. “ O’Reilly Media, Inc.”.
R Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna, Austria:
R Foundation for Statistical Computing.
R Special Interest Group on Databases (R-SIG-DB), H. Wickham, and K. Müller (2018). DBI:
R Database Interface. R package version 1.0.0.
Raiche, G. (2010). an R package for parallel analysis and non graphical solutions to the Cattell
scree test. R package version 2.3.3.
Reichheld, F. F. (2003).The one number you need to grow.Harvard business review 81(12), 46–55.
Revelle, W. (2018). psych: Procedures for Psychological, Psychometric, and Personality Research.
Evanston, Illinois: Northwestern University. R package version 1.8.12.
Robin, X., N. Turck, A. Hainard, N. Tiberti, F. Lisacek, J.-C. Sanchez, and M. Müller (2011).
proc: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinfor-
matics 12, 77.
Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators. Scandinavian
Journal of Statistics, 65–78.
❦ Rupp, K., P. Tillet, F. Rudolf, J. Weinbub, T. Grasser, and A. JÃijngel (2016-10-27). Viennacl- ❦
linear algebra library for multi- and many-core architectures. SIAM Journal on Scientific Com-
puting.
Russell, S. J. and P. Norvig (2016). Artificial intelligence: a modern approach. Malaysia; Pearson
Education Limited.
Ryan, J. A. and J. M. Ulrich (2018). xts: eXtensible Time Series. R package version 0.11-2.
Scott, D. W. (2015). Multivariate density estimation: theory, practice, and visualization. John Wiley
& Sons.
Setiono, R. (1997). Extracting rules from neural networks by pruning and hidden-unit splitting.
Neural Computation 9(1), 205–225.
Setiono, R., B. Baesens, and C. Mues (2008). Recursive neural network rule extraction for data
with mixed attributes. IEEE Transactions on Neural Networks 19(2), 299–307.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 866
❦
866 Bibliography
Sharpe, W. F. (1964). Capital asset prices: A theory of market equilibrium under conditions of
risk. Journal of Finance 19(3), 425–442.
Sheather, S. J. and M. C. Jones (1991). A reliable data-based bandwidth selection method for
kernel density estimation. Journal of the Royal Statistical Society. Series B (Methodological), 683–
690.
Simonoff, J. S. (2012). Smoothing methods in statistics. Springer Science & Business Media.
Sing, T., O. Sander, N. Beerenwinkel, and T. Lengauer (2005). Rocr: visualizing classifier perfor-
mance in r. Bioinformatics 21(20), 7881.
Smith, B. M. (2004). A history of the global stock market: from ancient Rome to Silicon Valley.
University of Chicago press.
Soetaert, K. (2017a). diagram: Functions for Visualising Simple Graphs (Networks), Plotting Flow
Diagrams. R package version 1.6.4.
Szekely, G. J. and M. L. Rizzo (2005). Hierarchical clustering via joint between-within distances:
Extending ward’s minimum variance method. Journal of classification 22(2), 151–183.
Tang, Y., M. Horikoshi, and W. Li (2016). ggfortify: Unified interface to visualize statistical result
of popular r packages. The R Journal 8.
Thaler, R. H. (2016). Behavioral economics: Past, present and future. Present and Future (May
27, 2016).
Therneau, T. and B. Atkinson (2019). rpart: Recursive Partitioning and Regression Trees. R pack-
age version 4.1-15.
Tickle, A. B., R. Andrews, M. Golea, and J. Diederich (1998). The truth will come to light:
Directions and challenges in extracting the knowledge embedded within trained artificial neural
networks. IEEE Transactions on Neural Networks 9(6), 1057–1068.
Tierney, L. and R. Jarjour (2016).proftools: Profile Output Processing Tools for R.R package version
0.99-2.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 867
❦
Bibliography 867
Treynor, J. L. (1962). Toward a theory of market value of risky assets. A final version was pub-
lished in 1999, in Asset Pricing and Portfolio Performance: Models, Strategy and Performance
Metrics. Robert A. Korajczyk (editor) London: Risk Books, pp. 15–22.
Ushey, K., J. Hester, and R. Krzyzanowski (2017). rex: Friendly Regular Expressions. R package
version 1.1.2.
Van Tilborg, H. C. and S. Jajodia (2014). Encyclopedia of cryptography and security. Springer
Science & Business Media.
Venables, W. N. and B. D. Ripley (2002a). Modern Applied Statistics with S (Fourth ed.). New
York: Springer. ISBN 0-387-95457-0.
Venables, W. N. and B. D. Ripley (2002b). Modern Applied Statistics with S (Fourth ed.). New
York: Springer. ISBN 0-387-95457-0.
Venkataraman, S., X. Meng, F. Cheung, and The Apache Software Foundation (2019). SparkR:
R Front End for ‘Apache Spark’. R package version 2.4.3.
Wickham, H. (2011). The split-apply-combine strategy for data analysis. Journal of Statistical
Software 40(1), 1–29.
Wickham, H. (2015). R packages: organize, test, document, and share your code. “ O’Reilly Media,
Inc.”.
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
Wickham, H. (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1.
Wickham, H. (2018a). profr: An Alternative Display for Profiling Information. R package version
0.3.3.
Wickham, H. (2018b). pryr: Tools for Computing on the Language. R package version 0.1.4.
Wickham, H. (2019).stringr: Simple, Consistent Wrappers for Common String Operations.R pack-
age version 1.4.0.
Wickham, H. et al. (2014). Tidy data. Journal of Statistical Software 59(10), 1–23.
Wickham, H., P. Danenberg, and M. Eugster (2018). roxygen2: In-Line Documentation for R. R
package version 6.1.1.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 868
❦
868 Bibliography
Wickham, H. and G. Grolemund (2016). R for data science: import, tidy, transform, visualize, and
model data. “ O’Reilly Media, Inc.”.
Wickham, H., J. Hester, and W. Chang (2019). devtools: Tools to Make Developing R Packages
Easier. R package version 2.1.0.
Wolfgang, P. and J. Baschnagel (1999). Stochastic Processes: From Physics to Finance. Springer.
Xie, Y. (2015). Dynamic Documents with R and knitr (2nd ed.). Boca Raton, Florida: Chapman
and Hall/CRC. ISBN 978-1498716963.
Xie, Y. (2018). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package
version 1.20.
Zeileis, A. (2004). Econometric computing with HC and HAC covariance matrix estimators.
Journal of Statistical Software 11(10), 1–17.
Zeileis, A. and G. Grothendieck (2005). zoo: S3 infrastructure for regular and irregular time
series. Journal of Statistical Software 14(6), 1–27.
❦ ❦
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 869
❦
Nomenclature
D
DPR := E, dividend payout ratio, page 619.
Φ(a) the preference flow between an alternative a and all others over all criteria,
equals: Φ(a) = x∈A kj=1 πj (fj (a), fj (x)), page 550.
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 870
❦
870 Nomenclature
Φ− (a) the preference flow that indicates how much other alternatives are
1
preferred over alternative a, it equals Φ− (a) = k−1 x∈A π(x, a) in
PROMethEE, page 544.
RRF the risk free return (in calculations one generally uses an average of past risk
free returns and not the actual risk free return), page 610.
E
ROE := P, return on equity, page 619.
Cα (T ) cost of complexity function for the tree T and pruning parameter α, page 409.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 871
❦
Nomenclature 871
2
S
log( X )+ r− σ2 (τ ) √
d2 := √
σ τ
= d1 − σ τ , page 650.
fest (x) the estimator for the probability density function, f (x), page 635.
fest (x; h) the estimator for the probability density function for a kernel
density estimation with bandwidth h, page 635.
Kh the kernel (of a kernel density estimation) with bandwidth h, page 635.
Mj the ideal value for criterion j in the Goal Programming Method, page 560.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 872
❦
872 Nomenclature
N (x) the result vector of the WSM for alternative x, page 527.
P (a) the “total score” of alternative a as used in the WPM, page 530.
P (a, b) the “total score” of alternative a as used in the dimensionless WPM, page 530.
r the capitalization rate, E[Rk ], estimated via the CAPM, page 617.
rj a conversion factor that removes the unit, and scales it, page 560.
V0 the value of an asset at time 0 (now) = the present value = PV, page 598.
wj the scaling factor in the WSM, it removes the unit, and scales (weights)
the j th criterion, page 527.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 873
❦
Nomenclature 873
Altiplano: option combined with a coupon, which only paid if the underlying security
never reaches its strike price during a given period, page 679.
American an American option can be executed from the moment it is bought till its
maturity date, page 640.
Annapurna option where payoff is paid only if all securities increase a given
❦ ❦
amount, page 678.
Asian the strike or spot is determined by the average price of the underlying taken
at different moments, page 678.
Atlas option in which the best and worst performing securities are removed before
maturity, page 678.
ATM an option is in the money if its Intrinsic value is zero, page 639.
Barrier Option generic term for knock-in and knock-out options, page 678.
Bermuda Option an option where the buyer has the right to exercise at a set (always discretely
spaced) number of times, page 679.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 874
❦
874 Nomenclature
Call Option the right to buy an underlying asset at a pre-agreed price, page 638.
Canary Option can be exercised at quarterly dates, but not before a set time period has
elapsed, page 679.
cash settlement pay out the profit to the option buyer in stead of deliver the
underlying, page 639.
deliver provide or accept the underlying from the option buyer, page 639.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 875
❦
Nomenclature 875
EBITDA earniCompany Value and ngs before interest, taxes and depreciation,
amortization, page 568.
European a European option can only be executed at its maturity date, page 640.
Everest option based on the worst-performing securities in the basket., page 678.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 876
❦
876 Nomenclature
Himalayan option based on the performance of the best asset in the portfolio., page 674.
IERS International Earth Rotation and Reference Systems Service, page 314.
ITM an option is in the money if its intrinsic value is positive, page 641.
❦ Knock-In this option only becomes active when a certain level (up or down) ❦
is reached, page 679.
long position the buyer of an option is said to have a long position on his books, page 640.
look-back the spot is determined as the best price of some moments., page 679.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 877
❦
Nomenclature 877
maturity or “maturity date” is the expiry date of an option, that is the last moment in
time that it can change value because of the movement of the
underlying, page 640.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 878
❦
878 Nomenclature
OTM an option is in the money if its Intrinsic value is negative, page 641.
P
PTCF = CF = price-to-cash-flow ratio, with CF = (free) cash flow, page 628.
P
PTS = S = price-to-sales ratio, with S = sales, page 628.
Put the right to sell an underlying asset at a pre-agreed price, page 640.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 879
❦
Nomenclature 879
Russian lookback over the whole life time of the option, page 679.
short position the seller of an option is said to have a short position on his books, page 640.
spot price the actual value of the underlying asset, the price to be paid for the asset to
buy it today and have it today, page 641.
strike the “execution price,” the price at which an option can be executed
(e.g. for a call the price at which the underlying can be sold), page 640.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 880
❦
880 Nomenclature
Verde Option option that can be exercised at set dates, but not before a set time period has
elapsed, page 680.
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 881
❦
Index
The Big R-Book: From Data Science to Learning Machines and Big Data, First Edition. Philippe J.S. De Brouwer.
© 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/De Brouwer/The Big R-Book
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 882
❦
882 Index
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 883
❦
Index 883
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 884
❦
884 Index
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 885
❦
Index 885
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 886
❦
886 Index
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 887
❦
Index 887
matrix(), 32 multiplication, 57
max(), 69 mutate(), 294
MCDA, 511 mutate_all(), 295
mclapply(), 743 mutate_at(), 295
mdy(), 315 mutate_if(), 295
mean, 139, 140 mutating join, 290
arithmetic, 139 MySQL, 79, 226, 253
generalized, 140
geometric, 141 Nasdaq, 496
harmonic, 141 National Association of Securities Dealers
holder, 141 Automated Quotations, 496
power, 141 NAV, 624
quadratic, 141 nchar(), 56
mean(), 69, 140 ndays(), 501
mean average deviation, 386 negative binomial distribution, 150
mean square error, 384 net asset value, 624
measure Net Free Cash Flow, 621
central tendency, 139 net operating assets, 578
measures of spread, 145 Net Operating Income After Taxes, 569
median, 142 net operating profit, 629
median absolute deviation, 145 Net Present Value, 600
memory limit, 761 Net Promoter Score, 594
net satisfaction score, 595
merge(), 53
neural network, 434
Methods(), 111
new(), 101, 108
methods(), 93, 98
New York Stock Exchange, 496
MI, 725
NewSQL, 217
mice(), 340
❦ Next(), 503 ❦
mice, 339
niceness, 817
microbenchmark(), 795
nlevels(), 47
microbenchmark, 795
nls(), 382
minute(), 319
NN, 434
MIS, 584
NOA, 578, 623
missing at random, 339
nominal interest rate, 599
ML, 405 nominal scale, 829
MLlib, 785 non-linear regression, 381
mode, 143 NOP, 629
mode(), 143 NOPAT, 569
model normal distribution, 150
exponential, 209 NoSQL, 217
log-linear, 379 not, 59
Poisson, 379 not equal, 58
model performance, 384 NPS, 594
model validation, 492 NPV, 600, 622
modelData(), 504, 505 nScree(), 370
modelr, 126, 469, 483 NSS, 595
Monte Carlo Simulation, 632 nweeks(), 501
month(), 318 nyears(), 501
monthlyReturn(), 504 NYSE, 496
moving average, 200
MRAN, 23 OA, 577
MSE, 384 object databases, 217
mtcars, 48 object oriented, 87
multi criteria decision analysis, 511 object oriented programming, 113
multiple linear regression, 379 object-relational mappings, 216
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 888
❦
888 Index
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 889
❦
Index 889
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 890
❦
890 Index
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 891
❦
Index 891
❦
Trim Size: 8.5in x 10.87in DeBrouwer - 08/26/2020 1:26pm Page 892
❦
892 Index