Introduction To Data Science A Python Approach To Concepts Techniques And Applications 2nd Edition 2nd Laura Igual download
Introduction To Data Science A Python Approach To Concepts Techniques And Applications 2nd Edition 2nd Laura Igual download
https://ebookbell.com/product/introduction-to-data-science-a-
python-approach-to-concepts-techniques-and-applications-2nd-
edition-2nd-laura-igual-56665478
https://ebookbell.com/product/introduction-to-data-science-a-python-
approach-to-concepts-techniques-and-applications-laura-igual-56686788
https://ebookbell.com/product/introduction-to-data-science-a-python-
approach-to-concepts-techniques-and-applications-2nd-laura-
igual-56685702
https://ebookbell.com/product/introduction-to-data-science-a-python-
approach-to-concepts-techniques-and-applications-laura-igual-5751316
https://ebookbell.com/product/introduction-to-data-science-practical-
approach-with-r-and-python-b-uma-maheswari-231577056
Python Data Science Learn Python In A Week And Master It An Handson
Introduction To Big Data Analysis And Mining A Projectbased Guide With
Practical Exercises 7 Days Crash Course Book 3 Computer Programming
Academy
https://ebookbell.com/product/python-data-science-learn-python-in-a-
week-and-master-it-an-handson-introduction-to-big-data-analysis-and-
mining-a-projectbased-guide-with-practical-exercises-7-days-crash-
course-book-3-computer-programming-academy-49850014
https://ebookbell.com/product/introduction-to-machine-learning-with-
python-a-guide-for-beginners-in-data-science-nedal-22090550
https://ebookbell.com/product/introduction-to-machine-learning-with-
python-a-guide-for-beginners-in-data-science-daniel-nedal-peters-
morgan-11417240
https://ebookbell.com/product/introduction-to-data-science-for-social-
and-policy-research-collecting-and-organizing-data-with-r-and-
python-1st-edition-jose-manuel-magallanes-reyes-42802664
https://ebookbell.com/product/introduction-to-python-in-earth-science-
data-analysis-1st-edition-maurizio-petrelli-34896642
Undergraduate Topics in Computer
Science
Series Editor
Ian Mackie, University of Sussex, Brighton, UK
Advisory Editors
Samson Abramsky , Department of Computer Science, University of Oxford,
Oxford, UK
Chris Hankin , Department of Computing, Imperial College London, London, UK
Mike Hinchey , Lero—The Irish Software Research Centre, University of Limerick,
Limerick, Ireland
Dexter C. Kozen, Department of Computer Science, Cornell University, Ithaca,
NY, USA
Hanne Riis Nielson , Department of Applied Mathematics and Computer Science,
Technical University of Denmark, Kongens Lyngby, Denmark
Steven S. Skiena, Department of Computer Science, Stony Brook University, Stony
Brook, NY, USA
Iain Stewart , Department of Computer Science, Durham University, Durham, UK
Joseph Migga Kizza, Engineering and Computer Science, University of Tennessee at
Chattanooga, Chattanooga, TN, USA
Roy Crole, School of Computing and Mathematics Sciences, University of Leicester,
Leicester, UK
Elizabeth Scott, Department of Computer Science, Royal Holloway University of
London, Egham, UK
‘Undergraduate Topics in Computer Science’ (UTiCS) delivers high-quality
instructional content for undergraduates studying in all areas of computing and
information science. From core foundational and theoretical material to final-year
topics and applications, UTiCS books take a fresh, concise, and modern approach
and are ideal for self-study or for a one- or two-semester course. The texts
are authored by established experts in their fields, reviewed by an international
advisory board, and contain numerous examples and problems, many of which
include fully worked solutions.
The UTiCS concept centers on high-quality, ideally and generally quite concise
books in softback format. For advanced undergraduate textbooks that are likely
to be longer and more expository, Springer continues to offer the highly regarded
Texts in Computer Science series, to which we refer potential authors.
Laura Igual · Santi Seguí
Introduction to Data
Science
A Python Approach to Concepts,
Techniques and Applications
Second Edition
Laura Igual Santi Seguí
Departament de Matemàtiques i Informàtica Departament de Matemàtiques i Informàtica
Universitat de Barcelona Universitat de Barcelona
Barcelona, Spain Barcelona, Spain
With Contribution by
Jordi Vitrià Eloi Puertas
Universitat de Barcelona Universitat de Barcelona
Barcelona, Spain Barcelona, Spain
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
In this era, where a huge amount of information from different fields is gath-
ered and stored, its analysis and the extraction of value have become one of the
most attractive tasks for companies and society in general. The design of solutions
for the new questions emerged from data has required multidisciplinary teams.
Computer scientists, statisticians, mathematicians, biologists, journalists, and soci-
ologists, as well as many others are now working together in order to provide
knowledge from data. This new interdisciplinary field is called data science. The
pipeline of any data science goes through asking the right questions, gathering
data, cleaning data, generating hypothesis, making inferences, visualizing data,
assessing solutions, etc.
vii
viii Preface
Target Audiences
Parts of the presented materials have been used in the postgraduate course of Data
Science and Big Data from Universitat de Barcelona. All contributing authors are
involved in this course.
This book can be used in any introductory data science course. The problem-based
approach adopted to introduce new concepts can be useful for the beginners. The
implemented code solutions for different problems are a good set of exercises for
the students. Moreover, these codes can serve as a baseline when students face
bigger projects.
Supplemental Resources
This book is accompanied by a set of IPython Notebooks containing all the codes
necessary to solve the practical cases of the book. The Notebooks can be found on
the following GitHub repository: https://github.com/DataScienceUB/introduction-
datascience-python-book.
Acknowledgments
ix
x Contents
4 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Statistical Inference: The Frequentist Approach . . . . . . . . . . . . . . . . 52
4.3 Measuring the Variability in Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.1 Testing Hypotheses Using Confidence Intervals . . . . . . . 60
4.4.2 Testing Hypotheses Using p-Values . . . . . . . . . . . . . . . . . . . 61
4.5 But, Is the Effect E Real? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 First Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 What Is Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 Training, Validation and Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.7 Two Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.7.1 Generalities Concerning Learning Models . . . . . . . . . . . . . 87
5.7.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.7.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.8 Ending the Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.9 A Toy Business Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.1 Simple Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . 100
6.2.2 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2.3 Practical Case 1: Sea Ice Data and Climate Change . . . 103
6.2.4 Polynomial Regression Model . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.5 Regularization and Sparse Models . . . . . . . . . . . . . . . . . . . . 108
6.2.6 Practical Case 2: Boston Housing Data and Price
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3.1 Practical Case 3: Winning or Losing Football
Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Contents xi
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Authors and Contributors
Contributors
Dr. Jordi Vitrià is a Full Professor at the Department of Mathematics and Com-
puter Science at the Universitat de Barcelona. He received his Ph.D. degree from
the Universitat Autònoma de Barcelona in 1990. Dr. Jordi Vitrià has published
more than 100 papers in SCI-indexed journals and has more than 30 years of expe-
rience in working on Computer Vision, Machine Learning, Causal Inference, and
Artificial Intelligence and their applications to several fields. He is now the leader
of the “Data Science Group at Universitat de Barcelona”, a multidisciplinary tech-
nology transfer unit that conveys results from scientific and technological research
in the market and society in general.
Dr. Eloi Puertas is an Assistant Professor in the Department of Mathematics and
Computer Science at the Universitat de Barcelona. He has been a Computer Sci-
ence Engineer by the Universitat Autònoma de Barcelona (Spain) since 2002.
He received his Ph.D. degree from the Universitat de Barcelona (Spain) in 2014.
His areas of interest include artificial intelligence, software engineering, and data
science.
xiii
xiv Authors and Contributors
Dr. Petia Radeva is a Full Professor at the Universitat de Barcelona. She graduated
in Applied Mathematics and Computer Science in 1989 at the University of Sofia,
Bulgaria, and received her Ph.D. degree on Computer Vision for Medical Imag-
ing in 1998 from the Universitat Autònoma de Barcelona, Spain. She has been
an ICREA Academia Researcher since 2015, head of the Consolidated Research
Group “Artificial Intelligence and Biomedical Applications”. Her present research
interests are on the development of learning-based approaches for computer vision,
deep learning, data-centric data analysis, food data analysis, egocentric vision, and
data science.
Dr. Oriol Pujol is a Full Professor at the Department of Mathematics and Com-
puter Science at the Universitat de Barcelona. He received his Ph.D. degree from
the Universitat Autònoma de Barcelona (Spain) in 2004 for his work in machine
learning and computer vision. His areas of interest include machine learning,
computer vision, and data science.
Dr. Sergio Escalera is a Full Professor at the Department of Mathematics and
Computer Science at the Universitat de Barcelona. He has been a Computer Sci-
ence Engineer by the Universitat Autònoma de Barcelona (Spain) since 2003. He
received his Ph.D. degree from the Universitat Autònoma de Barcelona (Spain)
in 2008. His research interests include, among others, statistical pattern recogni-
tion and visual object recognition, with special interest in behavior analysis from
multi-modal data.
Francesc Dantí is an adjunct professor and system administrator from the Depart-
ment of Mathematics and Computer Science at the Universitat de Barcelona. He is
a computer science engineer by the Universitat Oberta de Catalunya (Spain). His
particular areas of interest are HPC and grid computing, parallel computing, and
cybersecurity. Francesc Dantí is coauthor of Chap. 2.
Introduction to Data Science
1
You have, no doubt, already experienced data science in several forms. When you are
looking for information on the web by using a search engine or asking your mobile
phone for directions, you are interacting with data science products. Data science
has been behind resolving some of our most common daily tasks for several years.
Data science involves the application of scientific methods, algorithms, and sys-
tems to extract insights and knowledge from large volumes of data. It encompasses
various disciplines such as mathematics, statistics, computer science, and domain
expertise to analyze, interpret, and make informed decisions based on data.
Most of the scientific methods that power data science are not new and they have
been out there, waiting for applications to be developed, for a long time. Statistics is
an old science that stands on the shoulders of eighteenth-century giants such as Pierre
Simon Laplace (1749–1827) and Thomas Bayes (1701–1761). Machine learning has
made significant progress and can be considered a well-established discipline. While
it is relatively younger compared to some other branches of science and technology,
machine learning has gained prominence and achieved remarkable advancements in
recent years. Computer science changed our lives several decades ago and continues
to do so; but it cannot be considered new.
While data science itself encompasses scientific knowledge and methodologies,
its novelty and impact on society are indeed rooted in a disruptive change brought
about by the evolution of technology, specifically the concept of datification. Dat-
ification is the process of rendering into data aspects of the world that have never
been quantified before. At the personal level, the list of datified concepts is very long
and still growing: business networks, the lists of books we are reading, the films we
enjoy, the food we eat, our physical activity, our purchases, our driving behavior, and
so on. Even our thoughts are datified when we publish them on our favorite social
network. At the business level, companies are datifying semi-structured data that
were previously discarded: web activity logs, computer network activity, machinery
© Springer Nature Switzerland AG 2024 1
L. Igual and S. Seguí, Introduction to Data Science,
Undergraduate Topics in Computer Science,
https://doi.org/10.1007/978-3-031-48956-3_1
2 1 Introduction to Data Science
signals, etc. Nonstructured data, such as written reports, e-mails or voice recordings,
are now being stored not only for archive purposes but also to be analyzed.
The rise of datification and the vast availability of data offer significant benefits
and opportunities. However, it is crucial to acknowledge and address the potential
dangers and challenges that accompany this phenomenon. Some of the concerns
include privacy and security risks, data bias and discrimination, lack of transparency
and accountability, and the perpetuation of social and economic disparities.
However, datification is not the only ingredient of the data science revolution.
The other ingredient is the democratization of data analysis. Large companies such
as Google, IBM, or SAS were the only players in this field when data science had
no name. At the beginning of the century, the huge computational resources of
those companies allowed them to take advantage of datification by using analytical
techniques to develop innovative products and even to take decisions about their own
business.
Currently, there is a decreasing analytical divide between companies utilizing
advanced data analytics and the rest of the world, including both companies and
individuals. This trend can be attributed to the widespread adoption of open-source
tools. However, it is important to note that generative AI, a field that involves creating
AI models capable of generating content such as text, images, or videos, raises
concerns about democratization. The access and use of generative AI technologies
may not be equally distributed, potentially leading to further disparities and unequal
opportunities in the application of this technology.
Data science is commonly defined as a methodology by which actionable insights
can be inferred from data. This is a subtle but important difference with respect to
previous approaches to data analysis, such as business intelligence or exploratory
statistics. Performing data science is a task with an ambitious objective: the produc-
tion of beliefs informed by data and to be used as the basis of decision-making. In
the absence of data, beliefs are uninformed and decisions, in the best of cases, are
based on best practices or intuition. The representation of complex environments by
rich data opens up the possibility of applying all the scientific knowledge we have
regarding how to infer knowledge from data.
In general, data science allows us to adopt four different strategies to explore the
world using data:
Data science is definitely a cool and trendy discipline that routinely appears in the
headlines of very important newspapers and on TV stations. Data scientists are
presented in those forums as a scarce and expensive resource. As a result of this
situation, data science can be perceived as a complex and scary discipline that is
only accessible to a reduced set of geniuses working for major companies. The main
purpose of this book is to demystify data science by describing a set of tools and
techniques that allows a person with basic skills in computer science, mathematics,
and statistics to perform the tasks commonly associated with data science.
To this end, this book has been written under the following assumptions:
• Data science is a complex, multifaceted field that can be approached from sev-
eral points of view: ethics, methodology, business models, how to deal with big
data, data engineering, data governance, etc. Each point of view deserves a long
and interesting discussion, but the approach adopted in this book focuses on an-
alytical techniques, because such techniques constitute the core toolbox of every
data scientist and because they are the key ingredient in predicting future events,
discovering useful patterns, and probing the world.
• You have some experience with Python programming. For this reason, we do not
offer an introduction to the language. But even if you are new to Python, this should
not be a problem. Before reading this book you should start with any online Python
4 1 Introduction to Data Science
course. Mastering Python is not easy, but acquiring the basics is a manageable task
for anyone in a short period of time.
• Data science is about evidence-based storytelling and this kind of process requires
appropriate tools. The Python data science toolbox is one, not the only, of the
most developed environments for doing data science. You can easily install all you
need by using Anaconda1 : a free product that includes a programming language
(Python), an interactive environment to develop and present data science projects
(Jupyter notebooks), and most of the toolboxes necessary to perform data analysis.
• Learning by doing is the best approach to learn data science. For this reason all the
code examples and data in this book are available to download at https://github.
com/DataScienceUB/introduction-datascience-python-book.
• Data science deals with solving real-world problems. So all the chapters in the
book include and discuss practical cases using real data.
This book includes three different kinds of chapters. The first kind is about Python
extensions. Python was originally designed to have a minimum number of data ob-
jects (int, float, string, etc.); but when dealing with data, it is necessary to extend the
native set to more complex objects such as (numpy) numerical arrays or (pandas)
data frames. The second kind of chapter includes techniques and modules to per-
form statistical analysis and machine learning. Finally, there are some chapters that
describe several applications of data science, such as building recommenders or sen-
timent analysis. The composition of these chapters was chosen to offer a panoramic
view of the data science field, but we encourage the reader to delve deeper into these
topics and to explore those topics that have not been covered: big data analytics and
more advanced mathematical and statistical methods (e.g., Bayesian statistics).
1 https://www.anaconda.com/.
Data Science Tools
2
2.1 Introduction
In this chapter, first we introduce some of the cornerstone tools that data scientists
use. The toolbox of any data scientist, as for any kind of programmer, is an essential
ingredient for success and enhanced performance. Choosing the right tools can save
a lot of time and thereby allow us to focus on data analysis.
The most basic tool to decide on is which programming language we will use.
Many people use only one programming language in their entire life: the first and
only one they learn. For many, learning a new language is an enormous task that, if
at all possible, should be undertaken only once. The problem is that some languages
are intended for developing high-performance or production code, such as C, C++,
or Java, while others are more focused on prototyping code, among these the best
known are the so-called scripting languages: Ruby, Perl, and Python. So, depending
on the first language you learned, certain tasks will, at the very least, be rather tedious.
The main problem of being stuck with a single language is that many basic tools
simply will not be available in it, and eventually you will have either to reimplement
them or to create a bridge to use some other language just for a specific task.
In conclusion, you either have to be ready to change to the best language for each
task and then glue the results together, or choose a very flexible language with a rich
ecosystem (e.g., third-party open-source libraries). In this book we have selected
Python as the programming language.
Python1 is a mature programming language but it also has excellent properties for
newbie programmers, making it ideal for people who have never programmed before.
Some of the most remarkable of those properties are easy to read code, suppression
of non-mandatory delimiters, dynamic typing, and dynamic memory usage. Python
is an interpreted language, so the code is executed immediately in the Python con-
sole without needing the compilation step to machine language. Besides the Python
console (which comes included with any Python installation) you can find other inter-
active consoles, such as IPython,2 which give you a richer environment in which to
execute your Python code.
Currently, Python is one of the most flexible programming languages. One of its
main characteristics that makes it so flexible is that it can be seen as a multiparadigm
language. This is especially useful for people who already know how to program with
other languages, as they can rapidly start programming with Python in the same way.
For example, Java programmers will feel comfortable using Python as it supports
the object-oriented paradigm, or C programmers could mix Python and C code using
cython. Furthermore, for anyone who is used to programming in functional languages
such as Haskell or Lisp, Python also has basic statements for functional programming
in its own core library.
In summary, in this book, we have decided to use Python language because it is a
mature language programming, easy for the newbies, and can be used as a specific
platform for data scientists, thanks to its large ecosystem of scientific libraries and
its high and vibrant community. Furthermore, a lot of documentation and courses
about python already exist. Besides, Python is one of the languages that modern AI
programming assisting tools, like copilot(3 ), best supports. Other popular alternatives
to Python for data scientists are R and MATLAB/Octave.
The Python community is one of the most active programming communities with a
huge number of developed toolboxes. The most popular basic Python toolboxes for
data scientists are NumPy, SciPy, Pandas, Matplotlib, and Scikit-Learn.
1 https://www.python.org/downloads/.
2 http://ipython.org/install.html.
3 https://github.com/features/copilot.
2.3 Fundamental Python Libraries for Data Scientists 7
NumPy4 is the cornerstone toolbox for scientific computing with Python. NumPy
provides, among other things, support for multidimensional arrays with basic oper-
ations on them and useful linear algebra functions. Many toolboxes use the NumPy
array representations as an efficient basic data structure. Meanwhile, SciPy provides
a collection of numerical algorithms and domain-specific toolboxes, including sig-
nal processing, optimization, statistics, and much more. Another core toolbox is the
plotting library Matplotlib. This toolbox has many tools for data visualization.
Scikit-learn5 is a machine learning library built from NumPy, SciPy, and Matplotlib.
Scikit-learn offers simple and efficient tools for common tasks in data analysis such
as classification, regression, clustering, dimensionality reduction, model selection,
and preprocessing. For deep learning tasks other specific libraries exists, such Ten-
sorFlow,6 one of the first open-source deep learning frameworks and used in many
applications, Keras,7 a high-level neural network application programming interface
or PyTorch,8 another deep learning framework more flexible than the others.
Pandas9 provides high-performance data structures and data analysis tools. The key
feature of Pandas is a fast and efficient DataFrame object for data manipulation with
integrated indexing. The DataFrame structure can be seen as a spreadsheet which
offers very flexible ways of working with it. You can easily transform any dataset
in the way you want, by reshaping it and adding or removing columns or rows.
It also provides high-performance functions for aggregating, merging, and joining
datasets. Pandas also has tools for importing and exporting data from different for-
mats: comma-separated value (CSV), text files, Microsoft Excel, SQL databases,
and the fast HDF5 format. In many situations, the data you have in such formats will
not be complete or totally structured. For such cases, Pandas offers handling of miss-
ing data and intelligent data alignment. Furthermore, Pandas provides a convenient
Matplotlib interface.
4 http://www.scipy.org/scipylib/download.html.
5 http://www.scipy.org/scipylib/download.html.
6 https://www.tensorflow.org/.
7 https://keras.io/.
8 https://pytorch.org/.
9 http://pandas.pydata.org/getpandas.html.
8 2 Data Science Tools
Before we can get started on solving our own data-oriented problems, we will need to
set up our programming environment. The first question we need to answer concerns
Python language itself. Python is evolving continuously and it is important to be
updated to the last version. Once we have installed the last Python version, the next
thing to decide is whether we want to install the data scientist Python ecosystem in
our local system, or to use it directly from the cloud. For newbies, the second option
is recommended.
However, if a standalone installation is chosen, the Anaconda Python distribu-
tion10 is then a good option. The Anaconda distribution provides integration of
all the Python toolboxes and applications needed for data scientists into a single
directory without mixing it with other Python toolboxes installed on the machine.
It contains, of course, the core toolboxes and applications such as NumPy, Pandas,
SciPy, Matplotlib, Scikit-learn, IPython, Spyder, etc., but also more specific tools
for other related tasks such as data visualization, code optimization, and big data
processing.
For any programmer, and by extension, for any data scientist, the integrated devel-
opment environment (IDE) is an essential tool. IDEs are designed to maximize
programmer productivity. Thus, over the years this software has evolved in order to
make the coding task less complicated. Choosing the right IDE for each person is
crucial and, unfortunately, there is no “one-size-fits-all” programming environment.
The best solution is to try the most popular IDEs among the community and keep
whichever fits better in each case.
In general, the basic pieces of any IDE are three: the editor, the compiler (or
interpreter), and the debugger. Some IDEs can be used in multiple programming
languages, provided by language-specific plugins, such as Netbeans,11 Eclipse12 or
Visual Studio Code.13 Others are only specific for one language or even a specific
programming task. In the case of Python, there are a large number of specific IDEs:
PyCharm,14 WingIDE,15 Spyder.16 , 17
10 http://continuum.io/downloads.
11 https://netbeans.org/downloads/.
12 https://eclipse.org/downloads/.
13 https://code.visualstudio.com/.
14 https://www.jetbrains.com/pycharm/.
15 https://wingware.com/.
16 https://github.com/spyder-ide/spyder.
17 Eric https://eric-ide.python-projects.org/.
2.6 Useful Resources for Data Scientists 9
With the advent of web applications, a new generation of IDEs for interactive lan-
guages such as Python has been developed. Starting in the academia and e-learning
communities, web-based IDEs were developed considering how not only your code
but also all your environment and executions can be stored in a server.
Nowadays, such sessions are called notebooks and they are not only used in class-
rooms but also used to show results in presentations or on business dashboards.
The recent spread of such notebooks is mainly due to IPython. Since December
2011, IPython has been issued as a browser version of its interactive console, called
IPython notebook, which shows the Python execution results very clearly and con-
cisely by means of cells. Cells can contain content other than code. For example,
markdown (a wiki text language) cells can be added to introduce algorithms and text.
It is also possible to insert Matplotlib graphics to illustrate examples or even web
pages. Recently, some scientific journals have started to accept notebooks in order
to show experimental results, complete with their code and data sources. In this way,
experiments can become completely and absolutely replicable.
Since the project has grown so much, IPython notebook has been separated from
IPython software and now it has become a part of a larger project: Jupyter18 Jupyter
(for Julia, Python and R) aims to reuse the same WIDE for all these interpreted
languages and not just Python. All old IPython notebooks are automatically imported
to the new version when they are opened with the Jupyter platform; but once they
are converted to the new version, they cannot be used again in old IPython notebook
versions.
Nowadays, Jupyter notebooks are supported by major cloud computing providers.
For example, Google released Colaboratory also known as Colab.19 With Colab you
can open any Jupyter notebook from your Google Drive and execute it in the remote
Google servers. It is possible to use it for free, even free GPU instances are available.
However, paid plans exist if more powerful machines are needed. In this book, all
the examples shown use Jupyter notebook style.
Besides libraries and frameworks, there are other resources that can be very useful
for data scientists in their everyday tasks:
. Data Sources: Access to high-quality data is crucial. Relevant data sets can be
found in research data repositories, open data databases like governmental data
portals or private companies API’s (Application Programming Interface) to access
18 http://jupyter.readthedocs.org/en/latest/install.html.
19 https://colab.research.google.com/.
10 2 Data Science Tools
to their data. Websites like Zenodo,20 Eurostat,21 and academic data repositories
such UCI22 data sets are excellent starting points.
. Version Control Systems: Tools like Git23 and platforms like Github24 are crucial
for source code version control and collaboration on data science projects. Not
only code version control is needed, also data scientist needs to version control
experiments, results, data and even models. For this purpose, data and experiment
version control tools exist such as DVC.25
. Online Courses and Tutorials: Websites like Coursera, edX, and Udacity offer
courses on data science topics, including machine learning, data analysis, and
more.
. GitHub Repositories: GitHub hosts numerous open-source data science projects
and repositories where you can find code, datasets, and useful resources.
. Data Science Competitions: Platforms like Kaggle host data science competitions
where you can practice your skills, learn from others, and even win prizes.
Throughout this book, we will come across many practical examples. In this chapter,
we will see a very basic example to help get started with a data science ecosystem
from scratch. To execute our examples, we will use Jupyter notebook, although any
other console or IDE can be used.
Once we have set up our programming environment, we can start by launching the
Jupyter notebook platform. We can start the Jupyter notebook platform by clicking
on the Jupyter Notebook icon installed by Anaconda in the start menu or on the
desktop.
The browser will immediately be launched displaying the Jupyter notebook home-
page, whose URL is https://localhost:8888/tree. Note that a special port is used;
by default it is 8888. As can be seen in Fig. 2.1, this initial page displays a tree view
of a directory. The root directory is the current user directory. If you prefer to use
some cloud computing provider like Google Colab, just log into the Jupyter notebook
service and start using it right away.
20 https://zenodo.org/.
21 https://ec.europa.eu/eurostat.
22 https://archive.ics.uci.edu/.
23 https://git-scm.com/.
24 https://github.com/.
25 https://dvc.org/.
2.7 Get Started with Python and Pandas 11
Fig. 2.1 IPython notebook home page, displaying a home tree directory
Now, to start a new notebook, we only need to press the New Notebooks
Python 3 button at the top on the right of the home page.
As can be seen in Fig. 2.2, a blank notebook is created called Untitled.
First of all, we are going to change the name of the notebook to something
more appropriate. To do this, just click on the notebook name and rename it:
DataScience-GetStartedExample.
12 2 Data Science Tools
Let us begin by importing those toolboxes that we will need for our program. In the
first cell we put the code to import the Pandas library as pd. This is for convenience;
every time we need to use some functionality from the Pandas library, we will write
pd instead of pandas. We will also import the two core libraries mentioned above:
the numPy library as np and the matplotlib library as plt.
In []:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
To execute just one cell, we press the . button or click on Cell Run or press the
keys Ctrl + Enter . While execution is underway, the header of the cell shows the *
mark:
In [*]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
While a cell is being executed, no other cell can be executed. If you try to execute
another cell, its execution will not start until the first cell has finished its execution.
Once the execution is finished, the header of the cell will be replaced by the next
number of execution. Since this will be the first cell executed, the number shown will
be 1. If the process of importing the libraries is correct, no output cell is produced.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
For simplicity, other chapters in this book will avoid writing these imports.
The key data structure in Pandas is the DataFrame object. A DataFrame is basically
a tabular data structure, with rows and columns. Rows have a specific index to access
them, which can be any name or value. In Pandas, the columns are called Series,
a special type of data, which in essence consists of a list of several values, where
each value has an index. Therefore, the DataFrame data structure can be seen as
a spreadsheet, but it is much more flexible. To understand how it works, let us
see how to create a DataFrame from a common Python dictionary of lists. First,
we will create a new cell by clicking Insert Insert Cell Below or pressing the keys
Ctrl + M + B . Then, we write in the following code:
2.7 Get Started with Python and Pandas 13
In [2]:
data = {’year’: [
2010, 2011, 2012,
2010, 2011, 2012,
2010, 2011, 2012
],
’team’: [
’FCBarcelona’, ’FCBarcelona’,
’FCBarcelona’, ’RMadrid’,
’RMadrid’, ’RMadrid’,
’ValenciaCF’, ’ValenciaCF’,
’ValenciaCF’
],
’wins’: [30, 28, 32, 29, 32, 26, 21, 17, 19],
’draws’: [6, 7, 4, 5, 4, 7, 8, 10, 8],
’losses’: [2, 3, 2, 4, 2, 5, 9, 11, 11]
}
pd.DataFrame(data, columns = [
’year’, ’team’, ’wins’, ’draws’, ’losses’
]
)
In this example, we use the pandas DataFrame object constructor with a dictionary
of lists as argument. The value of each entry in the dictionary is the name of the
column, and the lists are their values.
The DataFrame columns can be arranged at construction time by entering a key-
word columns with a list of the names of the columns ordered as we want. If the
column keyword is not present in the constructor, the columns will be arranged in
alphabetical order. Now, if we execute this cell, the result will be a table like this:
Out[2]: year team wins draws losses
0 2010 FCBarcelona 30 6 2
1 2011 FCBarcelona 28 7 3
2 2012 FCBarcelona 32 4 2
3 2010 RMadrid 29 5 4
4 2011 RMadrid 32 4 2
5 2012 RMadrid 26 7 5
6 2010 ValenciaCF 21 8 9
7 2011 ValenciaCF 17 10 11
8 2012 ValenciaCF 19 8 11
where each entry in the dictionary is a column. The index of each row is created
automatically taking the position of its elements inside the entry lists, starting from
0. Although it is very easy to create DataFrames from scratch, most of the time what
we will need to do is import chunks of data into a DataFrame structure, and we will
see how to do this in later examples.
Apart from DataFrame data structure creation, Panda offers a lot of functions
to manipulate them. Among other things, it offers us functions for aggregation,
manipulation, and transformation of the data. In the following sections, we will
introduce some of these functions.
14 2 Data Science Tools
To illustrate how we can use Pandas in a simple real problem, we will start doing
some basic analysis of government data. For the sake of transparency, data pro-
duced by government entities must be open, meaning that they can be freely used,
reused, and distributed by anyone. An example of this is the Eurostat, which is the
home of European Commission data. Eurostat’s main role is to process and publish
comparable statistical information at the European level. The data in Eurostat are
provided by each member state and it is free to reuse them, for both noncommercial
and commercial purposes (with some minor exceptions).
Since the amount of data in the Eurostat database is huge, in our first study we
are only going to focus on data relative to indicators of educational funding by the
member states. Thus, the first thing to do is to retrieve such data from Eurostat.
Since open data have to be delivered in a plain text format, CSV (or any other
delimiter-separated value) formats are commonly used to store tabular data. In a
delimiter-separated value file, each line is a data record and each record consists of
one or more fields, separated by the delimiter character (usually a comma). Therefore,
the data we will use can be found already processed at book’s Github repository as
educ_figdp_1_Data.csv file.
Reading
Let us start reading the data we downloaded. First of all, we have to create a new
notebook called Open Government Data Analysis and open it. Then, after
ensuring that the educ_figdp_1_Data.csv file is stored in the same directory
as our notebook directory, we will write the following code to read and show the
content:
In [1]:
edu = pd.read_csv(’files/ch02/educ_figdp_1_Data.csv’,
na_values = ’:’,
usecols = ["TIME","GEO","Value"])
edu
character that represents “non available data” in the file. Normally, CSV files have a
header with the names of the columns. If this is the case, we can use the usecols
parameter to select which columns in the file will be used.
In this case, the DataFrame resulting from reading our data is stored in edu. The
output of the execution shows that the edu DataFrame size is 384 rows .× 3 columns.
Since the DataFrame is too large to be fully displayed, three dots appear in the middle
of each column.
Beside this, Pandas also has functions for reading files with formats such as Excel,
HDF5, tabulated files, or even the content from the clipboard (read_excel(),
read_hdf(), read_table(), read_clipboard()). Whichever function
we use, the result of reading a file is stored as a DataFrame structure.
To see how the data looks, we can use the head() method, which shows just the
first five rows. If we use a number as an argument to this method, this will be the
number of rows that will be listed:
In [2]:
edu.head()
Similarly, the tail() method returns the last five rows by default.
In [3]:
edu.tail()
If we want to know the names of the columns or the names of the indexes, we
can use the DataFrame attributes columns and index respectively. The names of
the columns or indexes can be changed by assigning a new list of the same length to
these attributes. The values of any DataFrame can be retrieved as a Python array by
calling its values attribute.
If we just want quick statistical information on all the numeric columns in a
DataFrame, we can use the function describe(). The result shows the count, the
mean, the standard deviation, the minimum and maximum, and the percentiles, by
default, the 25th, 50th, and 75th, for all the values in each column or series.
In [4]:
edu.describe()
16 2 Data Science Tools
Selecting Data
Out[5]: 0 NaN
1 NaN
2 5.00
3 5.03
4 4.95
... ...
380 6.10
381 6.81
382 6.85
383 6.76
Name: Value, dtype: float64
If we want to select a subset of rows from a DataFrame, we can do so by indicating
a range of rows separated by a colon (:) inside the square brackets. This is commonly
known as a slice of rows:
In [6]:
edu[10:14]
This instruction returns the slice of rows from the 10th to the 13th position. Note
that the slice does not use the index labels as references, but the position. In this case,
the labels of the rows simply coincide with the position of the rows.
If we want to select a subset of columns and rows using the labels as our references
instead of the positions, we can use loc indexing:
In [7]:
edu.loc[90:94, [’TIME’,’GEO’]]
This returns all the rows between the indexes specified in the slice before the
comma, and the columns specified as a list after the comma. In this case, loc
references the index labels, which means that loc does not return the 90th to 94th
rows, but it returns all the rows between the row labeled 90 and the row labeled 94;
thus if the index 100 is placed between the rows labeled as 90 and 94, this row would
also be returned.
Filtering Data
Another way to select a subset of data is by applying Boolean indexing. This indexing
is commonly known as a filter. For instance, if we want to filter those values less
than or equal to 6.5, we can do it like this:
In [8]:
edu[edu[’Value’] > 6.5].tail()
Boolean indexing uses the result of a Boolean operation over the data, returning
a mask with True or False for each row. The rows marked True in the mask will
be selected. In the previous example, the Boolean operation edu[’Value’] .>
6.5 produces a Boolean mask. When an element in the “Value” column is greater
than 6.5, the corresponding value in the mask is set to True, otherwise it is set to
False. Then, when this mask is applied as an index in edu[edu[’Value’] .>
6.5], the result is a filtered DataFrame containing only rows with values higher
than 6.5. Of course, any of the usual Boolean operators can be used for filtering: .<
18 2 Data Science Tools
(less than), .<= (less than or equal to), .> (greater than), .>= (greater than or equal
to), .= (equal to), and .! =(not equal to).
Pandas uses the special value NaN (not a number) to represent missing values. In
Python, NaN is a special floating-point value returned by certain operations when
one of their results ends in an undefined value. A subtle feature of NaN values is that
two NaN are never equal. Because of this, the only safe way to tell whether a value is
missing in a DataFrame is by using the isnull() function. Indeed, this function
can be used to filter rows with missing values:
In [9]:
edu[edu["Value"].isnull()].head()
Manipulating Data
Once we know how to select the desired data, the next thing we need to know is how
to manipulate data. One of the most straightforward things we can do is to operate
with columns or rows using aggregation functions. Table 2.1 shows a list of the most
common aggregation functions. The result of all these functions applied to a row or
column is always a number. Meanwhile, if a function is applied to a DataFrame or a
selection of rows and columns, then you can specify if the function should be applied
to the rows for each column (setting the axis=0 keyword on the invocation of the
function), or it should be applied on the columns for each row (setting the axis=1
keyword on the invocation of the function).
In [10]:
edu.max(axis = 0)
Note that these are functions specific to Pandas, not the generic Python functions.
There are differences in their implementation. In Python, NaN values propagate
through all operations without raising an exception. In contrast, Pandas operations
2.7 Get Started with Python and Pandas 19
exclude NaN values representing missing data. For example, the pandas max function
excludes NaN values, thus they are interpreted as missing values, while the standard
Python max function will take the mathematical interpretation of NaN and return it
as the maximum:
In [11]:
print("Pandas max function:", edu[’Value’].max())
print("Python max function:", max(edu[’Value’]))
Out[12]: 0 NaN
1 NaN
2 0.0500
3 0.0503
4 0.0495
Name: Value, dtype: float64
However, we can apply any function to a DataFrame or Series just setting its name
as argument of the apply method. For example, in the following code, we apply
the sqrt function from the NumPy library to perform the square root of each value
in the Value column.
20 2 Data Science Tools
In [13]:
s = edu["Value"].apply(np.sqrt)
s.head()
Out[13]: 0 NaN
1 NaN
2 2.236068
3 2.242766
4 2.224860
Name: Value, dtype: float64
If we need to design a specific function to apply it, we can write an in-line function,
commonly known as a .λ-function. A .λ-function is a function without a name. It is
only necessary to specify the parameters it receives, between the lambda keyword
and the colon (:). In the next example, only one parameter is needed, which will be
the value of each element in the Value column. The value the function returns will
be the square of that value.
In [14]:
s = edu["Value"].apply(lambda d: d**2)
s.head()
Out[14]: 0 NaN
1 NaN
2 25.0000
3 25.3009
4 24.5025
Name: Value, dtype: float64
Another basic manipulation operation is to set new values in our DataFrame. This
can be done directly using the assign operator (=) over a DataFrame. For example, to
add a new column to a DataFrame, we can assign a Series to a selection of a column
that does not exist. This will produce a new column in the DataFrame after all the
others. You must be aware that if a column with the same name already exists, the
previous values will be overwritten. In the following example, we assign the Series
that results from dividing the Value column by the maximum value in the same
column to a new column named ValueNorm.
In [15]:
edu[’ValueNorm’] = edu[’Value’]/edu[’Value’].max()
edu.tail()
Now, if we want to remove this column from the DataFrame, we can use the drop
function; this removes the indicated rows if axis=0, or the indicated columns if
axis=1. In Pandas, all the functions that change the contents of a DataFrame, such
as the drop function, will normally return a copy of the modified data, instead of
overwriting the DataFrame. Therefore, the original DataFrame is kept. If you do not
want to keep the old values, you can set the keyword inplace to True. By default,
this keyword is set to False, meaning that a copy of the data is returned.
In [16]:
edu.drop(’ValueNorm’, axis = 1, inplace = True)
edu.head()
Instead, if what we want to do is to insert a new row at the bottom of the DataFrame,
we can use the Pandas concat function. This function receives as argument two
data frames, and returns a new data frame with the contents of both. To add the new
row at the bottom, we set the index to the maximum value of the index of the original
DataFrame plus one.
In [17]:
edu = pd.concat([edu, pd.DataFrame({’TIME’: 2000, ’Value’:
5.00, ’GEO’: ’a’},
index=[max(edu.index)+1])])
edu.tail()
Finally, if we want to remove this row, we need to use the drop function again.
Now we have to set the axis to 0, and specify the index of the row we want to remove.
Since we want to remove the last row, we can use the max function over the indexes
to determine which row is.
In [18]:
edu.drop(max(edu.index), axis = 0, inplace = True)
edu.tail()
22 2 Data Science Tools
To remove NaN values, instead of the generic drop function, we can use the specific
dropna() function. If we want to erase any row that contains an NaN value, we
have to set the how keyword to any. To restrict it to a subset of columns, we can
specify it using the subset keyword.
In [19]:
eduDrop = edu.dropna(how = ’any’, subset = ["Value"])
eduDrop.head()
If, instead of removing the rows containing NaN, we want to fill them with another
value, then we can use the fillna() method, specifying which value has to be
used. If we want to fill only some specific columns, we have to set as argument to
the fillna() function a dictionary with the name of the columns as the key and
which character to be used for filling as the value.
In [20]:
eduFilled = edu.fillna(value = {"Value": 0})
eduFilled.head()
Sorting
Another important functionality we will need when inspecting our data is to sort by
columns. We can sort a DataFrame using any column, using the sort function. If
we want to see the first five rows of data sorted in descending order (i.e., from the
largest to the smallest values) and using the Value column, then we just need to do
this:
2.7 Get Started with Python and Pandas 23
In [21]:
edu.sort_values(by = ’Value’, ascending = False,
inplace = True)
edu.head()
Note that the inplace keyword means that the DataFrame will be overwritten,
and hence no new DataFrame is returned. If instead of ascending = False we
use ascending = True, the values are sorted in ascending order (i.e., from the
smallest to the largest values).
If we want to return to the original order, we can sort by an index using the
sort_index function and specifying axis=0:
In [22]:
edu.sort_index(axis = 0, ascending = True, inplace = True)
edu.head()
Grouping Data
Another very useful way to inspect data is to group it according to some criteria. For
instance, in our example it would be nice to group all the data by country, regardless
of the year. Pandas has the groupby function that allows us to do exactly this. The
value returned by this function is a special grouped DataFrame. To have a proper
DataFrame as a result, it is necessary to apply an aggregation function. Thus, this
function will be applied to all the values in the same group.
For example, in our case, if we want a DataFrame showing the mean of the values
for each country over all the years, we can obtain it by grouping according to country
and using the mean function as the aggregation method for each group. The result
would be a DataFrame with countries as indexes and the mean values as the column:
In [23]:
group = edu[["GEO", "Value"]].groupby(’GEO’).mean()
group.head()
24 2 Data Science Tools
Out[23]: Value
GEO
Austria 5.618333
Belgium 6.189091
Bulgaria 4.093333
Cyprus 7.023333
Czech Republic 4.16833
Rearranging Data
Up until now, our indexes have been just a numeration of rows without much meaning.
We can transform the arrangement of our data, redistributing the indexes and columns
for better manipulation of our data, which normally leads to better performance. We
can rearrange our data using the pivot_table function. Here, we can specify
which columns will be the new indexes, the new values, and the new columns.
For example, imagine that we want to transform our DataFrame to a spreadsheet-
like structure with the country names as the index, while the columns will be the
years starting from 2006 and the values will be the previous Value column. To do
this, first we need to filter out the data and then pivot it in this way:
In [24]:
filtered_data = edu[edu["TIME"] > 2005]
pivedu = pd.pivot_table(filtered_data, values = ’Value’,
index = [’GEO’],
columns = [’TIME’])
pivedu.head()
Now we can use the new index to select specific rows by label, using the ix
operator:
In [25]:
pivedu.ix[[’Spain’,’Portugal’], [2006,2011]]
Ranking Data
Another useful visualization feature is to rank data. For example, we would like to
know how each country is ranked by year. To see this, we will use the pandas rank
function. But first, we need to clean up our previous pivoted table a bit so that it only
has real countries with real data. To do this, first we drop the Euro area entries and
shorten the Germany name entry, using the rename function and then we drop all
the rows containing any NaN, using the dropna function.
Now we can perform the ranking using the rank function. Note here that the
parameter ascending=False makes the ranking go from the highest values to
the lowest values. The Pandas rank function supports different tie-breaking methods,
specified with the method parameter. In our case, we use the first method, in
which ranks are assigned in the order they appear in the array, avoiding gaps between
ranking.
In [26]:
pivedu = pivedu.drop([
’Euro area (13 countries)’,
’Euro area (15 countries)’,
’Euro area (17 countries)’,
’Euro area (18 countries)’,
’European Union (25 countries)’,
’European Union (27 countries)’,
’European Union (28 countries)’
],
axis = 0)
pivedu = pivedu.rename(index = {’Germany (until 1990 former
territory of the FRG)’: ’Germany’})
pivedu = pivedu.dropna()
pivedu.rank(ascending = False, method = ’first’).head()
If we want to make a global ranking taking into account all the years, we can
sum up all the columns and rank the result. Then we can sort the resulting values to
retrieve the top five countries for the last 6 years, in this way:
26 2 Data Science Tools
In [27]:
totalSum = pivedu.sum(axis = 1)
totalSum.rank(ascending = False, method = ’dense’)
.sort_values().head()
Out[27]: GEO
Denmark 1
Cyprus 2
Finland 3
Malta 4
Belgium 5
dtype: float64
Notice that the method keyword argument in the in the rank function specifies
how items that compare equals receive ranking. In the case of dense, items that
compare equals receive the same ranking number, and the next not equal item receives
the immediately following ranking number.
Plotting
Pandas DataFrames and Series can be plotted using the plot function, which uses
the library for graphics Matplotlib. For example, if we want to plot the accumulated
values for each country over the last 6 years, we can take the Series obtained in the
previous example and plot it directly by calling the plot function as shown in the
next cell:
In [28]:
totalSum = pivedu.sum(axis = 1)
.sort_values(ascending = False)
totalSum.plot(kind = ’bar’, style = ’b’, alpha = 0.4,
title = "Total Values for Country")
Out[28]:
2.7 Get Started with Python and Pandas 27
Note that if we want the bars ordered from the highest to the lowest value, we
need to sort the values in the Series first. The parameter kind used in the plot
function defines which kind of graphic will be used. In our case, a bar graph. The
parameter style refers to the style properties of the graphic, in our case, the color
of bars is set to b (blue). The alpha channel can be modified adding a keyword
parameter alpha with a percentage, producing a more translucent plot. Finally,
using the title keyword the name of the graphic can be set.
It is also possible to plot a DataFrame directly. In this case, each column is treated
as a separated Series. For example, instead of printing the accumulated value over
the years, we can plot the value for each year.
In [29]:
my_colors = [’b’, ’r’, ’g’, ’y’, ’m’, ’c’]
ax = pivedu.plot(kind = ’barh’,
stacked = True,
color = my_colors)
ax.legend(loc = ’center left’, bbox_to_anchor = (1, .5))
Out[29]:
In this case, we have used a horizontal bar graph (kind=’barh’) stacking all the
years in the same country bar. This can be done by setting the parameter stacked
to True. The number of default colors in a plot is only 5, thus if you have more
than 5 Series to show, you need to specify more colors or otherwise the same set of
colors will be used again. We can set a new set of colors using the keyword color
with a list of colors. Basic colors have a single-character code assigned to each, for
example, “b” is for blue, “r” for red, “g” for green, “y” for yellow, “m” for magenta,
and “c” for cyan. When several Series are shown in a plot, a legend is created for
identifying each one. The name for each Series is the name of the column in the
DataFrame. By default, the legend goes inside the plot area. If we want to change
this, we can use the legend function of the axis object (this is the object returned
when the plot function is called). By using the loc keyword, we can set the relative
position of the legend with respect to the plot. It can be a combination of right or
left and upper, lower, or center. With bbox_to_anchor we can set an absolute
position with respect to the plot, allowing us to put the legend outside the graph.
28 2 Data Science Tools
2.8 Conclusions
This chapter has been a brief introduction to the most essential elements of a pro-
gramming environment for data scientists. The tutorial followed in this chapter is
just a simple introduction to Pandas library. For more advanced uses, refer to the
library documentation.26
Acknowledgements This chapter was writen by Eloi Puertas and Francesc Dantí.
26 https://pandas.pydata.org/docs/.
Descriptive Statistics
3
3.1 Introduction
Descriptive statistics applies the concepts, measures, and terms that are used to
describe the basic features of the samples in a study. These procedures are essential
to provide summaries about the samples as an approximation of the population.
Together with simple graphics, they form the basis of every quantitative analysis of
data. In order to describe the sample data and to be able to infer any conclusion, we
should go through several steps:
1. Data preparation: Given a specific example, we need to prepare the data for
generating statistically valid descriptions.
2. Descriptive statistics: This generates different statistics to describe and summa-
rize the data concisely and evaluate different ways to visualize them.
One of the first tasks when analyzing data is to collect and prepare the data in a format
appropriate for analysis of the samples. The most common steps for data preparation
involve the following operations.
1. Obtaining the data: Data can be read directly from a file or they might be obtained
by scraping the web.
2. Parsing the data: The right parsing procedure depends on what format the data
are in: plain text, fixed columns, CSV, XML, HTML, etc.
3. Cleaning the data: Survey responses and other data files are almost always in-
complete. Sometimes, there are multiple codes for things such as, not asked, did
not know, and declined to answer. And there are almost always errors. A simple
strategy is to remove or ignore incomplete records.
4. Building data structures: Once you read the data, it is necessary to store them in
a data structure that lends itself to the analysis we are interested in. If the data fit
into the memory, building a data structure is usually the way to go. If not, usually
a database is built, which is an out-of-memory data structure. Most databases
provide a mapping from keys to values, so they serve as dictionaries.
Let us consider a public database called the “Adult” dataset, hosted on the UCI’s
Machine Learning Repository.1 It contains approximately 32,000 observations con-
cerning different financial parameters related to the US population: age, sex, marital
(marital status of the individual), country, income (Boolean variable: whether the per-
son makes more than $50,000 per annum), education (the highest level of education
achieved by the individual), occupation, capital gain, etc.
We will show that we can explore the data by asking questions like: “Are men
more likely to become high-income professionals than women, i.e., to receive an
income of over $50,000 per annum?”
1 https://archive.ics.uci.edu/ml/datasets/Adult.
3.2 Data Preparation 31
data = []
for line in file:
data1 = line.split(’, ’)
if len(data1) == 15:
data.append([chr_int(data1[0]), data1[1],
chr_int(data1[2]), data1[3],
chr_int(data1[4]), data1[5],
data1[6], data1[7], data1[8],
data1[9], chr_int(data1[10]),
chr_int(data1[11]),
chr_int(data1[12]),
data1[13], data1[14]
])
df = pd.DataFrame(data)
df.columns = [
’age’, ’type_employer’, ’fnlwgt’,
’education’, ’education_num’, ’marital’,
’occupation’,’ relationship’, ’race’,
’sex’, ’capital_gain’, ’capital_loss’,
’hr_per_week’, ’country’, ’income’
]
The command shape gives exactly the number of data samples (in rows, in this
case) and features (in columns):
32 3 Descriptive Statistics
In [4]:
df.shape
Out[5]: country
? 583
Cambodia 19
Canada 121
China 75
Columbia 59
The first row shows the number of samples with unknown country, followed by
the number of samples corresponding to the firsts countries in the dataset.
Let us split people according to their gender into two groups: men and women.
In [6]:
ml = df[(df.sex == ’Male’)]
The data that come from performing a particular measurement on all the subjects
in a sample represent our observations for a single characteristic like country,
age, education, etc. These measurements and categories represent a sample
distribution of the variable, which in turn approximately represents the population
distribution of the variable. One of the main goals of exploratory data analysis is
to visualize and summarize the sample distribution, thereby allowing us to make
tentative assumptions about the population distribution.
3.3 Exploratory Data Analysis 33
The data in general can be categorical or quantitative. For categorical data, a simple
tabulation of the frequency of each category is the best non-graphical exploration
for data analysis. For example, we can ask ourselves what is the proportion of high-
income professionals in our database:
In [8]:
df1 = df[(df.income==’>50K\n’)]
print(’The rate of people with high income is: ’,
int(len(df1)/float(len(df))*100, ’%.’), ’
print(’The rate of men with high income is: ’,
int(len(ml1)/float(len(ml))*100, ’%.’), ’
print(’The rate of women with high income is: ’,
int(len(fm1)/float(len(fm))*100, ’%.’), ’
3.3.1.1 Mean
One of the first measurements we use to have a look at the data is to obtain sample
statistics from the data, such as the sample mean [1]. Given a sample of .n values,
.{x i }, , i = 1, . . . , n, the mean, .μ, is the sum of the values divided by the number of
2 We will use the following notation: . X is a random variable,.x is a column vector,.xT (the transpose
of .x) is a row vector, .X is a matrix, and .xi is the ith element of a dataset.
34 3 Descriptive Statistics
In [9]:
print(’The average age of men is: ’,
ml[’age’].mean())
print(’The average age of women is: ’,
fm[’age’].mean())
ml_median_age = ml1[’age’].median()
fm_median_age = fm1[’age’].median()
print("Median age per men and women with high-income: ",
ml_median_age, fm_median_age)
That value, .x p , is the . pth quantile, or the .100 × pth percentile. For example, a 5-
number summary is defined by the values .xmin , Q 1 , Q 2 , Q 3 , xmax , where . Q 1 is the
.25 × pth percentile, . Q 2 is the .50 × pth percentile and . Q 3 is the .75 × pth percentile.
Summarizing data by just looking at their mean, median, and variance can be danger-
ous: very different data can be described by the same statistics. The best thing to do
is to validate the data by inspecting them. We can have a look at the data distribution,
which describes how often each value appears (i.e., what is its frequency).
The most common representation of a distribution is a histogram, which is a graph
that shows the frequency of each value. Let us show the age of working men and
women separately.
In [12]:
ml_age = ml[’age’]
ml_age.hist(histtype = ’stepfilled’, bins = 20)
In [13]:
fm_age = fm[’age’]
fm_age.hist(histtype = ’stepfilled’, bins = 10)
The output can be seen in Fig. 3.1. If we want to compare the histograms, we can
plot them overlapping in the same graphic as follows:
Fig. 3.1 Histogram of the age of working men (left) and women (right)
Random documents with unrelated
content Scribd suggests to you:
Walker's wife. When she put out her light, neither Mr. Hale nor Tim
had returned.
Lesbia's sleep lasted for some considerable time. Then she suddenly
sat up with her senses keenly alive to every sensation. It seemed to
her that George had called her, and that she had awakened in
answer to his cry. And it was a cry for help, too! With a sensation of
alarm, she sprang from her bed, and opened the lattice to look down
the garden and across the river. There it flowed silvery in the calm
moonlight: but she heard no cry and saw nothing. Yet the call for
help had been very distinct. Lesbia was not superstitious, and had it
been broad daylight she would have laughed, at such midnight
fancies. But in the mysterious moonlight--alone in the house so far
as she knew--and at the hour of twelve o'clock, her heart beat
rapidly, and a cold perspiration broke out on her forehead. George
was in danger: she was sure of that. And George had called to her in
a dream. What was she to do? In which direction was she to look?
The first idea that came into her head was to see Tim, and explain.
He would not laugh at her fancies, as he had many of his own.
Lesbia threw on her dressing-gown, slipped her feet into shoes, and
went down the narrow staircase, taking a lighted candle with her. In
the hall all was quiet, and she paused here for a single moment,
wondering if it was worth while to awaken Tim with such a
fantastical story of midnight terrors. Just as she was deciding that it
would be wiser to return to bed, she heard a groan, and in her fright
nearly dropped the candle. But being a brave girl, she plucked up
courage and listened. There came a second groan--from the parlour.
Lesbia immediately opened the door and entered. There on the floor
she saw a man bound and gagged and stiff, with nothing alive about
him but his eyes. And those were the eyes of George Walker.
CHAPTER III
ANOTHER MYSTERY
She called loudly, quite heedless of the fact that she might waken
her father, who did not approve of young Walker. And even if he did
not, it was necessary that he should come to aid the unfortunate
man. So while the French clock on the mantelpiece struck a silvery
twelve, Lesbia shouted at the full pitch of her healthy young lungs.
In a few minutes the alarmed voice of Tim was heard, and by the
time she was again kneeling beside George, the dwarf shuffled
hurriedly into the dimly-lighted room, half-dressed, a candle in one
hand and the kitchen poker in the other.
"The saints be betwixt us and harm, Miss Lesbia," cried Tim, who
looked scared out of his senses, "what's come to you?"
"What's come to George, you mean," said Lesbia, looking up. "See,
Tim, I heard him call me and came downstairs a few minutes ago to
find him bound and wounded. Don't stand there shaking, and don't
chatter. Get the brandy and heat some water. He has fainted, and we
must bring him to his senses."
"But how the divil did Masther Garge come here?" demanded Tim,
aghast.
"How should I know?" retorted Lesbia impatiently. "We can ask him
when he is able to speak. Go and do what I tell you while I waken
my father."
"Sure the masther isn't in, Miss," expostulated Tim, backing towards
the door. "He wint out afther dinner to spind the night wid Captain
Sargent at Cookham. An' that we shud have the bad luck av this,
while he's away. Oh, Miss Lesbia, wasn't it burglars I was thinking
av? But nivir murder, save the mark, an' sudden death at that."
"It will be sudden death if you don't get that brandy. Stop!" Lesbia
started to her feet. "I'll get it myself. Go and heat the water to bathe
his wound."
She ran into the dining-room and procured the spirit, while Tim went
to stoke up the kitchen fire. Lesbia forced George's teeth apart and
poured the brandy wholesale down his throat. The ardent liquor
revived him, and he opened his eyes with a faint sigh. "Don't speak,
darling," she whispered, with a second kiss, and then set to work
chafing his limbs. By the time Tim appeared with a jug of boiling
water, the young man had quite recovered his senses, and
attempted to explain.
"No," said Lesbia sharply, "you are too weak as yet. Bring a basin,
Tim, and a sponge. We must bathe his head."
"You did, Tim, you did," assented Lesbia, who was seated by the
now recovered man, and looking somewhat weary after her
exertions, "but as George is comparatively well, he can explain."
"The cross is quite safe," said Walker faintly. "I left it at home. Oh,
my head, how it aches. No wonder, when such a heavy blow was
struck."
"Drink some more of this," said Lesbia, holding the glass to his pale
lips, "and wait until you feel stronger."
"Oh, I'm much better now," he replied, pushing the brandy and
water away, "but I shan't be able to go to the office to-morrow
morning."
"What can he say?" demanded Miss Hale tartly. "Father can't hold
you and me accountable for the unexpected."
"The crass! the crass!" muttered Tim, shaking his shaggy head.
"Divil a thing, but that it brings bad luck," answered Tim sturdily.
"It is not altogether bad luck that George has been brought here for
me to attend to him," she retorted.
"No, dear," Walker patted her hand, "this accident shows me what
an angel you are. But how did I come here?"
"I know nothing from the time I was struck down on the towing-
path near Medmenham, until the moment I saw you standing in
yonder doorway with a candle in your hand."
Lesbia knitted her pretty brows. "I can't understand. Some enemy---
-"
"Then it's a mystery," declared the girl, still more perplexed. "Tell me
exactly what took place."
Walker passed his hand wearily across his forehead, for his head
ached considerably. "After leaving you with your father, darling, I
rowed back to Medmenham, and went home to the cottage. My
mother was not within, as she had gone up to town early in the day
and did not intend to return until to-morrow----"
Lesbia patted his hand. "You need not have troubled, dear. My father
and I got on very well together."
"I did not know that, and so was anxious. I ferried over the river to
the towing-path, and walked down towards Marlow, intending to
cross the bridge and come here."
"Well, I did not expect you, dear," explained the girl. "As Tim was
out on the river, and my father had gone away, I found it dull. I went
to bed because I could think of nothing else to do. Then I fancied I
heard you calling for help, and came down to find you gagged and
bound."
"I did not call for help because I was gagged," said George, "and
almost insensible. I expect you were dreaming."
"A very serviceable dream," said Lesbia drily. "Go on, George,
darling."
"And then?"
"You have the crass!" murmured Tim, who was squatting on the
floor, and who looked like a goblin.
"Tim." It was Lesbia who spoke. "Do you think that Mr. Walker was
attacked to get the amethyst cross?"
"Faith, an' I can't say, Miss. But me mother--rest her sowl--towld me
that the crass brought bad luck, and it's come to Masther Garge
here. Maybe it's only talk, but there you are," and he pointed to the
young man.
Walker reflected for a moment or so, while Lesbia turned over Tim's
explanation in her mind. "I daresay he is right," said George
pensively, "and you also, Lesbia. I was rendered insensible so that I
might be robbed, as is proved by my pockets being turned inside
out. As the only article of value I possessed was the cross, and I
only acquired that yesterday evening, I expect it was the cross this
man was after. If so, he must be very much disappointed, for I left
your gift in the drawer of my dressing-table, before I came to see
you at ten o'clock."
"I told you that I only caught a glimpse of him," said Walker fretfully,
for the conversation wearied him. "He seemed to be a tall man, and
was roughly dressed. His soft hat was pulled over his eyes, and--and
I know, nothing more about him."
Seeing that he was still weak, Lesbia stood up. "You can lie here on
the sofa and go to sleep," she said softly. "To-morrow morning we
can talk."
"Bother the office!" said Lesbia inelegantly. "You are not fit to go to
the office. Try to sleep. Tim, give me that rug you brought. There,
dear," she tucked him in. "I have left a glass of water beside you.
Tim can come in every now and then to see how you are."
"Augh," groaned Tim, yawning, "it's just as well, Miss. I cudn't slape
forty winks, wid blue murther about. But the masther will come back
after breakfast, an' what will we say at all, at all?"
"Say," snapped Lesbia, who was at the door, looking extremely
weary. "Tell the truth, of course. My father will quite approve of what
we have done. George, don't talk to Tim, who is a chatterbox, but
go to sleep. You need all you can get, poor boy."
It was ten o'clock when she woke, and at once her thoughts
reverted to the late exciting event. No such sensational happening
had ever before disturbed the quietness of the riverside cottage, and
the mystery which environed it was an added fascination. As Lesbia
slowly dressed--and in her prettiest frock for the sake of George,--
she again wondered if her father was connected with the assault and
the attempted robbery.
George could only have been attacked for the sake of the amethyst
cross, and her father alone--so far as she knew--desired that cross.
Yet if Mr. Hale was guilty, why had he brought his victim into his own
house? No one else could have brought George, for no one else
could have entered. Lesbia had no great love for her father, since he
invariably repelled all her proffers of affection; but she now felt that
she could actively hate him for his wickedness in so dealing with the
man she loved. And yet, as she reflected when she descended the
stairs, she could not be sure that her father was guilty, even in the
face of such evidence.
When Lesbia entered the dining-room she found George quite his
old self. The night's rest had done him good, and a cold bath had
refreshed him greatly. With Tim's willing assistance he had made
himself presentable and, save for a linen bandage round his head,
looked much the same as he had done on the previous day. He
came forward swiftly with sparkling eyes, and took Lesbia in his
arms, murmuring soft and foolish words, after the way of lovers,
even less romantic.
"I could have done no less for anyone," replied Lesbia, leading him
to a chair. "Sit down, dearest, you are still weak."
"Your father will return soon," he explained, passing his cup for more
coffee, "and then I shall have to tell my story all over again. Let us
talk about ourselves and of our future."
Lesbia, after a faint resistance, was only too pleased to obey, so they
had an extremely pleasant meal. The room was cheerful with the
summer sun, which poured in floods of light and warmth through
the windows, and the feeling of spring was still in the air. Most
prosaically they enjoyed their food and unromantically ate a large
breakfast, but all the time they kept looking at one another and
relishing the novel situation. It was brought to an end only too
speedily by the sudden entrance of Mr. Hale. Tall, lean, cold and
stern, he appeared on the threshold, and stared in surprise at the
way in which young Walker was taking possession, not only of his
house but of his daughter.
"What the devil does this mean?" asked Hale, politely indignant.
"Look at George's head," cried Lesbia with a shiver, for her doubts
returned fortyfold at the sight of her aristocratic father.
"That explains nothing," said Hale drily, "perhaps, Mr. Walker, you
will undertake to tell me how it comes that I find you making
yourself at home in my poor abode?"
George, who was perfectly cool and collected, told his story. Hale
listened, much more discomposed than he chose to appear, and at
the conclusion of the narrative asked one question, which showed
where his thoughts were.
"The cross," he said eagerly, "have you been robbed of the cross?"
"I only know that it belonged to my wife and that I want to get it
back as soon as possible. Lesbia should never have given it to you.
As to your being attacked so that you might be robbed of it, I can't
believe that story. The cross, as a jewel, is not so very valuable.
Besides, no one but myself and Lesbia and Tim knew that you had
it. I presume," ended Hale, in his most sarcastic manner, "that you
do not suspect any one of us three."
Hale threw up his hand to interrupt. "We can talk of your adventure
later, Mr. Walker. After all, the cross may have something to do with
the way in which you were assaulted, although--as I said--it appears
unlikely. I want to recover it immediately, and am the more eager,
now that I have heard of your adventure. Give me a note to your
mother saying that the cross is to be given to me, and I shall
consent to your marriage with Lesbia."
George looked at the girl, who nodded. "Let my father have back the
cross, since he so greatly desires it," she said. "I can give you
something else, dear. I am willing to pay that price for my father's
consent."
"My mother went to London yesterday and will not be back until
three o'clock to-day. If you like to wait I can go over with you later."
George was nothing loth, and when Mr. Hale departed he walked
with his beloved in the garden. They should have talked of the
adventure, and Lesbia should have told George the thought that was
uppermost in her mind--namely, that her father was cognisant of the
assault. But she did not care to make such an accusation upon
insufficient grounds, and moreover hesitated to accuse her father of
such a crime. She therefore willingly agreed to postpone all talk of
the adventure until Mr. Hale's return, and surrendered herself to the
pleasure of the moment. The lovers spent a long morning in the
garden of love, gathering the rosebuds which Herrick recommends
should be culled in youth. Time flew by on golden wings, and Hale
was no sooner gone all the way to Medmenham, than he seemed to
come back. He could not have been away for more than five
minutes, as it appeared to these two enthralled by Love. For them
time had no existence.
But their dream of love fled, when Hale came swiftly down the path
looking both angry and alarmed, and, indeed, perplexed. "The cross
has gone," he said.
"Impossible," cried George, starting to his feet, astonished. "I left it--
--"
"The cross has gone," repeated Hale decisively, "your cottage has
been robbed, burgled. I repeat, the cross has gone."
CHAPTER IV
A FAMILY HISTORY
After delivering his message of woe, Mr. Hale sat down on the
garden seat under the chestnut tree, and mechanically flicked the
dust from his neat brown shoes with a silk handkerchief. He was
perfectly arrayed as usual, and on account of the heat of the day
wore a suit of spotless drill, cool and clean-looking. But if his clothes
were cool he certainly was not, for his usually colourless face was
flushed a deep red and his eyes sparkled with anger. Lesbia, who
had risen with George, looked at him with compunction in her heart.
After all--so her thoughts ran--she had suspected her father wrongly.
If he had attacked George to regain this unlucky cross, he assuredly
would not now be lamenting its loss. And yet if he were innocent,
who was guilty, considering the few people who knew that the
ornament was in existence? Tim might--but it was impossible to
suspect Tim Burke, who was the soul of honesty.
Mr. Hale shrugged his shoulders. "I usually say what I mean," he
remarked acridly. "I took your note to Medmenham, and found the
local policeman conversing with your mother's servant. From her I
learned what had taken place, and, indeed, she was telling the
constable when I came up."
"Well?"
"Well, then, Jenny rose this morning to find the window of the
drawing-room wide open. Nothing was touched in that room. But
your bedroom was ransacked thoroughly. Your clothes were strewn
about, and apparently every pocket had been examined. The
drawers were opened, and even the bed had been overhauled.
There was no sign of the burglar, and Jenny swears that--sleeping at
the back of the house--she heard nothing."
"Absolutely! I gave Jenny the note and together with the policeman
who, by the way, is a bucolic idiot, she took me to the bedroom. I
examined the right-hand drawer which was open, as were all the
other drawers, and found that the cross was missing. Jenny declared
that nothing else had been taken. Of course the girl was in a great
state of alarm, as she was the sole person in the house, and she
feared lest she should be accused. Also, and very naturally, she was
surprised at your being away, Walker."
George nodded. "I daresay. It is rarely that I sleep away from home,
and when I do I give notice. Humph!" he sat down on the grass
opposite Mr. Hale and gripped his ankles. "What do you think, sir?"
Hale made a vague motion of despair. "What can I think? I know as
much as you do, and nothing more. Would you mind my putting you
in the witness-box, Walker?"
"And I shall be counsel for the defence," said Lesbia, sitting down
beside her lover with rather a wry smile. It appeared to her that Mr.
Hale wished to recall his offer to let the marriage take place: also
that he wished to get George into trouble if he could. But how he
proposed to do so the girl could not tell. However she was anxious
and listened with all her ears. Mr. Hale raised his eyebrows at her
odd speech, but took no further notice of it. He was too much
interested in his examination.
"Lesbia," said Mr. Hale quietly, "gave you the cross yesterday
evening in my presence, so to speak. What did you do with it?"
"No. And if I had shown it to Jenny, it would not have mattered. You
do not suspect an honest girl like her, I presume."
"Ah! then some one else did see it," said Hale, with satisfaction and
with marked eagerness. "Come, man, speak up."
"I had almost forgotten," said Walker slowly. "Perhaps the blow on
my head made me forget; but I remember now."
Hale nodded. "I think as you do. So the best thing to be done will be
to come and see the constable, or the inspector here in Marlow. We
must have those gipsies searched before they go away. The
encampment was still there this morning; but I saw signs of
removal."
George leaped to his feet. "Yes, it must be so" he cried eagerly. "I
daresay the man robbed me--the cross being flamboyant is just the
thing which would attract him."
"Then we must see the inspector. I must get the cross back. It is a
pity I remained at Cookham last night with Sargent. Had I been
here, I should have gone at once to Medmenham."
"I don't care. The mere fact that Walker here was assaulted would
have proved to me that the cross was wanted. Since he left it at
home the thief would probably have burgled the house. I might have
caught him red-handed. Oh, why didn't I come home last night?"
Mr. Hale was genuinely moved over the loss of the ornament. And
yet Lesbia could not think that it was mere sentimental attachment
thereto, as having belonged to his dead wife, that made him so
downcast. Also in itself the cross was of comparatively little value.
Lesbia's suspicions returned, and again she dismissed them as
unworthy. Moreover, if Hale had assaulted George and had
committed a burglary he would not be so eager to set the police on
the track. Whosoever was guilty he at least must be innocent. Cold
as her father was to her, and little affection as she bore him, it was
agreeable to find that he was honest--though, to be sure, every child
expects to find its parents above reproach. Perhaps a sixth sense
told Lesbia that her father was not all he should be. In no other way
could she guess how she came to be so ready to think ill of him. But
up to the present, she had suspected him wrongly, and so was
pleased.
"Did you see any of those gipsies lurking about the house?" asked
Parson.
"No," said Jenny positively, "I did not. Mr. George went out for a
walk at ten o'clock, and I lay down at half-past. I never knew
anything, or heard anything, or guessed anything. When I got up at
seven, as usual, and went to dust the drawing-room, I found the
window open. And that didn't scare me, as I thought Mr. George
might have opened it when he got up."
"But you knew that he was not in the house?" said Hale alertly.
"I never did, sir. I went to wake him after I found the drawing-room
window open, and found that he hadn't been to bed. The room was
upset too, just as you saw it. If I'd known that I was alone in the
cottage I should have been scared out of my life; but I thought Mr.
George came in late, and had gone to bed as usual. I nearly fainted,
I can tell you," cried Jenny tearfully. "Fancy a weak girl like me being
left alone with them horrid gipsies down the lane! But I slept
through it all, and I never saw no gipsies about. When I saw the
bedroom upset and that Mr. George wasn't there, I called in Quain
the policeman. That's all I know, and if missus does give me notice
when she comes back I'd have her know that I'm a respectable girl
as doesn't rob anyone."
Jenny had much more to say on the subject, but all to no purpose;
so the three men went to the camp. They found the vagrants
making preparations to leave, and shortly were in the middle of what
promised to be a free fight. The gipsies were most indignant at
being accused, and but for a certain awe of the police would
certainly have come to blows with those who doubted their honesty.
The man who had seen the cross accounted for his movements on
the previous night. He was in the village public-house until eleven,
so could not have assaulted Walker on the towing-path, and
afterwards was in bed in one of the caravans, as was deposed to by
his wife. In fact, every member of this particular tribe--they were
mostly Lovels from the New Forest--proved that he or she had
nothing to do with either the assault or burglary. Finally, Parson,
entirely beaten, departed with the other two men, and the gipsies
proceeded to move away in a high state of indignation.
"Do you really think that they are innocent?" asked Hale, who
surveyed the procession of outgoing caravans with a frown.
"Yes, I do," said Parson, who was not going to be taught his
business by any civilian.
"So do I," struck in Walker. "All the men who saw the cross have
accounted for their whereabouts last night. They were not near my
mother's house, nor across the river on the towing-path."
"And the men and women also, I suppose, sir," said Parson quietly.
"I had no warrant to do so, let me remind you. Even gipsies have
their privileges under the English law. Also, if anyone of these men
were guilty, he could easily have passed the cross to one of the
women, or buried it. I might have searched and found nothing, only
to lay myself open to a lecture from my superiors."
"And let me remind you, sir," broke in the officer stiffly, "that only
this ornament you speak of was stolen. If a gipsy had broken into
the house he would certainly have taken other things. And again, no
gipsy could have carried Mr. Walker into your parlour, seeing that not
one member of the tribe is aware of your existence, much less
where your cottage is situated. I am ignorant on that score myself."
Having thus delivered himself with some anger, for the supercilious
demeanour of Hale irritated him, Parson strode away. He intimated
curtly to the two men, as he turned on his heel, that if he heard of
anything likely to elucidate the mystery he would communicate with
them: also he advised them if they found a clue to see him.
Hale laughed at this last request. "I fancy I see myself placing the
case in the hands of such a numskull."
George shook his head. "If you do not employ the police, who is to
look into the matter?" he asked gravely.
"I never said that it was," retorted Hale, tartly. "All the same you will
have to find it and return it to me before I will agree to your
marriage with my daughter. It would have been much better had
you handed it over to me last night."
"I daresay," said George, somewhat sulkily, "but I'm not the man to
give up anything when the demand is made in such a tone as you
used. Besides, I don't see how I can find the cross."
"Sargent!" The blood rushed to Walker's cheeks and his voice shook
with indignation. "Do you mean to say that you would give your
daughter to that broken rake, to that worn-out----
"Ta! Ta! Ta!" said Hale, in an airy French fashion, and glad to see the
young man lose his temper. "Sargent is my very good friend and was
my brother officer when I was in the army. He would make Lesbia an
excellent husband, as he is handsome and well-off and amiable,
and----"
"You'd better not let him hear you talk like that."
Walker laughed. "I fear no one, let me tell you, Mr. Hale. Mr. Sargent
or Captain Sargent as he calls himself----"
"It is not usually thought good manners to continue the title after a
man has left the army," said George drily, and recovering his temper,
which he saw he should never have lost with a hardened man like
Hale. "You, for instance, do not call yourself----"
"There! There! that's enough, Walker," cried the elder man
impatiently. "You know my terms. That cross and my consent:
otherwise Lesbia marries Sargent."
"She loves me: she will never obey you," cried the lover desperately.
"I shall find means to compel her consent," said Hale coldly. "Surely,
Mr. Walker, you have common sense at your age. Sargent has money
and a certain position you have neither."
"Then go and do so. When you are rich and highly-placed we can
talk."
"Listen, Mr. Hale," he said, when Lesbia's father was on the point of
moving away from a conversation which he found unprofitable and
disagreeable. "I did not intend to tell you, but as my engagement
with Lesbia is at stake I will make a clean breast of it."
Hale wheeled round with a cold light in his eyes. "Are you going to
confess that you stole the cross and got up a comedy to hide the
theft?"
George laughed. "I am not clever enough for that. But it is about a
possible fortune that I wish to speak--one that may come to me
through my mother."
"A fortune." Hale flushed, for only the mention of money could touch
his hard nature. "I never knew that your mother had money."
"I should not tell you either," said Walker bluntly, "and so I hesitated.
I have no business to interfere with my mother's affairs. However, I
must speak since I want to marry Lesbia."
"My grandfather left his large fortune equally divided between his
two daughters. One was my mother; and her husband, my father,
ran through the lot, leaving her only a trifle to live on. I help to keep
her."
"But what you don't know is that my aunt--my mother's sister, that
is, ran away with some unknown person during her father's lifetime.
He was angry, but forgave her on his death-bed and left her a fair
share of the money--that is half. As my mother inherited fifty
thousand, there is an equal amount in the hands of Mr. Simon Jabez,
a lawyer in Lincoln's Inn Fields, waiting for my aunt should she ever
come back."
"Humph! But you say your aunt ran away with someone--to marry
the man, I suppose. What if there is a child?"
Hale laughed harshly. "You have found a mare's nest," he said coolly,
"and I see no reason to change my decision with regard to your
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
ebookbell.com