100% found this document useful (3 votes)
18 views

Python Machine Learning Case Studies Five Case Studies For The Data Scientist 1st Edition Haroon pdf download

The document is about 'Python Machine Learning Case Studies: Five Case Studies for the Data Scientist' by Danish Haroon, which covers various machine learning concepts through practical case studies. It includes topics such as statistics, regression, time series, clustering, and classification, providing insights into data analysis and predictive modeling. The book is designed for data scientists looking to enhance their skills with real-world applications in Python.

Uploaded by

ouarasrysava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
18 views

Python Machine Learning Case Studies Five Case Studies For The Data Scientist 1st Edition Haroon pdf download

The document is about 'Python Machine Learning Case Studies: Five Case Studies for the Data Scientist' by Danish Haroon, which covers various machine learning concepts through practical case studies. It includes topics such as statistics, regression, time series, clustering, and classification, providing insights into data analysis and predictive modeling. The book is designed for data scientists looking to enhance their skills with real-world applications in Python.

Uploaded by

ouarasrysava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Python Machine Learning Case Studies Five Case

Studies For The Data Scientist 1st Edition


Haroon download

https://ebookbell.com/product/python-machine-learning-case-
studies-five-case-studies-for-the-data-scientist-1st-edition-
haroon-32704098

Explore and download more ebooks at ebookbell.com


Here are some recommended products that we believe you will be
interested in. You can click the link to download.

Machine Learning For Decision Sciences With Case Studies In Python S


Sumathi

https://ebookbell.com/product/machine-learning-for-decision-sciences-
with-case-studies-in-python-s-sumathi-43242592

Machine Learning For Decision Sciences With Case Studies In Python 1st
Edition S Sumathi

https://ebookbell.com/product/machine-learning-for-decision-sciences-
with-case-studies-in-python-1st-edition-s-sumathi-43824810

Machine Learning Applications Using Python Cases Studies From


Healthcare Retail And Finance Mathur

https://ebookbell.com/product/machine-learning-applications-using-
python-cases-studies-from-healthcare-retail-and-finance-
mathur-20009562

Machine Learning Theory And Applications Handson Use Cases With Python
On Classical And Quantum Machines 1st Edition Vasques

https://ebookbell.com/product/machine-learning-theory-and-
applications-handson-use-cases-with-python-on-classical-and-quantum-
machines-1st-edition-vasques-54845610
Building Machine Learning Systems Using Python Practice To Train
Predictive Models And Analyze Machine Learning Results With Real
Usecases English Edition Deepti Chopra

https://ebookbell.com/product/building-machine-learning-systems-using-
python-practice-to-train-predictive-models-and-analyze-machine-
learning-results-with-real-usecases-english-edition-deepti-
chopra-34624686

Machine Learning On Geographical Data Using Python Introduction Into


Geodata With Applications And Use Cases Joos Korstanje

https://ebookbell.com/product/machine-learning-on-geographical-data-
using-python-introduction-into-geodata-with-applications-and-use-
cases-joos-korstanje-44173498

Python Machine Learning Projects Learn How To Build Machine Learning


Projects From Scratch Deepali R Vora

https://ebookbell.com/product/python-machine-learning-projects-learn-
how-to-build-machine-learning-projects-from-scratch-deepali-r-
vora-49422988

Python Machine Learning Projects Learn How To Build Machine Learning


Projects From Scratch Dr Deepali R Vora

https://ebookbell.com/product/python-machine-learning-projects-learn-
how-to-build-machine-learning-projects-from-scratch-dr-deepali-r-
vora-49763682

Python Machine Learning Projects 1st Edition Lisa Tagliaferri

https://ebookbell.com/product/python-machine-learning-projects-1st-
edition-lisa-tagliaferri-53726322
Python
Machine Learning
Case Studies
Five Case Studies for the Data Scientist

Danish Haroon
Python Machine
Learning Case
Studies
Five Case Studies for the
Data Scientist

Danish Haroon
Python Machine Learning Case Studies
Danish Haroon
Karachi, Pakistan
ISBN-13 (pbk): 978-1-4842-2822-7 ISBN-13 (electronic): 978-1-4842-2823-4
DOI 10.1007/978-1-4842-2823-4
Library of Congress Control Number: 2017957234
Copyright © 2017 by Danish Haroon
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole
or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical
way, and transmission or information storage and retrieval, electronic adaptation, computer
software, or by similar or dissimilar methodology now known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark
symbol with every occurrence of a trademarked name, logo, or image we use the names, logos,
and images only in an editorial fashion and to the benefit of the trademark owner, with no
intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if
they are not identified as such, is not to be taken as an expression of opinion as to whether or not
they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the
date of publication, neither the authors nor the editors nor the publisher can accept any legal
responsibility for any errors or omissions that may be made. The publisher makes no warranty,
express or implied, with respect to the material contained herein.
Cover image by Freepik (www.freepik.com)
Managing Director: Welmoed Spahr
Editorial Director: Todd Green
Acquisitions Editor: Celestin Suresh John
Development Editor: Matthew Moodie
Technical Reviewer: Somil Asthana
Coordinating Editor: Sanchita Mandal
Copy Editor: Lori Jacobs
Compositor: SPi Global
Indexer: SPi Global
Artist: SPi Global
Distributed to the book trade worldwide by Springer Science+Business Media New York,
233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505,
e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is
a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc
(SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.
For information on translations, please e-mail [email protected], or visit
http://www.apress.com/rights-permissions.
Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook
versions and licenses are also available for most titles. For more information, reference our Print
and eBook Bulk Sales web page at http://www.apress.com/bulk-sales.
Any source code or other supplementary material referenced by the author in this book is available
to readers on GitHub via the book’s product page, located at www.apress.com/978-1-4842-2822-7.
For more detailed information, please visit http://www.apress.com/source-code.
Printed on acid-free paper
Contents at a Glance

About the Author������������������������������������������������������������������������������ xi


About the Technical Reviewer�������������������������������������������������������� xiii
Acknowledgments��������������������������������������������������������������������������� xv
Introduction����������������������������������������������������������������������������������� xvii


■Chapter 1: Statistics and Probability���������������������������������������������� 1

■Chapter 2: Regression������������������������������������������������������������������ 45

■Chapter 3: Time Series����������������������������������������������������������������� 95

■Chapter 4: Clustering������������������������������������������������������������������ 129

■Chapter 5: Classification������������������������������������������������������������ 161

■Appendix A: Chart types and when to use them������������������������� 197

Index���������������������������������������������������������������������������������������������� 201

iii
Contents

About the Author������������������������������������������������������������������������������ xi


About the Technical Reviewer�������������������������������������������������������� xiii
Acknowledgments��������������������������������������������������������������������������� xv
Introduction����������������������������������������������������������������������������������� xvii


■Chapter 1: Statistics and Probability���������������������������������������������� 1
Case Study: Cycle Sharing Scheme—Determining Brand Persona�������� 1
Performing Exploratory Data Analysis����������������������������������������������������� 4
Feature Exploration�������������������������������������������������������������������������������������������������� 4
Types of variables����������������������������������������������������������������������������������������������������� 6
Univariate Analysis��������������������������������������������������������������������������������������������������� 9
Multivariate Analysis���������������������������������������������������������������������������������������������� 14
Time Series Components���������������������������������������������������������������������������������������� 18
Measuring Center of Measure��������������������������������������������������������������� 20
Mean����������������������������������������������������������������������������������������������������������������������� 20
Median�������������������������������������������������������������������������������������������������������������������� 22
Mode����������������������������������������������������������������������������������������������������������������������� 22
Variance������������������������������������������������������������������������������������������������������������������ 22
Standard Deviation������������������������������������������������������������������������������������������������� 23
Changes in Measure of Center Statistics due to Presence of Constants���������������� 23
The Normal Distribution������������������������������������������������������������������������������������������ 25

v
 ■ Contents

Correlation��������������������������������������������������������������������������������������������� 34
Pearson R Correlation��������������������������������������������������������������������������������������������� 34
Kendall Rank Correlation���������������������������������������������������������������������������������������� 34
Spearman Rank Correlation������������������������������������������������������������������������������������ 35

Hypothesis Testing: Comparing Two Groups������������������������������������������ 37


t-Statistics�������������������������������������������������������������������������������������������������������������� 37
t-Distributions and Sample Size����������������������������������������������������������������������������� 38

Central Limit Theorem��������������������������������������������������������������������������� 40


Case Study Findings������������������������������������������������������������������������������ 41
Applications of Statistics and Probability���������������������������������������������� 42
Actuarial Science���������������������������������������������������������������������������������������������������� 42
Biostatistics������������������������������������������������������������������������������������������������������������ 42
Astrostatistics��������������������������������������������������������������������������������������������������������� 42
Business Analytics�������������������������������������������������������������������������������������������������� 42
Econometrics���������������������������������������������������������������������������������������������������������� 43
Machine Learning��������������������������������������������������������������������������������������������������� 43
Statistical Signal Processing���������������������������������������������������������������������������������� 43
Elections����������������������������������������������������������������������������������������������������������������� 43


■Chapter 2: Regression������������������������������������������������������������������ 45
Case Study: Removing Inconsistencies in Concrete
Compressive Strength��������������������������������������������������������������������������� 45
Concepts of Regression������������������������������������������������������������������������ 48
Interpolation and Extrapolation������������������������������������������������������������������������������ 48
Linear Regression��������������������������������������������������������������������������������������������������� 49
Least Squares Regression Line of y on x���������������������������������������������������������������� 50
Multiple Regression������������������������������������������������������������������������������������������������ 51
Stepwise Regression���������������������������������������������������������������������������������������������� 52
Polynomial Regression������������������������������������������������������������������������������������������� 53

vi
 ■ Contents

Assumptions of Regressions����������������������������������������������������������������� 54
Number of Cases���������������������������������������������������������������������������������������������������� 55
Missing Data����������������������������������������������������������������������������������������������������������� 55
Multicollinearity and Singularity����������������������������������������������������������������������������� 55

Features’ Exploration���������������������������������������������������������������������������� 56
Correlation�������������������������������������������������������������������������������������������������������������� 58

Overfitting and Underfitting������������������������������������������������������������������� 64


Regression Metrics of Evaluation���������������������������������������������������������� 67
Explained Variance Score��������������������������������������������������������������������������������������� 68
Mean Absolute Error����������������������������������������������������������������������������������������������� 68
Mean Squared Error����������������������������������������������������������������������������������������������� 68
R2���������������������������������������������������������������������������������������������������������������������������� 69
Residual������������������������������������������������������������������������������������������������������������������ 69
Residual Plot����������������������������������������������������������������������������������������������������������� 70
Residual Sum of Squares��������������������������������������������������������������������������������������� 70

Types of Regression������������������������������������������������������������������������������ 70
Linear Regression��������������������������������������������������������������������������������������������������� 71
Grid Search������������������������������������������������������������������������������������������������������������� 75
Ridge Regression���������������������������������������������������������������������������������������������������� 75
Lasso Regression��������������������������������������������������������������������������������������������������� 79
ElasticNet��������������������������������������������������������������������������������������������������������������� 81
Gradient Boosting Regression�������������������������������������������������������������������������������� 82
Support Vector Machines���������������������������������������������������������������������������������������� 86

Applications of Regression�������������������������������������������������������������������� 89
Predicting Sales������������������������������������������������������������������������������������������������������ 89
Predicting Value of Bond����������������������������������������������������������������������������������������� 90
Rate of Inflation������������������������������������������������������������������������������������������������������ 90
Insurance Companies��������������������������������������������������������������������������������������������� 91
Call Center�������������������������������������������������������������������������������������������������������������� 91

vii
 ■ Contents

Agriculture�������������������������������������������������������������������������������������������������������������� 91
Predicting Salary���������������������������������������������������������������������������������������������������� 91
Real Estate Industry����������������������������������������������������������������������������������������������� 92


■Chapter 3: Time Series����������������������������������������������������������������� 95
Case Study: Predicting Daily Adjusted Closing Rate of Yahoo��������������� 95
Feature Exploration������������������������������������������������������������������������������� 97
Time Series Modeling��������������������������������������������������������������������������������������������� 98

Evaluating the Stationary Nature of a Time Series Object��������������������� 98


Properties of a Time Series Which Is Stationary in Nature������������������������������������� 99
Tests to Determine If a Time Series Is Stationary��������������������������������������������������� 99
Methods of Making a Time Series Object Stationary�������������������������������������������� 102

Tests to Determine If a Time Series Has Autocorrelation�������������������� 113


Autocorrelation Function�������������������������������������������������������������������������������������� 113
Partial Autocorrelation Function��������������������������������������������������������������������������� 114
Measuring Autocorrelation����������������������������������������������������������������������������������� 114

Modeling a Time Series����������������������������������������������������������������������� 115


Tests to Validate Forecasted Series���������������������������������������������������������������������� 116
Deciding Upon the Parameters for Modeling�������������������������������������������������������� 116
Auto-Regressive Integrated Moving Averages������������������������������������ 119
Auto-Regressive Moving Averages����������������������������������������������������������������������� 119
Auto-Regressive��������������������������������������������������������������������������������������������������� 120
Moving Average���������������������������������������������������������������������������������������������������� 121
Combined Model��������������������������������������������������������������������������������������������������� 122

Scaling Back the Forecast������������������������������������������������������������������� 123


Applications of Time Series Analysis��������������������������������������������������� 127
Sales Forecasting������������������������������������������������������������������������������������������������� 127
Weather Forecasting��������������������������������������������������������������������������������������������� 127
Unemployment Estimates������������������������������������������������������������������������������������� 127

viii
 ■ Contents

Disease Outbreak������������������������������������������������������������������������������������������������� 128


Stock Market Prediction��������������������������������������������������������������������������������������� 128


■Chapter 4: Clustering������������������������������������������������������������������ 129
Case Study: Determination of Short Tail Keywords for Marketing������� 129
Features’ Exploration�������������������������������������������������������������������������� 131
Supervised vs. Unsupervised Learning����������������������������������������������� 133
Supervised Learning��������������������������������������������������������������������������������������������� 133
Unsupervised Learning����������������������������������������������������������������������������������������� 133

Clustering�������������������������������������������������������������������������������������������� 134
Data Transformation for Modeling������������������������������������������������������� 135
Metrics of Evaluating Clustering Models�������������������������������������������������������������� 137

Clustering Models������������������������������������������������������������������������������� 137


k-Means Clustering���������������������������������������������������������������������������������������������� 137
Applying k-Means Clustering for Optimal Number of Clusters����������������������������� 143
Principle Component Analysis������������������������������������������������������������������������������ 144
Gaussian Mixture Model��������������������������������������������������������������������������������������� 151
Bayesian Gaussian Mixture Model������������������������������������������������������������������������ 156

Applications of Clustering������������������������������������������������������������������� 159


Identifying Diseases��������������������������������������������������������������������������������������������� 159
Document Clustering in Search Engines�������������������������������������������������������������� 159
Demographic-Based Customer Segmentation����������������������������������������������������� 159


■Chapter 5: Classification������������������������������������������������������������ 161
Case Study: Ohio Clinic—Meeting Supply and Demand��������������������� 161
Features’ Exploration�������������������������������������������������������������������������� 164
Performing Data Wrangling����������������������������������������������������������������� 168
Performing Exploratory Data Analysis������������������������������������������������� 172
Features’ Generation��������������������������������������������������������������������������� 178

ix
 ■ Contents

Classification��������������������������������������������������������������������������������������� 180
Model Evaluation Techniques������������������������������������������������������������������������������� 181
Ensuring Cross-Validation by Splitting the Dataset���������������������������������������������� 184
Decision Tree Classification���������������������������������������������������������������������������������� 185

Kernel Approximation�������������������������������������������������������������������������� 186


SGD Classifier������������������������������������������������������������������������������������������������������� 187
Ensemble Methods����������������������������������������������������������������������������������������������� 189

Random Forest Classification�������������������������������������������������������������� 190


Gradient Boosting������������������������������������������������������������������������������������������������� 193

Applications of Classification�������������������������������������������������������������� 195


Image Classification��������������������������������������������������������������������������������������������� 196
Music Classification���������������������������������������������������������������������������������������������� 196
E-mail Spam Filtering������������������������������������������������������������������������������������������� 196
Insurance�������������������������������������������������������������������������������������������������������������� 196


■Appendix A: Chart types and when to use them������������������������� 197
Pie chart���������������������������������������������������������������������������������������������� 197
Bar graph�������������������������������������������������������������������������������������������� 198
Histogram�������������������������������������������������������������������������������������������� 198
Stem and Leaf plot������������������������������������������������������������������������������ 199
Box plot����������������������������������������������������������������������������������������������� 199

Index���������������������������������������������������������������������������������������������� 201

x
About the Author

Danish Haroon currently leads the Data Sciences


team at Market IQ Inc, a patented predictive analytics
platform focused on providing actionable, real-time
intelligence, culled from sentiment inflection points.
He received his MBA from Karachi School for Business
and Leadership, having served corporate clients and
their data analytics requirements. Most recently, he
led the data commercialization team at PredictifyME,
a startup focused on providing predictive analytics for
demand planning and real estate markets in the US
market. His current research focuses on the amalgam of
data sciences for improved customer experiences (CX).

xi
About the Technical
Reviewer

Somil Asthana has a BTech from IITBHU India and


a MS from the University of New York at Buffalo (in
the United States) both in Computer Science. He is an
entrepreneur, machine learning wizard, and BigData
specialist consulting with fortune 500 companies like
Sprint, Verizon , HPE, and Avaya. He has a startup
which provides BigData solutions and Data Strategies
to Data Driven Industries in ecommerce, content/
media domain.

xiii
Acknowledgments

I would like to thank my parents and lovely wife for their continuous support throughout
this enlightening journey.

xv
Introduction

This volume embraces machine learning approaches and Python to enable automatic
rendering of rich insights and solutions to business problems. The book uses a
hands-on case study-based approach to crack real-world applications where machine
learning concepts can provide a best fit. These smarter machines will enable your
business processes to achieve efficiencies in minimal time and resources.
Python Machine Learning Case Studies walks you through a step-by-step approach to
improve business processes and help you discover the pivotal points that frame corporate
strategies. You will read about machine learning techniques that can provide support to
your products and services. The book also highlights the pros and cons of each of these
machine learning concepts to help you decide which one best suits your needs.
By taking a step-by-step approach to coding you will be able to understand the
rationale behind model selection within the machine learning process. The book is
equipped with practical examples and code snippets to ensure that you understand the
data science approach for solving real-world problems.
Python Machine Leaarning Case Studies acts as an enabler for people from both
technical and non-technical backgrounds to apply machine learning techniques to
real-world problems. Each chapter starts with a case study that has a well-defined
business problem. The chapters then proceed by incorporating storylines, and code
snippets to decide on the most optimal solution. Exercises are laid out throughout the
chapters to enable the hands-on practice of the concepts learned. Each chapter ends
with a highlight of real-world applications to which the concepts learned can be applied.
Following is a brief overview of the contents covered in each of the five chapters:
Chapter 1 covers the concepts of statistics and probability.
Chapter 2 talks about regression techniques and methods to fine-tune the model.
Chapter 3 exposes readers to time series models and covers the property of
stationary in detail.
Chapter 4 uses clustering as an aid to segment the data for marketing purposes.
Chapter 5 talks about classification models and evaluation metrics to gauge the
goodness of these models.

xvii
CHAPTER 1

Statistics and Probability

The purpose of this chapter is to instill in you the basic concepts of traditional statistics
and probability. Certainly many of you might be wondering what it has to do with
machine learning. Well, in order to apply a best fit model to your data, the most important
prerequisite is for you to understand the data in the first place. This will enable you to find
out distributions within data, measure the goodness of data, and run some basic tests
to understand if some form of relationship exists between dependant and independent
variables. Let’s dive in.

■■Note This book incorporates Python 2.7.11 as the de facto standard for coding
examples. Moreover, you are required to have it installed it for the Exercises as well.

So why do I prefer Python 2.7.11 over Python 3x? Following are some of the reasons:
• Third-party library support for Python 2x is relatively better than
support for Python 3x. This means that there are a considerable
number of libraries in Python 2x that lack support in Python 3x.
• Some current Linux distributions and macOS provide Python 2x
by default. The objective is to let readers, regardless of their OS
version, apply the code examples on their systems, and thus this
is the choice to go forward with.
• The above-mentioned facts are the reason why companies prefer
to work with Python 2x or why they decide not to migrate their
code base from Python 2x to Python 3x.

Case Study: Cycle Sharing


Scheme—Determining Brand Persona
Nancy and Eric were assigned with the huge task of determining the brand persona
for a new cycle share scheme. They had to present their results at this year’s annual
board meeting in order to lay out a strong marketing plan for reaching out to
potential customers.

© Danish Haroon 2017 1


D. Haroon, Python Machine Learning Case Studies, DOI 10.1007/978-1-4842-2823-4_1
Chapter 1 ■ Statistics and Probability

The cycle sharing scheme provides means for the people of the city to commute
using a convenient, cheap, and green transportation alternative. The service has 500
bikes at 50 stations across Seattle. Each of the stations has a dock locking system (where
all bikes are parked); kiosks (so customers can get a membership key or pay for a trip);
and a helmet rental service. A person can choose between purchasing a membership
key or short-term pass. A membership key entitles an annual membership, and the key
can be obtained from a kiosk. Advantages for members include quick retrieval of bikes
and unlimited 45-minute rentals. Short-term passes offer access to bikes for a 24-hour
or 3-day time interval. Riders can avail and return the bikes at any of the 50 stations
citywide.
Jason started this service in May 2014 and since then had been focusing on
increasing the number of bikes as well as docking stations in order to increase
convenience and accessibility for his customers. Despite this expansion, customer
retention remained an issue. As Jason recalled, “We had planned to put in the investment
for a year to lay out the infrastructure necessary for the customers to start using it. We
had a strategy to make sure that the retention levels remain high to make this model self-
sustainable. However, it worked otherwise (i.e., the customer base didn’t catch up with
the rate of the infrastructure expansion).”
A private service would have had three alternatives to curb this problem: get
sponsors on board, increase service charges, or expand the pool of customers. Price hikes
were not an option for Jason as this was a publicly sponsored initiative with the goal of
providing affordable transportation to all. As for increasing the customer base, they had
to decide upon a marketing channel that guarantees broad reach on low cost incurred.
Nancy, a marketer who had worked in the corporate sector for ten years, and Eric, a
data analyst, were explicitly hired to find a way to make things work around this problem.
The advantage on their side was that they were provided with the dataset of transaction
history and thus they didn’t had to go through the hassle of conducting marketing
research to gather data.
Nancy realized that attracting recurring customers on a minimal budget
required understanding the customers in the first place (i.e., persona). As she stated,
“Understanding the persona of your brand is essential, as it helps you reach a targeted
audience which is likely to convert at a higher probability. Moreover, this also helps in
reaching out to sponsors who target a similar persona. This two-fold approach can make
our bottom line positive.”
As Nancy and Eric contemplated the problem at hand, they had questions like the
following: Which attribute correlates the best with trip duration and number of trips?
Which age generation adapts the most to our service?
Following is the data dictionary of the Trips dataset that was provided to Nancy and
Eric:

2
Chapter 1 ■ Statistics and Probability

Table 1-1. Data Dictionary for the Trips Data from Cycles Share Dataset

Feature name Description


trip_id Unique ID assigned to each trip
Starttime Day and time when the trip started, in PST
Stoptime Day and time when the trip ended, in PST
Bikeid ID attached to each bike
Tripduration Time of trip in seconds
from_station_name Name of station where the trip originated
to_station_name Name of station where the trip terminated
from_station_id ID of station where trip originated
to_station_id ID of station where trip terminated
Usertype Value can include either of the following: short-term pass
holder or member
Gender Gender of the rider
Birthyear Birth year of the rider

Exercises for this chapter required Eric to install the packages shown in Listing 1-1.
He preferred to import all of them upfront to avoid bottlenecks while implementing the
code snippets on your local machine.
However, for Eric to import these packages in his code, he needed to install them in
the first place. He did so as follows:
1. Opened terminal/shell
2. Navigated to his code directory using terminal/shell
3. Installed pip:

python get-pip.py

4. Installed each package separately, for example:

pip install pandas

Listing 1-1. Importing Packages Required for This Chapter


%matplotlib inline

import random
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import statistics

3
Chapter 1 ■ Statistics and Probability

import numpy as np
import scipy
from scipy import stats
import seaborn

Performing Exploratory Data Analysis


Eric recalled to have explained Exploratory Data Analysis in the following words:

What do I mean by exploratory data analysis (EDA)? Well, by this I


mean to see the data visually. Why do we need to see the data visually?
Well, considering that you have 1 million observations in your dataset
then it won’t be easy for you to understand the data just by looking at it,
so it would be better to plot it visually. But don’t you think it’s a waste of
time? No not at all, because understanding the data lets us understand
the importance of features and their limitations.

Feature Exploration
Eric started off by loading the data into memory (see Listing 1-2).

Listing 1-2. Reading the Data into Memory


data = pd.read_csv('examples/trip.csv')

Nancy was curious to know how big the data was and what it looked like. Hence, Eric
wrote the code in Listing 1-3 to print some initial observations of the dataset to get a feel
of what it contains.

Listing 1-3. Printing Size of the Dataset and Printing First Few Rows
print len(data)
data.head()

Output

236065

4
Chapter 1 ■ Statistics and Probability

Table 1-2. Print of Observations in the First Seven Columns of Dataset

trip_id starttime stoptime bikeid tripduration from_station_name to_station_name

Occidental Park/
10/13/2014 10/13/2014
431 SEA00298 985.935 2nd Ave & Spring St Occidental Ave S
10:31 10:48
& S Washing...

Occidental Park/
10/13/2014 10/13/2014
432 SEA00195 926.375 2nd Ave & Spring St Occidental Ave S
10:32 10:48
& S Washing...

Occidental Park/
10/13/2014 10/13/2014
433 SEA00486 883.831 2nd Ave & Spring St Occidental Ave S
10:33 10:48
& S Washing...

Occidental Park/
10/13/2014 10/13/2014
434 SEA00333 865.937 2nd Ave & Spring St Occidental Ave S
10:34 10:48
& S Washing...

Occidental Park/
10/13/2014 10/13/2014
435 SEA00202 923.923 2nd Ave & Spring St Occidental Ave S
10:34 10:49
& S Washing...

Table 1-3. Print of Observations in the Last five Columns of Dataset

from_station_id to_station_id usertype gender birthyear

CBD-06 PS-04 Member Male 1960.0

CBD-06 PS-04 Member Male 1970.0

CBD-06 PS-04 Member Female 1988.0

CBD-06 PS-04 Member Female 1977.0

CBD-06 PS-04 Member Male 1971.0

5
Chapter 1 ■ Statistics and Probability

After looking at Table 1-2 and Table 1-3 Nancy noticed that tripduration is
represented in seconds. Moreover, the unique identifiers for bike, from_station, and
to_station are in the form of strings, contrary to those for trip identifier which are in
the form of integers.

Types of variables
Nancy decided to go an extra mile and allocated data type to each feature in the dataset.

Table 1-4. Nancy’s Approach to Classifying Variables into Data Types

Feature name Variable type


trip_id Numbers
bikeid
tripduration
from_station_id
to_station_id
birthyear
Starttime Date
Stoptime
from_station_name to_station_name Text
Usertype
Gender

After looking at the feature classification in Table 1-4 Eric noticed that Nancy had
correctly identified the data types and thus it seemed to be an easy job for him to explain
what variable types mean. As Eric recalled to have explained the following:

In normal everyday interaction with data we usually represent numbers


as integers, text as strings, True/False as Boolean, etc. These are what
we refer to as data types. But the lingo in machine learning is a bit more
granular, as it splits the data types we knew earlier into variable types.
Understanding these variable types is crucial in deciding upon the type
of charts while doing exploratory data analysis or while deciding upon a
suitable machine learning algorithm to be applied on our data.

Continuous/Quantitative Variables
A continuous variable can have an infinite number of values within a given range. Unlike
discrete variables, they are not countable. Before exploring the types of continuous
variables, let’s understand what is meant by a true zero point.

6
Chapter 1 ■ Statistics and Probability

True Zero Point


If a level of measurement has a true zero point, then a value of 0 means you have nothing.
Take, for example, a ratio variable which represents the number of cupcakes bought. A
value of 0 will signify that you didn’t buy even a single cupcake. The true zero point is a
strong discriminator between interval and ratio variables.
Let’s now explore the different types of continuous variables.

Interval Variables
Interval variables exist around data which is continuous in nature and has a numerical
value. Take, for example, the temperature of a neighborhood measured on a daily basis.
Difference between intervals remains constant, such that the difference between 70
Celsius and 50 Celsius is the same as the difference between 80 Celsius and 100 Celsius.
We can compute the mean and median of interval variables however they don’t have a
true zero point.

Ratio Variables
Properties of interval variables are very similar to those of ratio variables with the
difference that in ratio variables a 0 indicates the absence of that measurement. Take,
for example, distance covered by cars from a certain neighborhood. Temperature in
Celsius is an interval variable, so having a value of 0 Celsius does not mean absence of
temperature. However, notice that a value of 0 KM will depict no distance covered by the
car and thus is considered as a ratio variable. Moreover, as evident from the name, ratios
of measurements can be used as well such that a distance covered of 50 KM is twice the
distance of 25 KM covered by a car.

Discrete Variables
A discrete variable will have finite set of values within a given range. Unlike continuous
variables those are countable. Let’s look at some examples of discrete variables which are
categorical in nature.

Ordinal Variables
Ordinal variables have values that are in an order from lowest to highest or vice versa.
These levels within ordinal variables can have unequal spacing between them. Take, for
example, the following levels:
1. Primary school
2. High school
3. College
4. University

7
Chapter 1 ■ Statistics and Probability

The difference between primary school and high school in years is definitely not
equal to the difference between high school and college. If these differences were
constant, then this variable would have also qualified as an interval variable.

Nominal Variables
Nominal variables are categorical variables with no intrinsic order; however, constant
differences between the levels exist. Examples of nominal variables can be gender, month
of the year, cars released by a manufacturer, and so on. In the case of month of year, each
month is a different level.

Dichotomous Variables
Dichotomous variables are nominal variables which have only two categories or levels.
Examples include
• Age: under 24 years, above 24 years
• Gender: male, female

Lurking Variable
A lurking variable is not among exploratory (i.e., independent) or response
(i.e., dependent) variables and yet may influence the interpretations of relationship
among these variables. For example, if we want to predict whether or not an applicant
will get admission in a college on the basis of his/her gender. A possible lurking variable
in this case can be the name of the department the applicant is seeking admission to.

Demographic Variable
Demography (from the Greek word meaning “description of people”) is the study of
human populations. The discipline examines size and composition of populations as well
as the movement of people from locale to locale. Demographers also analyze the effects
of population growth and its control. A demographic variable is a variable that is collected
by researchers to describe the nature and distribution of the sample used with inferential
statistics. Within applied statistics and research, these are variables such as age, gender,
ethnicity, socioeconomic measures, and group membership.

Dependent and Independent Variables


An independent variable is also referred to as an exploratory variable because it is being
used to explain or predict the dependent variable, also referred to as a response variable
or outcome variable.
Taking the dataset into consideration, what are the dependent and independent
variables? Let’s say that Cycle Share System’s management approaches you and asks
you to build a system for them to predict the trip duration beforehand so that the supply

8
Chapter 1 ■ Statistics and Probability

of cycles can be ensured. In that case, what is your dependent variable? Definitely
tripduration. And what are the independent variables? Well, these variables will comprise
of the features which we believe influence the dependent variable (e.g., usertype, gender,
and time and date of the day).
Eric asked Nancy to classify the features in the variable types he had just explained.

Table 1-5. Nancy’s Approach to Classifying Variables into Variable Types

Feature name Variable type


trip_id Continuous
bikeid
tripduration
from_station_id
to_station_id
birthyear
Starttime DateTime
Stoptime
from_station_name String
to_station_name
Usertype gender Nominal

Nancy now had a clear idea of the variable types within machine learning, and also
which of the features qualify for which of those variable types (see Table 1-5). However
despite of looking at the initial observations of each of these features (see Table 1-2) she
couldn’t deduce the depth and breadth of information that each of those tables contains.
She mentioned this to Eric, and Eric, being a data analytics guru, had an answer: perform
univariate analysis on features within the dataset.

Univariate Analysis
Univariate comes from the word “uni” meaning one. This is the analysis performed on a
single variable and thus does not account for any sort of relationship among exploratory
variables.
Eric decided to perform univariate analysis on the dataset to better understand the
features in isolation (see Listing 1-4).

Listing 1-4. Determining the Time Range of the Dataset


data = data.sort_values(by='starttime')
data.reset_index()
print 'Date range of dataset: %s - %s'%(data.ix[1, 'starttime'],
data.ix[len(data)-1, 'stoptime'])

Output

Date range of dataset: 10/13/2014 10:32 - 9/1/2016 0:20

9
Chapter 1 ■ Statistics and Probability

Eric knew that Nancy would have a hard time understanding the code so he decided
to explain the ones that he felt were complex in nature. In regard to the code in Listing
1-4, Eric explained the following:

We started off by sorting the data frame by starttime. Do note that


data frame is a data structure in Python in which we initially loaded
the data in Listing 1-2. Data frame helps arrange the data in a tabular
form and enables quick searching by means of hash values. Moreover,
data frame comes up with handy functions that make lives easier when
doing analysis on data. So what sorting did was to change the position
of records within the data frame, and hence the change in positions
disturbed the arrangement of the indexes which were earlier in an
ascending order. Hence, considering this, we decided to reset the indexes
so that the ordered data frame now has indexes in an ascending order.
Finally, we printed the date range that started from the first value of
starttime and ended with the last value of stoptime.

Eric’s analysis presented two insights. One is that the data ranges from October 2014
up till September 2016 (i.e., three years of data). Moreover, it seems like the cycle sharing
service is usually operational beyond the standard 9 to 5 business hours.
Nancy believed that short-term pass holders would avail more trips than their
counterparts. She believed that most people would use the service on a daily basis rather
than purchasing the long term membership. Eric thought otherwise; he believed that
new users would be short-term pass holders however once they try out the service and
become satisfied would ultimately avail the membership to receive the perks and benefits
offered. He also believed that people tend to give more weight to services they have paid
for, and they make sure to get the maximum out of each buck spent. Thus, Eric decided
to plot a bar graph of trip frequencies by user type to validate his viewpoint (see Listing 1-5).
But before doing so he made a brief document of the commonly used charts and
situations for which they are a best fit to (see Appendix A for a copy). This chart gave
Nancy his perspective for choosing a bar graph for the current situation.

Listing 1-5. Plotting the Distribution of User Types


groupby_user = data.groupby('usertype').size()
groupby_user.plot.bar(title = 'Distribution of user types')

10
Chapter 1 ■ Statistics and Probability

Distribution of user types


160000

140000

120000

100000

80000

60000

40000

20000

Short-Term Pass Holder


Member

usertype

Figure 1-1. Bar graph signifying the distribution of user types

Nancy didn’t understand the code snippet in Listing 1-5. She was confused by the
functionality of groupby and size methods. She recalled asking Eric the following: “I can
understand that groupby groups the data by a given field, that is, usertype, in the current
situation. But what do we mean by size? Is it the same as count, that is, counts trips falling
within each of the grouped usertypes?”
Eric was surprised by Nancy’s deductions and he deemed them to be correct.
However, the bar graph presented insights (see Figure 1-1) in favor of Eric’s view as the
members tend to avail more trips than their counterparts.
Nancy had recently read an article that talked about the gender gap among
people who prefer riding bicycles. The article mentioned a cycle sharing scheme in UK
where 77% of the people who availed the service were men. She wasn’t sure if similar
phenomenon exists for people using the service in United States. Hence Eric came up
with the code snippet in Listing 1-6 to answer the question at hand.

Listing 1-6. Plotting the Distribution of Gender


groupby_gender = data.groupby('gender').size()
groupby_gender.plot.bar(title = 'Distribution of genders')

11
Chapter 1 ■ Statistics and Probability

Distribution of genders
120000

100000

80000

60000

40000

20000

0
Male

Other
Female

gender

Figure 1-2. Bar graph signifying the distribution of genders

Figure 1-2 revealed that the gender gap resonates in states as well. Males seem to
dominate the trips taken as part of the program.
Nancy, being a marketing guru, was content with the analysis done so far. However
she wanted to know more about her target customers to whom to company’s marketing
message will be targetted to. Thus Eric decided to come up with the distribution of
birth years by writing the code in Listing 1-7. He believed this would help the Nancy
understand the age groups that are most likely to ride a cycle or the ones that are more
prone to avail the service.

Listing 1-7. Plotting the Distribution of Birth Years


data = data.sort_values(by='birthyear')
groupby_birthyear = data.groupby('birthyear').size()
groupby_birthyear.plot.bar(title = 'Distribution of birth years',
figsize = (15,4))

12
Chapter 1 ■ Statistics and Probability

Distribution of birth years


14000

12000

10000
8000
6000

4000

2000

0
1931.0
1936.0
1939.0
1942.0
1943.0
1944.0
1945.0
1946.0
1947.0
1948.0
1949.0
1950.0
1951.0
1952.0
1953.0
1954.0
1955.0
1956.0
1957.0
1958.0
1959.0
1960.0
1961.0
1962.0
1963.0
1964.0
1965.0
1966.0
1967.0
1968.0
1969.0
1970.0
1971.0
1972.0
1973.0
1974.0
1975.0
1976.0
1977.0
1978.0
1979.0
1980.0
1981.0
1982.0
1983.0
1984.0
1985.0
1986.0
1987.0
1988.0
1989.0
1990.0
1991.0
1992.0
1993.0
1994.0
1995.0
1996.0
1997.0
1998.0
1999.0
birthyear

Figure 1-3. Bar graph signifying the distribution of birth years

Figure 1-3 provided a very interesting illustration. Majority of the people who had
subscribed to this program belong to Generation Y (i.e., born in the early 1980s to mid
to late 1990s, also known as millennials). Nancy had recently read the reports published
by Elite Daily and CrowdTwist which said that millennials are the most loyal generation
to their favorite brands. One reason for this is their willingness to share thoughts and
opinions on products/services. These opinions thus form a huge corpus of experiences—
enough information for the millenials to make a conscious decision, a decision they will
remain loyal to for a long period. Hence Nancy was convinced that most millennials
would be members rather than short-term pass holders. Eric decided to populate a bar
graph to see if Nancy’s deduction holds true.

Listing 1-8. Plotting the Frequency of Member Types for Millenials


data_mil = data[(data['birthyear'] >= 1977) & (data['birthyear']<=1994)]
groupby_mil = data_mil.groupby('usertype').size()
groupby_mil.plot.bar(title = 'Distribution of user types')

Distribution of user types


120000

100000

80000

60000

40000

20000

0
Member

usertype

Figure 1-4. Bar graph of member types for millenials


13
Chapter 1 ■ Statistics and Probability

After looking at Figure 1-4 Eric was surprised to see that Nancy’s deduction appeared
to be valid, and Nancy made a note to make sure that the brand engaged millennials as
part of the marketing plan.
Eric knew that more insights can pop up when more than one feature is used as part
of the analysis. Hence, he decided to give Nancy a sneak peek at multivariate analysis
before moving forward with more insights.

Multivariate Analysis
Multivariate analysis refers to incorporation of multiple exploratory variables to
understand the behavior of a response variable. This seems to be the most feasible
and realistic approach considering the fact that entities within this world are usually
interconnected. Thus the variability in response variable might be affected by the
variability in the interconnected exploratory variables.
Nancy believed males would dominate females in terms of the trips completed. The
graph in Figure 1-2, which showed that males had completed far more trips than any
other gender types, made her embrace this viewpoint. Eric thought that the best approach
to validate this viewpoint was a stacked bar graph (i.e., a bar graph for birth year, but each
bar having two colors, one for each gender) (see Figure 1-5).

Listing 1-9. Plotting the Distribution of Birth Years by Gender Type


groupby_birthyear_gender = data.groupby(['birthyear', 'gender'])
['birthyear'].count().unstack('gender').fillna(0)
groupby_birthyear_gender[['Male','Female','Other']].plot.bar(title =
'Distribution of birth years by Gender', stacked=True, figsize = (15,4))

Distribution of birth years by Gender


14000
gender
12000 Male
Female
10000 Other
8000

6000

4000

2000

0
1931.0
1936.0
1939.0
1942.0
1943.0
1944.0
1945.0
1946.0
1947.0
1948.0
1949.0
1950.0
1951.0
1952.0
1953.0
1954.0
1955.0
1956.0
1957.0
1958.0
1959.0
1960.0
1961.0
1962.0
1963.0
1964.0
1965.0
1966.0
1967.0
1968.0
1969.0
1970.0
1971.0
1972.0
1973.0
1974.0
1975.0
1976.0
1977.0
1978.0
1979.0
1980.0
1981.0
1982.0
1983.0
1984.0
1985.0
1986.0
1987.0
1988.0
1989.0
1990.0
1991.0
1992.0
1993.0
1994.0
1995.0
1996.0
1997.0
1998.0
1999.0

birthyear

Figure 1-5. Bar graph signifying the distribution of birth years by gender type

14
Chapter 1 ■ Statistics and Probability

The code snippet in Listing 1-9 brought up some new aspects not previously
highlighted.

We at first transformed the data frame by unstacking, that is, splitting,


the gender column into three columns, that is, Male, Female, and Other.
This meant that for each of the birth years we had the trip count for all
three gender types. Finally, a stacked bar graph was created by using this
transformed data frame.

It seemed as if males were dominating the distribution. It made sense as well. No?
Well, it did; as seen earlier, that majority of the trips were availed by males, hence this
skewed the distribution in favor of males. However, subscribers born in 1947 were all
females. Moreover, those born in 1964 and 1994 were dominated by females as well. Thus
Nancy’s hypothesis and reasoning did hold true.
The analysis in Listing 1-4 had revealed that all millennials are members. Nancy was
curious to see what the distribution of user type was for the other age generations. Is it
that the majority of people in the other age generations were short-term pass holders?
Hence Eric brought a stacked bar graph into the application yet again (see Figure 1-6).

Listing 1-10. Plotting the Distribution of Birth Years by User Types


groupby_birthyear_user = data.groupby(['birthyear', 'usertype'])
['birthyear'].count().unstack('usertype').fillna(0)

groupby_birthyear_user['Member'].plot.bar(title = 'Distribution of birth


years by Usertype', stacked=True, figsize = (15,4))

Distribution of birth years by Usertype


14000
12000
10000
8000
6000
4000
2000
0
1931.0
1936.0
1939.0
1942.0
1943.0
1944.0
1945.0
1946.0
1947.0
1948.0
1949.0
1950.0
1951.0
1952.0
1953.0
1954.0
1955.0
1956.0
1957.0
1958.0
1959.0
1960.0
1961.0
1962.0
1963.0
1964.0
1965.0
1966.0
1967.0
1968.0
1969.0
1970.0
1971.0
1972.0
1973.0
1974.0
1975.0
1976.0
1977.0
1978.0
1979.0
1980.0
1981.0
1982.0
1983.0
1984.0
1985.0
1986.0
1987.0
1988.0
1989.0
1990.0
1991.0
1992.0
1993.0
1994.0
1995.0
1996.0
1997.0
1998.0
1999.0

birthyear

Figure 1-6. Bar graph signifying the distribution of birth years by user types

15
Chapter 1 ■ Statistics and Probability

Whoa! Nancy was surprised to see the distribution of only one user type and not
two (i.e., membership and short-term pass holders)? Does this mean that birth year
information was only present for only one user type? Eric decided to dig in further and
validate this (see Listing 1-11).

Listing 1-11. Validation If We Don’t Have Birth Year Available for Short-Term Pass
Holders
data[data['usertype']=='Short-Term Pass Holder']['birthyear'].isnull().
values.all()

Output

True

In the code in Listing 1-11, Eric first sliced the data frame to consider only short-
term pass holders. Then he went forward to find out if all the values in birth year are
missing (i.e., null) for this slice. Since that is the case, Nancy’s initially inferred hypothesis
was true—that birth year data is only available for members. This made her recall her
prior deduction about the brand loyalty of millennials. Hence the output for Listing 1-11
nullifies Nancy’s deduction made after the analysis in Figure 1-4. This made Nancy sad,
as the loyalty of millenials can’t be validated from the data at hand. Eric believed that
members have to provide details like birth year when applying for the membership,
something which is not a prerequisite for short-term pass holders. Eric decided to test his
deduction by checking if gender is available for short-term pass holders or not for which
he wrote the code in Listing 1-12.

Listing 1-12. Validation If We Don’t Have Gender Available for Short-Term Pass Holders
data[data['usertype']=='Short-Term Pass Holder']['gender'].isnull().values.
all()

Output

True

Thus Eric concluded that we don’t have the demographic variables for user type
‘Short-Term Pass holders’.
Nancy was interested to see as to how the frequency of trips vary across date and
time (i.e., a time series analysis). Eric was aware that trip start time is given with the data,
but for him to make a time series plot, he had to transform the date from string to date
time format (see Listing 1-13). He also decided to do more: that is, split the datetime into
date components (i.e., year, month, day, and hour).

16
Chapter 1 ■ Statistics and Probability

Listing 1-13. Converting String to datetime, and Deriving New Features


List_ = list(data['starttime'])

List_ = [datetime.datetime.strptime(x, "%m/%d/%Y %H:%M") for x in List_]


data['starttime_mod'] = pd.Series(List_,index=data.index)
data['starttime_date'] = pd.Series([x.date() for x in List_],index=data.index)
data['starttime_year'] = pd.Series([x.year for x in List_],index=data.index)
data['starttime_month'] = pd.Series([x.month for x in List_],index=data.index)
data['starttime_day'] = pd.Series([x.day for x in List_],index=data.index)
data['starttime_hour'] = pd.Series([x.hour for x in List_],index=data.index)

Eric made sure to explain the piece of code in Listing 1-13 as he had explained to Nancy:

At first we converted start time column of the dataframe into a list.


Next we converted the string dates into python datetime objects. We
then converted the list into a series object and converted the dates from
datetime object to pandas date object. The time components of year,
month, day and hour were derived from the list with the datetime objects.

Now it was time for the time series analysis of the frequency of trips over all days
provided within the dataset (see Listing 1-14).

Listing 1-14. Plotting the Distribution of Trip Duration over Daily Time
data.groupby('starttime_date')['tripduration'].mean().plot.bar(title =
'Distribution of Trip duration by date', figsize = (15,4))

Distribution of Trip duration by date


3000

2500

2000

1500

1000

500

starttime date

Figure 1-7. Bar graph signifying the distribution of trip duration over daily time

Wow! There seems to be a definitive pattern of trip duration over time.

17
Chapter 1 ■ Statistics and Probability

Time Series Components


Eric decided to brief Nancy about the types of patterns that exist in a time series analysis.
This he believed would help Nancy understand the definite pattern in Figure 1-7.

Seasonal Pattern
A seasonal pattern (see Figure 1-8) refers to a seasonality effect that incurs after a fixed
known period. This period can be week of the month, week of the year, month of the year,
quarter of the year, and so on. This is the reason why seasonal time series are also referred
to as periodic time series.
60

0 5 10
seasonal

-10
-20
Figure 1-8. Illustration of seasonal pattern

Cyclic Pattern
A cyclic pattern (see Figure 1-9) is different from a seasonal pattern in the notion that the
patterns repeat over non-periodic time cycles.
90
Monthly housing sales (millions)
80
70
60
50
40
30

1975 1980 1985 1990 1995


Year

Figure 1-9. Illustration of cyclic pattern

18
Chapter 1 ■ Statistics and Probability

Trend
A trend (see Figure 1-10) is a long-term increase or decrease in a continuous variable.
This pattern might not be exactly linear over time, but when smoothing is applied it can
generalize into either of the directions.
91
US treasury bill contracts
86 87 88 89 9085

0 20 40 60 80 100
Day

Figure 1-10. Illustration of trend

Eric decided to test Nancy’s concepts on time series, so he asked her to provide her
thoughts on the time series plot in Figure 1-7. “What do you think of the time series plot?
Is the pattern seasonal or cyclic? Seasonal is it right?”
Nancy’s reply amazed Eric once again. She said the following:

Yes it is because the pattern is repeating over a fixed interval of time—


that is, seasonality. In fact, we can split the distribution into three
distributions. One pattern is the seasonality that is repeating over time.
The second one is a flat density distribution. Finally, the last pattern is
the lines (that is, the hikes) over that density function. In case of time
series prediction we can make estimations for a future time using both
of these distributions and add up in order to predict upon a calculated
confidence interval.

On the basis of her deduction it seemed like Nancy’s grades in her statistics elective
course had paid off. Nancy wanted answers to many more of her questions. Hence she
decided to challenge the readers with the Exercises that follow.

19
Chapter 1 ■ Statistics and Probability

EXERCISES

1. Determine the distribution of number of trips by year. Do you


see a specific pattern?
2. Determine the distribution of number of trips by month. Do you
see a specific pattern?
3. Determine the distribution of number of trips by day. Do you see
a specific pattern?
4. Determine the distribution of number of trips by day. Do you see
a specific pattern?
5. Plot a frequency distribution of trips on a daily basis.

Measuring Center of Measure


Eric believed that measures like mean, median, and mode help give a summary view
of the features in question. Taking this into consideration, he decided to walk Nancy
through the concepts of center of measure.

Mean
Mean in layman terms refers to the averaging out of numbers Mean is highly affected by
outliers, as the skewness introduced by outliers will pull the mean toward extreme values.
• Symbol:
• μ-> Parameter -> population mean
• x’ -> Statistic -> sample mean

• Rules of mean:

• ma +bx = a + bm x
• mx+y = mx + my
We will be using statistics.mean(data) in our coding examples. This will return the
sample arithmetic mean of data, a sequence or iterator of real-valued numbers.
Mean exists in two major variants.

20
Chapter 1 ■ Statistics and Probability

Arithmetic Mean
An arithmetic mean is simpler than a geometric mean as it averages out the numbers
(i.e., it adds all the numbers and then divides the sum by the frequency of those numbers).
Take, for example, the grades of ten students who appeared in a mathematics test.
78, 65, 89, 93, 87, 56, 45, 73, 51, 81
Calculating the arithmetic mean will mean

78 + 65 + 89 + 93 + 87 + 56 + 45 + 73 + 51 + 81
mean = = 71.8
10

Hence the arithmetic mean of scores taken by students in their mathematics test was
71.8. Arithmetic mean is most suitable in situations when the observations (i.e., math
scores) are independent of each other. In this case it means that the score of one student
in the test won’t affect the score that another student will have in the same test.

Geometric Mean
As we saw earlier, arithmetic mean is calculated for observations which are independent
of each other. However, this doesn’t hold true in the case of a geometric mean as it is
used to calculate mean for observations that are dependent on each other. For example,
suppose you invested your savings in stocks for five years. Returns of each year will be
invested back in the stocks for the subsequent year. Consider that we had the following
returns in each one of the five years:
60%, 80%, 50%, -30%, 10%
Are these returns dependent on each other? Well, yes! Why? Because the investment
of the next year is done on the capital garnered from the previous year, such that a loss in
the first year will mean less capital to invest in the next year and vice versa. So, yes, we will
be calculating the geometric mean. But how? We will do so as follows:
[(0.6 + 1) * (0.8 + 1) * (0.5 + 1) * (-0.3 + 1) * (0.1 + 1)]1/5 - 1 = 0.2713
Hence, an investment with these returns will yield a return of 27.13% by the end of
the fifth year. Looking at the calculation above, you can see that at first we first converted
percentages into decimals. Next we added 1 to each of them to nullify the effects brought
on by the negative terms. Then we multiplied all terms among themselves and applied a
power to the resultant. The power applied was 1 divided by the frequency of observations
(i.e., five in this case). In the end we subtracted the result by 1. Subtraction was done to
nullify the effect introduced by an addition of 1, which we did initially with each term.
The subtraction by 1 would not have been done had we not added 1 to each of the terms
(i.e., yearly returns).

21
Chapter 1 ■ Statistics and Probability

Median
Median is a measure of central location alongside mean and mode, and it is less affected
by the presence of outliers in your data. When the frequency of observations in the data is
odd, the middle data point is returned as the median.
In this chaapter we will use statistics.median(data) to calculate the median. This
returns the median (middle value) of numeric data if frequency of values is odd and
otherwise mean of the middle values if frequency of values is even using “mean of middle
two” method. If data is empty, StatisticsError is raised.

Mode
Mode is suitable on data which is discrete or nominal in nature. Mode returns the
observation in the dataset with the highest frequency. Mode remains unaffected by the
presence of outliers in data.

Variance
Variance represents variability of data points about the mean. A high variance means
that the data is highly spread out with a small variance signifying the data to be closely
clustered.
1. Symbol: s x2
2. Formula:

a. å( X - X’ )
2

n -1

s x2 = å ( xi - m x ) pi
2
b.

3. Why n-1 beneath variance calculation? The sample variance


averages out to be smaller than the population variance; hence,
degrees of freedom is accounted for as the conversion factor.
4. Rules of variance:

i. s a2+bx = b 2s x2

ii. s x2+ y = s x2 + s y2 (If X and Y are independent variables)

s x2- y = s x2 + s y2

iii. s x2+ y = s x2 + s y2 + 2rs xs y (if X and Y have correlation r)

s x2+ y = s x2 + s y2 + 2rs xs y

22
Chapter 1 ■ Statistics and Probability

We will be incorporating statistics.variance(data, xbar=None) to calculate variance


in our coding exercises. This will return the sample variance across at least two real-valued
numbered series.

Standard Deviation
Standard deviation, just like variance, also captures the spread of data along the mean.
The only difference is that it is a square root of the variance. This enables it to have the
same unit as that of the data and thus provides convenience in inferring explanations
from insights. Standard deviation is highly affected by outliers and skewed distributions.
• Symbol: σ
• Formula: s2
We measure standard deviation instead of variance because
• It is the natural measure of spread in a Normal distribution
• Same units as original observations

Changes in Measure of Center Statistics due to Presence


of Constants
Let’s evaluate how measure of center statistics behave when data is transformed by
the introduction of constants. We will evaluate the outcomes for mean, median, IQR
(interquartile range), standard deviation, and variance. Let’s first start with what behavior
each of these exhibits when a constant “a” is added or subtracted from each of these.
Addition: Adding a
• x’ new = a + x’

• mediannew = a + median
• IQRnew = a + IQR
• snew = s
• s x2 new = s x2
Adding a constant to each of the observations affected the mean, median, and IQR.
However, standard deviation and variance remained unaffected. Note that the same
behavior will come through when observations within the data are subtracted from a
constant. Let’s see if the same behavior will repeat when we multiply a constant (i.e., “b”)
to each observation within the data.

23
Chapter 1 ■ Statistics and Probability

Multiplication: Multiplying b
• x’ new = bx’

• mediannew = bmedian
• IQRnew = bIQR
• snew = bs
• s x2 new = b 2s x2
Wow! Multiplying a constant to each observation within the data changed all five
measures of center statistics. Do note that you will achieve the same effect when all
observations within the data are divided by a constant term.
After going through the description of center of measures, Nancy was interested in
understanding the trip durations in detail. Hence Eric came up with the idea to calculate
the mean and median trip durations. Moreover, Nancy wanted to determine the station
from which most trips originated in order to run promotional campaigns for existing
customers. Hence Eric decided to determine the mode of ‘from_station_name’ field.

■■Note Determining the measures of centers using the statistics package will require us
to transform the input data structure to a list type.

Listing 1-15. Determining the Measures of Center Using Statistics Package


trip_duration = list(data['tripduration'])
station_from = list(data['from_station_name'])
print 'Mean of trip duration: %f'%statistics.mean(trip_duration)
print 'Median of trip duration: %f'%statistics.median(trip_duration)
print 'Mode of station originating from: %s'%statistics.mode(station_from)

Output

Mean of trip duration: 1202.612210


Median of trip duration: 633.235000
Mode of station originating from: Pier 69 / Alaskan Way & Clay St

The output of Listing 1-15 revealed that most trips originated from Pier 69/Alaskan
Way & Clay St station. Hence this was the ideal location for running promotional
campaigns targeted to existing customers. Moreover, the output showed the mean to
be greater than that of the mean. Nancy was curious as to why the average (i.e., mean)
is greater than the central value (i.e., median). On the basis of what she had read, she
realized that this might be either due to some extreme values after the median or due to
the majority of values lying after the median. Eric decided to plot a distribution of the trip
durations (see Listing 1-16) in order to determine which premise holds true.

24
Chapter 1 ■ Statistics and Probability

Listing 1-16. Plotting Histogram of Trip Duration


data['tripduration'].plot.hist(bins=100, title='Frequency distribution of
Trip duration')
plt.show()

Frequency distribution of Trip duration


80000

70000

60000

50000
Frequency

40000

30000

20000

10000

0
0 5000 10000 15000 20000 25000 30000

Figure 1-11. Frequency distribution of trip duration

The distribution in Figure 1-11 has only one peak (i.e., mode). The distribution is
not symmetric and has majority of values toward the right-hand side of the mode. These
extreme values toward the right are negligible in quantity, but their extreme nature tends
to pull the mean toward themselves. Thus the reason why the mean is greater than the
median.
The distribution in Figure 1-11 is referred to as a normal distribution.

The Normal Distribution


Normal distribution, or in other words Gaussian distribution, is a continuous probability
distribution that is bell shaped. The important characteristic of this distribution is that
the mean lies at the center of this distribution with a spread (i.e., standard deviation)
around it. The majority of the observations in normal distribution lie around the mean
and fade off as they distance away from the mean. Some 68% of the observations lie
within 1 standard deviation from the mean; 95% of the observations lie within 2 standard
deviations from the mean, whereas 99.7% of the observations lie within 3 standard
deviations from the mean. A normal distribution with a mean of zero and a standard
deviation of 1 is referred to as a standard normal distribution. Figure 1-12 shows normal
distribution along with confidence intervals.

25
Chapter 1 ■ Statistics and Probability

0.40
Normal distribution with confidence intervals

0.35

0.30

0.25

0.20

0.15

0.10 0.683

0.05
0.159 0.159

0.00
–4 –3 –2 –1 0 1 2 3 4

Figure 1-12. Normal distribution and confidence levels

These are the most common confidence levels:

Confidence level Formula


68% Mean ± 1 std.
95% Mean ± 2 std.
99.7% Mean ± 3 std.

Skewness
Skewness is a measure of the lack of symmetry. The normal distribution shown
previously is symmetric and thus has no element of skewness. Two types of skewness
exist (i.e., positive and negative skewness).

26
Chapter 1 ■ Statistics and Probability

(a) Negatively skewed (b) Normal (no skew) (c) Positively skewed
Mean
Median
Mode Mode Mode
Median Median
Frequency

Mean Mean

X X X

Negative direction The normal curve Positive direction


represents a perfectly
symmetrical distribution

Figure 1-13. Skewed and symmetric normal distributions

As seen from Figure 1-13, a relationship exists among measure of centers for each
one of the following variations:
• Symmetric distributions: Mean = Median = Mode
• Positively skewed: Mean < Median < Mode
• Negatively skewed: Mean > Median > Mode
Going through Figure 1-12 you will realize that the distribution in Figure 1-13(c) has
a long tail on its right. This might be due to the presence of outliers.

Outliers
Outliers refer to the values distinct from majority of the observations. These occur either
naturally, due to equipment failure, or because of entry mistakes.
In order to understand what outliers are, we need to look at Figure 1-14.

Median
(Q2)
Minimum Value in Maximum Value in
the Data 25th Percentile (Q1) 75th Percentile the Data
(Q3)

Potential Interquartile Range Potential


Outliers (IQR) Outliers

Maximum (Minimum Value in the Data, Q1 – I.5*IQR)

Figure 1-14. Illustration of outliers using a box plot

27
Chapter 1 ■ Statistics and Probability

From Figure 1-14 we can see that the observations lying outside the whiskers are
referred to as the outliers.

Listing 1-17. Interval of Values Not Considered Outliers


[Q1 – 1.5 (IQR) ,  Q3 + 1.5 (IQR) ] (i.e. IQR = Q3 - Q1)

Values not lying within this interval are considered outliers. Knowing the values of
Q1 and Q3 is fundamental for this calculation to take place.
Is the presence of outliers good in the dataset? Usually not! So, how are we going to
treat the outliers in our dataset? Following are the most common methods for doing so:
• Remove the outliers: This is only possible when the proportion of
outliers to meaningful values is quite low, and the data values are
not on a time series scale. If the proportion of outliers is high, then
removing these values will hurt the richness of data, and models
applied won’t be able to capture the true essence that lies within.
However, in case the data is of a time series nature, removing
outliers from the data won’t be feasible, the reason being that for
a time series model to train effectively, data should be continuous
with respect to time. Removing outliers in this case will introduce
breaks within the continuous distribution.
• Replace outliers with means: Another way to approach this is
by taking the mean of values lying with the interval shown in
Figure 1-14, calculate the mean, and use these to replace the
outliers. This will successfully transform the outliers in line with
the valid observations; however, this will remove the anomalies
that were otherwise present in the dataset, and their findings
could present interesting insights.
• Transform the outlier values: Another way to cop up with outliers
is to limit them to the upper and lower boundaries of acceptable
data. The upper boundary can be calculated by plugging in the
values of Q3 and IQR into Q3 + 1.5IQR and the lower boundary
can be calculated by plugging in the values of Q1 and IQR into
Q1 – 1.5IQR.
• Variable transformation: Transformations are used to convert the
inherited distribution into a normal distribution. Outliers bring
non-normality to the data and thus transforming the variable can
reduce the influence of outliers. Methodologies of transformation
include, but are not limited to, natural log, conversion of data into
ratio variables, and so on.
Nancy was curious to find out whether outliers exist within our dataset—more
precisely in the tripduration feature. For that Eric decided to first create a box plot
(see Figure 1-15) by writing code in Listing 1-18 to see the outliers visually and then
checked the same by applying the interval calculation method in Listing 1-19.

28
Chapter 1 ■ Statistics and Probability

Listing 1-18. Plotting a Box plot of Trip Duration


box = data.boxplot(column=['tripduration'])
plt.show()

30000

25000

20000

15000

10000

5000

0
tripduration

Figure 1-15. Box plot of trip duration

Nancy was surprised to see a huge number of outliers in trip duration from the box
plot in Figure 1-15. She asked Eric if he could determine the proportion of trip duration
values which are outliers. She wanted to know if outliers are a tiny or majority portion of
the dataset. For that Eric wrote the code in Listing 1-19.

Listing 1-19. Determining Ratio of Values in Observations of tripduration Which Are


Outliers
q75, q25 = np.percentile(trip_duration, [75 ,25])
iqr = q75 - q25
print 'Proportion of values as outlier: %f percent'%(
(len(data) - len([x for x in trip_duration if q75+(1.5*iqr)
>=x>= q25-(1.5*iqr)]))*100/float(len(data)))

Output

Proportion of values as outlier: 9.548218 percent

29
Chapter 1 ■ Statistics and Probability

Eric explained the code in Listing 1-19 to Nancy as follows:

As seen in Figure 1-14, Q3 refers to the 75th percentile and Q1 refers


to the 25th percentile. Hence we use the numpy.percentile() method
to determine the values for Q1 and Q3. Next we compute the IQR by
subtracting both of them. Then we determine the subset of values by
applying the interval as specified in Listing 1-18. We then used the
formula to get the number of outliers.

Listing 1-20. Formula for Calculating Number of Outliers


Number of outliers values = Length of all values - Length of all non
outliers values

In our code, len(data) determines Length of all values and Length of all non outliers
values is determined by len([x for x in trip_duration if q75+(1.5*iqr) >=x>=
q25-(1.5*iqr)])).
Hence then the formula in Listing 1-20 was applied to calculate the ratio of values
considered outliers.

Listing 1-21. Formula for Calculating Ratio of Outlier Values


Ratio of outliers = ( Number of outliers values / Length of all values ) * 100

Nancy was relieved to see only 9.5% of the values within the dataset to be outliers.
Considering the time series nature of the dataset she knew that removing these outliers
wouldn’t be an option. Hence she knew that the only option she could rely on was to
apply transformation to these outliers to negate their extreme nature. However, she was
interested in observing the mean of the non-outlier values of trip duration. This she then
wanted to compare with the mean of all values calculated earlier in Listing 1-15.

Listing 1-22. Calculating z scores for Observations Lying Within tripduration


mean_trip_duration = np.mean([x for x in trip_duration if q75+(1.5*iqr)
>=x>= q25-(1.5*iqr)])
upper_whisker = q75+(1.5*iqr)
print 'Mean of trip duration: %f'%mean_trip_duration

Output

Mean of trip duration: 711.726573

30
Chapter 1 ■ Statistics and Probability

The mean of non-outlier trip duration values in Listing 1-22 (i.e., approximately 712)
is considerably lower than that calculated in the presence of outliers in Listing 1-15 (i.e.,
approximately 1,203). This best describes the notion that mean is highly affected by the
presence of outliers in the dataset.
Nancy was curious as to why Eric initialized the variable upper_whisker given that
it is not used anywhere in the code in Listing 1-22. Eric had a disclaimer for this: “upper_
whisker is the maximum value of the right (i.e., positive) whisker i.e. boundary uptill
which all values are valid and any value greater than that is considered as an outlier. You
will soon understand why we initialized it over here.”
Eric was interested to see the outcome statistics once the outliers were transformed
into valid value sets. Hence he decided to start with a simple outlier transformation to the
mean of valid values calculated in Listing 1-22.

Listing 1-23. Calculating Mean Scores for Observations Lying Within tripduration
def transform_tripduration(x):

    if x > upper_whisker:


        return mean_trip_duration
    return x

data['tripduration_mean'] = data['tripduration'].apply(lambda x: transform_


tripduration(x))

data['tripduration_mean'].plot.hist(bins=100, title='Frequency distribution


of mean transformed Trip duration')
plt.show()

Eric remembers walking Nancy through the code in Listing 1-23.

We initialized a function by the name of transform_tripduration.


The function will check if a trip duration value is greater than the upper
whisker boundary value, and if that is the case it will replace it with the
mean. Next we add tripduration_mean as a new column to the data
frame. We did so by custom modifying the already existing tripduration
column by applying the transform_tripduration function.

Nancy was of the opinion that the transformed distribution in Figure 1-16 is a
positively skewed normal distribution. Comparing Figure 1-16 to Figure 1-10 reveals that
the skewness has now decreased to a great extent after the transformation. Moreover,
the majority of the observations have a tripduration of 712 primarily because all values
greater than the upper whisker boundary are not converted into the mean of the non-
outlier values calculated in Listing 1-22. Nancy was now interested in understanding how
the center of measures appear for this transformed distribution. Hence Eric came up with
the code in Listing 1-24.

31
Chapter 1 ■ Statistics and Probability

Listing 1-24. Deternining the Measures of Center in Absence of Outliers


print 'Mean of trip duration: %f'%data['tripduration_mean'].mean()
print 'Standard deviation of trip duration: %f'%data['tripduration_mean'].std()
print 'Median of trip duration: %f'%data['tripduration_mean'].median()

Output

Mean of trip duration: 711.726573


Standard deviation of trip duration: 435.517297
Median of trip duration: 633.235000

Frequency distribution of z transformed Trip duration


80000

70000

60000

50000
Frequency

40000

30000

20000

10000

0
–2 0 2 4 6 8 10 12 14

Frequency distribution of mean transformed Trip duration


30000

25000

20000
Frequency

15000

10000

5000

0
0 500 1000 1500 2000 2500

Figure 1-16. Frequency distribution of mean transformed trip duration

32
Chapter 1 ■ Statistics and Probability

Nancy was expecting the mean to appear the same as that in Listing 1-22 because
of the mean transformation of the outlier values. In Figure 1-16 she knew that the hike at
711.7 is the mode, which meant that after the transformation the mean is the same as that
of the mode. The thing that surprised her the most was that the median is approaching the
mean, which means that the positive skewness we saw in Figure 1-16 is not that strong.
On the basis of the findings in Figure 1-1, Nancy knew that males dominate females
in terms of trips taken. She was hence interested to see the trip duration of males and
repeat the outlier treatment for them as well. Hence she came up with these exercise
questions for you in the hopes of gaining further insights.

EXERCISES

1. Find the mean, median, and mode of the trip duration of gender
type male.
2. By looking at the numbers obtained earlier, in your opinion is
the distribution symmetric or skewed? If skewed, then is is it
positively skewed or negatively skewed?
3. Plot a frequency distribution of trip duration for trips availed by
gender type male. Does it validate your inference as you did so
in the previous question?
4. Plot a box plot of the trip duration of trips taken by males. Do
you think any outliers exist?
5. Apply the formula in Listing 1-6 to determine the percentage of
observations for which outliers exists.
6. Perform the treatment of outliers by incorporating one of the
methods we discussed earlier for the treatment of outliers.

The multivariate analysis that Nancy and Eric had performed had yielded some good
insights. However, Nancy was curious to know if some statistical tests exist to determine
the strength of the relationship between two variables. She wanted to use this information
to determine the features which have the most impact on trip duration. The concept of
correlation popped up in Eric’s mind, and he decided to share his knowledge base before
moving on further with the analysis.

33
Chapter 1 ■ Statistics and Probability

Correlation
Correlation refers to the strength and direction of the relationship between two
quantitative features. A correlation value of 1 means strong correlation in the positive
direction, whereas a correlation value of -1 means a strong correlation in the negative
direction. A value of 0 means no correlation between the quantitative features. Please
note that correlation doesn’t imply causation; that is, the change in one entity doesn’t
enforce a change in the other one.
Correlation of an attribute to itself will imply a correlation value of 1. Many machine
learning algorithms fail to provide optimum performance because of the presence of
multicollinearity. Multicollinearity refers to the presence of correlations among the
features of choice, and thus it is usually recommended to review all pair-wise correlations
among the features of a dataset before considering them for analysis.
Following are the most common types of correlations:

Pearson R Correlation
Pearson R correlation is the most common of the three and is usually suitable to calculate
the relationships between two quantitative variables which are linearly related and seem
to be normally distributed. Take, for example, two securities in the stock market closely
related to one another and examine the degree of relationship between them.

Kendall Rank Correlation


As compared to Pearson, which is suitable for normally distributed data, Kendall
rank correlation is a non-parametric test to determine the strength and direction
of relationship between two quantitative features. Non-parametric techniques are
targeted to distributions other than the normal distribution. To the contrary, parametric
techniques are targeted toward normal distribution.

34
Chapter 1 ■ Statistics and Probability

Spearman Rank Correlation


Spearman rank correlation is a non-parametric test just like Kendall rank correlation,
with the difference that Spearman rank correlation does not make any assumptions about
the distribution of the data. Spearman rank correlation is most suitable for ordinal data.
Nancy was interested to see if change in age brings a linear change to trip duration.
For that Eric decided to bring Pearson R correlation into practice and decided to make a
scatter plot between the two quantities for them to see the relationship visually.

Listing 1-25. Pairplot of trip duration and age


data = data.dropna()
seaborn.pairplot(data, vars=['age', 'tripduration'], kind='reg')
plt.show()

90
80
70
60
age

50
40
30
20
10
30000
25000
20000
tripduration

15000
10000
5000
0
–5000
10 20 30 40 50 60 70 80 90 –50000 500010000
15000
20000
25000
30000
age tripduration

Figure 1-17. Pairplot between trip duration and age

35
Chapter 1 ■ Statistics and Probability

While looking at Figure 1-17, Nancy didn’t find any definitive pattern between trip
duration and age. There is a minor positive correlation, as explained in Figure 1-18.

y-axis y-axis y-axis

0 x-axis 0 x-axis 0 x-axis


Positive Correlation Negative Correlation No Correlation

Figure 1-18. Correlation directions

Nancy knew that a perfect positive correlation meant a value of 1; hence she wanted
to see if the correlation value between age and tripduration is positive and approaches 1
or not. Eric wrote the code in Listing 1-26 to make it possible.

Listing 1-26. Correlation Coefficient Between trip duration and age


pd.set_option('display.width', 100)
pd.set_option('precision', 3)

data['age'] = data['starttime_year'] - data['birthyear']

correlations = data[['tripduration','age']].corr(method='pearson')
print(correlations)

Output

                tripduration    age
tripduration    1.000           0.058
age             0.058           1.000

The correlation coefficient came out to be greater than 0 which according to Nancy’s
deduction was a positive correlation, but being much less than 1 meant it to be weak in
nature.
Nancy was aware that a simple analysis meant taking a feature into consideration
and analyzing it. Another more complex method was to split the feature into its categories
(e.g., splitting gender into male and female) and then performing the analysis on both
these chunks separately. She was confused as to which was the right approach and thus
asked Eric for his opinion. Eric thought of introducing the concept of t-statistics and came
up with a small demonstration for Nancy.

36
Discovering Diverse Content Through
Random Scribd Documents
400. One who had formerly been rich, but had squandered away his
estate, and left himself no furniture in the house but a sorry bed, a
little table, a few broken chairs, and some other odd things, seeing a
parcel of thieves, who knew not his condition, breaking into his
house in the night, he cried out to them, Are not you a pack of fools,
to think to find anything here in the dark, when I can find nothing by
daylight?
401. A certain great lord having, by his extravagancies, run himself
over head and ears in debt, and seeming very little concerned about
it, one of his friends told him one day, That he wondered how he
could sleep quietly in his bed, whilst he was so much in debt. For my
part, said my lord, I sleep very well; but I wonder how my creditors
can.
402. A bishop of Cervia in Italy came in great haste to the Pope, and
told him, that it was generally reported his holiness had done him
the honour to make him governor of Rome. How, said the Pope,
don’t you know that fame spreads a great many false reports? and I
dare say you will find this one of them.
403. A Gascon, one day reading in company a letter he had just
received from his father, who therein acquainted him, that he was
threatened with an assessment, which would be very hard upon him,
whose whole estate was not above two hundred livres per annum.
This sum was written in figures, thus (200). But the Gascon reading
two thousand instead of two hundred, a lady that stood behind him,
and read the letter without uttering a word, so that he could not
perceive her, hearing him say two thousand; Hold, hold, sir, said she,
there are but two hundred. Let me be hanged, said he, turning
about to her, if the coxcomb, meaning his father, has not forgot a
cipher.
404. Another Gascon officer, who had served under Henry IV. King of
France, and not having received any pay for a considerable time,
came to the king, and confidently said to him, Sire, three words with
your majesty: Money or discharge. Four with you, answered his
majesty: Neither one nor t’other.
405. A certain Italian having wrote a book upon the art of making
gold, dedicated it to Pope Leo X. in hopes of a good reward: His
holiness finding the man constantly following him, at length gave
him a large empty purse, saying, Sir, since you know how to make
gold, you can have no need of anything but a purse to put it in.
406. A countryman seeing a lady in the street in a very odd dress as
he thought, begged her to be pleased to tell him what she called it.
The lady, a little surprised at the question, called him impertinent
fellow. Nay, I hope no offence, madam, cried Hodge, I am a poor
countryman, just going out of town, and my wife always expects I
should bring her an account of the newest fashion, which occasioned
my inquiring what you call this that you wear. It is a sack, said she,
in a great pet. I have heard, replied the countryman (heartily nettled
at her behaviour) of a pig in a poke, but never saw a sow in a sack
before.
407. A proud parson, and his man, riding over a common, saw a
shepherd tending his flock, and having a new coat on, the parson
asked him, in a haughty tone, who gave him that coat? The same,
said the shepherd, that clothed you, the parish. The parson, nettled
at this, rode on a little way, and then bade his man go back, and ask
the shepherd if he’d come and live with him, for he wanted a fool.
The man going accordingly to the shepherd, delivered his master’s
message, and concluded as he was ordered, that his master wanted
a fool. Why, are you going away then? said the shepherd. No,
answered the other. Then you may tell your master, replied the
shepherd, his living can’t maintain three of us.
408. A lad was running along the gunnel of a ship, with a can of flip
in his hand, of which he was to have part himself, when a cannon
ball came suddenly, and took off one of his legs; Look ye there now,
said he, all the flip’s spilt.
409. Lord Falkland, the author of the play, called The Marriage Night,
was chosen very young to sit in parliament; and when he was first
elected, some of the members opposed his admission, urging, That
he had not sown all his wild oats. Then, replied he, it will be the best
way to sow them in the house, where there are so many geese to
pick them up.
410. The Duke of —— asked a friend, Who he thought had
undertaken the most difficult task, Mr. Whiston, in his attempts to
discover the longitude, or Mr. Lisle, to find the philosopher’s stone?
The friend answered, that he could not tell which was the more
arduous task of the two which those gentlemen had undertaken, but
he was sure that he had himself engaged in a much more difficult
work than either of them. What is that? said his grace. I have been
these six years endeavouring to prevail on you to pay your debts,
replied the friend.
411. A schoolmaster asking one of his boys, in a sharp wintry
morning, what was Latin for cold, the boy hesitated a little: What,
sirrah, said he, can’t you tell? Yes, yes, replied the boy, I have it at
my fingers’ ends.
412. When the gate, which joined to Whitehall, was ordered by the
House of Commons to be pulled down, to make the coach-way more
open and commodious, a member made a motion, that the other
which was contiguous to it, might be taken down at the same time;
which was opposed by a gentleman, who told the house, that he
had a very high veneration for that fabric, that he looked upon it as
a noble piece of antiquity; that he had the honour to have lived by it
many years; and therefore humbly begged the house would continue
the honour to him, for it would really make him unhappy to be
deprived of it now. Counsellor Hungerford seconded the gentleman,
and said, ’Twould be a thousand pities, but he should be indulged to
live still by his gate, for he was sure he could never live by his style.
413. A nobleman having presented King Charles II. with a fine
horse, his majesty bade Killigrew, who was present, tell him his age;
whereupon Killigrew went and examined the tail; What are you
doing? said the king, that is not the place to find out his age. O! sir,
said Killigrew, Your majesty knows one should never look a gift horse
in the mouth.
414. A certain poetaster, whose head was full of a play of his own
writing, was explaining the plot and design of it to a courtier. The
scene of it, said he, is in Cappadocia; and, to judge rightly of the
play, a man must transport himself into the country, and get
acquainted with the genius of the people. You say right, answered
the courtier, and I think it would be best to have it acted there.
415. A young man, who was a very great talker, making a bargain
with Isocrates to be taught by him, Isocrates asked double the price
that his other scholars gave him; and the reason, said he, is, that I
must teach thee two sciences, one to speak, and the other to hold
thy tongue.
416. A certain couple going to Dunmow in Essex, to claim the flitch
of bacon, which is to be given to every married pair, who can swear
they had no dispute, nor once repented their bargain in a year and a
day, the steward ready to deliver it, asked where they would put it;
the husband produced a bag, and told him, in that. That, answered
the steward, is not big enough to hold it. So I told my wife, replied
the good man; and I believe we have had a hundred words about it.
Ay, said the steward, but they were not such as will butter any
cabbage to eat with this bacon; and so hung the flitch up again.
417. Two gentlemen, one named Chambers, the other Garret, riding
by Tyburn, said the first, This is a very pretty tenement, if it had but
a Garret. You fool, said Garret, don’t you know there must be
Chambers first?
418. Two gentlemen, one named Woodcock, the other Fuller,
walking together, happened to see an owl; said the last, That bird is
very much like a Woodcock. You are very wrong, said the first, for
it’s Fuller in the head, Fuller in the eyes, and Fuller all over.
419. An arch boy having taken notice of his schoolmaster’s often
reading a chapter in Corinthians, wherein is this sentence, ‘We shall
all be changed in the twinkling of an eye,’ privately erased the letter
c in the word changed. The next time the master read it, we shall all
be hanged in the twinkling of an eye.
420. A certain great man, who had been a furious party man, and
most surprisingly changed sides, by which he obtained a coronet,
was soon after at cards at a place where Lady T—nd was, and
complaining in the midst of the game, that he had a great pain in his
side, I thought your lordship had no side, said she.
421. A gentleman living in Jamaica, not long ago, had a wife not of
the most agreeable humour in the world; however, as an indulgent
husband, he had bought her a fine pad, which soon after gave her a
fall that broke her neck. Another gentleman in the same
neighbourhood, blessed likewise with a termagant spouse, asked the
widower, if he would sell his wife’s pad, for he had a great fancy for
it, and he would give him what he would for it. No, said the other, I
don’t care to sell it, for I am not sure that I shan’t marry again.
422. A scholar of Dr. Busby’s coming into a parlour where the doctor
had laid a fine bunch of grapes for his own eating, took it up and
said aloud, I publish the banns between these grapes and my
mouth; if any one knows any just cause or impediment why these
two should not be joined together, let them declare it. The doctor,
being but in the next room, overheard all that was said, and coming
into the school, he ordered the boy who had eaten his grapes to be
taken up, or, as they called it, horsed on another boy’s back; but
before he proceeded to the usual discipline, he cried out aloud, as
the delinquent had done: I publish the banns between my rod and
this boy’s breech, if any one knows any just cause or impediment
why these two should not be joined together, let them declare it. I
forbid the banns, cried the boy. Why so? said the doctor. Because
the parties are not agreed, replied the boy. Which answer so pleased
the doctor, who loved to find any readiness of wit in his scholars,
that he ordered the boy to be set down.
423. The late Sir Robert Henley, who was commonly pretty much in
debt, walking one day with two or three other gentlemen in the
Park, was accosted by a tradesman, who took him aside for a minute
or two, and when the baronet rejoined his company, he seemed to
be in a great passion, which his friends taking notice of, asked him
what was the matter? Why the rascal, said he, has been dunning me
for money I have owed him these seven years, with as much
impudence as if it was a debt of yesterday.
424. The late Mr. D—t, the player, a man of great humanity, as will
appear by the story, having heard that his landlady’s maid had cut
her throat with one of his razors, of which an account was brought
to him behind scenes at the time of the play; D—t, with great
concern and emotion, cried out, Zoons, I hope it was not with my
best razor!
425. Joe Haines, the player, being asked what could transport Mr.
Collier into so blind a zeal for the general suppression of the stage,
when only some particular authors had abused it; whereas the
stage, he could not but know, was generally allowed, when rightly
conducted, to be a delightful method of mending the morals? For
that reason, replied Haines; Collier is, by profession, a moral-mender
himself, and two of a trade, you know, can never agree.
426. Some gentlemen being at a tavern together, for want of better
diversion, one proposed play; but, said another of the company, I
have fourteen good reasons against gaming. What are they? said
another. In the first place, answered he, I have no money. Oh! said
the other, if you had four hundred reasons, you need not name
another.
427. A parson, in the country, taking his text from St. Matthew,
chap. viii. 14, ‘And Peter’s wife’s mother lay sick of a fever,’ preached
for three Sundays together on the same subject. Soon after, two
country fellows going across the church-yard, and hearing the bell
toll, one asked the other, who it was for? Nay, I can’t tell you;
perhaps, replied he, it is for Peter’s wife’s mother, for she has been
sick of a fever these three weeks.
428. The Hon. Mr. L— one morning, at the late Sir Robert Walpole’s
levee, as I sat by them, asked John Lawton for a pinch of snuff, who
told him he had none in his box, for he seldom took any, but now
and then to keep him awake at church. That, said the other, is the
most improper thing you can do there; for it quite destroys the
natural operation of the sermon.
429. I remember in the reign of the late Queen Anne, when disputes
ran high between Whig and Tory, some persons suffered party to
mix in every their minutest action. A Tory would not cock his hat in
the same manner that a Whig did, nor a Whig lady patch her face on
the same side that the Tory ladies patched theirs. A pleasant
instance of this strict adherence to party in trivial affairs, was Dick W
—l, who, being sent to parliament on the Tory interest, was resolved
to do nothing but what was on that side. The house, a few days
after he took his seat in it, happening to sit late, a motion was made
for candles to be brought in, which being put to the vote, Dick pulled
a high-flying member, who sat near him, by the sleeve, and asked
him if candles were for the church? And being answered in the
affirmative, very readily gave his voice for them, which otherwise he
would not have done.
430. A young fellow, not quite so wise as Solomon, eating some
Cheshire cheese full of mites, one night at the tavern: Now, said he,
have I done as much as Sampson, for I have slain my thousands
and my ten thousands. Yes, answered one of the company, and with
the same weapon too, the jawbone of an ass.
431. Poor Joe Miller going one day along the Strand, an impudent
Derby captain came swaggering up to him, and thrust between him
and the wall. I don’t use to give the wall, said he, to every
jackanapes. But I do, said Joe; and so made way for him.
432. When the late Duke of —— went over as Lord Lieutenant to
Ireland, he took an excellent man cook with him, but they had not
been there above a month, when, finding his grace kept a very
scurvy house, he gave him warning. What’s the reason, said the
duke, that you have a mind to leave me? Why, if I continue with
your excellency much longer, answered the cook, I shall quite forget
my trade.
433. A certain officer in the guards telling one night, in company
with Joe Miller, of several wonderful things he had seen abroad,
among the rest he told the company he had seen a pike caught that
was six feet long. That’s a trifle, said Joe, I have seen a half-pike, in
England, longer by a foot, and yet not worth twopence.
434. Jemmy Spiller, another of the jocose comedians, going one day
through Rag Fair, a place where they sell second-hand goods,
cheapened a leg of mutton, he saw hanging up there, at a butcher’s
stall. The butcher told him it was a groat a pound. Are you not an
unconscionable fellow, said Spiller, to ask such a price, when one
may have a new one for the same price in Clare Market?
435. A gentleman having a servant with a very thick skull, used
often to call him the king of fools. I wish, said the fellow one day,
you could make your words good, I should then be the greatest
monarch in the world.
436. A lawyer being sick, made his last will, and gave all his estate
to fools and madmen; being asked the reason for so doing: From
such, said he, I had it, and to such I give it again.
437. A thief being brought to Tyburn to be executed, the ordinary of
Newgate, in taking his last confession, asked him if he was not sorry
for having committed the robbery for which he was going to suffer?
The criminal answered, Yes, but that he was more sorry for not
having stolen enough to bribe the jury.
438. A certain poor unfortunate gentleman was so often pulled by
the sleeve by the bailiffs, that he was in continual apprehension of
them; and going one day through Tavistock Street, his coat sleeve
happened to hitch upon the iron spike of one of the rails; whereupon
he immediately turned about in a great surprise, and cried out, At
whose suit, sir? at whose suit?
439. A soldier in the late wars, a little before an engagement, found
a horse-shoe, and stuck it in his girdle; shortly after, in the heat of
the action, a bullet came and hit him upon that part. Well, said he, I
find a little armour will serve a turn, if it be put in the right place.
440. The late famous Arthur Moor, who was much in favor with the
Tory ministry, in the latter part of Queen Anne’s reign, had a lady
who was reckoned a woman of great wit and humour, but of political
principles quite opposite to those of her husband. After the death of
the Queen, when it was talked of as if the late ministers would have
been called to account, my Lord B—ke meeting Mrs. Moor one day,
in a visit, Well, madam, said he, you hear how terribly we are
threatened; you’ll come, I hope, and see me, when I go to Tower
Hill? Upon my word, my lord, said she, I should be extremely glad to
do it: but I believe I shall be engaged another way, for I am told my
Snub (the name by which she always called her husband) will be
obliged to go the same day to Tyburn.
441. The same lady, coming home one evening, told her husband
she wished him joy, for she heard he was to be made a lord. (This
was before the death of Queen Anne.) And pray, said he, what did
they say was to be my title? My Lord Tariff, replied she, which was a
sneer upon him, for having been engaged in settling a tariff of trade
which he was thought well skilled in. And why don’t you, when you
hear any one abuse your husband, spit in their face? said he. No, I
thank you, answered the lady, I don’t intend to spit myself into a
consumption.
442. The late Sir John Tash was a famous wine-merchant, and sold
great quantities of that liquor, but was supposed to make it chiefly
without much of the juice of the grape; therefore Alderman Parsons
meeting him one day, saluted him by the name of brother brewer. I
deal in wine, Mr. Alderman, said Sir John, and am no brewer. But I
know you are, replied the other, and can brew more by an inch of
candle, than I can with a caldron of coals.
443. A late archbishop having promised one of his chaplains, who
was a favourite, the first good living in his gift, that he should like,
and think worthy his acceptance; soon after hearing of the death of
an old rector, whose parsonage was worth about 300l. a year, sent
his chaplain to the place to see how he liked it; the doctor, when he
came back again, thanked his grace for the offer he had made him,
but said, he had met with such an account of the country, and the
neighbourhood, as was not at all agreeable to him, and therefore
should be glad, if his grace pleased, to wait till something else fell.
Another vacancy not long after happening, the archbishop sent him
also to view that; but he returned as before, not satisfied with it,
which did not much please his grace. A third living, much better than
either of the others becoming vacant, as he was told, the chaplain
was sent to take a view of that; and when he came back, Well, now,
said my lord, how do you like this last living? what objection can you
have to this? I like the country very well, my Lord, answered he, and
the house, the income, and the neighbourhood, but—— But! replied
the archbishop, what but can there be then? But, my lord, said he, I
found the old incumbent smoking his pipe at the gate of his house.
444. Two city ladies meeting at a visit, one a grocer’s wife, and the
other a cheesemonger’s (who perhaps stood more upon the punctilio
of precedence than some of their betters would have done at the
court end of the town) when they had risen up and taken their
leaves, the cheesemonger’s wife was going out of the room first,
upon which the grocer’s lady, pulling her back by the tail of her
gown, and stepping before her, No, madam, said she, nothing comes
after cheese.
445. Old Johnson, the player, who was not only a very good actor,
but a good judge of painting, and remarkable for making many dry
jokes, was shown a picture, done by a very indifferent hand, but
much commended, and was asked his opinion of it. Why, truly, said
he, the painter is a very good painter, and observes the Lord’s
commandments. What do you mean by that, Mr. Johnson? said one
who stood by. Why, I think, answered he, that he hath not made to
himself the likeness of anything that is in Heaven above, or that is in
the earth beneath, or that is in the water under the earth.
446. A certain noble lord in the county of Hants, who had not much
applied himself to letters, and was remarkable for his ill-spelling,
dining at a neighbouring gentleman’s house, took notice several
times, and commended a snuff-box he made use of; when my lord
was gone away, the gentleman’s wife said to her husband, My dear,
you did not observe how often my lord commended your snuff-box;
I dare say he would have been highly pleased if you had made him
an offer of it; if I was you I would send it after him. The gentleman
took his lady’s advice, and the next morning sent a servant away
with a letter, and the snuff-box, as a present to the lord.—The lady
judged right, for my lord was mightily delighted with it, and returned
a most complaisant letter of thanks for the present, and told the
gentleman, in his ill-spelling, that he was greatly obliged to him, and
in a few days would send him an elephant, (equivalent he would
have written). The gentleman, not at all liking my lord’s proposal,
sent his servant with a letter again next day, telling his lordship, that
he was very glad the box was so acceptable to him, and thanking
him for the honour he designed him, but begged he would not think
of sending what he mentioned, for it would not only be attended
with an expense, which he could not very well afford, being such a
devouring animal, but would bring such numbers of people to see it,
that it would make his house a perfect house of call. My lord, a little
while after, meeting the gentleman, told him, he was surprised at his
letter, and could not imagine what he meant by it. The elephant,
said he, that your lordship spoke of sending me. Elephant! said the
learned lord, how could a man of your understanding make such a
mistake? I said I would send you an equivalent. I beg your lordship’s
pardon, returned the gentleman, and am ashamed of being such a
dunce that I could not read your lordship’s letter.
447. Young Griffith Lloyd, of the county of Cardigan, being sent to
Jesus College, Oxford, where he was looked upon as an errant
dunce, wore a calf-skin waistcoat, tanned with the hair on, and
trimmed with a broad gold lace, and gold buttons. One of the
Oxonians, an eminent punster, said, that Griffith was like a dull book,
bound in calf-skin, and gilt, but very ill-lettered.
448. Old G——, the rich miser of Gloucestershire, going home one
day, between Wickivarr and Badminton, the way being greasy, after
a shower of rain, his foot slipped, and he fell off a high bank into a
wet ditch, where he was almost smothered; a countryman, who
knew his character, coming by, he begged him, for God’s sake, to
help him. Ay, said the countryman, give me your hand. Give being a
word that old G—— had a great aversion to, cried out, I thank you,
honest friend, I will lend you my hand with all my heart. I have often
heard, said the other, that you would never give anything in your
life, so you may lie there; and on he walked.
449. An old woman at the head of a table, said a satirical young
one, seems to revive the old Grecian custom of serving up a death’s
head with their banquets.
450. The famous Tony Lee, a player in King Charles the Second’s
reign, being killed in a tragedy, having a violent cold, could not
forbear coughing as he lay dead upon the stage, which occasioned a
good deal of laughter and noise in the house; he lifted up his head,
and speaking to the audience, said, This makes good what my poor
mother used to tell me; for she would often say that I should cough
in my grade, because I used to drink in my porridge. This set the
house in such good humour, that it produced a thundering peal of
applause, and made every one very readily pardon the solecism he
had before committed.
451. Tom S—, the organist of St. M—, being reckoned to have a fine
finger, drew many people to hear him, whom, he would oftentimes
entertain with a voluntary after evening service, and his auditory
seeming one day greatly delighted with his performance, after the
church was cleared, Adad, sir, said his organ-blower, who was an
idiot, I think we did rarely to-day. We, sirrah! said Tom. Ay, we, to be
sure, answered the other; what would you have done without me?
The next Sunday, Tom sitting down to play, could not make his
organ speak, whereupon, calling to the bellows-blower, asked him
what he meant? why he did not blow? Shall it be we, then? said the
other.
452. A certain French gentleman, having been but a very little while
in England, was invited to a friend’s house, where a large bowl of
punch was made, a liquor he had never seen before, and which did
not at all agree with him; but having forgot the name of it, he asked
a person the next day, What dey call a dat liqur in England, which is
all de contradiction; where is de brandy to make it strong, and de
vater to make it small, de sugar to make it sweet, and de lemons to
make it sower. Punch, answered the other, I suppose you mean. Ay,
ponche, begar, cried monsieur, it almost ponche my brain out last
night.
453. The famous Captain Fitzpatrick, who married ’Squire Western’s
niece, and was reckoned an excellent hand at making bulls, was
walking one day with two or three ladies, a little way out of West
Chester, with his hat under his arm; the wind blowing very hard, one
of the ladies said, I wonder, captain, you will be so ceremonious to
walk bare-headed in such boisterous weather; pray, sir, put on your
hat. Arrah, by my shoul, dear madam, answered the captain, I have
been after trying two or three times already, and the wind is so high,
that I can’t keep my hat upon my head any longer than ’tis under
my arm.
454. The same gentleman being with the aforesaid ladies, in a
nobleman’s garden, where there was a large iron roller, told them,
he thought it was the biggest iron rolling-stone he had ever seen in
his life.
455. A philosopher being blamed by a stander-by, for defending an
argument weakly against the Emperor Adrian, replied, What! would
you have me contend with a man that commands thirty legions of
soldiers?
456. A painter turned physician; upon which change, a friend
applauded him, saying, You have done well, for before, your faults
could be discovered by the naked eye, but now they are hid.
457. Bishop Latimer preaching at court, said, that it was reported the
king was poor, and that they were seeking ways and means to make
him rich; but he added, For my part, I think the best way to make
the king rich, would be to give him a good post, or office, for all his
officers are rich.
458. Zelim, the first of the Ottoman Emperors that shaved his beard,
his predecessors having always worn it long, being asked by one of
his bashaws, why he altered the custom of his predecessors?
answered, Because you bashaws shall not lead me by the beard, as
you did them.
459. It being told Antigonus, in order to intimidate him, as he
marched to the field of battle, that the enemy would shoot such
volleys of arrows, as would intercept the light of the sun. I am glad
of it, replied he, for it being very hot, we shall then fight in the
shade.
460. A sailor having received ten guineas for turning Roman
Catholic, said to the priest who paid him the money, Sir, you ought
to give me ten guineas more, because it is so hard to believe
transubstantiation.
461. One seeing an affected coxcomb buying books, told him, His
bookseller was properly his upholsterer, for he furnished his room
rather than his head.
462. An arch wag once said, That tailors were like woodcocks, for
they got their sustenance by their long bills.
463. A complaint being made to the court of Spain of a certain
Viceroy of Mexico, the Secretary of State, who was his friend, wrote
him word, that he was accused at court of having extorted great
sums of money from the people under his government; which I
hope, said the Secretary, is true, or else you are undone.
464. At a religious meeting a lady persevered in standing on a
bench, and thus intercepting the view of others, though repeatedly
requested to sit down. A reverend old gentleman at last rose, and
said gravely, I think, if the lady knew that she had a large hole in
each of her stockings, she would not exhibit them in this way. This
had the desired effect—she immediately sunk down on her seat. A
young minister standing by, blushed to the temples, and said, O,
brother, how could you say what was not the fact? Not the fact!
replied the old gentleman; if she had not a large hole in each of her
stockings, I should like to know how she gets them on.
465. A gentleman in the country having the misfortune to have his
wife hang herself on an apple tree, a neighbour of his came to him
and begged he would give him a scion of that tree, that he might
graft it upon one in his own orchard; for who knows, said he, but it
may bear the same fruit!
466. St. Evremond said, in defence of Cardinal Mazarine, when he
was reproached with neglecting the good of the kingdom that he
might engross the riches of it, Well, let him get all the riches, and
then he will think of the good of the kingdom, for it will be all his
own.
467. The late Earl of S— kept an Irish footman, who, perhaps, was
as expert in making bulls as the most learned of his countrymen. My
lord having sent him one day with a present to a certain judge, the
judge in return sent my lord half-a-dozen live partridges with a
letter; the partridges fluttering in the basket upon Teague’s back, as
he was carrying them home, he set down the basket, and opened
the lid of it to quiet them, whereupon they all flew away. Oh! the
devil burn ye, said he, I am glad you are gone. But when he came
home, and my lord had read the letter, Well, Teague, said my lord, I
find there are half-a-dozen partridges in the letter. Arrah now, dear
sir, said Teague, I am glad you have found them in the letter, for
they are all lost out of the basket.
468. The same nobleman going out one day, called Teague to the
side of his chariot, and bade him tell Mr. Such-a-one, if he came,
that he should be at home at dinner-time. But when my lord was got
across the square in which he lived, Teague came puffing after him,
and calling to the coachman to stop; upon which my lord, pulling the
string, desired to know what Teague wanted; My lord, said he, you
bade me tell Mr. Such-a-one, if he came, that you would dine at
home; but what must I say if he don’t come?
469. A tailor’s boy being at church, heard it said that a remnant only
should be saved. Egad, said the boy, then my master makes plaguy
long remnants.
470. The renowned Mr. Wh—n, the famous astronomer, had made a
calculation that the world would be at an end in fifteen years, and
some time after offered to dispose of an estate; he asked the
gentleman who was about it, at the rate of thirty years purchase,
upon which the gentleman, in great surprise, demanded how he
could ask so many years purchase, when he very well knew the
world would be at an end in half the time.
471. Some thievish fellows being at a tavern, they agreed amongst
themselves to steal the silver cup that was brought up to them, and
when they were going by the bar, You are welcome, gentlemen,
kindly welcome, cried the landlord. Ah, said the fellow with the cup
to himself, I wish we were well gone too.
472. A waterman belonging to the Tower, being put by one of the
players into the upper gallery in Covent Garden playhouse, the
fellow, not being very sober, and falling asleep, tumbled into the pit;
but having the old proverb on his side, received little or no hurt; and
being told by some of his companions that he was now free of the
house, he went to Mr. Rich (the then manager) to put in his claim,
who very readily allowed it, with this proviso, that he should always
go out the same way he had come in.
473. One told another, who did not use to be clothed over often,
that his new coat was too short for him; That’s true, answered his
friend, but it will be long enough before I get another.
474. A gentleman who was travelling in Italy, saw one day, as he
passed along the road near Naples, a man standing up to his chin in
a puddle of dirty water; not able to guess at the meaning of it, he
cried out to him, What are you catching there, friend? Cold, replied
the other, for I have to sing the bass part at the opera to-night. But
suppose, said the gentleman, you catch your death. Why, then, said
the other, the opera will be damned.
475. In the reign of Queen Anne, when it was said Lord Orford had
got a number of peers made at once, to serve a particular turn,
being met next day by Lord Wharton,—So, Robin, said he, I find
what you lost by tricks you have gained by honours.
476. A young gentleman who had stolen a ward, being in suit for her
fortune, before a late lord chancellor, and the counsel insisting much
on the equity of decreeing her a fortune for her maintenance, his
lordship turned briskly upon him with this sentence, That since the
suitor had stolen the flesh, he should get bread to it how he could.
477. A country fellow, who had served several years in the army
abroad, when the war was over, coming home to his friends, was
received amongst them with great rejoicing, and the miraculous
stories related by him were heard with no small pleasure. Well, said
the old father, and prythee Jack, what didst thou learn there? Learn,
sir, why I learnt to know that when I turned my shirt, the vermin had
a day’s march to my skin again.
478. An Irish barrister had a client of his own country who was a
sailor, and having been at sea for some time, his wife was married
again in his absence, so he was resolved to prosecute her; and
coming to advise with the counsellor, told him he must have
witnesses to prove that he was alive when his wife married again.
Arrah, by my shoul, but that shall be impossible, said the other, for
my shipmates are all gone to sea again upon a long voyage, and
shan’t return this twelve-month. Oh! then, answered the counsellor,
there can be nothing done in it, and what a pity it is that such a
brave cause should be lost now, only because you cannot prove
yourself to be alive.
479. King Charles the First being prevailed upon by one of his
courtiers to knight a very worthless fellow, of mean aspect, when he
was going to lay the sword upon his shoulder the new knight drew a
little back, and hung down his head as out of countenance; Don’t be
ashamed, said the king, ’tis I have most reason to be so.
480. One said Sir John Cutler looked very dismally when night came
on, not because it brought darkness with it, but because daylight
saved him a candle.
481. A man was reproached by another with barbarity in beating his
wife so severely as he often did; Go, you are a fool, and ignorant of
the scriptures, said he, else you would know that it was a proof of
my love for her, otherwise I would not be at the trouble; but he that
the Lord loveth he chastizeth, and so do I.
482. An Irish soldier once returning from battle in the night,
marching a little way behind his companion, called out to him, Hollo,
Pat, I have catch’d a tartar! Bring him along then! Ay, but he won’t
come. Why then come away without him. By Jasus, but he won’t let
me!
483. A very harmless Irishman, eating an apple-pie with some
quinces in it, Arrah now, dear honey, said he, if a few of these
quinces give such a flavour, how would an apple-pie taste made all
of quinces?
484. The late duke of Wharton, going through Holborn in a hackney
coach, with Phil. F—, saw a fellow drumming before the door of a
puppet-show; Now, this is a pretty employment, Phil., said the duke;
if you were reduced so low, that you were obliged to be either a
highwayman or drummer to a puppet-show, which would you
choose? Faith, my lord, answered Phil., I would be the highwayman
rather than the other. Ay, replied the duke, that confirms the opinion
I always had of you, that you have more pride than honesty.
485. Sir T. P. once in parliament brought in a bill that wanted some
amendment, which being not attended to by the house, he
frequently repeated that he thirsted to mend his bill. Upon which a
worthy member got up, and said, Mr. Speaker, I humbly move, since
the honourable member thirsts so very much, that he may be
allowed to mend his draught. This put the house in such a good
humour, that his request was granted.
486. An English gentleman asked Sir Richard Steele, who was an
Irishman, What was the reason that his countrymen were so
remarkable for blundering and making bulls? Faith, said the knight, I
believe there is something in the air of Ireland; and I dare say, if an
Englishman was born there he would do the same.
487. A gentleman who was a staunch Whig, disputing with a
Jacobite, said, he had two good reasons for being against the
interest of the pretender: What are those? said the other. The first,
replied he, is, that he is an impostor, not really King James’s son:
Why, that, said the Tory, would be a good reason, if it could be
proved. And, pray, sir, what is your other? Why, said the Whig, that
he is King James’s son.
488. Although the infirmities of nature are not proper subjects to be
made a jest of, yet when people take a great deal of pains to
conceal what everybody sees, there is nothing more ridiculous: of
this sort was old Cross the player, who, being very deaf, did not care
anybody should know it. Honest Joe Miller going with a friend one
day along Fleet Street, and seeing old Cross on the other side of the
way, told his acquaintance he should see some sport; so beckoning
to Cross with his finger, and stretching open his mouth as wide as he
could, as if he hallooed to him, though he said nothing, the old
fellow came puffing from the other side of the way; What the deuce,
said he, do you make such a noise for? do you think one can’t hear?
489. There is in Rome a certain broken statue called Pasquin, to
which, in the night time, people affix the libels they dare not own; a
kind of dumb satire on the vices of the grandees, not sparing even
the Pope himself, as may be seen by the following story:—A late
Pope, being descended from a very mean family, on his
advancement to the holy see, bestowed great preferment on most of
his poor relations; whereupon Pasquin, on the next great festival,
early in the morning, was observed to have an extremely dirty shirt
on, with a scroll of paper in his hand, whereon was written, How
now, Pasquin? What! so dirty upon a holiday? and under that his
answer: Alas! I have no clean linen, my washerwoman is made a
princess.
490. An Irishman and an Englishman falling out, the Hibernian told
him if he did not hold his tongue, he would break his impenetrable
head and let the brains out of his empty skull!
491. Rogers, when a certain M.P. wrote a review of his poems, and
said he wrote very well for a banker, wrote in return, the following:
They say he has no heart, but I deny it:
He has a heart, he gets his speeches by it.

492. A prisoner being brought up to Bow Street, the following


dialogue passed between him and the sitting magistrate:—How do
you live? Pretty well, sir, generally a joint and pudding at dinner. I
mean, sir, how do you get your bread? I beg your worship’s pardon;
sometimes at the baker’s, and sometimes at the chandler’s shop.
You may be as witty as you please, sir; but I mean simply to ask you
how do you do? Tolerably well, I thank your worship: I hope your
worship is well.
493. When Citizen Thelwall was on his trial at the Old Bailey for high
treason, during the evidence for the prosecution, he wrote the
following note, and sent it to his counsel, Mr. Erskine: I am
determined to plead my cause myself. Mr. Erskine wrote under it: If
you do you’ll be hanged;—to which Thelwall immediately returned
this reply: I’ll be hanged if I do.
494. Chateauneuf, keeper of the seals under Louis XIII. when a boy
of only nine years old, was asked many questions by a bishop, and
gave very prompt answers to them all. At length the prelate said, I
will give you an orange if you will tell me where God is? My lord,
replied the boy, I will give you two if you will tell me where He is
not.
495. A Mr. Johnstone having been lost in the dreadful conflagration
of the Theatre Royal Covent Garden, Mr. John Johnstone, of Drury
Lane, received a letter from an Irish friend, requesting to know, by
the return of post, if it was he that was really burned or not.
496. A gentleman who lived in Great Turnstile, Holborn, being the
subject of conversation in a party, a person inquired where he lived,
if he had a large house, kept a good table, &c. Oh! yes, answered
another, he lives in the greatest stile in Holborn.
497. Gentleman and ladies,—said the facetious Beau Nash, the then
master of the ceremonies for Bath, introducing a most lovely woman
into the ball-room,—this is Mrs. Hobson. I have often heard of
Hobson’s choice, but never had the pleasure to view it until now,
and you must coincide with me that it reflects credit on his taste.
498. A gentleman on circuit narrating to Lord Norbury some
extravagant feat in sporting, mentioned that he had lately shot
thirty-three hares before breakfast. Thirty-three hairs! exclaimed his
lordship; Zounds, sir! then you must have been firing at a wig.
499. During Lord Townshend’s residence in Dublin, as viceroy, he
often went in disguise through the city. He had heard much of the
wit of a shoeblack, known by the name of Blind Peter, whose stand
was always at the Globe Coffee-house door; having found him out,
he stopped to get his boots cleaned; which was no sooner done than
his lordship asked Peter to give him change for a guinea. A guinea!
your honour, said the ragged wit, change for a guinea from me! Sir,
you may as well ask a Highlander for a knee-buckle. His lordship was
so well pleased, that he left him the gold.
500. A late nobleman, who was very avaricious, was upon the same
good terms with his lady as the elements of water and lightning
when they encounter in the atmosphere. I am of opinion, my lord,
said her ladyship, that you would marry the devil’s daughter, after
my decease, if her dowry were equal to your expectations. That is
impossible, my lady, replied the earl, for it is contrary to the law of
England to marry two sisters.
501. A gentleman staying late one night at the tavern, his wife sent
his servant for him about twelve. John, said he, go home and tell
your mistress it can be no more. The man returned, by his mistress’s
order, again at one, the answer then was, it could be no less. But,
sir, said the man, day has broke. With all my heart, replied the
master, he owes me nothing. But the sun is up, sir. And so he ought
to be, John, ought he not? He has farther to go than we have, I am
sure.
502. A noisy talkative spark, who had a handsome place in the king’s
revenue, more than he merited, was holding an argument one day
with a gentleman, at a public coffee-house; the controversy turned
upon some point of government, and his antagonist, who had
somewhat galled him by the strength of his argument, referred him
to such a place in history, where he would find how much he was
mistaken in the dispute. Phoo, said said he, d’ye think I have no
other business but to read histories? Faith, said the other, ’tis pity
you had, till you had read a little more.
503. Susan, a country girl, desirous of matrimony, received from her
mistress a present of a 5l. bank note for her marriage portion. Her
mistress wished to see the object of Susan’s favour; and a very
diminutive fellow, swarthy as a Moor, and ugly as an ape, made his
appearance. Ah, Susan, said her mistress, what a strange choice you
have made! La, ma’am, said Susan, in such hard times as these,
when almost all the tall fellows are gone for soldiers, what more of a
man than this can you expect for a 5l. note?
504. There happened, when Swift was at Larcone in Ireland, the sale
of a farm and stock, the farmer being dead. Swift chanced to walk
past during the auction, just as a pen of poultry had been put up.
Roger (Swift’s clerk) bid for them, but was overbid by a farmer of
the name of Hatch. What, Roger, won’t you buy the poultry?
exclaimed Swift. No, sir, said Roger, I see they are just a going to
Hatch.
505. In a debate on the leather tax, in 1795, in the Irish House of
Commons, the Chancellor of the Exchequer (Sir John P——)
observed, with great emphasis, That, in the prosecution of the
present war, every man ought to give his last guinea to protect the
remainder. Mr. Vaudelure said, that however that might be, the tax
on leather would be severely felt by the barefooted peasantry of
Ireland. To which Sir Boyle Roache replied, that this could be easily
remedied, by making the under-leathers of wood.
506. Lieutenant Connolly, an Irishman in the service of the United
States, during the American war, chanced to take three Hessian
prisoners himself, without any assistance. Being asked by the
commander in chief how he had taken them? I surrounded them,
was the answer.
507. A seedsman being held to bail for having used inflammatory
language respecting the reform bill, a wag observed, It was probably
in the line of his profession—to promote business, he wished to sow
sedition.
508. When Quin and Garrick performed at the same theatre, and in
the same play, the night being very stormy, each ordered a chair. To
the mortification of Quin, Mr. Garrick’s chair came up first. Let me
get into the chair, cried the surly veteran—let me get into the chair,
and put little Davy into the lantern. By all means, said Garrick; I shall
ever be happy to give Mr. Quin light in anything.
509. The late Richard Russel, esq. had a renter’s share at Drury
Lane, where he used to go almost every evening; and,
notwithstanding his immense fortune, his penury was so great, that
rather than give a trifle to any of the women who attended in the
lobby-box to take care of his great coat on an evening, he used
constantly to pledge it for a shilling, at a pawnbroker’s near the
theatre, and redeem it when the performance was over, which cost
him one halfpenny interest.
510. A mountebank, expatiating on the virtues of his drawing salve,
and reciting many instances of its success, was interrupted by an old
woman, who asserted, rather iron-ically, that she had seen it draw
out of a door four rusty tenpenny nails, that defied the united efforts
of two of the strongest blacksmiths, with their hammers and pincers.
511. At the close of that season in which Shuter, the comedian, first
became so universally and deservedly celebrated in his Master
Stephen, in the revived comedy of Every Man in his Humour, he was
engaged for a few nights, in a principal city in the north of England.
It happened that the coach in which he went down (and in which
there was only an old gentleman and himself) was stopped on the
other side of Finchley Common by a highwayman. The old
gentleman, in order to save his own money, pretended to be asleep;
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge


connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.

Join us on a journey of knowledge exploration, passion nurturing, and


personal growth every day!

ebookbell.com

You might also like