100% found this document useful (2 votes)
154 views

The Data Science Design Manual 1st Edition Steven S. Skiena 2024 scribd download

Manual

Uploaded by

riffelindipv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
154 views

The Data Science Design Manual 1st Edition Steven S. Skiena 2024 scribd download

Manual

Uploaded by

riffelindipv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Download the Full Version of textbook for Fast Typing at textbookfull.

com

The Data Science Design Manual 1st Edition Steven


S. Skiena

https://textbookfull.com/product/the-data-science-design-
manual-1st-edition-steven-s-skiena/

OR CLICK BUTTON

DOWNLOAD NOW

Download More textbook Instantly Today - Get Yours Now at textbookfull.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

The Algorithm Design Manual 3rd Edition Steven S. Skiena

https://textbookfull.com/product/the-algorithm-design-manual-3rd-
edition-steven-s-skiena/

textboxfull.com

A Clinician s Guide to Cannabinoid Science 1st Edition


Steven James

https://textbookfull.com/product/a-clinician-s-guide-to-cannabinoid-
science-1st-edition-steven-james/

textboxfull.com

The Craft and Science of Game Design: A Video Game


Designer's Manual 1st Edition O'Connor

https://textbookfull.com/product/the-craft-and-science-of-game-design-
a-video-game-designers-manual-1st-edition-oconnor/

textboxfull.com

The Devil s Garden 1st Edition Steven Zaloga

https://textbookfull.com/product/the-devil-s-garden-1st-edition-
steven-zaloga/

textboxfull.com
Living for the Elderly A Design Manual A Design Manual
Second and Revised Edition Eckhard Feddersen

https://textbookfull.com/product/living-for-the-elderly-a-design-
manual-a-design-manual-second-and-revised-edition-eckhard-feddersen/

textboxfull.com

Hospitals A Design Manual 1st Edition Noor Wagenaar

https://textbookfull.com/product/hospitals-a-design-manual-1st-
edition-noor-wagenaar/

textboxfull.com

Audiology: Science to Practice Steven Kramer

https://textbookfull.com/product/audiology-science-to-practice-steven-
kramer/

textboxfull.com

The Graphic Design Idea Book: Inspiration from 50 Masters


1st Edition Steven Heller

https://textbookfull.com/product/the-graphic-design-idea-book-
inspiration-from-50-masters-1st-edition-steven-heller/

textboxfull.com

The Martha Manual Martha Stewart S

https://textbookfull.com/product/the-martha-manual-martha-stewart-s/

textboxfull.com
TEXTS IN COMPUTER SCIENCE

THE
Data Science Design
MANUAL

Steven S. Skiena
123
Texts in Computer Science

Series editor
David Gries
Orit Hazzan
Fred B. Schneider
More information about this series at http://www.springer.com/series/3191
Steven S. Skiena

The Data Science Design Manual


Steven S. Skiena
Computer Science Department
Stony Brook University
Stony Brook, NY
USA

Series editors
David Gries Fred B. Schneider
Department of Computer Science Department of Computer Science
Cornell University Cornell University
Ithaca, NY Ithaca, NY
USA USA

Orit Hazzan
Faculty of Education in Science and Technology
Technion—Israel Institute of Technology
Haifa
Israel

ISSN 1868-0941 ISSN 1868-095X (electronic)


Texts in Computer Science
ISBN 978-3-319-55443-3 ISBN 978-3-319-55444-0 (eBook)
DOI 10.1007/978-3-319-55444-0

Library of Congress Control Number: 2017943201

This book was advertised with a copyright holder in the name of the publisher in error, whereas the author(s) holds the copyright.

© The Author(s) 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and
therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be
true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or
implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher
remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

Making sense of the world around us requires obtaining and analyzing data from
our environment. Several technology trends have recently collided, providing
new opportunities to apply our data analysis savvy to greater challenges than
ever before.
Computer storage capacity has increased exponentially; indeed remembering
has become so cheap that it is almost impossible to get computer systems to for-
get. Sensing devices increasingly monitor everything that can be observed: video
streams, social media interactions, and the position of anything that moves.
Cloud computing enables us to harness the power of massive numbers of ma-
chines to manipulate this data. Indeed, hundreds of computers are summoned
each time you do a Google search, scrutinizing all of your previous activity just
to decide which is the best ad to show you next.
The result of all this has been the birth of data science, a new field devoted
to maximizing value from vast collections of information. As a discipline, data
science sits somewhere at the intersection of statistics, computer science, and
machine learning, but it is building a distinct heft and character of its own.
This book serves as an introduction to data science, focusing on the skills and
principles needed to build systems for collecting, analyzing, and interpreting
data.
My professional experience as a researcher and instructor convinces me that
one major challenge of data science is that it is considerably more subtle than it
looks. Any student who has ever computed their grade point average (GPA) can
be said to have done rudimentary statistics, just as drawing a simple scatter plot
lets you add experience in data visualization to your resume. But meaningfully
analyzing and interpreting data requires both technical expertise and wisdom.
That so many people do these basics so badly provides my inspiration for writing
this book.

To the Reader
I have been gratified by the warm reception that my book The Algorithm Design
Manual [Ski08] has received since its initial publication in 1997. It has been
recognized as a unique guide to using algorithmic techniques to solve problems
that often arise in practice. The book you are holding covers very different
material, but with the same motivation.

v
vi

In particular, here I stress the following basic principles as fundamental to


becoming a good data scientist:
• Valuing doing the simple things right: Data science isn’t rocket science.
Students and practitioners often get lost in technological space, pursuing
the most advanced machine learning methods, the newest open source
software libraries, or the glitziest visualization techniques. However, the
heart of data science lies in doing the simple things right: understanding
the application domain, cleaning and integrating relevant data sources,
and presenting your results clearly to others.
Simple doesn’t mean easy, however. Indeed it takes considerable insight
and experience to ask the right questions, and sense whether you are mov-
ing toward correct answers and actionable insights. I resist the temptation
to drill deeply into clean, technical material here just because it is teach-
able. There are plenty of other books which will cover the intricacies of
machine learning algorithms or statistical hypothesis testing. My mission
here is to lay the groundwork of what really matters in analyzing data.
• Developing mathematical intuition: Data science rests on a foundation of
mathematics, particularly statistics and linear algebra. It is important to
understand this material on an intuitive level: why these concepts were
developed, how they are useful, and when they work best. I illustrate
operations in linear algebra by presenting pictures of what happens to
matrices when you manipulate them, and statistical concepts by exam-
ples and reducto ad absurdum arguments. My goal here is transplanting
intuition into the reader.
But I strive to minimize the amount of formal mathematics used in pre-
senting this material. Indeed, I will present exactly one formal proof in
this book, an incorrect proof where the associated theorem is obviously
false. The moral here is not that mathematical rigor doesn’t matter, be-
cause of course it does, but that genuine rigor is impossible until after
there is comprehension.
• Think like a computer scientist, but act like a statistician: Data science
provides an umbrella linking computer scientists, statisticians, and domain
specialists. But each community has its own distinct styles of thinking and
action, which gets stamped into the souls of its members.
In this book, I emphasize approaches which come most naturally to com-
puter scientists, particularly the algorithmic manipulation of data, the use
of machine learning, and the mastery of scale. But I also seek to transmit
the core values of statistical reasoning: the need to understand the appli-
cation domain, proper appreciation of the small, the quest for significance,
and a hunger for exploration.
No discipline has a monopoly on the truth. The best data scientists incor-
porate tools from multiple areas, and this book strives to be a relatively
neutral ground where rival philosophies can come to reason together.
vii

Equally important is what you will not find in this book. I do not emphasize
any particular language or suite of data analysis tools. Instead, this book pro-
vides a high-level discussion of important design principles. I seek to operate at
a conceptual level more than a technical one. The goal of this manual is to get
you going in the right direction as quickly as possible, with whatever software
tools you find most accessible.

To the Instructor
This book covers enough material for an “Introduction to Data Science” course
at the undergraduate or early graduate student levels. I hope that the reader
has completed the equivalent of at least one programming course and has a bit
of prior exposure to probability and statistics, but more is always better than
less.
I have made a full set of lecture slides for teaching this course available online
at http://www.data-manual.com. Data resources for projects and assignments
are also available there to aid the instructor. Further, I make available online
video lectures using these slides to teach a full-semester data science course. Let
me help teach your class, through the magic of the web!
Pedagogical features of this book include:

• War Stories: To provide a better perspective on how data science tech-


niques apply to the real world, I include a collection of “war stories,” or
tales from our experience with real problems. The moral of these stories is
that these methods are not just theory, but important tools to be pulled
out and used as needed.

• False Starts: Most textbooks present methods as a fait accompli, ob-


scuring the ideas involved in designing them, and the subtle reasons why
other approaches fail. The war stories illustrate my reasoning process on
certain applied problems, but I weave such coverage into the core material
as well.

• Take-Home Lessons: Highlighted “take-home” lesson boxes scattered


through each chapter emphasize the big-picture concepts to learn from
each chapter.

• Homework Problems: I provide a wide range of exercises for home-


work and self-study. Many are traditional exam-style problems, but there
are also larger-scale implementation challenges and smaller-scale inter-
view questions, reflecting the questions students might encounter when
searching for a job. Degree of difficulty ratings have been assigned to all
problems.
In lieu of an answer key, a Solution Wiki has been set up, where solutions to
all even numbered problems will be solicited by crowdsourcing. A similar
system with my Algorithm Design Manual produced coherent solutions,
viii

or so I am told. As a matter of principle I refuse to look at them, so let


the buyer beware.

• Kaggle Challenges: Kaggle (www.kaggle.com) provides a forum for data


scientists to compete in, featuring challenging real-world problems on fas-
cinating data sets, and scoring to test how good your model is relative to
other submissions. The exercises for each chapter include three relevant
Kaggle challenges, to serve as a source of inspiration, self-study, and data
for other projects and investigations.

• Data Science Television: Data science remains mysterious and even


threatening to the broader public. The Quant Shop is an amateur take
on what a data science reality show should be like. Student teams tackle
a diverse array of real-world prediction problems, and try to forecast the
outcome of future events. Check it out at http://www.quant-shop.com.
A series of eight 30-minute episodes has been prepared, each built around
a particular real-world prediction problem. Challenges include pricing art
at an auction, picking the winner of the Miss Universe competition, and
forecasting when celebrities are destined to die. For each, we observe as a
student team comes to grips with the problem, and learn along with them
as they build a forecasting model. They make their predictions, and we
watch along with them to see if they are right or wrong.
In this book, The Quant Shop is used to provide concrete examples of
prediction challenges, to frame discussions of the data science modeling
pipeline from data acquisition to evaluation. I hope you find them fun, and
that they will encourage you to conceive and take on your own modeling
challenges.

• Chapter Notes: Finally, each tutorial chapter concludes with a brief notes
section, pointing readers to primary sources and additional references.

Dedication
My bright and loving daughters Bonnie and Abby are now full-blown teenagers,
meaning that they don’t always process statistical evidence with as much alacrity
as I would I desire. I dedicate this book to them, in the hope that their analysis
skills improve to the point that they always just agree with me.
And I dedicate this book to my beautiful wife Renee, who agrees with me
even when she doesn’t agree with me, and loves me beyond the support of all
creditable evidence.

Acknowledgments
My list of people to thank is large enough that I have probably missed some.
I will try to do enumerate them systematically to minimize omissions, but ask
those I’ve unfairly neglected for absolution.
ix

First, I thank those who made concrete contributions to help me put this
book together. Yeseul Lee served as an apprentice on this project, helping with
figures, exercises, and more during summer 2016 and beyond. You will see
evidence of her handiwork on almost every page, and I greatly appreciate her
help and dedication. Aakriti Mittal and Jack Zheng also contributed to a few
of the figures.
Students in my Fall 2016 Introduction to Data Science course (CSE 519)
helped to debug the manuscript, and they found plenty of things to debug. I
particularly thank Rebecca Siford, who proposed over one hundred corrections
on her own. Several data science friends/sages reviewed specific chapters for
me, and I thank Anshul Gandhi, Yifan Hu, Klaus Mueller, Francesco Orabona,
Andy Schwartz, and Charles Ward for their efforts here.
I thank all the Quant Shop students from Fall 2015 whose video and mod-
eling efforts are so visibly on display. I particularly thank Jan (Dini) Diskin-
Zimmerman, whose editing efforts went so far beyond the call of duty I felt like
a felon for letting her do it.
My editors at Springer, Wayne Wheeler and Simon Rees, were a pleasure to
work with as usual. I also thank all the production and marketing people who
helped get this book to you, including Adrian Pieron and Annette Anlauf.
Several exercises were originated by colleagues or inspired by other sources.
Reconstructing the original sources years later can be challenging, but credits
for each problem (to the best of my recollection) appear on the website.
Much of what I know about data science has been learned through working
with other people. These include my Ph.D. students, particularly Rami al-Rfou,
Mikhail Bautin, Haochen Chen, Yanqing Chen, Vivek Kulkarni, Levon Lloyd,
Andrew Mehler, Bryan Perozzi, Yingtao Tian, Junting Ye, Wenbin Zhang, and
postdoc Charles Ward. I fondly remember all of my Lydia project masters
students over the years, and remind you that my prize offer to the first one who
names their daughter Lydia remains unclaimed. I thank my other collaborators
with stories to tell, including Bruce Futcher, Justin Gardin, Arnout van de Rijt,
and Oleksii Starov.
I remember all members of the General Sentiment/Canrock universe, partic-
ularly Mark Fasciano, with whom I shared the start-up dream and experienced
what happens when data hits the real world. I thank my colleagues at Yahoo
Labs/Research during my 2015–2016 sabbatical year, when much of this book
was conceived. I single out Amanda Stent, who enabled me to be at Yahoo
during that particularly difficult year in the company’s history. I learned valu-
able things from other people who have taught related data science courses,
including Andrew Ng and Hans-Peter Pfister, and thank them all for their help.

If you have a procedure with ten parameters, you probably missed


some.
– Alan Perlis
x

Caveat
It is traditional for the author to magnanimously accept the blame for whatever
deficiencies remain. I don’t. Any errors, deficiencies, or problems in this book
are somebody else’s fault, but I would appreciate knowing about them so as to
determine who is to blame.
Steven S. Skiena
Department of Computer Science
Stony Brook University
Stony Brook, NY 11794-2424
http://www.cs.stonybrook.edu/~skiena
[email protected]
May 2017
Contents

1 What is Data Science? 1


1.1 Computer Science, Data Science, and Real Science . . . . . . . . 2
1.2 Asking Interesting Questions from Data . . . . . . . . . . . . . . 4
1.2.1 The Baseball Encyclopedia . . . . . . . . . . . . . . . . . 5
1.2.2 The Internet Movie Database (IMDb) . . . . . . . . . . . 7
1.2.3 Google Ngrams . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.4 New York Taxi Records . . . . . . . . . . . . . . . . . . . 11
1.3 Properties of Data . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 Structured vs. Unstructured Data . . . . . . . . . . . . . 14
1.3.2 Quantitative vs. Categorical Data . . . . . . . . . . . . . 15
1.3.3 Big Data vs. Little Data . . . . . . . . . . . . . . . . . . . 15
1.4 Classification and Regression . . . . . . . . . . . . . . . . . . . . 16
1.5 Data Science Television: The Quant Shop . . . . . . . . . . . . . 17
1.5.1 Kaggle Challenges . . . . . . . . . . . . . . . . . . . . . . 19
1.6 About the War Stories . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 War Story: Answering the Right Question . . . . . . . . . . . . . 21
1.8 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Mathematical Preliminaries 27
2.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.1 Probability vs. Statistics . . . . . . . . . . . . . . . . . . . 29
2.1.2 Compound Events and Independence . . . . . . . . . . . . 30
2.1.3 Conditional Probability . . . . . . . . . . . . . . . . . . . 31
2.1.4 Probability Distributions . . . . . . . . . . . . . . . . . . 32
2.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.1 Centrality Measures . . . . . . . . . . . . . . . . . . . . . 34
2.2.2 Variability Measures . . . . . . . . . . . . . . . . . . . . . 36
2.2.3 Interpreting Variance . . . . . . . . . . . . . . . . . . . . 37
2.2.4 Characterizing Distributions . . . . . . . . . . . . . . . . 39
2.3 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.1 Correlation Coefficients: Pearson and Spearman Rank . . 41
2.3.2 The Power and Significance of Correlation . . . . . . . . . 43
2.3.3 Correlation Does Not Imply Causation! . . . . . . . . . . 45

xi
xii CONTENTS

2.3.4 Detecting Periodicities by Autocorrelation . . . . . . . . . 46


2.4 Logarithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.1 Logarithms and Multiplying Probabilities . . . . . . . . . 48
2.4.2 Logarithms and Ratios . . . . . . . . . . . . . . . . . . . . 48
2.4.3 Logarithms and Normalizing Skewed Distributions . . . . 49
2.5 War Story: Fitting Designer Genes . . . . . . . . . . . . . . . . . 50
2.6 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3 Data Munging 57
3.1 Languages for Data Science . . . . . . . . . . . . . . . . . . . . . 57
3.1.1 The Importance of Notebook Environments . . . . . . . . 59
3.1.2 Standard Data Formats . . . . . . . . . . . . . . . . . . . 61
3.2 Collecting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.1 Hunting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.2 Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.3 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Cleaning Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.1 Errors vs. Artifacts . . . . . . . . . . . . . . . . . . . . . 69
3.3.2 Data Compatibility . . . . . . . . . . . . . . . . . . . . . . 72
3.3.3 Dealing with Missing Values . . . . . . . . . . . . . . . . . 76
3.3.4 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . 78
3.4 War Story: Beating the Market . . . . . . . . . . . . . . . . . . . 79
3.5 Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.5.1 The Penny Demo . . . . . . . . . . . . . . . . . . . . . . . 81
3.5.2 When is the Crowd Wise? . . . . . . . . . . . . . . . . . . 82
3.5.3 Mechanisms for Aggregation . . . . . . . . . . . . . . . . 83
3.5.4 Crowdsourcing Services . . . . . . . . . . . . . . . . . . . 84
3.5.5 Gamification . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.6 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4 Scores and Rankings 95


4.1 The Body Mass Index (BMI) . . . . . . . . . . . . . . . . . . . . 96
4.2 Developing Scoring Systems . . . . . . . . . . . . . . . . . . . . . 99
4.2.1 Gold Standards and Proxies . . . . . . . . . . . . . . . . . 99
4.2.2 Scores vs. Rankings . . . . . . . . . . . . . . . . . . . . . 100
4.2.3 Recognizing Good Scoring Functions . . . . . . . . . . . . 101
4.3 Z-scores and Normalization . . . . . . . . . . . . . . . . . . . . . 103
4.4 Advanced Ranking Techniques . . . . . . . . . . . . . . . . . . . 104
4.4.1 Elo Rankings . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4.2 Merging Rankings . . . . . . . . . . . . . . . . . . . . . . 108
4.4.3 Digraph-based Rankings . . . . . . . . . . . . . . . . . . . 109
4.4.4 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5 War Story: Clyde’s Revenge . . . . . . . . . . . . . . . . . . . . . 111
4.6 Arrow’s Impossibility Theorem . . . . . . . . . . . . . . . . . . . 114
CONTENTS xiii

4.7 War Story: Who’s Bigger? . . . . . . . . . . . . . . . . . . . . . . 115


4.8 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5 Statistical Analysis 121


5.1 Statistical Distributions . . . . . . . . . . . . . . . . . . . . . . . 122
5.1.1 The Binomial Distribution . . . . . . . . . . . . . . . . . . 123
5.1.2 The Normal Distribution . . . . . . . . . . . . . . . . . . 124
5.1.3 Implications of the Normal Distribution . . . . . . . . . . 126
5.1.4 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . 127
5.1.5 Power Law Distributions . . . . . . . . . . . . . . . . . . . 129
5.2 Sampling from Distributions . . . . . . . . . . . . . . . . . . . . . 132
5.2.1 Random Sampling beyond One Dimension . . . . . . . . . 133
5.3 Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . . 135
5.3.1 The Significance of Significance . . . . . . . . . . . . . . . 135
5.3.2 The T-test: Comparing Population Means . . . . . . . . . 137
5.3.3 The Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . 139
5.3.4 The Bonferroni Correction . . . . . . . . . . . . . . . . . . 141
5.3.5 False Discovery Rate . . . . . . . . . . . . . . . . . . . . . 142
5.4 War Story: Discovering the Fountain of Youth? . . . . . . . . . . 143
5.5 Permutation Tests and P-values . . . . . . . . . . . . . . . . . . . 145
5.5.1 Generating Random Permutations . . . . . . . . . . . . . 147
5.5.2 DiMaggio’s Hitting Streak . . . . . . . . . . . . . . . . . . 148
5.6 Bayesian Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.7 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6 Visualizing Data 155


6.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . 156
6.1.1 Confronting a New Data Set . . . . . . . . . . . . . . . . 156
6.1.2 Summary Statistics and Anscombe’s Quartet . . . . . . . 159
6.1.3 Visualization Tools . . . . . . . . . . . . . . . . . . . . . . 160
6.2 Developing a Visualization Aesthetic . . . . . . . . . . . . . . . . 162
6.2.1 Maximizing Data-Ink Ratio . . . . . . . . . . . . . . . . . 163
6.2.2 Minimizing the Lie Factor . . . . . . . . . . . . . . . . . . 164
6.2.3 Minimizing Chartjunk . . . . . . . . . . . . . . . . . . . . 165
6.2.4 Proper Scaling and Labeling . . . . . . . . . . . . . . . . 167
6.2.5 Effective Use of Color and Shading . . . . . . . . . . . . . 168
6.2.6 The Power of Repetition . . . . . . . . . . . . . . . . . . . 169
6.3 Chart Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.3.1 Tabular Data . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.3.2 Dot and Line Plots . . . . . . . . . . . . . . . . . . . . . . 174
6.3.3 Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.3.4 Bar Plots and Pie Charts . . . . . . . . . . . . . . . . . . 179
6.3.5 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.3.6 Data Maps . . . . . . . . . . . . . . . . . . . . . . . . . . 187
xiv CONTENTS

6.4 Great Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . 189


6.4.1 Marey’s Train Schedule . . . . . . . . . . . . . . . . . . . 189
6.4.2 Snow’s Cholera Map . . . . . . . . . . . . . . . . . . . . . 191
6.4.3 New York’s Weather Year . . . . . . . . . . . . . . . . . . 192
6.5 Reading Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.5.1 The Obscured Distribution . . . . . . . . . . . . . . . . . 193
6.5.2 Overinterpreting Variance . . . . . . . . . . . . . . . . . . 193
6.6 Interactive Visualization . . . . . . . . . . . . . . . . . . . . . . . 195
6.7 War Story: TextMapping the World . . . . . . . . . . . . . . . . 196
6.8 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

7 Mathematical Models 201


7.1 Philosophies of Modeling . . . . . . . . . . . . . . . . . . . . . . . 201
7.1.1 Occam’s Razor . . . . . . . . . . . . . . . . . . . . . . . . 201
7.1.2 Bias–Variance Trade-Offs . . . . . . . . . . . . . . . . . . 202
7.1.3 What Would Nate Silver Do? . . . . . . . . . . . . . . . . 203
7.2 A Taxonomy of Models . . . . . . . . . . . . . . . . . . . . . . . 205
7.2.1 Linear vs. Non-Linear Models . . . . . . . . . . . . . . . . 206
7.2.2 Blackbox vs. Descriptive Models . . . . . . . . . . . . . . 206
7.2.3 First-Principle vs. Data-Driven Models . . . . . . . . . . . 207
7.2.4 Stochastic vs. Deterministic Models . . . . . . . . . . . . 208
7.2.5 Flat vs. Hierarchical Models . . . . . . . . . . . . . . . . . 209
7.3 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
7.3.1 Baseline Models for Classification . . . . . . . . . . . . . . 210
7.3.2 Baseline Models for Value Prediction . . . . . . . . . . . . 212
7.4 Evaluating Models . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.4.1 Evaluating Classifiers . . . . . . . . . . . . . . . . . . . . 213
7.4.2 Receiver-Operator Characteristic (ROC) Curves . . . . . 218
7.4.3 Evaluating Multiclass Systems . . . . . . . . . . . . . . . 219
7.4.4 Evaluating Value Prediction Models . . . . . . . . . . . . 221
7.5 Evaluation Environments . . . . . . . . . . . . . . . . . . . . . . 224
7.5.1 Data Hygiene for Evaluation . . . . . . . . . . . . . . . . 225
7.5.2 Amplifying Small Evaluation Sets . . . . . . . . . . . . . 226
7.6 War Story: 100% Accuracy . . . . . . . . . . . . . . . . . . . . . 228
7.7 Simulation Models . . . . . . . . . . . . . . . . . . . . . . . . . . 229
7.8 War Story: Calculated Bets . . . . . . . . . . . . . . . . . . . . . 230
7.9 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

8 Linear Algebra 237


8.1 The Power of Linear Algebra . . . . . . . . . . . . . . . . . . . . 237
8.1.1 Interpreting Linear Algebraic Formulae . . . . . . . . . . 238
8.1.2 Geometry and Vectors . . . . . . . . . . . . . . . . . . . . 240
8.2 Visualizing Matrix Operations . . . . . . . . . . . . . . . . . . . . 241
8.2.1 Matrix Addition . . . . . . . . . . . . . . . . . . . . . . . 242
CONTENTS xv

8.2.2 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . 243


8.2.3 Applications of Matrix Multiplication . . . . . . . . . . . 244
8.2.4 Identity Matrices and Inversion . . . . . . . . . . . . . . . 248
8.2.5 Matrix Inversion and Linear Systems . . . . . . . . . . . . 250
8.2.6 Matrix Rank . . . . . . . . . . . . . . . . . . . . . . . . . 251
8.3 Factoring Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 252
8.3.1 Why Factor Feature Matrices? . . . . . . . . . . . . . . . 252
8.3.2 LU Decomposition and Determinants . . . . . . . . . . . 254
8.4 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 255
8.4.1 Properties of Eigenvalues . . . . . . . . . . . . . . . . . . 255
8.4.2 Computing Eigenvalues . . . . . . . . . . . . . . . . . . . 256
8.5 Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . . . 257
8.5.1 Singular Value Decomposition . . . . . . . . . . . . . . . . 258
8.5.2 Principal Components Analysis . . . . . . . . . . . . . . . 260
8.6 War Story: The Human Factors . . . . . . . . . . . . . . . . . . . 262
8.7 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
8.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

9 Linear and Logistic Regression 267


9.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
9.1.1 Linear Regression and Duality . . . . . . . . . . . . . . . 268
9.1.2 Error in Linear Regression . . . . . . . . . . . . . . . . . . 269
9.1.3 Finding the Optimal Fit . . . . . . . . . . . . . . . . . . . 270
9.2 Better Regression Models . . . . . . . . . . . . . . . . . . . . . . 272
9.2.1 Removing Outliers . . . . . . . . . . . . . . . . . . . . . . 272
9.2.2 Fitting Non-Linear Functions . . . . . . . . . . . . . . . . 273
9.2.3 Feature and Target Scaling . . . . . . . . . . . . . . . . . 274
9.2.4 Dealing with Highly-Correlated Features . . . . . . . . . . 277
9.3 War Story: Taxi Deriver . . . . . . . . . . . . . . . . . . . . . . . 277
9.4 Regression as Parameter Fitting . . . . . . . . . . . . . . . . . . 279
9.4.1 Convex Parameter Spaces . . . . . . . . . . . . . . . . . . 280
9.4.2 Gradient Descent Search . . . . . . . . . . . . . . . . . . . 281
9.4.3 What is the Right Learning Rate? . . . . . . . . . . . . . 283
9.4.4 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 285
9.5 Simplifying Models through Regularization . . . . . . . . . . . . 286
9.5.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . 286
9.5.2 LASSO Regression . . . . . . . . . . . . . . . . . . . . . . 287
9.5.3 Trade-Offs between Fit and Complexity . . . . . . . . . . 288
9.6 Classification and Logistic Regression . . . . . . . . . . . . . . . 289
9.6.1 Regression for Classification . . . . . . . . . . . . . . . . . 290
9.6.2 Decision Boundaries . . . . . . . . . . . . . . . . . . . . . 291
9.6.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 292
9.7 Issues in Logistic Classification . . . . . . . . . . . . . . . . . . . 295
9.7.1 Balanced Training Classes . . . . . . . . . . . . . . . . . . 295
9.7.2 Multi-Class Classification . . . . . . . . . . . . . . . . . . 297
9.7.3 Hierarchical Classification . . . . . . . . . . . . . . . . . . 298
xvi CONTENTS

9.7.4 Partition Functions and Multinomial Regression . . . . . 299


9.8 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
9.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

10 Distance and Network Methods 303


10.1 Measuring Distances . . . . . . . . . . . . . . . . . . . . . . . . . 303
10.1.1 Distance Metrics . . . . . . . . . . . . . . . . . . . . . . . 304
10.1.2 The Lk Distance Metric . . . . . . . . . . . . . . . . . . . 305
10.1.3 Working in Higher Dimensions . . . . . . . . . . . . . . . 307
10.1.4 Dimensional Egalitarianism . . . . . . . . . . . . . . . . . 308
10.1.5 Points vs. Vectors . . . . . . . . . . . . . . . . . . . . . . 309
10.1.6 Distances between Probability Distributions . . . . . . . . 310
10.2 Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . 311
10.2.1 Seeking Good Analogies . . . . . . . . . . . . . . . . . . . 312
10.2.2 k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . 313
10.2.3 Finding Nearest Neighbors . . . . . . . . . . . . . . . . . 315
10.2.4 Locality Sensitive Hashing . . . . . . . . . . . . . . . . . . 317
10.3 Graphs, Networks, and Distances . . . . . . . . . . . . . . . . . . 319
10.3.1 Weighted Graphs and Induced Networks . . . . . . . . . . 320
10.3.2 Talking About Graphs . . . . . . . . . . . . . . . . . . . . 321
10.3.3 Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . 323
10.4 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
10.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
10.5.1 k-means Clustering . . . . . . . . . . . . . . . . . . . . . . 330
10.5.2 Agglomerative Clustering . . . . . . . . . . . . . . . . . . 336
10.5.3 Comparing Clusterings . . . . . . . . . . . . . . . . . . . . 341
10.5.4 Similarity Graphs and Cut-Based Clustering . . . . . . . 341
10.6 War Story: Cluster Bombing . . . . . . . . . . . . . . . . . . . . 344
10.7 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
10.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346

11 Machine Learning 351


11.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
11.1.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 354
11.1.2 Dealing with Zero Counts (Discounting) . . . . . . . . . . 356
11.2 Decision Tree Classifiers . . . . . . . . . . . . . . . . . . . . . . . 357
11.2.1 Constructing Decision Trees . . . . . . . . . . . . . . . . . 359
11.2.2 Realizing Exclusive Or . . . . . . . . . . . . . . . . . . . . 361
11.2.3 Ensembles of Decision Trees . . . . . . . . . . . . . . . . . 362
11.3 Boosting and Ensemble Learning . . . . . . . . . . . . . . . . . . 363
11.3.1 Voting with Classifiers . . . . . . . . . . . . . . . . . . . . 363
11.3.2 Boosting Algorithms . . . . . . . . . . . . . . . . . . . . . 364
11.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . 366
11.4.1 Linear SVMs . . . . . . . . . . . . . . . . . . . . . . . . . 369
11.4.2 Non-linear SVMs . . . . . . . . . . . . . . . . . . . . . . . 369
11.4.3 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
CONTENTS xvii

11.5 Degrees of Supervision . . . . . . . . . . . . . . . . . . . . . . . . 372


11.5.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . 372
11.5.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . 372
11.5.3 Semi-supervised Learning . . . . . . . . . . . . . . . . . . 374
11.5.4 Feature Engineering . . . . . . . . . . . . . . . . . . . . . 375
11.6 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
11.6.1 Networks and Depth . . . . . . . . . . . . . . . . . . . . . 378
11.6.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 382
11.6.3 Word and Graph Embeddings . . . . . . . . . . . . . . . . 383
11.7 War Story: The Name Game . . . . . . . . . . . . . . . . . . . . 385
11.8 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
11.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388

12 Big Data: Achieving Scale 391


12.1 What is Big Data? . . . . . . . . . . . . . . . . . . . . . . . . . . 392
12.1.1 Big Data as Bad Data . . . . . . . . . . . . . . . . . . . . 392
12.1.2 The Three Vs . . . . . . . . . . . . . . . . . . . . . . . . . 394
12.2 War Story: Infrastructure Matters . . . . . . . . . . . . . . . . . 395
12.3 Algorithmics for Big Data . . . . . . . . . . . . . . . . . . . . . . 397
12.3.1 Big Oh Analysis . . . . . . . . . . . . . . . . . . . . . . . 397
12.3.2 Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
12.3.3 Exploiting the Storage Hierarchy . . . . . . . . . . . . . . 401
12.3.4 Streaming and Single-Pass Algorithms . . . . . . . . . . . 402
12.4 Filtering and Sampling . . . . . . . . . . . . . . . . . . . . . . . . 403
12.4.1 Deterministic Sampling Algorithms . . . . . . . . . . . . . 404
12.4.2 Randomized and Stream Sampling . . . . . . . . . . . . . 406
12.5 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
12.5.1 One, Two, Many . . . . . . . . . . . . . . . . . . . . . . . 407
12.5.2 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . 409
12.5.3 Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . 409
12.5.4 Cloud Computing Services . . . . . . . . . . . . . . . . . . 410
12.6 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
12.6.1 Map-Reduce Programming . . . . . . . . . . . . . . . . . 412
12.6.2 MapReduce under the Hood . . . . . . . . . . . . . . . . . 414
12.7 Societal and Ethical Implications . . . . . . . . . . . . . . . . . . 416
12.8 Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
12.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419

13 Coda 423
13.1 Get a Job! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
13.2 Go to Graduate School! . . . . . . . . . . . . . . . . . . . . . . . 424
13.3 Professional Consulting Services . . . . . . . . . . . . . . . . . . 425

14 Bibliography 427
Chapter 1

What is Data Science?

The purpose of computing is insight, not numbers.


– Richard W. Hamming

What is data science? Like any emerging field, it hasn’t been completely defined
yet, but you know enough about it to be interested or else you wouldn’t be
reading this book.
I think of data science as lying at the intersection of computer science, statis-
tics, and substantive application domains. From computer science comes ma-
chine learning and high-performance computing technologies for dealing with
scale. From statistics comes a long tradition of exploratory data analysis, sig-
nificance testing, and visualization. From application domains in business and
the sciences comes challenges worthy of battle, and evaluation standards to
assess when they have been adequately conquered.
But these are all well-established fields. Why data science, and why now? I
see three reasons for this sudden burst of activity:
• New technology makes it possible to capture, annotate, and store vast
amounts of social media, logging, and sensor data. After you have amassed
all this data, you begin to wonder what you can do with it.
• Computing advances make it possible to analyze data in novel ways and at
ever increasing scales. Cloud computing architectures give even the little
guy access to vast power when they need it. New approaches to machine
learning have lead to amazing advances in longstanding problems, like
computer vision and natural language processing.
• Prominent technology companies (like Google and Facebook) and quan-
titative hedge funds (like Renaissance Technologies and TwoSigma) have
proven the power of modern data analytics. Success stories applying data
to such diverse areas as sports management (Moneyball [Lew04]) and elec-
tion forecasting (Nate Silver [Sil12]) have served as role models to bring
data science to a large popular audience.

1
© The Author(s) 2017
S.S. Skiena, The Data Science Design Manual,
Texts in Computer Science, DOI 10.1007/978-3-319-55444-0_1
2 CHAPTER 1. WHAT IS DATA SCIENCE?

This introductory chapter has three missions. First, I will try to explain how
good data scientists think, and how this differs from the mindset of traditional
programmers and software developers. Second, we will look at data sets in terms
of the potential for what they can be used for, and learn to ask the broader
questions they are capable of answering. Finally, I introduce a collection of
data analysis challenges that will be used throughout this book as motivating
examples.

1.1 Computer Science, Data Science, and Real


Science
Computer scientists, by nature, don’t respect data. They have traditionally
been taught that the algorithm was the thing, and that data was just meat to
be passed through a sausage grinder.
So to qualify as an effective data scientist, you must first learn to think like
a real scientist. Real scientists strive to understand the natural world, which
is a complicated and messy place. By contrast, computer scientists tend to
build their own clean and organized virtual worlds and live comfortably within
them. Scientists obsess about discovering things, while computer scientists in-
vent rather than discover.
People’s mindsets strongly color how they think and act, causing misunder-
standings when we try to communicate outside our tribes. So fundamental are
these biases that we are often unaware we have them. Examples of the cultural
differences between computer science and real science include:

• Data vs. method centrism: Scientists are data driven, while computer
scientists are algorithm driven. Real scientists spend enormous amounts
of effort collecting data to answer their question of interest. They invent
fancy measuring devices, stay up all night tending to experiments, and
devote most of their thinking to how to get the data they need.
By contrast, computer scientists obsess about methods: which algorithm
is better than which other algorithm, which programming language is best
for a job, which program is better than which other program. The details
of the data set they are working on seem comparably unexciting.

• Concern about results: Real scientists care about answers. They analyze
data to discover something about how the world works. Good scientists
care about whether the results make sense, because they care about what
the answers mean.
By contrast, bad computer scientists worry about producing plausible-
looking numbers. As soon as the numbers stop looking grossly wrong,
they are presumed to be right. This is because they are personally less
invested in what can be learned from a computation, as opposed to getting
it done quickly and efficiently.
1.1. COMPUTER SCIENCE, DATA SCIENCE, AND REAL SCIENCE 3

• Robustness: Real scientists are comfortable with the idea that data has
errors. In general, computer scientists are not. Scientists think a lot about
possible sources of bias or error in their data, and how these possible prob-
lems can effect the conclusions derived from them. Good programmers use
strong data-typing and parsing methodologies to guard against formatting
errors, but the concerns here are different.
Becoming aware that data can have errors is empowering. Computer
scientists chant “garbage in, garbage out” as a defensive mantra to ward
off criticism, a way to say that’s not my job. Real scientists get close
enough to their data to smell it, giving it the sniff test to decide whether
it is likely to be garbage.

• Precision: Nothing is ever completely true or false in science, while every-


thing is either true or false in computer science or mathematics.
Generally speaking, computer scientists are happy printing floating point
numbers to as many digits as possible: 8/13 = 0.61538461538. Real
scientists will use only two significant digits: 8/13 ≈ 0.62. Computer
scientists care what a number is, while real scientists care what it means.

Aspiring data scientists must learn to think like real scientists. Your job is
going to be to turn numbers into insight. It is important to understand the why
as much as the how.
To be fair, it benefits real scientists to think like data scientists as well. New
experimental technologies enable measuring systems on vastly greater scale than
ever possible before, through technologies like full-genome sequencing in biology
and full-sky telescope surveys in astronomy. With new breadth of view comes
new levels of vision.
Traditional hypothesis-driven science was based on asking specific questions
of the world and then generating the specific data needed to confirm or deny
it. This is now augmented by data-driven science, which instead focuses on
generating data on a previously unheard of scale or resolution, in the belief that
new discoveries will come as soon as one is able to look at it. Both ways of
thinking will be important to us:

• Given a problem, what available data will help us answer it?

• Given a data set, what interesting problems can we apply it to?

There is another way to capture this basic distinction between software en-
gineering and data science. It is that software developers are hired to build
systems, while data scientists are hired to produce insights.
This may be a point of contention for some developers. There exist an
important class of engineers who wrangle the massive distributed infrastructures
necessary to store and analyze, say, financial transaction or social media data
4 CHAPTER 1. WHAT IS DATA SCIENCE?

on a full Facebook or Twitter-level of scale. Indeed, I will devote Chapter 12


to the distinctive challenges of big data infrastructures. These engineers are
building tools and systems to support data science, even though they may not
personally mine the data they wrangle. Do they qualify as data scientists?
This is a fair question, one I will finesse a bit so as to maximize the poten-
tial readership of this book. But I do believe that the better such engineers
understand the full data analysis pipeline, the more likely they will be able to
build powerful tools capable of providing important insights. A major goal of
this book is providing big data engineers with the intellectual tools to think like
big data scientists.

1.2 Asking Interesting Questions from Data


Good data scientists develop an inherent curiosity about the world around them,
particularly in the associated domains and applications they are working on.
They enjoy talking shop with the people whose data they work with. They ask
them questions: What is the coolest thing you have learned about this field?
Why did you get interested in it? What do you hope to learn by analyzing your
data set? Data scientists always ask questions.
Good data scientists have wide-ranging interests. They read the newspaper
every day to get a broader perspective on what is exciting. They understand that
the world is an interesting place. Knowing a little something about everything
equips them to play in other people’s backyards. They are brave enough to get
out of their comfort zones a bit, and driven to learn more once they get there.
Software developers are not really encouraged to ask questions, but data
scientists are. We ask questions like:

• What things might you be able to learn from a given data set?

• What do you/your people really want to know about the world?

• What will it mean to you once you find out?

Computer scientists traditionally do not really appreciate data. Think about


the way algorithm performance is experimentally measured. Usually the pro-
gram is run on “random data” to see how long it takes. They rarely even look
at the results of the computation, except to verify that it is correct and efficient.
Since the “data” is meaningless, the results cannot be important. In contrast,
real data sets are a scarce resource, which required hard work and imagination
to obtain.
Becoming a data scientist requires learning to ask questions about data, so
let’s practice. Each of the subsections below will introduce an interesting data
set. After you understand what kind of information is available, try to come
up with, say, five interesting questions you might explore/answer with access to
this data set.
1.2. ASKING INTERESTING QUESTIONS FROM DATA 5

Figure 1.1: Statistical information on the performance of Babe Ruth can be


found at http://www.baseball-reference.com.

The key is thinking broadly: the answers to big, general questions often lie
buried in highly-specific data sets, which were by no means designed to contain
them.

1.2.1 The Baseball Encyclopedia


Baseball has long had an outsized importance in the world of data science. This
sport has been called the national pastime of the United States; indeed, French
historian Jacques Barzun observed that “Whoever wants to know the heart and
mind of America had better learn baseball.” I realize that many readers are not
American, and even those that are might be completely disinterested in sports.
But stick with me for a while.
What makes baseball important to data science is its extensive statistical
record of play, dating back for well over a hundred years. Baseball is a sport of
discrete events: pitchers throw balls and batters try to hit them – that naturally
lends itself to informative statistics. Fans get immersed in these statistics as chil-
dren, building their intuition about the strengths and limitations of quantitative
analysis. Some of these children grow up to become data scientists. Indeed, the
success of Brad Pitt’s statistically-minded baseball team in the movie Moneyball
remains the American public’s most vivid contact with data science.
This historical baseball record is available at http://www.baseball-reference.
com. There you will find complete statistical data on the performance of every
player who even stepped on the field. This includes summary statistics of each
season’s batting, pitching, and fielding record, plus information about teams
6 CHAPTER 1. WHAT IS DATA SCIENCE?

Figure 1.2: Personal information on every major league baseball player is avail-
able at http://www.baseball-reference.com.

and awards as shown in Figure 1.1.


But more than just statistics, there is metadata on the life and careers of all
the people who have ever played major league baseball, as shown in Figure 1.2.
We get the vital statistics of each player (height, weight, handedness) and their
lifespan (when/where they were born and died). We also get salary information
(how much each player got paid every season) and transaction data (how did
they get to be the property of each team they played for).
Now, I realize that many of you do not have the slightest knowledge of or
interest in baseball. This sport is somewhat reminiscent of cricket, if that helps.
But remember that as a data scientist, it is your job to be interested in the
world around you. Think of this as chance to learn something.
So what interesting questions can you answer with this baseball data set?
Try to write down five questions before moving on. Don’t worry, I will wait here
for you to finish.

The most obvious types of questions to answer with this data are directly
related to baseball:

• How can we best measure an individual player’s skill or value?


• How fairly do trades between teams generally work out?
• What is the general trajectory of player’s performance level as they mature
and age?
• To what extent does batting performance correlate with position played?
For example, are outfielders really better hitters than infielders?

These are interesting questions. But even more interesting are questions
about demographic and social issues. Almost 20,000 major league baseball play-
1.2. ASKING INTERESTING QUESTIONS FROM DATA 7

ers have taken the field over the past 150 years, providing a large, extensively-
documented cohort of men who can serve as a proxy for even larger, less well-
documented populations. Indeed, we can use this baseball player data to answer
questions like:

• Do left-handed people have shorter lifespans than right-handers? Handed-


ness is not captured in most demographic data sets, but has been diligently
assembled here. Indeed, analysis of this data set has been used to show
that right-handed people live longer than lefties [HC88]!

• How often do people return to live in the same place where they were
born? Locations of birth and death have been extensively recorded in this
data set. Further, almost all of these people played at least part of their
career far from home, thus exposing them to the wider world at a critical
time in their youth.

• Do player salaries generally reflect past, present, or future performance?

• To what extent have heights and weights been increasing in the population
at large?

There are two particular themes to be aware of here. First, the identifiers
and reference tags (i.e. the metadata) often prove more interesting in a data set
than the stuff we are supposed to care about, here the statistical record of play.
Second is the idea of a statistical proxy, where you use the data set you have
to substitute for the one you really want. The data set of your dreams likely
does not exist, or may be locked away behind a corporate wall even if it does.
A good data scientist is a pragmatist, seeing what they can do with what they
have instead of bemoaning what they cannot get their hands on.

1.2.2 The Internet Movie Database (IMDb)


Everybody loves the movies. The Internet Movie Database (IMDb) provides
crowdsourced and curated data about all aspects of the motion picture industry,
at www.imdb.com. IMDb currently contains data on over 3.3 million movies and
TV programs. For each film, IMDb includes its title, running time, genres, date
of release, and a full list of cast and crew. There is financial data about each
production, including the budget for making the film and how well it did at the
box office.
Finally, there are extensive ratings for each film from viewers and critics.
This rating data consists of scores on a zero to ten stars scale, cross-tabulated
into averages by age and gender. Written reviews are often included, explaining
why a particular critic awarded a given number of stars. There are also links
between films: for example, identifying which other films have been watched
most often by viewers of It’s a Wonderful Life.
Every actor, director, producer, and crew member associated with a film
merits an entry in IMDb, which now contains records on 6.5 million people.
8 CHAPTER 1. WHAT IS DATA SCIENCE?

Figure 1.3: Representative film data from the Internet Movie Database.

Figure 1.4: Representative actor data from the Internet Movie Database.
1.2. ASKING INTERESTING QUESTIONS FROM DATA 9

These happen to include my brother, cousin, and sister-in-law. Each actor


is linked to every film they appeared in, with a description of their role and
their ordering in the credits. Available data about each personality includes
birth/death dates, height, awards, and family relations.
So what kind of questions can you answer with this movie data?

Perhaps the most natural questions to ask IMDb involve identifying the
extremes of movies and actors:
• Which actors appeared in the most films? Earned the most money? Ap-
peared in the lowest rated films? Had the longest career or the shortest
lifespan?
• What was the highest rated film each year, or the best in each genre?
Which movies lost the most money, had the highest-powered casts, or got
the least favorable reviews.
Then there are larger-scale questions one can ask about the nature of the
motion picture business itself:
• How well does movie gross correlate with viewer ratings or awards? Do
customers instinctively flock to trash, or is virtue on the part of the cre-
ative team properly rewarded?
• How do Hollywood movies compare to Bollywood movies, in terms of rat-
ings, budget, and gross? Are American movies better received than foreign
films, and how does this differ between U.S. and non-U.S. reviewers?
• What is the age distribution of actors and actresses in films? How much
younger is the actress playing the wife, on average, than the actor playing
the husband? Has this disparity been increasing or decreasing with time?
• Live fast, die young, and leave a good-looking corpse? Do movie stars live
longer or shorter lives than bit players, or compared to the general public?
Assuming that people working together on a film get to know each other,
the cast and crew data can be used to build a social network of the movie
business. What does the social network of actors look like? The Oracle of
Bacon (https://oracleofbacon.org/) posits Kevin Bacon as the center of
the Hollywood universe and generates the shortest path to Bacon from any
other actor. Other actors, like Samuel L. Jackson, prove even more central.
More critically, can we analyze this data to determine the probability that
someone will like a given movie? The technique of collaborative filtering finds
people who liked films that I also liked, and recommends other films that they
liked as good candidates for me. The 2007 Netflix Prize was a $1,000,000 com-
petition to produce a ratings engine 10% better than the proprietary Netflix
system. The ultimate winner of this prize (BellKor) used a variety of data
sources and techniques, including the analysis of links [BK07].
10 CHAPTER 1. WHAT IS DATA SCIENCE?

Figure 1.5: The rise and fall of data processing, as witnessed by Google Ngrams.

1.2.3 Google Ngrams


Printed books have been the primary repository of human knowledge since
Gutenberg’s invention of movable type in 1439. Physical objects live somewhat
uneasily in today’s digital world, but technology has a way of reducing every-
thing to data. As part of its mission to organize the world’s information, Google
undertook an effort to scan all of the world’s published books. They haven’t
quite gotten there yet, but the 30 million books thus far digitized represent over
20% of all books ever published.
Google uses this data to improve search results, and provide fresh access
to out-of-print books. But perhaps the coolest product is Google Ngrams, an
amazing resource for monitoring changes in the cultural zeitgeist. It provides
the frequency with which short phrases occur in books published each year.
Each phrase must occur at least forty times in their scanned book corpus. This
eliminates obscure words and phrases, but leaves over two billion time series
available for analysis.
This rich data set shows how language use has changed over the past 200
years, and has been widely applied to cultural trend analysis [MAV+ 11]. Figure
1.5 uses this data to show how the word data fell out of favor when thinking
about computing. Data processing was the popular term associated with the
computing field during the punched card and spinning magnetic tape era of the
1950s. The Ngrams data shows that the rapid rise of Computer Science did not
eclipse Data Processing until 1980. Even today, Data Science remains almost
invisible on this scale.
Check out Google Ngrams at http://books.google.com/ngrams. I promise
you will enjoy playing with it. Compare hot dog to tofu, science against religion,
freedom to justice, and sex vs. marriage, to better understand this fantastic
telescope for looking into the past.
But once you are done playing, think of bigger things you could do if you
got your hands on this data. Assume you have access to the annual number
of references for all words/phrases published in books over the past 200 years.
1.2. ASKING INTERESTING QUESTIONS FROM DATA 11

Google makes this data freely available. So what are you going to do with it?

Observing the time series associated with particular words using the Ngrams
Viewer is fun. But more sophisticated historical trends can be captured by
aggregating multiple time series together. The following types of questions
seem particularly interesting to me:

• How has the amount of cursing changed over time? Use of the four-
letter words I am most familiar with seem to have exploded since 1960,
although it is perhaps less clear whether this reflects increased cussing or
lower publication standards.

• How often do new words emerge and get popular? Do these words tend
to stay in common usage, or rapidly fade away? Can we detect when
words change meaning over time, like the transition of gay from happy to
homosexual?

• Have standards of spelling been improving or deteriorating with time,


especially now that we have entered the era of automated spell check-
ing? Rarely-occurring words that are only one character removed from a
commonly-used word are likely candidates to be spelling errors (e.g. al-
gorithm vs. algorthm). Aggregated over many different misspellings, are
such errors increasing or decreasing?

You can also use this Ngrams corpus to build a language model that captures
the meaning and usage of the words in a given language. We will discuss word
embeddings in Section 11.6.3, which are powerful tools for building language
models. Frequency counts reveal which words are most popular. The frequency
of word pairs appearing next to each other can be used to improve speech
recognition systems, helping to distinguish whether the speaker said that’s too
bad or that’s to bad. These millions of books provide an ample data set to build
representative models from.

1.2.4 New York Taxi Records


Every financial transaction today leaves a data trail behind it. Following these
paths can lead to interesting insights.
Taxi cabs form an important part of the urban transportation network. They
roam the streets of the city looking for customers, and then drive them to their
destination for a fare proportional to the length of the trip. Each cab contains
a metering device to calculate the cost of the trip as a function of time. This
meter serves as a record keeping device, and a mechanism to ensure that the
driver charges the proper amount for each trip.
The taxi meters currently employed in New York cabs can do many things
beyond calculating fares. They act as credit card terminals, providing a way
Exploring the Variety of Random
Documents with Different Content
shown, and this will be of service in explaining the different parts as
they are referred to.
In the construction of engines, as will be more particularly pointed
out hereinafter, the inlet and exhaust valves are usually operated by
mechanical means, but certain engines are so constructed that the
inlet valve is automatic in its operation, and the exhaust valve only is
actuated mechanically.
In the drawings, Figs. 63 to 66, inclusive, both valves are operated
from cams on a secondary shaft, and in the first of these four figures
the crank has just turned the point where the piston is at its highest
limit, and is about to descend. Both valves A B are closed, and the
spark fires the charge, driving down the piston to its lowest limit.
In Fig. 64 the crank is shown about to move the piston upwardly,
and just as it turns the dead center the cam C, on the secondary
shaft, unseats the valve B, through the stem D. As the piston moves
upwardly, the burnt gases are forced out past the valve B.
When the piston reaches the highest point in its first revolution, as
shown in Fig. 65, the stem D drops off the cam C, thus closing the
discharge, and immediately the valve A is opened by the cam E
moving the valve stem F upwardly, and as the piston now descends,
fuel is now drawn in until the piston reaches its lowest point.
In Fig. 66 the crank is turning the dead center, and is about to
move upwardly, and the cams G E are now both in such position that
the valves A B are closed, and when the piston moves up again, to
complete the second revolution, the fuel gas within the cylinder is
compressed, and ready to be fired the moment the crank reaches
the position, shown in Fig. 63.
Fig. 65. Drawing Fig. 66.
in Charge. Compression.

The Ignition Point in the Cycle.—In practice, the firing takes place
before the crank has made the turn past the dead center, and this is
called pre-ignition, when the spark is advanced too far to the left.
The ignition should take place slightly before the crank turns,
because it takes a small interval of time for the charge to burn the
gases, and during this time the crank will have passed the dead
center, and started on its way downwardly.
From the diagrams it will be observed that two of the strokes,
namely the first and the third, are downward, and the second and
fourth are upward, and that the downward strokes take place during
the admission and impulse, and the compression and exhaust while
the piston moves upwardly.
The Fly-Wheel.—As the impulse in this type can take place only at
each second revolution, it is obvious that some means must be
provided to keep the shaft moving during the two turns, and for this
purpose the fly-wheel is utilized.
Practice has found the multi-cylinder type the most valuable, in
connection with the fly-wheel, as in employing two or more cylinders
in line, a smaller fly wheel will be sufficient.
Impulses in 4-Cylinder Engine.—In such a case the four cylinders are
arranged so the impulse will be at four different points of the shaft,
and we may assume that the four cylinders in Figs. 63, 64, 65 and
66, show the relative positions of the four pistons in a four cylinder
engine.
The Cylinder Case, and Connections.—A cross section of a case and
the relative positions of the various parts, is shown in Fig. 67. The
cylinder A is provided with a water jacket B, so as to form a space C
around the cylinder which has an inlet pipe D at the bottom, and an
outlet pipe E at the upper end.

Fig. 67. Automatic


Inlet Valve.

The inlet valve F is in the head of the cylinder, and it is held


against its seat by a tension spring G. The exhaust valve H is placed
in a lateral extension of the cylinder, in such a position that it is
directly above the secondary shaft I running through the crank case.
The stem J of the valve, is actuated by a cam K on the secondary
shaft, and it is, preferably, made in two parts, the upper being so
arranged that it has a limited longitudinal movement independently
of the lower part, and a spring is arranged so as to provide for
longitudinal thrust in either direction.
The crank shaft M has alongside the crank, a gear wheel N, which
meshes with a gear O on the secondary shaft I, this latter gear
being twice the diameter of the gear N.
Piston and Crank Construction.—The piston is hollow, and the crank
is located as close to the head as possible. This has two or more
circumferential grooves, to receive packing rings. The rings are
made of very hard steel, and are turned up slightly larger than the
diameter of the cylinder, and then cut across diagonally, so they may
be sprung into place, and when in position they will bear against the
inside of the cylinder, and thus serve to prevent the passage of the
gases.
Calculating the Efficiency.—The great problem with every beginner
is to know something of the power of the engine, and how it is
determined. Considering that the boy knows nothing of the terms
used to designate the step we shall try to make the following
description as free from technicalities as possible.
In Fig. 68 a cylinder is represented, containing a piston A. B C
indicate the limits of the stroke, and for convenience this space is
provided with eleven marks to represent the pressure of the ignited
gases at various portions of the travel of the piston.
Pressure in Explosion.—When the explosion takes place, at B, the
pressure will be, approximately, 230 pounds per square inch of the
piston. When it moves to the next mark the pressure has decreased
to 220 pounds, at the next mark it is 200, and so on, until, at the
end of the stroke, opposite C, the pressure is only 40 pounds.

Fig. 68. Calculating Efficiency.


Expansion Line.—These figures represent the expansion line. It is
now necessary to get the mean effective pressure, which means that
we must know what the average pressure of the gas is in each
square inch from B to C.
Mean Effective Pressure.—This is obtained by adding together the
figures given in the sketch, and the result is, 1530. As eleven
pressures were required to produce this sum, it should be divided by
that number, making the result 148, avoiding fractions, as we shall
do in all the calculations.
The figures represent that the mean effective pressure of the
gases on the piston is 148 pounds. If this is multiplied by the area of
the piston, and this result by the stroke in feet and the number of
power strokes per minute, we get what is called foot pounds.
Foot Pounds.—Assuming that the diameter of the piston is 5
inches, which, figure, if multiplied by 3.1416, will give its area as a
little over 15-1/2 square inches. Let us assume the crank is 4 inches.
This will give a power stroke of 8 inches.
To find out how many power strokes there are in a minute, we
must know the revolutions, and this being taken at 800, and a
power stroke at only every other revolution, would mean that we
have 400 impulses, and each impulse traveled 8 inches, = 3200.
This represents inches, which must be converted into feet, so that
we have 266 feet of power strokes per minute.
First multiply the mean effective pressure on the cylinder, that is
148 × 15-1/2, which equals 2294. Then, 2294 × 266, equals
610,204. This product represents foot pounds.
Work or Energy.—A foot pound is the amount of work or energy
expended in raising a weight of one pound, through a distance of
one foot. If 550 pounds should be raised one foot in one second of
time it would represent one horse power of work accomplished. If
550 pounds should be raised one foot in one minute of time it would
be equal to 550 × 60 = 33,000 foot pounds, and this would mean
one horse power, or the work done in one minute of time.
Fig. 69. Two-cycle Expansion
Position.

In our above calculation we have determined how many foot


pounds we had in a minute of time, so that if we divide the foot
pounds 610,204, by 33,000, we shall get as a result, a little over 18-
1/2 horse power.
The Two-Cycle Engine.—The longitudinal shell A, Fig. 69, is separate
from the crank case B, the latter being secured to the former by
flanges and bolts, as at C. The piston D is of such length that when
it reaches the limit of its compression stroke, as shown in this figure,
it covers both the supply port E and the discharge port F.
In its outward stroke the upper end clears both of these ports as
in Fig. 71, the discharge port F being the first to open, as shown in
Fig. 70.

Fig. 70. Exhausting.


Fig. 71. Compression.

Cycle of Operations.—The cycle of operation is as follows: The


inward stroke, which is in the direction of the head of the cylinder,
draws in the gaseous fuel through the valve G, and at its outward
stroke the gas in the crank case B is compressed, and the moment
the end of the piston passes the inlet port E, the gas passes through
the duct H into the cylinder above the piston.
The burnt gases within the cylinder pass out the discharge port F,
facilitated, in a measure, by the compressed inflowing gas. When
the piston again returns, and passes the discharge port, the gas is
trapped, and is compressed during the inward stroke of the piston.
The Crank Shaft.—The most important element in the engine is the
crank shaft. It is usually made of a single steel forging, and out of
this are turned up the crank wrists, the crank arms, and the bearings
which are placed intermediate the different cranks. It is made
extremely large to provide for any strain due to the fuel explosions,
and it is the most difficult part of the engine to turn out.

Fig. 72. Crank Shaft.

Special Metals.—Special metals are used by various manufacturers,


and the sizes and structural shapes are now so well understood that
few of them break, although in the early history of the engine this
was the weak and troublesome part of the car.
Improper alining, in the case, and poor or faulty bearings, were
responsible for many accidents, and now means have been found to
overcome most of these objections.
Engine Troubles.—When we come to consider the engine troubles,
so-called, we shall find there are legions of them. In these days
many of the troubles are easy to remedy, but to remedy them
means that the causes of troubles should be understood. A physician
cannot prescribe for a disease until he has made a diagnosis.
Sometimes the difficulty will be recognized by the symptoms, and
is easily adjusted. But suppose the firing is all right, and the engine
fails to pick up, and seems to be dying out, it may be attributable to
several causes, either one of which would account for it.
Difficulties Pointed Out.—If the engine seems to run down, and
fails to pick up quickly, it may be due to water in the carbureter, or
to a weak battery, or to leaks in the water jacket that will admit
water into the compression chamber, or the trouble may be faulty
compression.
Other things should be looked up: The pump may be out of order,
the connections loose, and thus permit waste through the leaks, or
there may be a stoppage somewhere in the water circulation, or the
water may be exhausted, or the gasoline too low or too poor for the
kind of carbureter which you have.
If anything is due to the engine itself, in the vast majority of
cases, it is due to poor compression. The engine is too often blamed
for faults which belong elsewhere. Nevertheless, it is well carefully to
examine the bearings, to look over the clutch, and the bearings in
the line leading to the drive shaft.
Starting the Engine.—In starting, some engines give a great deal of
trouble, usually due to wrong adjustment of the sparking device.
This should not be advanced too much. If the trouble is not at that
point, it may arise from too weak a suction, or an obstruction in the
carbureter itself.
Carbureter.—At slow turning speed of the engine, the carbureter is
very sluggish, because it must be started up from a condition of
repose, and unless there is the best of compression, the suction will
not be sufficient to dislodge or move the slightest impediment which
may be in the way.
Low Compression.—Low compression arises from numerous causes.
A carelessly screwed sparking plug; defective or partly blown out
gasket in the cylinder head; loose, or partly open compression cock;
a sticking valve; a rusted, or defective inlet valve; leak in the
combustion chamber; or a worn or scratched cylinder.
Whenever it is possible, the engine should be examined to
observe the condition of the piston rings. Sometimes the rings will
break into small pieces, and these parts will wear the most
perceptible creases in the cylinder walls. When such is the case they
will have to be taken out and lapped.
Mixtures.—Too rich a mixture has the effect, in many cases, of
causing a deposit of carbon which is bad for the engine. It coats the
walls of the cylinders, and is hard to remove. The application of
petroleum and alcohol, if allowed to remain in the cylinder for some
hours, will aid in taking it out, but removing the cylinder and
scraping is the only safe method.
The usual way to test the cylinders to see whether either misses
fire, is to cut out all of the spark plugs except one, and then test
that, and so with all the others in succession, and in this way the
location of the trouble will be discovered.
Spark Plugs.—It is also the case that carbon deposits on the plug
points will become heated up to such a point that pre-ignition will
take place. Over-heated cylinders may cause this, and in certain
cases, where the rotor arm wears, at the contact point, it leaves a
trail of metallic particles over which the current will travel.
The Weather.—Cold weather is often a serious check to the starting
of an engine, the water jacket, or some of the piping may be frozen,
or the lubricating oil may become too thick to render proper service.
Drainage.—A careful operator will see to it that when the car is left
all the water will be drained from the pipes and the water jacket and
pump, and the parts can be dried out by running the engine for a
minute or so, during the time of draining, so as to heat up the parts.
CHAPTER X
COOLING SYSTEMS

Proper cooling is a necessary feature of all gasoline motors,


otherwise the intense heat of the burning fuel would expand the
pistons to such an extent as to prevent their free motion in the
cylinders, as well as destroy the spark plugs, injure the springs, and
make lubrication a difficult matter, if not impossible, by burning up
the oil.
Air Cooling.—Cooling was originally obtained by using air, which
was blown against the cylinders; but this was not generally
developed to a satisfactory degree except for small motors.
Air does not take up heat readily, whereas water is the greatest
absorbent known, and in the primary stages of the art water was
objected to on account of its weight, and for the further reason that
the jacketing of the engine was considered a needless expense.
One of the best known devices to increase the cooling capacity
with air cooling, and now largely used in motorcycles, is to provide
the cylinders with a plurality of thin broad ribs, annularly-disposed,
as shown in Fig. 72a.

Fig. 72a.
Increasing
Cooling Area.

Air-Cooling Devices.—A highly-heated metallic surface actually


repels such a subtile fluid as air, hence it is necessary to supply the
cylinders with a blast of air, and also provide a greater cooling area,
so that if the ribs themselves can be cooled, the temperature will be
decreased in proportion to the enlarged surface thus provided.
In using water this artifice is not necessary, because it will absorb
heat instantly along the surface in contact with the metal, and
quickly change the heated particles in favor of the cooler portions.
Water Cooling.—While heat will cause a circulation of water in a
definite direction, for the foregoing reason, it has been found that, in
practice, it is more practical to keep up the movement by mechanical
means.
This is done by a pump placed in the line of the circulating pipe,
and usually so arranged that the cold, or coldest, water is forced into
the circulating area around the cylinders.

Fig. 73. Movement of


Heated Water.

Gravity System.—The natural circulation is founded on the principle


of the well known law, that heated water will flow upwardly, hence,
if a cylinder, such as A, Fig. 73, which has a water jacket around it,
has its lower end connected by a pipe B, from the bottom of a water
reservoir C, and the upper end of the jacket is provided with a pipe
connection D, with the upper part of the reservoir, the water will flow
from the bottom of the reservoir to the jacket, and from the top of
the jacket to the reservoir, in the direction of the arrows.
Locating the Reservoir.—This flow would be materially increased if
the reservoir should be located a considerable distance above the
jacket. But in an automobile it would be difficult to use an elevated
reservoir, and, furthermore, as means must be provided to cool the
water, such disposition of the reservoir would be still more
impracticable.

Fig. 74. Cooling System.

The area forward of the engine is the most available space for
placing the water tank, and, especially for the reasons that the
radiator itself may be utilized for inclosing the engine hood, and
because the air, which is only partially heated in passing through the
radiator, serves to keep the space within the hood reasonably cool.
Force System of Cooling.—Under the circumstances the water
should be caused to circulate by mechanical means, which, while it
adds another operative element to the machinery, is nevertheless so
much more effective that it is worth the care, attention and expense
which are involved.
The Radiator Connection.—In Fig. 74 a radiator, engine and
circulating system are connected together to show the relative
arrangement of the various elements, in which the pump A is placed
in the pipe line B running from the lower end of the radiator C to the
manifold D at the lower end of the water jacket of the engine.
The upper end of the radiator is connected by a pipe E with the
top of the jacket, and the pipes are thus so disposed as to be free of
the other mechanism, and are all contained within the hood of the
engine.
A fan F, suitably geared to the crane shaft of the engine, provides
a means for inducing an air current through the radiator whenever
the engine is running.
Radiators.—Much time and money has been spent in developing a
simple and efficient type of radiator. As, of necessity, it must be
made up of a multiplicity of parts, leakage is apt to occur, and while
in the past most of the constructions depended on soldering
together the various portions, it will be seen how insecure such a
system of construction must be necessarily.
Construction of Radiator.—In Fig. 75, is shown a front and a
sectional view of portion of a simple type, which is made up of
square tubes A, their ends being fitted into square holes formed
through front and rear plates B C, and the tubes are so arranged
that there are small spaces D between the tubes.

Fig. 75. Radiator Type.

When water enters through the inlet tube E, it fills the spaces, and
being cooled moves downwardly, while the air rushing through the
open-ended tubes, cools down the water over the large area thus
afforded.
All radiators employ substantially the same construction, the
illustration given being merely to show the principle of the device.
A drain cock G, Fig. 74 should be placed in the system below the
radiator, in the pipe line B, so that water can be drained off from all
the pipes, to prevent liability of freezing. The diagram shows the fan
shaft connected and run by a belt H. This is not the best
construction, as it is not a positive drive. Most cars are provided with
gearing for this purpose.
Operation of Radiator.—The water is thus carried from the bottom
of the radiator to the water jacket space, and from the upper end of
the jacketed area to the top of the radiator, and used over again.
More or less of the water is lost by evaporation, so more must be
added from time to time, and the radiator should be kept as full as
possible to get the best results. If the water level falls too far below
the return pipe at the top of the radiator, the area of the heating
surface and the decreased quantity of water exposed to the cooling
surface, are likely to cause undue heating, or vaporization.
The Pump.—A variety of pumps are used, but they are generally
based on the principle of the turbine impelling system, or on
centrifugal action. A type which utilizes both these principles is
shown in Figs. 76 and 77, in which the former is a cross vertical
section of 77 along line 1, and the latter is a central vertical section
on line 2 of Fig. 76.
The device comprises a cylindrical shell A, with an inlet B, at one
edge near the front wall, and an outlet C at the upper edge near the
rear wall.
Pump Construction.—Within is a revoluble tubular hub D, with one
end E projecting, to which power is applied. A disk partition G is
secured to this hub, midway between its ends, and on each side of
the partition is a pair of oppositely-projecting convolute blades,
those on the inlet side, indicated by H, and the ones in the discharge
side by I.
Fig. 76. Side View
of Pump. Fig. 77. Section.

It will be noticed that the blades H on the intake side are so


disposed that their concave surfaces are on the advance sides while
those in the discharge end of the shell have their convex faces in the
retreating side.
Action of Pump.—The hub has inlet ports J below each blade, and
discharge ports K between each of the blades I. When rotating the
points of the blades H catch the water at the inlet and drive it
inwardly through the ports J, from which it passes through the hub
to the ports K, and is then violently thrown by centrifugal motion,
and by the action of the blades I to the discharge opening C.
Should the pump cease working there is always a free passage
way for the natural circulation of water through the pump.
CHAPTER XI
CARBURETERS

In considering carbureters it would be well to have an


understanding of what is meant by this term. It is the practice to call
the vaporized fuel from the carbureter, a gas; but this is a misnomer.
It is not a gas, but a vapor, being merely air which is charged with
small particles of gasoline.
Carbureted Air.—It has been frequently termed also a carbureted
fuel. This is a wrong term. What is meant is carbureted air, because
the air carries the fuel with it, and is impregnated with a carbon
charge.
Composition of Gasoline.—Gasoline contains, approximately, 82 per
cent. carbon, and 15 per cent. of hydrogen. This mixture of the two
fuel elements requires about two parts of oxygen to one part of the
gasoline, but as common air is only one-fifth oxygen and four-fifths
nitrogen, which does not aid in combustion, it is necessary to supply
five times the amount of air, which would mean at least fifteen parts
of air to one of the gasoline.
In speaking of parts it must not be understood, that reference is
made to parts in a liquid form, but it is necessary for the gasoline to
be put into the form of a gas, and this gas becomes the measure
from which we determine the parts.
Gasoline Expansion.—If a cubic inch of gasoline is converted into a
gas, it will occupy a space equal to about one cubic foot, which
means that it now has a volume, or bulk of 1728 cubic inches. Now,
for every 1728 inches, there must be about 30,000 cubic inches of
air, in order to make a combustible fuel out of the mixture.
Requirements of a Carbureter.—A carbureter is designed to do
several well-defined things: First; it must be able to comminute, or
break up the liquid fuel into infinitesimally small particles.
Second; it must be able to properly mingle the vapor thus
produced.
Third; it should be so constructed that it will automatically check
the inflow of gasoline, and prevent flooding, or waste of the fuel.
Evaporation.—All liquids have the property known as vaporization,
and will change their form into a gaseous state at ordinary
temperatures. All solids will vaporize, if sufficient heat is applied. But
at the ordinary temperature, with which we have to deal, in
considering the use of carbureters, air is the factor which facilitates
the process.
Air Saturation.—Gasoline, confined in a vessel, will vaporize up to a
point where it completely saturates the air contained therein, and
then ceases. If allowed to stand in the open air, it will, in time,
entirely evaporate. This is true of water, also.
It is well, in this connection, to observe another thing. If the same
quantity of liquid is placed in two separate vessels, one very tall,
with a small surface of air in contact with the two surfaces, and the
other vessel very shallow, so it has a large surface in contact with
air, the latter will produce the most speedy evaporation. This shows
that contact with air is the factor of the greatest importance in
making a vapor.
Air Contact With Gasoline.—The office of a carbureter is to provide
the proper amount of air to the liquid fuel,—that is, up to that point
where it can be utilized as a fuel to the best advantage. If a drop of
gasoline, in one case is broken up into five hundred tiny particles,
and in the other case into one thousand, it is obvious that in the
latter case the air comes into contact with double the surface of the
liquid than in the former case, hence will be so much more efficient,
for the following reason:
Perfect combustion is the desired object in the engine cylinder.
The more nearly the vapor approaches an impalpable gas the
quicker will it ignite. Furthermore, the more intimate the air and the
vapor are mixed the better will be the explosion or combustion.
Compression.—The compression of the carbureted air in the engine
cylinder performs certain very important things: When any gas is
compressed the temperature is increased, the theory being that at
each compression to one-half its volume, the temperature is
increased double its former heat.
If, therefore, compression in a cylinder reaches, say, 90 pounds,
the heat set up is sufficient to instantaneously break up the small
globules of gasoline, and at the same time produce a more intimate
unity, which tends to make a more efficient mixture than would be
possible without the compression.
Compression as a Mixing Means.—It will also be understood, that
compression permits the bringing together of a much larger amount
of fuel at each charge than would be possible without it, so that the
two factors, namely, the volatilizing action of the air, the mixing of
the air and vapor, and the compression, all serve to mix together the
elements which will produce an explosion when the proper heat is
finally applied.
Carbureter Types.—There are two distinct types of carbureters, one
in which the gasoline is forced out through a very fine nozzle, and at
the ejecting point is mixed with a current of air which passes to the
engine cylinders, and this is designated as the spraying device.
The other form of construction depends for carbureting the air on
exposing a large body of the gasoline to a passing blast of air, and is
called the surface type.
The Spraying Carbureter.—As most cars now use the spraying
system, that type will be considered first. There is no special form of
nozzle required to eject the fuel, and the distinctive features of the
various designs has been to produce positive and regular feed and to
assure the proper mixture at all times during the operation of the
engine.
Dissecting the Carbureter.—For the purpose of making each
particular part of a carbureter clear and distinct, let us build up one,
so that special attention may be directed to the various operative
elements.
A cored cylindrical casting A, Fig. 78, is provided, which has a
large opening in its lower end that is closed by a plug B. This plug
has an upwardly-extending tubular projection B´. The upper end of
the cylinder has a cap C, open centrally, and having an opening
formed by a downwardly-projecting tube D, and this has a
contracted throat as at E.
The Mixing Chamber.—The exterior of the downwardly-projecting
cap tube, is turned up true, and fits into the tubular extension B´.
The particular feature of this sketch is to show the adjustment of the
needle valve which admits the gasoline, and the relative position of
the float.

Fig. 78. Carbureter


Float and Needle.

The Float Chamber.—The circularly-formed chamber G, within which


the float operates, contains the liquid fuel. The inner end of the plug
B has a cross duct I, and centrally is an upwardly-projecting tubular
extension J, the bore being flaring, as shown, and in this the needle
valve K rests and is made adjustable at its upper threaded end.
When the needle valve is raised, gasoline flows through the duct I
upwardly past the flaring orifice, in J, and air is permitted to flow in
through the openings I around the central tube J, so that the air and
gasoline meet above the upper end of the tube.
The Venturi Tube.—The inwardly-projecting part E constitutes what
is called a venturi tube, the upwardly-rushing air between the
contracted opening formed around the tube at this point being such
that when the two fluids meet and spread out in the enlarged
opening above, the particles of gasoline are not only broken up
minutely, but are intimately mixed with the air.
Fig. 79. Carbureter Inlet
Valve.

The Inlet Valve.—Now if this chamber G has at one side an


extension, like L, Fig. 79, means may be provided for adding a valve
to be controlled by the float. Within the extension is an upwardly-
moving needle valve M, which is designed to close the duct which
leads from the gasoline supply.
Between the valve and the float is the fulcrum O, of a lever N, the
short end of which engages with the upper end of the valve and the
long end rests on the float H, as shown. The movement of the float
above the predetermined point has the effect of seating the needle
valve M, thus cutting off the inflow of gasoline until that in the
chamber G is drawn out so that the float descends and again admits
a fresh supply.
Fig. 80.
Carbureter
Discharge Port.

Thus far we have the fuel oil control, together with the manner in
which the primary air supply is introduced. We shall now go a step
further, and illustrate the mixing chamber, discharge and throttle.
The Throttle Valve.—Referring to Fig. 80 it will be seen that directly
above the venturi tube described, is a space O. This is the mixing
chamber, which has an outlet P to the left, which connects with the
engine cylinders.
Within this tube is a throttle valve Q, operated by the throttle lever
on the steering wheel of the car. It is simply a disk which fits into the
interior of the conduit and is adapted to be turned by a stem R, on
which it is mounted.
While the lower inlets K are designed to supply the primary air for
carburetion, it is found necessary to admit a secondary supply, and
this should be taken into the mixing chamber directly instead of
passing the tube which conveys the oil.
The Secondary Air Supply.—The particular reasons for thus admitting
the air may be explained as follows: When the engine draws in a
supply of carbureted air, more or less of a vacuum is brought about
in the mixing chamber O. The faster the engine runs the richer will
the mixture become, because the additional suction draws in an
increasing quantity of gasoline, but the throat of the tube does not
change, and the requisite, proportionate quantity of air does not
follow, so that the mixture has too much fuel for the air.
Automatic Admission of Secondary Air.—If the engine should be
speeded up so twice the amount of oil is drawn into the mixing
chamber, the additional suction will not, at the same time, draw in
twice the amount of air.
This necessitates a provision whereby the secondary air shall be
admitted automatically only at times when the suction exceeds the
normal requirement, or to prevent too rich a mixture, which is
explained by reference to Fig. 80.

Fig. 81. Carbureter Secondary


Air Inlet.

The extension S, on the right side of the shell, has an opening T,


with a seat to receive a weighted valve, like a ball U, preferably
reinforced by a spring V, which is capable of having its pressure on
the seat regulated by an adjusting screw W.
It will be obvious, therefore, that during the normal action of the
engine suction, no air will enter the duct T; but when an undue
vacuum exists in the chamber O, the ball valve U is raised, and
additional air is supplied to the carbureted air within the chamber.
Fig. 82. Complete Carbureter.

Carbureter Adjustment.—Each of these four elements has some


particular method of adjustment, as will be more particularly noticed
in the completely assembled carbureter, made up of the foregoing
illustrations, in which the details are refined and shown as actually
made in one of the well known types of carbureters.
Fig. 82 shows the different parts arranged in a practical manner, in
which the regulating arm for controlling the throttle, as well as the
secondary air supply and the gasoline inlets are capable of being
adjusted by special means.
Special Points Concerning Carbureters.—A rich mixture is
undesirable, except in the case of heavy loads and at slow speed, for
various reasons. It does not burn quickly, or explode as readily as a
lean one, and owing to the slow combustion the temperature in the
engine cylinder remains high to the end of the stroke.
Thin Mixtures.—On the other hand, a thin mixture will compress
better and burn with greater facility, and at the same time heat the
cylinder less than the rich mixture, to say nothing of the saving in
fuel. It has long been recognized that a carbureter will not act
uniformly with all engines. Some have better compression than
others, and some have more efficient sparking means. This has a
bearing on the character of the fuel delivered to the cylinders.
Speeds and Mixtures.—There is also a wide difference in the
performances of engines at high and at low speeds, as to the quality
of the mixtures required, so it will be seen that a carbureter which is
capable of being controlled for all emergencies, is the one to select.
Above all, the structure should be such that the valves can be
easily taken out for inspection and repairs. It is impossible to
prevent grit from finding its way into the gasoline, and it is
astonishing how the smallest piece of fiber, finding a lodgment in a
valve, will disarrange the entire power system.
Surface Carbureter.—These devices depend on presenting as large
an area of gasoline as possible, and then conducting the air flow
over the surface so as to take up the volatile hydro-carbon.
The Float.—Such devices also require a float to regulate the inflow
of fuel, and the distinctive feature of construction depends on
increasing or decreasing the area so exposed to the moving air
column.
Fig. 83 shows a well-known type of this character which is a
combination spray and surface carbureter. A U-shaped tube A, with
the air inlet at B, and discharge at C, has a butterfly valve D in its
latter end. Below the U-shaped bend, is a reservoir E to contain a
float F, vertically-movable around a central stem G which is part of
and projects down from the U-shaped tube.
Fig. 83. Surface Carbureter.

Through this stem G is a duct H, the lower end of which


communicates with the gasoline reservoir, or float chamber, and the
upper end has a small orifice leading to the U-shaped tube. A valve
stem I is adapted to regulate the inflow of gasoline through the
duct.
The Gasoline Inlet.—At one side of the reservoir is an extension J,
within which is a vertically disposed needle valve K, seated in the
duct I, by way of which gasoline is admitted. A lever M, pivoted at N,
has one end attached to the float F, and the other end is in
engagement with the needle valve K.
The float is so arranged as to permit the gasoline to flow up into
the U-shaped tube A, and form a small pool of the fuel before it
closes the needle valve K.
Securing Surface for Air Contact.—Directly above the oil inlet duct
H, the U-shaped tube is contracted by a downwardly-projecting wall
P, the object being to compel all the passing air to intimately come
into contact with the gasoline pool, and thus take up as much vapor
as possible.
In this arrangement the suction of the engine does not draw up
the gasoline from the reservoir, but all the energy is expended in
moving air through the tube, and past the contracted throat.
In starting the engine the float is momentarily depressed by the
pin Q, and a drain duct R is provided to prevent flooding of the tube
A.
CHAPTER XII
IGNITION SYSTEMS

The universal use of electricity as a means of igniting the fuel in


gasoline motors, makes it necessary that the novice should know
something of the fundamentals of the science.
Seeing the Effect of Electricity.—While it is impossible to see a
current, there are certain mechanical devices which enables it to be
seen by the effects produced on them. One of these devices is the
armature which, if placed across the poles of a horseshoe magnet,
will adhere to the magnet, by means of its magnetic pull.
Another exhibition is the spark caused by separating the contact
point of a conductor through which a current is flowing, causing a
spark.
Action of a Current.—The current flowing over a wire acts
substantially the same as water flowing through a pipe, that is, the
quantity is dependent on the size of the wire, just as in water where
the diameter of the pipe determines the flow.
Amperes and Volts.—Water may flow sluggishly through a pipe, or
be forced through with great violence. So with an electric current.
Pressure, therefore, expresses the second similarity in the two
mediums.
The quantity of flow in an electric current is called amperes and
the pressure is designated as volts.
Conductivity.—All metals conduct a current with greater or less
facility. Silver is the best conductor, followed by copper. German
silver offers a great resistance, and many alloys offer greater or less
opposition to the flow.
Resistance.—The length of a wire also serves to check the flow, and
this may be overcome by enlarging the size of the wire, or by
increasing the pressure, or voltage.
Generating Electricity.—A current may be generated by a dynamo,
or by means of cells. The dynamo derives its motion from an engine,
which turns, what is called, the armature past a number of magnets,
called the field. The armature contains a series of wire wrappings,
extending around from end to end, and the field is composed of
metallic heads, each carrying a coil.
Magnetic Field.—When these coils have a current flowing through
them the heads become magnetized, and have what is called a
magnetic field surrounding them and extending out some distance,
and the armature coils pass through these magnetic fields.
As these wires cut the lines of force in the magnetic fields, a
current is set up in the armature, and as the armature windings are
connected up with the lead and the return wires which transmit the
current, it will be seen that the strength, or pressure of the current
depends on the speed of the armature movement.
Batteries.—The other method of generating a current is to use a
jar of electrolyte, a liquid which may be either an acid or a salt
solution. If certain metals which are opposite to each other, are
placed in this solution, a chemical action takes place, which results in
producing current, and this may be shown by connecting together
the two metals by a wire outside of the jar.
Metallic Couples.—Within the jar the solution serves as the
conductor between the two metals. Copper and zinc are two good
metal couples, in which zinc is the positive, and copper the negative.
As zinc is readily eaten away by the action of the electrolyte, carbon
is used instead.
What Determines Voltage.—Each cell with the two metals, will
furnish approximately two volts. It is immaterial whether the cell
contains a pint or a gallon of liquid, or what the size of the plates
may be. In any event the pressure will not be greater than two volts.
Controlling Amperage.—But the metal plates may be made very
large, or have a great surface in each cell. The greater the surface
the greater the amperage, so that while each cell has only two volts,
it may have a very small amperage, or it may have two, five, ten, or
even more amperes flowing therefrom.
Dry Batteries.—Instead of using cells with liquid in them, as the
electrolyte, a dry cell is made which acts efficiently. This is usually
made in the form of a zinc cup, within which is centrally held a
carbon rod, and the space around the rod is filled with ground
carbon and dioxide of manganese, and moistened with sal
ammoniac.
Cell Construction.—The zinc cell and the carbon have upwardly-
projecting posts to which the wires are attached, and when thus
made the top of the cup is closed with pitch, or some suitable
preparation to prevent evaporation and to retain the substances
within, and the whole is then inclosed in a jacket, usually of
pasteboard.
Usually these cells give one and a half volts, and are very durable.
This is, of course, a very low voltage, and it is necessary, for this
reason, to use at least a half dozen, to operate the coil used in an
ignition system.
Connecting Up Cells.—If we have a number of cells they can be
connected with each other so as to get an additional voltage as well
as greater amperage. This statement must be understood in a
definite way. Supposing we have six cells, each with an output of 1-
1/2 volts, and an ampere flow of 25 in each. Multiplying 25 by 9
makes 225 watts.

Fig. 84. Series Wiring.

We may connect up the six cells in such a way that we can get
First: 9 volts, and 25 amperes, equal to 225 watts, or,
Second: 1-1/2 volts and 150 amperes, equal to 225 watts, or,
Third: 4-1/2 volts and 50 amperes, also equal to 225 watts.
In either case, you will see we have 225 watts. These three
windings are designated as series, parallel, and series multiple.
The Series Connection.—The illustration, Fig. 84, shows the series
winding. Here the positive wire B is connected with the carbon pole
C, and the wire D, wired up with the zinc pole, E, the connections
being made directly through each cell, to the outlet wire F. Now, as
we have six cells, the combined voltage is 1-1/2 × 6 = 9 volts.
As, however, all the cells now act as one cell, the amperage is just
the same as of one cell, namely, 25.

Fig. 85. Parallel Wiring.

The Parallel Connection.—Fig. 85 shows the parallel connection.


Here all the carbon terminals A are connected together in series by a
wire B, and all the zinc terminals C by a wire D. In this method the
voltage of the battery is the same as that of a single cell, but the
amperage is the same as that of a single cell multiplied by the
number of cells, namely, 25 amperes × 6.
Series Multiple Connection.—The series multiple, Fig. 86, is so
arranged as to form two distinct batteries, 1 and 2. Each battery is
connected up in series, by means of the wires A, which join the
carbon and zinc. In this way we have at one end a pair of carbon
terminals which are joined by a wire B, and at the other end a pair
of zinc terminals, joined by a wire C.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like