100% found this document useful (1 vote)
37 views

PDF Discovering Knowledge in Data An Introduction to Data Mining 1st Edition Daniel T. Larose download

Data

Uploaded by

puschboeen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
37 views

PDF Discovering Knowledge in Data An Introduction to Data Mining 1st Edition Daniel T. Larose download

Data

Uploaded by

puschboeen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Download the full version of the ebook at

https://ebookultra.com

Discovering Knowledge in Data An Introduction


to Data Mining 1st Edition Daniel T. Larose

https://ebookultra.com/download/discovering-
knowledge-in-data-an-introduction-to-data-
mining-1st-edition-daniel-t-larose/

Explore and download more ebook at https://ebookultra.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Data Mining and Predictive Analytics 2nd Edition Daniel T.


Larose

https://ebookultra.com/download/data-mining-and-predictive-
analytics-2nd-edition-daniel-t-larose/

ebookultra.com

Cluster Analysis and Data Mining An Introduction King

https://ebookultra.com/download/cluster-analysis-and-data-mining-an-
introduction-king/

ebookultra.com

Biological Data Mining Chapman Hall Crc Data Mining and


Knowledge Discovery Series 1st Edition Jake Y. Chen

https://ebookultra.com/download/biological-data-mining-chapman-hall-
crc-data-mining-and-knowledge-discovery-series-1st-edition-jake-y-
chen/
ebookultra.com

Statistical Data Mining and Knowledge Discovery Hamparsum


Bozdogan

https://ebookultra.com/download/statistical-data-mining-and-knowledge-
discovery-hamparsum-bozdogan/

ebookultra.com
Introduction To Data Mining Instructors Solution Manual
1st ed. Edition Tan

https://ebookultra.com/download/introduction-to-data-mining-
instructors-solution-manual-1st-ed-edition-tan/

ebookultra.com

Data Mining Applications for Empowering Knowledge


Societies 1st Edition Hakikur Rahman

https://ebookultra.com/download/data-mining-applications-for-
empowering-knowledge-societies-1st-edition-hakikur-rahman/

ebookultra.com

Data Mining and Data Warehousing 1st Edition S.K. Mourya

https://ebookultra.com/download/data-mining-and-data-warehousing-1st-
edition-s-k-mourya/

ebookultra.com

Exploratory Data Mining and Data Cleaning 1st Edition


Tamraparni Dasu

https://ebookultra.com/download/exploratory-data-mining-and-data-
cleaning-1st-edition-tamraparni-dasu/

ebookultra.com

Data Mining in Proteomics From Standards to Applications


1st Edition Michael Hamacher

https://ebookultra.com/download/data-mining-in-proteomics-from-
standards-to-applications-1st-edition-michael-hamacher/

ebookultra.com
Discovering Knowledge in Data An Introduction to Data
Mining 1st Edition Daniel T. Larose Digital Instant
Download
Author(s): Daniel T. Larose
ISBN(s): 9780471666578, 0471666572
Edition: 1
File Details: PDF, 5.19 MB
Year: 2004
Language: english
DISCOVERING
KNOWLEDGE IN DATA
An Introduction to Data Mining

DANIEL T. LAROSE
Director of Data Mining
Central Connecticut State University

A JOHN WILEY & SONS, INC., PUBLICATION


DISCOVERING
KNOWLEDGE IN DATA
DISCOVERING
KNOWLEDGE IN DATA
An Introduction to Data Mining

DANIEL T. LAROSE
Director of Data Mining
Central Connecticut State University

A JOHN WILEY & SONS, INC., PUBLICATION


Copyright ©2005 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.


Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to
the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400,
fax 978-646-8600, or on the web at www.copyright.com. Requests to the Publisher for permission should
be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken,
NJ 07030, (201) 748-6011, fax (201) 748-6008.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategies contained herein may not be suitable
for your situation. You should consult with a professional where appropriate. Neither the publisher nor
author shall be liable for any loss of profit or any other commercial damages, including but not limited to
special, incidental, consequential, or other damages.
For general information on our other products and services please contact our Customer Care Department
within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print,
however, may not be available in electronic format.

Library of Congress Cataloging-in-Publication Data:


Larose, Daniel T.
Discovering knowledge in data : an introduction to data mining / Daniel T. Larose
p. cm.
Includes bibliographical references and index.
ISBN 0-471-66657-2 (cloth)
1. Data mining. I. Title.
QA76.9.D343L38 2005
006.3 12—dc22 2004003680

Printed in the United States of America


10 9 8 7 6 5 4 3 2 1
Dedication

To my parents,
And their parents,
And so on...

For my children,
And their children,
And so on...

2004 Chantal Larose


CONTENTS

PREFACE xi

1 INTRODUCTION TO DATA MINING 1

What Is Data Mining? 2


Why Data Mining? 4
Need for Human Direction of Data Mining 4
Cross-Industry Standard Process: CRISP–DM 5
Case Study 1: Analyzing Automobile Warranty Claims: Example of the
CRISP–DM Industry Standard Process in Action 8
Fallacies of Data Mining 10
What Tasks Can Data Mining Accomplish? 11
Description 11
Estimation 12
Prediction 13
Classification 14
Clustering 16
Association 17
Case Study 2: Predicting Abnormal Stock Market Returns Using
Neural Networks 18
Case Study 3: Mining Association Rules from Legal Databases 19
Case Study 4: Predicting Corporate Bankruptcies Using Decision Trees 21
Case Study 5: Profiling the Tourism Market Using k-Means Clustering Analysis 23
References 24
Exercises 25

2 DATA PREPROCESSING 27

Why Do We Need to Preprocess the Data? 27


Data Cleaning 28
Handling Missing Data 30
Identifying Misclassifications 33
Graphical Methods for Identifying Outliers 34
Data Transformation 35
Min–Max Normalization 36
Z-Score Standardization 37
Numerical Methods for Identifying Outliers 38
References 39
Exercises 39

vii
viii CONTENTS

3 EXPLORATORY DATA ANALYSIS 41

Hypothesis Testing versus Exploratory Data Analysis 41


Getting to Know the Data Set 42
Dealing with Correlated Variables 44
Exploring Categorical Variables 45
Using EDA to Uncover Anomalous Fields 50
Exploring Numerical Variables 52
Exploring Multivariate Relationships 59
Selecting Interesting Subsets of the Data for Further Investigation 61
Binning 62
Summary 63
References 64
Exercises 64

4 STATISTICAL APPROACHES TO ESTIMATION AND PREDICTION 67

Data Mining Tasks in Discovering Knowledge in Data 67


Statistical Approaches to Estimation and Prediction 68
Univariate Methods: Measures of Center and Spread 69
Statistical Inference 71
How Confident Are We in Our Estimates? 73
Confidence Interval Estimation 73
Bivariate Methods: Simple Linear Regression 75
Dangers of Extrapolation 79
Confidence Intervals for the Mean Value of y Given x 80
Prediction Intervals for a Randomly Chosen Value of y Given x 80
Multiple Regression 83
Verifying Model Assumptions 85
References 88
Exercises 88

5 k-NEAREST NEIGHBOR ALGORITHM 90

Supervised versus Unsupervised Methods 90


Methodology for Supervised Modeling 91
Bias–Variance Trade-Off 93
Classification Task 95
k-Nearest Neighbor Algorithm 96
Distance Function 99
Combination Function 101
Simple Unweighted Voting 101
Weighted Voting 102
Quantifying Attribute Relevance: Stretching the Axes 103
Database Considerations 104
k-Nearest Neighbor Algorithm for Estimation and Prediction 104
Choosing k 105
Reference 106
Exercises 106
CONTENTS ix

6 DECISION TREES 107

Classification and Regression Trees 109


C4.5 Algorithm 116
Decision Rules 121
Comparison of the C5.0 and CART Algorithms Applied to Real Data 122
References 126
Exercises 126

7 NEURAL NETWORKS 128

Input and Output Encoding 129


Neural Networks for Estimation and Prediction 131
Simple Example of a Neural Network 131
Sigmoid Activation Function 134
Back-Propagation 135
Gradient Descent Method 135
Back-Propagation Rules 136
Example of Back-Propagation 137
Termination Criteria 139
Learning Rate 139
Momentum Term 140
Sensitivity Analysis 142
Application of Neural Network Modeling 143
References 145
Exercises 145

8 HIERARCHICAL AND k-MEANS CLUSTERING 147

Clustering Task 147


Hierarchical Clustering Methods 149
Single-Linkage Clustering 150
Complete-Linkage Clustering 151
k-Means Clustering 153
Example of k-Means Clustering at Work 153
Application of k-Means Clustering Using SAS Enterprise Miner 158
Using Cluster Membership to Predict Churn 161
References 161
Exercises 162

9 KOHONEN NETWORKS 163

Self-Organizing Maps 163


Kohonen Networks 165
Example of a Kohonen Network Study 166
Cluster Validity 170
Application of Clustering Using Kohonen Networks 170
Interpreting the Clusters 171
Cluster Profiles 175
x CONTENTS

Using Cluster Membership as Input to Downstream Data Mining Models 177


References 178
Exercises 178

10 ASSOCIATION RULES 180

Affinity Analysis and Market Basket Analysis 180


Data Representation for Market Basket Analysis 182
Support, Confidence, Frequent Itemsets, and the A Priori Property 183
How Does the A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets 185
How Does the A Priori Algorithm Work (Part 2)? Generating Association Rules 186
Extension from Flag Data to General Categorical Data 189
Information-Theoretic Approach: Generalized Rule Induction Method 190
J-Measure 190
Application of Generalized Rule Induction 191
When Not to Use Association Rules 193
Do Association Rules Represent Supervised or Unsupervised Learning? 196
Local Patterns versus Global Models 197
References 198
Exercises 198

11 MODEL EVALUATION TECHNIQUES 200

Model Evaluation Techniques for the Description Task 201


Model Evaluation Techniques for the Estimation and Prediction Tasks 201
Model Evaluation Techniques for the Classification Task 203
Error Rate, False Positives, and False Negatives 203
Misclassification Cost Adjustment to Reflect Real-World Concerns 205
Decision Cost/Benefit Analysis 207
Lift Charts and Gains Charts 208
Interweaving Model Evaluation with Model Building 211
Confluence of Results: Applying a Suite of Models 212
Reference 213
Exercises 213

EPILOGUE: “WE’VE ONLY JUST BEGUN” 215

INDEX 217
PREFACE

WHAT IS DATA MINING?

Data mining is predicted to be “one of the most revolutionary developments of the


next decade,” according to the online technology magazine ZDNET News (February 8,
2001). In fact, the MIT Technology Review chose data mining as one of ten emerging
technologies that will change the world. According to the Gartner Group, “Data min-
ing is the process of discovering meaningful new correlations, patterns and trends by
sifting through large amounts of data stored in repositories, using pattern recognition
technologies as well as statistical and mathematical techniques.”
Because data mining represents such an important field, Wiley-Interscience and
Dr. Daniel T. Larose have teamed up to publish a series of volumes on data mining,
consisting initially of three volumes. The first volume in the series, Discovering
Knowledge in Data: An Introduction to Data Mining, introduces the reader to this
rapidly growing field of data mining.

WHY IS THIS BOOK NEEDED?

Human beings are inundated with data in most fields. Unfortunately, these valuable
data, which cost firms millions to collect and collate, are languishing in warehouses
and repositories. The problem is that not enough trained human analysts are available
who are skilled at translating all of the data into knowledge, and thence up the
taxonomy tree into wisdom. This is why this book is needed; it provides readers with:
r Models and techniques to uncover hidden nuggets of information
r Insight into how data mining algorithms work
r The experience of actually performing data mining on large data sets

Data mining is becoming more widespread every day, because it empowers


companies to uncover profitable patterns and trends from their existing databases.
Companies and institutions have spent millions of dollars to collect megabytes and
terabytes of data but are not taking advantage of the valuable and actionable infor-
mation hidden deep within their data repositories. However, as the practice of data
mining becomes more widespread, companies that do not apply these techniques
are in danger of falling behind and losing market share, because their competitors
are using data mining and are thereby gaining the competitive edge. In Discovering
Knowledge in Data, the step-by-step hands-on solutions of real-world business prob-
lems using widely available data mining techniques applied to real-world data sets

xi
xii PREFACE

will appeal to managers, CIOs, CEOs, CFOs, and others who need to keep abreast of
the latest methods for enhancing return on investment.

DANGER! DATA MINING IS EASY TO DO BADLY

The plethora of new off-the-shelf software platforms for performing data mining has
kindled a new kind of danger. The ease with which these GUI-based applications
can manipulate data, combined with the power of the formidable data mining algo-
rithms embedded in the black-box software currently available, make their misuse
proportionally more hazardous.
Just as with any new information technology, data mining is easy to do badly. A
little knowledge is especially dangerous when it comes to applying powerful models
based on large data sets. For example, analyses carried out on unpreprocessed data
can lead to erroneous conclusions, or inappropriate analysis may be applied to data
sets that call for a completely different approach, or models may be derived that are
built upon wholly specious assumptions. If deployed, these errors in analysis can lead
to very expensive failures.

‘‘WHITE BOX’’ APPROACH: UNDERSTANDING THE


UNDERLYING ALGORITHMIC AND MODEL STRUCTURES

The best way to avoid these costly errors, which stem from a blind black-box approach
to data mining, is to apply instead a “white-box” methodology, which emphasizes
an understanding of the algorithmic and statistical model structures underlying the
software. Discovering Knowledge in Data applies this white-box approach by:
r Walking the reader through the various algorithms
r Providing examples of the operation of the algorithm on actual large data sets
r Testing the reader’s level of understanding of the concepts and algorithms
r Providing an opportunity for the reader to do some real data mining on large
data sets

Algorithm Walk-Throughs
Discovering Knowledge in Data walks the reader through the operations and nuances
of the various algorithms, using small-sample data sets, so that the reader gets a
true appreciation of what is really going on inside the algorithm. For example, in
Chapter 8, we see the updated cluster centers being updated, moving toward the
center of their respective clusters. Also, in Chapter 9 we see just which type of network
weights will result in a particular network node “winning” a particular record.

Applications of the Algorithms to Large Data Sets


Discovering Knowledge in Data provides examples of the application of various
algorithms on actual large data sets. For example, in Chapter 7 a classification problem
DATA MINING AS A PROCESS xiii

is attacked using a neural network model on a real-world data set. The resulting
neural network topology is examined along with the network connection weights, as
reported by the software. These data sets are included at the book series Web site, so
that readers may follow the analytical steps on their own, using data mining software
of their choice.

Chapter Exercises: Checking to Make Sure That You Understand It


Discovering Knowledge in Data includes over 90 chapter exercises, which allow
readers to assess their depth of understanding of the material, as well as to have a
little fun playing with numbers and data. These include conceptual exercises, which
help to clarify some of the more challenging concepts in data mining, and “tiny
data set” exercises, which challenge the reader to apply the particular data mining
algorithm to a small data set and, step by step, to arrive at a computationally sound
solution. For example, in Chapter 6 readers are provided with a small data set and
asked to construct by hand, using the methods shown in the chapter, a C4.5 decision
tree model, as well as a classification and regression tree model, and to compare the
benefits and drawbacks of each.

Hands-on Analysis: Learn Data Mining by Doing Data Mining


Chapters 2 to 4 and 6 to 11 provide the reader with hands-on analysis problems,
representing an opportunity for the reader to apply his or her newly acquired data
mining expertise to solving real problems using large data sets. Many people learn
by doing. Discovering Knowledge in Data provides a framework by which the reader
can learn data mining by doing data mining. The intention is to mirror the real-world
data mining scenario. In the real world, dirty data sets need cleaning; raw data needs
to be normalized; outliers need to be checked. So it is with Discovering Knowledge in
Data, where over 70 hands-on analysis problems are provided. In this way, the reader
can “ramp up” quickly and be “up and running” his or her own data mining analyses
relatively shortly.
For example, in Chapter 10 readers are challenged to uncover high-confidence,
high-support rules for predicting which customer will be leaving a company’s service.
In Chapter 11 readers are asked to produce lift charts and gains charts for a set of
classification models using a large data set, so that the best model may be identified.

DATA MINING AS A PROCESS

One of the fallacies associated with data mining implementation is that data mining
somehow represents an isolated set of tools, to be applied by some aloof analysis
department, and is related only inconsequentially to the mainstream business or re-
search endeavor. Organizations that attempt to implement data mining in this way
will see their chances of success greatly reduced. This is because data mining should
be view as a process.
Discovering Knowledge in Data presents data mining as a well-structured
standard process, intimately connected with managers, decision makers, and those
xiv PREFACE

involved in deploying the results. Thus, this book is not only for analysts but also for
managers, who need to be able to communicate in the language of data mining. The
particular standard process used is the CRISP–DM framework: the Cross-Industry
Standard Process for Data Mining. CRISP–DM demands that data mining be seen
as an entire process, from communication of the business problem through data col-
lection and management, data preprocessing, model building, model evaluation, and
finally, model deployment. Therefore, this book is not only for analysts and man-
agers but also for data management professionals, database analysts, and decision
makers.

GRAPHICAL APPROACH, EMPHASIZING EXPLORATORY


DATA ANALYSIS

Discovering Knowledge in Data emphasizes a graphical approach to data analysis.


There are more than 80 screen shots of actual computer output throughout the book,
and over 30 other figures. Exploratory data analysis (EDA) represents an interesting
and exciting way to “feel your way” through large data sets. Using graphical and
numerical summaries, the analyst gradually sheds light on the complex relationships
hidden within the data. Discovering Knowledge in Data emphasizes an EDA approach
to data mining, which goes hand in hand with the overall graphical approach.

HOW THE BOOK IS STRUCTURED

Discovering Knowledge in Data provides a comprehensive introduction to the field.


Case studies are provided showing how data mining has been utilized successfully
(and not so successfully). Common myths about data mining are debunked, and
common pitfalls are flagged, so that new data miners do not have to learn these
lessons themselves.
The first three chapters introduce and follow the CRISP–DM standard process,
especially the data preparation phase and data understanding phase. The next seven
chapters represent the heart of the book and are associated with the CRISP–DM
modeling phase. Each chapter presents data mining methods and techniques for a
specific data mining task.
r Chapters 5, 6, and 7 relate to the classification task, examining the k-nearest
neighbor (Chapter 5), decision tree (Chapter 6), and neural network (Chapter
7) algorithms.
r Chapters 8 and 9 investigate the clustering task, with hierarchical and k-means
clustering (Chapter 8) and Kohonen network (Chapter 9) algorithms.
r Chapter 10 handles the association task, examining association rules through
the a priori and GRI algorithms.
r Finally, Chapter 11 covers model evaluation techniques, which belong to the
CRISP–DM evaluation phase.
ACKNOWLEDGMENTS xv

DISCOVERING KNOWLEDGE IN DATA AS A TEXTBOOK

Discovering Knowledge in Data naturally fits the role of textbook for an introductory
course in data mining. Instructors may appreciate:
r The presentation of data mining as a process
r The “white-box” approach, emphasizing an understanding of the underlying
algorithmic structures:
◦ algorithm walk-throughs
◦ application of the algorithms to large data sets
◦ chapter exercises
◦ hands-on analysis
r The graphical approach, emphasizing exploratory data analysis
r The logical presentation, flowing naturally from the CRISP–DM standard pro-
cess and the set of data mining tasks
Discovering Knowledge in Data is appropriate for advanced undergraduate
or graduate courses. Except for one section in Chapter 7, no calculus is required.
An introductory statistics course would be nice but is not required. No computer
programming or database expertise is required.

ACKNOWLEDGMENTS

Discovering Knowledge in Data would have remained unwritten without the assis-
tance of Val Moliere, editor, Kirsten Rohsted, editorial program coordinator, and
Rosalyn Farkas, production editor, at Wiley-Interscience and Barbara Zeiders, who
copyedited the work. Thank you for your guidance and perserverance.
I wish also to thank Dr. Chun Jin and Dr. Daniel S. Miller, my colleagues in the
Master of Science in Data Mining program at Central Connecticut State University;
Dr. Timothy Craine, the chair of the Department of Mathematical Sciences; Dr. Dipak
K. Dey, chair of the Department of Statistics at the University of Connecticut; and
Dr. John Judge, chair of the Department of Mathematics at Westfield State College.
Your support was (and is) invaluable.
Thanks to my children, Chantal, Tristan, and Ravel, for sharing the computer
with me. Finally, I would like to thank my wonderful wife, Debra J. Larose, for her
patience, understanding, and proofreading skills. But words cannot express. . . .

Daniel T. Larose, Ph.D.


Director, Data Mining @CCSU
www.ccsu.edu/datamining
CHAPTER 1
INTRODUCTION TO
DATA MINING

WHAT IS DATA MINING?


WHY DATA MINING?
NEED FOR HUMAN DIRECTION OF DATA MINING
CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM
CASE STUDY 1: ANALYZING AUTOMOBILE WARRANTY CLAIMS: EXAMPLE
OF THE CRISP–DM INDUSTRY STANDARD PROCESS IN ACTION
FALLACIES OF DATA MINING
WHAT TASKS CAN DATA MINING ACCOMPLISH?
CASE STUDY 2: PREDICTING ABNORMAL STOCK MARKET RETURNS USING
NEURAL NETWORKS
CASE STUDY 3: MINING ASSOCIATION RULES FROM LEGAL DATABASES
CASE STUDY 4: PREDICTING CORPORATE BANKRUPTCIES USING
DECISION TREES
CASE STUDY 5: PROFILING THE TOURISM MARKET USING k-MEANS
CLUSTERING ANALYSIS

About 13 million customers per month contact the West Coast customer service
call center of the Bank of America, as reported by CIO Magazine’s cover story
on data mining in May 1998 [1]. In the past, each caller would have listened to
the same marketing advertisement, whether or not it was relevant to the caller’s
interests. However, “rather than pitch the product of the week, we want to be as
relevant as possible to each customer,” states Chris Kelly, vice president and director
of database marketing at Bank of America in San Francisco. Thus, Bank of America’s
customer service representatives have access to individual customer profiles, so that
the customer can be informed of new products or services that may be of greatest

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose


ISBN 0-471-66657-2 Copyright  C 2005 John Wiley & Sons, Inc.

1
2 CHAPTER 1 INTRODUCTION TO DATA MINING

interest to him or her. Data mining helps to identify the type of marketing approach
for a particular customer, based on the customer’s individual profile.
Former President Bill Clinton, in his November 6, 2002 address to the Demo-
cratic Leadership Council [2], mentioned that not long after the events of September
11, 2001, FBI agents examined great amounts of consumer data and found that five
of the terrorist perpetrators were in the database. One of the terrorists possessed
30 credit cards with a combined balance totaling $250,000 and had been in the country
for less than two years. The terrorist ringleader, Mohammed Atta, had 12 different
addresses, two real homes, and 10 safe houses. Clinton concluded that we should
proactively search through this type of data and that “if somebody has been here a
couple years or less and they have 12 homes, they’re either really rich or up to no
good. It shouldn’t be that hard to figure out which.”
Brain tumors represent the most deadly cancer among children, with nearly
3000 cases diagnosed per year in the United States, nearly half of which are fatal.
Eric Bremer [3], director of brain tumor research at Children’s Memorial Hospital
in Chicago, has set the goal of building a gene expression database for pediatric
brain tumors, in an effort to develop more effective treatment. As one of the first
steps in tumor identification, Bremer uses the Clementine data mining software suite,
published by SPSS, Inc., to classify the tumor into one of 12 or so salient types. As
we shall learn in Chapter 5 classification, is one of the most important data mining
tasks.
These stories are examples of data mining.

WHAT IS DATA MINING?

According to the Gartner Group [4], “Data mining is the process of discovering
meaningful new correlations, patterns and trends by sifting through large amounts of
data stored in repositories, using pattern recognition technologies as well as statistical
and mathematical techniques.” There are other definitions:
r “Data mining is the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel ways that are
both understandable and useful to the data owner” (Hand et al. [5]).
r “Data mining is an interdisciplinary field bringing togther techniques from
machine learning, pattern recognition, statistics, databases, and visualization to
address the issue of information extraction from large data bases” (Evangelos
Simoudis in Cabena et al. [6]).

Data mining is predicted to be “one of the most revolutionary developments


of the next decade,” according to the online technology magazine ZDNET News [7].
In fact, the MIT Technology Review [8] chose data mining as one of 10 emerging
technologies that will change the world. “Data mining expertise is the most sought
after . . .” among information technology professionals, according to the 1999 Infor-
mation Week National Salary Survey [9]. The survey reports: “Data mining skills
WHAT IS DATA MINING? 3

are in high demand this year, as organizations increasingly put data repositories
online. Effectively analyzing information from customers, partners, and suppliers
has become important to more companies. ‘Many companies have implemented a
data warehouse strategy and are now starting to look at what they can do with all that
data,’ says Dudley Brown, managing partner of BridgeGate LLC, a recruiting firm in
Irvine, Calif.”
How widespread is data mining? Which industries are moving into this area?
Actually, the use of data mining is pervasive, extending into some surprising areas.
Consider the following employment advertisement [10]:

STATISTICS INTERN: SEPTEMBER–DECEMBER 2003

Work with Basketball Operations


Resposibilities include:
r Compiling and converting data into format for use in statistical models
r Developing statistical forecasting models using regression, logistic regression, data
mining, etc.
r Using statistical packages such as Minitab, SPSS, XLMiner
Experience in developing statistical models a differentiator, but not required.

Candidates who have completed advanced statistics coursework with a strong knowledge
of basketball and the love of the game should forward your résumé and cover letter to:

Boston Celtics
Director of Human Resources
151 Merrimac Street
Boston, MA 02114

Yes, the Boston Celtics are looking for a data miner. Perhaps the Celtics’ data
miner is needed to keep up with the New York Knicks, who are using IBM’s Advanced
Scout data mining software [11]. Advanced Scout, developed by a team led by Inder-
pal Bhandari, is designed to detect patterns in data. A big basketball fan, Bhandari
approached the New York Knicks, who agreed to try it out. The software depends on
the data kept by the National Basketball Association, in the form of “events” in every
game, such as baskets, shots, passes, rebounds, double-teaming, and so on. As it turns
out, the data mining uncovered a pattern that the coaching staff had evidently missed.
When the Chicago Bulls double-teamed Knicks’ center Patrick Ewing, the Knicks’
shooting percentage was extremely low, even though double-teaming should open up
an opportunity for a teammate to shoot. Based on this information, the coaching staff
was able to develop strategies for dealing with the double-teaming situation. Later,
16 of the 29 NBA teams also turned to Advanced Scout to mine the play-by-play
data.
4 CHAPTER 1 INTRODUCTION TO DATA MINING

WHY DATA MINING?

While waiting in line at a large supermarket, have you ever just closed your eyes and
listened? What do you hear, apart from the kids pleading for candy bars? You might
hear the beep, beep, beep of the supermarket scanners, reading the bar codes on the
grocery items, ringing up on the register, and storing the data on servers located at
the supermarket headquarters. Each beep indicates a new row in the database, a new
“observation” in the information being collected about the shopping habits of your
family and the other families who are checking out.
Clearly, a lot of data is being collected. However, what is being learned from
all this data? What knowledge are we gaining from all this information? Probably,
depending on the supermarket, not much. As early as 1984, in his book Megatrends
[12], John Naisbitt observed that “we are drowning in information but starved for
knowledge.” The problem today is not that there is not enough data and information
streaming in. We are, in fact, inundated with data in most fields. Rather, the problem
is that there are not enough trained human analysts available who are skilled at
translating all of this data into knowledge, and thence up the taxonomy tree into
wisdom.
The ongoing remarkable growth in the field of data mining and knowledge
discovery has been fueled by a fortunate confluence of a variety of factors:
r The explosive growth in data collection, as exemplified by the supermarket
scanners above
r The storing of the data in data warehouses, so that the entire enterprise has
access to a reliable current database
r The availability of increased access to data from Web navigation and intranets
r The competitive pressure to increase market share in a globalized economy
r The development of off-the-shelf commercial data mining software suites
r The tremendous growth in computing power and storage capacity

NEED FOR HUMAN DIRECTION OF DATA MINING

Many software vendors market their analytical software as being plug-and-play out-
of-the-box applications that will provide solutions to otherwise intractable problems
without the need for human supervision or interaction. Some early definitions of data
mining followed this focus on automation. For example, Berry and Linoff, in their
book Data Mining Techniques for Marketing, Sales and Customer Support [13], gave
the following definition for data mining: “Data mining is the process of exploration
and analysis, by automatic or semi-automatic means, of large quantities of data in
order to discover meaningful patterns and rules” (emphasis added). Three years later,
in their sequel, Mastering Data Mining [14], the authors revisit their definition of
data mining and state: “If there is anything we regret, it is the phrase ‘by automatic
or semi-automatic means’ . . . because we feel there has come to be too much focus
on the automatic techniques and not enough on the exploration and analysis. This has
CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM 5

misled many people into believing that data mining is a product that can be bought
rather than a discipline that must be mastered.”
Very well stated! Automation is no substitute for human input. As we shall
learn shortly, humans need to be actively involved at every phase of the data mining
process. Georges Grinstein of the University of Massachusetts at Lowell and AnVil,
Inc., stated it like this [15]:

Imagine a black box capable of answering any question it is asked. Any question. Will
this eliminate our need for human participation as many suggest? Quite the opposite.
The fundamental problem still comes down to a human interface issue. How do I phrase
the question correctly? How do I set up the parameters to get a solution that is applicable
in the particular case I am interested in? How do I get the results in reasonable time
and in a form that I can understand? Note that all the questions connect the discovery
process to me, for my human consumption.

Rather than asking where humans fit into data mining, we should instead inquire about
how we may design data mining into the very human process of problem solving.
Further, the very power of the formidable data mining algorithms embedded in
the black-box software currently available makes their misuse proportionally more
dangerous. Just as with any new information technology, data mining is easy to
do badly. Researchers may apply inappropriate analysis to data sets that call for a
completely different approach, for example, or models may be derived that are built
upon wholly specious assumptions. Therefore, an understanding of the statistical and
mathematical model structures underlying the software is required.

CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM

There is a temptation in some companies, due to departmental inertia and com-


partmentalization, to approach data mining haphazardly, to reinvent the wheel and
duplicate effort. A cross-industry standard was clearly required that is industry-
neutral, tool-neutral, and application-neutral. The Cross-Industry Standard Process
for Data Mining (CRISP–DM) [16] was developed in 1996 by analysts representing
DaimlerChrysler, SPSS, and NCR. CRISP provides a nonproprietary and freely avail-
able standard process for fitting data mining into the general problem-solving strategy
of a business or research unit.
According to CRISP–DM, a given data mining project has a life cycle consisting
of six phases, as illustrated in Figure 1.1. Note that the phase sequence is adaptive.
That is, the next phase in the sequence often depends on the outcomes associated
with the preceding phase. The most significant dependencies between phases are
indicated by the arrows. For example, suppose that we are in the modeling phase.
Depending on the behavior and characteristics of the model, we may have to return to
the data preparation phase for further refinement before moving forward to the model
evaluation phase.
The iterative nature of CRISP is symbolized by the outer circle in Figure 1.1.
Often, the solution to a particular business or research problem leads to further ques-
tions of interest, which may then be attacked using the same general process as before.
6 CHAPTER 1 INTRODUCTION TO DATA MINING

Business / Research Data Understanding


Understanding Phase Phase

Deployment Phase Data Preparation


Phase

Evaluation Phase Modeling Phase

Figure 1.1 CRISP–DM is an iterative, adaptive process.

Lessons learned from past projects should always be brought to bear as input into
new projects. Following is an outline of each phase. Although conceivably, issues
encountered during the evaluation phase can send the analyst back to any of the pre-
vious phases for amelioration, for simplicity we show only the most common loop,
back to the modeling phase.

CRISP–DM: The Six Phases


1. Business understanding phase. The first phase in the CRISP–DM standard
process may also be termed the research understanding phase.
a. Enunciate the project objectives and requirements clearly in terms of the
business or research unit as a whole.
b. Translate these goals and restrictions into the formulation of a data mining
problem definition.
c. Prepare a preliminary strategy for achieving these objectives.
2. Data understanding phase
a. Collect the data.
CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM 7

b. Use exploratory data analysis to familiarize yourself with the data and dis-
cover initial insights.
c. Evaluate the quality of the data.
d. If desired, select interesting subsets that may contain actionable patterns.
3. Data preparation phase
a. Prepare from the initial raw data the final data set that is to be used for all
subsequent phases. This phase is very labor intensive.
b. Select the cases and variables you want to analyze and that are appropriate
for your analysis.
c. Perform transformations on certain variables, if needed.
d. Clean the raw data so that it is ready for the modeling tools.
4. Modeling phase
a. Select and apply appropriate modeling techniques.
b. Calibrate model settings to optimize results.
c. Remember that often, several different techniques may be used for the same
data mining problem.
d. If necessary, loop back to the data preparation phase to bring the form of
the data into line with the specific requirements of a particular data mining
technique.
5. Evaluation phase
a. Evaluate the one or more models delivered in the modeling phase for quality
and effectiveness before deploying them for use in the field.
b. Determine whether the model in fact achieves the objectives set for it in the
first phase.
c. Establish whether some important facet of the business or research problem
has not been accounted for sufficiently.
d. Come to a decision regarding use of the data mining results.
6. Deployment phase
a. Make use of the models created: Model creation does not signify the com-
pletion of a project.
b. Example of a simple deployment: Generate a report.
c. Example of a more complex deployment: Implement a parallel data mining
process in another department.
d. For businesses, the customer often carries out the deployment based on your
model.
You can find out much more information about the CRISP–DM standard process
at www.crisp-dm.org. Next, we turn to an example of a company applying CRISP–
DM to a business problem.
8 CHAPTER 1 INTRODUCTION TO DATA MINING

CASE STUDY 1
ANALYZING AUTOMOBILE WARRANTY CLAIMS: EXAMPLE OF THE
CRISP–DM INDUSTRY STANDARD PROCESS IN ACTION [17]

Quality assurance continues to be a priority for automobile manufacturers, including Daimler


Chrysler. Jochen Hipp of the University of Tubingen, Germany, and Guido Lindner of Daim-
lerChrysler AG, Germany, investigated patterns in the warranty claims for DaimlerChrysler
automobiles.

1. Business Understanding Phase

DaimlerChrysler’s objectives are to reduce costs associated with warranty claims and im-
prove customer satisfaction. Through conversations with plant engineers, who are the technical
experts in vehicle manufacturing, the researchers are able to formulate specific business prob-
lems, such as the following:
r Are there interdependencies among warranty claims?
r Are past warranty claims associated with similar claims in the future?
r Is there an association between a certain type of claim and a particular garage?

The plan is to apply appropriate data mining techniques to try to uncover these and other
possible associations.

2. Data Understanding Phase

The researchers make use of DaimlerChrysler’s Quality Information System (QUIS), which
contains information on over 7 million vehicles and is about 40 gigabytes in size. QUIS
contains production details about how and where a particular vehicle was constructed, including
an average of 30 or more sales codes for each vehicle. QUIS also includes warranty claim
information, which the garage supplies, in the form of one of more than 5000 possible potential
causes.
The researchers stressed the fact that the database was entirely unintelligible to domain
nonexperts: “So experts from different departments had to be located and consulted; in brief a
task that turned out to be rather costly.” They emphasize that analysts should not underestimate
the importance, difficulty, and potential cost of this early phase of the data mining process, and
that shortcuts here may lead to expensive reiterations of the process downstream.

3. Data Preparation Phase

The researchers found that although relational, the QUIS database had limited SQL access.
They needed to select the cases and variables of interest manually, and then manually derive
new variables that could be used for the modeling phase. For example, the variable number of
days from selling date until first claim had to be derived from the appropriate date attributes.
They then turned to proprietary data mining software, which had been used at
DaimlerChrysler on earlier projects. Here they ran into a common roadblock—that the data
format requirements varied from algorithm to algorithm. The result was further exhaustive pre-
processing of the data, to transform the attributes into a form usable for model algorithms. The
researchers mention that the data preparation phase took much longer than they had planned.
CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM 9

4. Modeling Phase

Since the overall business problem from phase 1 was to investigate dependence among the war-
ranty claims, the researchers chose to apply the following techniques: (1) Bayesian networks
and (2) association rules. Bayesian networks model uncertainty by explicitly representing the
conditional dependencies among various components, thus providing a graphical visualization
of the dependency relationships among the components. As such, Bayesian networks represent
a natural choice for modeling dependence among warranty claims. The mining of association
rules is covered in Chapter 10. Association rules are also a natural way to investigate depen-
dence among warranty claims since the confidence measure represents a type of conditional
probability, similar to Bayesian networks.
The details of the results are confidential, but we can get a general idea of the type of
dependencies uncovered by the models. One insight the researchers uncovered was that a
particular combination of construction specifications doubles the probability of encountering
an automobile electrical cable problem. DaimlerChrysler engineers have begun to investigate
how this combination of factors can result in an increase in cable problems.
The researchers investigated whether certain garages had more warranty claims of a certain
type than did other garages. Their association rule results showed that, indeed, the confidence
levels for the rule “If garage X, then cable problem,” varied considerably from garage to garage.
They state that further investigation is warranted to reveal the reasons for the disparity.

5. Evaluation Phase

The researchers were disappointed that the support for sequential-type association rules was
relatively small, thus precluding generalization of the results, in their opinion. Overall, in fact,
the researchers state: “In fact, we did not find any rule that our domain experts would judge
as interesting, at least at first sight.” According to this criterion, then, the models were found
to be lacking in effectiveness and to fall short of the objectives set for them in the business
understanding phase. To account for this, the researchers point to the “legacy” structure of the
database, for which automobile parts were categorized by garages and factories for historic or
technical reasons and not designed for data mining. They suggest adapting and redesigning the
database to make it more amenable to knowledge discovery.

6. Deployment Phase

The researchers have identified the foregoing project as a pilot project, and as such, do not intend
to deploy any large-scale models from this first iteration. After the pilot project, however, they
have applied the lessons learned from this project, with the goal of integrating their methods
with the existing information technology environment at DaimlerChrysler. To further support
the original goal of lowering claims costs, they intend to develop an intranet offering mining
capability of QUIS for all corporate employees.

What lessons can we draw from this case study? First, the general impression
one draws is that uncovering hidden nuggets of knowledge in databases is a rocky road.
In nearly every phase, the researchers ran into unexpected roadblocks and difficulties.
This tells us that actually applying data mining for the first time in a company requires
asking people to do something new and different, which is not always welcome.
Therefore, if they expect results, corporate management must be 100% supportive of
new data mining initiatives.
10 CHAPTER 1 INTRODUCTION TO DATA MINING

Another lesson to draw is that intense human participation and supervision is


required at every stage of the data mining process. For example, the algorithms require
specific data formats, which may require substantial preprocessing (see Chapter 2).
Regardless of what some software vendor advertisements may claim, you can’t just
purchase some data mining software, install it, sit back, and watch it solve all your
problems. Data mining is not magic. Without skilled human supervision, blind use
of data mining software will only provide you with the wrong answer to the wrong
question applied to the wrong type of data. The wrong analysis is worse than no
analysis, since it leads to policy recommendations that will probably turn out to be
expensive failures.
Finally, from this case study we can draw the lesson that there is no guarantee of
positive results when mining data for actionable knowledge, any more than when one
is mining for gold. Data mining is not a panacea for solving business problems. But
used properly, by people who understand the models involved, the data requirements,
and the overall project objectives, data mining can indeed provide actionable and
highly profitable results.

FALLACIES OF DATA MINING

Speaking before the U.S. House of Representatives Subcommittee on Technology,


Information Policy, Intergovernmental Relations, and Census, Jen Que Louie, presi-
dent of Nautilus Systems, Inc., described four fallacies of data mining [18]. Two of
these fallacies parallel the warnings we described above.
r Fallacy 1. There are data mining tools that we can turn loose on our data
repositories and use to find answers to our problems.
◦ Reality. There are no automatic data mining tools that will solve your problems
mechanically “while you wait.” Rather, data mining is a process, as we have
seen above. CRISP–DM is one method for fitting the data mining process
into the overall business or research plan of action.
r Fallacy 2. The data mining process is autonomous, requiring little or no human
oversight.
◦ Reality. As we saw above, the data mining process requires significant human
interactivity at each stage. Even after the model is deployed, the introduction
of new data often requires an updating of the model. Continuous quality mon-
itoring and other evaluative measures must be assessed by human analysts.
r Fallacy 3. Data mining pays for itself quite quickly.
◦ Reality. The return rates vary, depending on the startup costs, analysis per-
sonnel costs, data warehousing preparation costs, and so on.
r Fallacy 4. Data mining software packages are intuitive and easy to use.
◦ Reality. Again, ease of use varies. However, data analysts must combine
subject matter knowledge with an analytical mind and a familiarity with the
overall business or research model.
WHAT TASKS CAN DATA MINING ACCOMPLISH? 11

To the list above, we add two additional common fallacies:


r Fallacy 5. Data mining will identify the causes of our business or research
problems.
◦ Reality. The knowledge discovery process will help you to uncover patterns
of behavior. Again, it is up to humans to identify the causes.
r Fallacy 6. Data mining will clean up a messy database automatically.
◦ Reality. Well, not automatically. As a preliminary phase in the data mining
process, data preparation often deals with data that has not been examined or
used in years. Therefore, organizations beginning a new data mining operation
will often be confronted with the problem of data that has been lying around
for years, is stale, and needs considerable updating.
The discussion above may have been termed what data mining cannot or should
not do. Next we turn to a discussion of what data mining can do.

WHAT TASKS CAN DATA MINING ACCOMPLISH?

Next, we investigate the main tasks that data mining is usually called upon to accom-
plish. The following list shows the most common data mining tasks.
r Description
r Estimation
r Prediction
r Classification
r Clustering
r Association

Description
Sometimes, researchers and analysts are simply trying to find ways to describe patterns
and trends lying within data. For example, a pollster may uncover evidence that
those who have been laid off are less likely to support the present incumbent in
the presidential election. Descriptions of patterns and trends often suggest possible
explanations for such patterns and trends. For example, those who are laid off are now
less well off financially than before the incumbent was elected, and so would tend to
prefer an alternative.
Data mining models should be as transparent as possible. That is, the results of
the data mining model should describe clear patterns that are amenable to intuitive in-
terpretation and explanation. Some data mining methods are more suited than others to
transparent interpretation. For example, decision trees provide an intuitive and human-
friendly explanation of their results. On the other hand, neural networks are compara-
tively opaque to nonspecialists, due to the nonlinearity and complexity of the model.
High-quality description can often be accomplished by exploratory data anal-
ysis, a graphical method of exploring data in search of patterns and trends. We look
at exploratory data analysis in Chapter 3.
12 CHAPTER 1 INTRODUCTION TO DATA MINING

Estimation
Estimation is similar to classification except that the target variable is numerical rather
than categorical. Models are built using “complete” records, which provide the value
of the target variable as well as the predictors. Then, for new observations, estimates
of the value of the target variable are made, based on the values of the predictors.
For example, we might be interested in estimating the systolic blood pressure reading
of a hospital patient, based on the patient’s age, gender, body-mass index, and blood
sodium levels. The relationship between systolic blood pressure and the predictor
variables in the training set would provide us with an estimation model. We can then
apply that model to new cases.
Examples of estimation tasks in business and research include:
r Estimating the amount of money a randomly chosen family of four will spend
for back-to-school shopping this fall.
r Estimating the percentage decrease in rotary-movement sustained by a National
Football League running back with a knee injury.
r Estimating the number of points per game that Patrick Ewing will score when
double-teamed in the playoffs.
r Estimating the grade-point average (GPA) of a graduate student, based on that
student’s undergraduate GPA.
Consider Figure 1.2, where we have a scatter plot of the graduate grade-point
averages (GPAs) against the undergraduate GPAs for 1000 students. Simple linear
regression allows us to find the line that best approximates the relationship between
these two variables, according to the least-squares criterion. The regression line,
indicated in blue in Figure 1.2, may then be used to estimate the graduate GPA of a
student given that student’s undergraduate GPA. Here, the equation of the regression
line (as produced by the statistical package Minitab, which also produced the graph)
is ŷ = 1.24 + 0.67x. This tells us that the estimated graduate GPA ŷ equals 1.24 plus

3.25
Graduate GPA

2 3 4
Undergraduate GPA
Figure 1.2 Regression estimates lie on the regression line.
WHAT TASKS CAN DATA MINING ACCOMPLISH? 13

0.67 times the student’s undergraduate GPA. For example, if your undergrad GPA is
3.0, your estimated graduate GPA is ŷ = 1.24 + 0.67(3) = 3.25. Note that this point
(x = 3.0, ŷ = 3.25) lies precisely on the regression line, as do all linear regression
predictions.
The field of statistical analysis supplies several venerable and widely used
estimation methods. These include point estimation and confidence interval estima-
tions, simple linear regression and correlation, and multiple regression. We examine
these methods in Chapter 4. Neural networks (Chapter 7) may also be used for esti-
mation.

Prediction
Prediction is similar to classification and estimation, except that for prediction, the
results lie in the future. Examples of prediction tasks in business and research include:
r Predicting the price of a stock three months into the future (Figure 1.3)
r Predicting the percentage increase in traffic deaths next year if the speed limit
is increased
r Predicting the winner of this fall’s baseball World Series, based on a comparison
of team statistics
r Predicting whether a particular molecule in drug discovery will lead to a prof-
itable new drug for a pharmaceutical company
Any of the methods and techniques used for classification and estimation may
also be used, under appropriate circumstances, for prediction. These include the
traditional statistical methods of point estimation and confidence interval estimations,
simple linear regression and correlation, and multiple regression, investigated in
Chapter 4, as well as data mining and knowledge discovery methods such as neural
network (Chapter 7), decision tree (Chapter 6), and k-nearest neighbor (Chapter 5)
methods. An application of prediction using neural networks is examined later in the
chapter in Case Study 2.

?
Stock Price

?
1st Quarter 2nd Quarter 3rd Quarter 4th Quarter
Figure 1.3 Predicting the price of a stock three months in the future.
14 CHAPTER 1 INTRODUCTION TO DATA MINING

Classification
In classification, there is a target categorical variable, such as income bracket, which,
for example, could be partitioned into three classes or categories: high income, middle
income, and low income. The data mining model examines a large set of records, each
record containing information on the target variable as well as a set of input or predictor
variables. For example, consider the excerpt from a data set shown in Table 1.1.
Suppose that the researcher would like to be able to classify the income brackets of
persons not currently in the database, based on other characteristics associated with
that person, such as age, gender, and occupation. This task is a classification task, very
nicely suited to data mining methods and techniques. The algorithm would proceed
roughly as follows. First, examine the data set containing both the predictor variables
and the (already classified) target variable, income bracket. In this way, the algorithm
(software) “learns about” which combinations of variables are associated with which
income brackets. For example, older females may be associated with the high-income
bracket. This data set is called the training set. Then the algorithm would look at
new records, for which no information about income bracket is available. Based on
the classifications in the training set, the algorithm would assign classifications to the
new records. For example, a 63-year-old female professor might be classified in the
high-income bracket.
Examples of classification tasks in business and research include:
r Determining whether a particular credit card transaction is fraudulent
r Placing a new student into a particular track with regard to special needs
r Assessing whether a mortgage application is a good or bad credit risk
r Diagnosing whether a particular disease is present
r Determining whether a will was written by the actual deceased, or fraudulently
by someone else
r Identifying whether or not certain financial or personal behavior indicates a
possible terrorist threat
For example, in the medical field, suppose that we are interested in classifying
the type of drug a patient should be prescribed, based on certain patient characteristics,
such as the age of the patient and the patient’s sodium/potassium ratio. Figure 1.4 is
a scatter plot of patients’ sodium/potassium ratio against patients’ ages for a sample
of 200 patients. The particular drug prescribed is symbolized by the shade of the
points. Light gray points indicate drug Y; medium gray points indicate drug A or X;

TABLE 1.1 Excerpt from Data Set for Classifying Income

Subject Age Gender Occupation Income Bracket

001 47 F Software engineer High


002 28 M Marketing consultant Middle
003 35 M Unemployed Low
..
.
WHAT TASKS CAN DATA MINING ACCOMPLISH? 15

40

30
Na / K Ratio

20

10

10 20 30 40 50 60 70
Age
Figure 1.4 Which drug should be prescribed for which type of patient?

dark gray points indicate drug B or C. This plot was generated using the Clementine
data mining software suite, published by SPSS.
In this scatter plot, Na/K (sodium/potassium ratio) is plotted on the Y (vertical)
axis and age is plotted on the X (horizontal) axis. Suppose that we base our prescription
recommendation on this data set.
1. Which drug should be prescribed for a young patient with a high sodium/
potassium ratio?
◦ Young patients are on the left in the graph, and high sodium/potassium ra-
tios are in the upper half, which indicates that previous young patients with
high sodium/potassium ratios were prescribed drug Y (light gray points). The
recommended prediction classification for such patients is drug Y.
2. Which drug should be prescribed for older patients with low sodium/potassium
ratios?
◦ Patients in the lower right of the graph have been taking different prescriptions,
indicated by either dark gray (drugs B and C) or medium gray (drugs A
and X). Without more specific information, a definitive classification cannot
be made here. For example, perhaps these drugs have varying interactions
with beta-blockers, estrogens, or other medications, or are contraindicated
for conditions such as asthma or heart disease.
Graphs and plots are helpful for understanding two- and three-dimensional re-
lationships in data. But sometimes classifications need to be based on many different
predictors, requiring a many-dimensional plot. Therefore, we need to turn to more so-
phisticated models to perform our classification tasks. Common data mining methods
used for classification are k-nearest neighbor (Chapter 5), decision tree (Chapter 6),
and neural network (Chapter 7). An application of classification using decision trees
is examined in Case Study 4.
16 CHAPTER 1 INTRODUCTION TO DATA MINING

Clustering
Clustering refers to the grouping of records, observations, or cases into classes of
similar objects. A cluster is a collection of records that are similar to one another, and
dissimilar to records in other clusters. Clustering differs from classification in that
there is no target variable for clustering. The clustering task does not try to classify,
estimate, or predict the value of a target variable. Instead, clustering algorithms seek
to segment the entire data set into relatively homogeneous subgroups or clusters,
where the similarity of the records within the cluster is maximized and the similarity
to records outside the cluster is minimized.
Claritas, Inc. [19] is in the clustering business. Among the services they provide
is a demographic profile of each of the geographic areas in the country, as defined
by zip code. One of the clustering mechanisms they use is the PRIZM segmentation
system, which describes every U.S. zip code area in terms of distinct lifestyle types
(Table 1.2). Just go to the company’s Web site [19], enter a particular zip code, and
you are shown the most common PRIZM clusters for that zip code.
What do these clusters mean? For illustration, let’s look up the clusters for
zip code 90210, Beverly Hills, California. The resulting clusters for zip code 90210
are:
r Cluster 01: Blue Blood Estates
r Cluster 10: Bohemian Mix
r Cluster 02: Winner’s Circle
r Cluster 07: Money and Brains
r Cluster 08: Young Literati

TABLE 1.2 The 62 Clusters Used by the PRIZM Segmentation System

01 Blue Blood Estates 02 Winner’s Circle 03 Executive Suites 04 Pools & Patios
05 Kids & Cul-de-Sacs 06 Urban Gold Coast 07 Money & Brains 08 Young Literati
09 American Dreams 10 Bohemian Mix 11 Second City Elite 12 Upward Bound
13 Gray Power 14 Country Squires 15 God’s Country 16 Big Fish, Small Pond
17 Greenbelt Families 18 Young Influentials 19 New Empty Nests 20 Boomers & Babies
21 Suburban Sprawl 22 Blue-Chip Blues 23 Upstarts & Seniors 24 New Beginnings
25 Mobility Blues 26 Gray Collars 27 Urban Achievers 28 Big City Blend
29 Old Yankee Rows 30 Mid-City Mix 31 Latino America 32 Middleburg Managers
33 Boomtown Singles 34 Starter Families 35 Sunset City Blues 36 Towns & Gowns
37 New Homesteaders 38 Middle America 39 Red, White & Blues 40 Military Quarters
41 Big Sky Families 42 New Eco-topia 43 River City, USA 44 Shotguns & Pickups
45 Single City Blues 46 Hispanic Mix 47 Inner Cities 48 Smalltown Downtown
49 Hometown Retired 50 Family Scramble 51 Southside City 52 Golden Ponds
53 Rural Industria 54 Norma Rae-Ville 55 Mines & Mills 56 Agri-Business
57 Grain Belt 58 Blue Highways 59 Rustic Elders 60 Back Country Folks
61 Scrub Pine Flats 62 Hard Scrabble

Source: Claritas, Inc.


WHAT TASKS CAN DATA MINING ACCOMPLISH? 17

The description for cluster 01, Blue Blood Estates, is: “Established executives,
professionals, and ‘old money’ heirs that live in America’s wealthiest suburbs. They
are accustomed to privilege and live luxuriously—one-tenth of this group’s members
are multimillionaires. The next affluence level is a sharp drop from this pinnacle.”
Examples of clustering tasks in business and research include:
r Target marketing of a niche product for a small-capitalization business that does
not have a large marketing budget
r For accounting auditing purposes, to segmentize financial behavior into benign
and suspicious categories
r As a dimension-reduction tool when the data set has hundreds of attributes
r For gene expression clustering, where very large quantities of genes may exhibit
similar behavior
Clustering is often performed as a preliminary step in a data mining process,
with the resulting clusters being used as further inputs into a different technique
downstream, such as neural networks. We discuss hierarchical and k-means clustering
in Chapter 8 and Kohonen networks in Chapter 9. An application of clustering is
examined in Case Study 5.

Association
The association task for data mining is the job of finding which attributes “go to-
gether.” Most prevalent in the business world, where it is known as affinity analysis or
market basket analysis, the task of association seeks to uncover rules for quantifying
the relationship between two or more attributes. Association rules are of the form “If
antecedent, then consequent,” together with a measure of the support and confidence
associated with the rule. For example, a particular supermarket may find that of the
1000 customers shopping on a Thursday night, 200 bought diapers, and of those 200
who bought diapers, 50 bought beer. Thus, the association rule would be “If buy dia-
pers, then buy beer” with a support of 200/1000 = 20% and a confidence of 50/200 =
25%.
Examples of association tasks in business and research include:
r Investigating the proportion of subscribers to a company’s cell phone plan that
respond positively to an offer of a service upgrade
r Examining the proportion of children whose parents read to them who are
themselves good readers
r Predicting degradation in telecommunications networks
r Finding out which items in a supermarket are purchased together and which
items are never purchased together
r Determining the proportion of cases in which a new drug will exhibit dangerous
side effects
We discuss two algorithms for generating association rules, the a priori algo-
rithm and the GRI algorithm, in Chapter 10. Association rules were utilized in Case
Study 1. We examine another application of association rules in Case Study 3.
18 CHAPTER 1 INTRODUCTION TO DATA MINING

Next we examine four case studies, each of which demonstrates a particular


data mining task in the context of the CRISP–DM data mining standard process.

CASE STUDY 2
PREDICTING ABNORMAL STOCK MARKET RETURNS
USING NEURAL NETWORKS [20]

1. Business/Research Understanding Phase

Alan M. Safer, of California State University–Long Beach, reports that stock market trades
made by insiders usually have abnormal returns. Increased profits can be made by outsiders
using legal insider trading information, especially by focusing on attributes such as company
size and the time frame for prediction. Safer is interested in using data mining methodol-
ogy to increase the ability to predict abnormal stock price returns arising from legal insider
trading.

2. Data Understanding Phase

Safer collected data from 343 companies, extending from January 1993 to June 1997 (the
source of the data being the Securities and Exchange Commission). The stocks used in the
study were all of the stocks that had insider records for the entire period and were in the S&P
600, S&P 400, or S&P 500 (small, medium, and large capitalization, respectively) as of June
1997. Of the 946 resulting stocks that met this description, Safer chose only those stocks that
underwent at least two purchase orders per year, to assure a sufficient amount of transaction
data for the data mining analyses. This resulted in 343 stocks being used for the study. The
variables in the original data set include the company, name and rank of the insider, transaction
date, stock price, number of shares traded, type of transaction (buy or sell), and number of
shares held after the trade. To assess an insider’s prior trading patterns, the study examined the
previous 9 and 18 weeks of trading history. The prediction time frames for predicting abnormal
returns were established as 3, 6, 9, and 12 months.

3. Data Preparation Phase

Safer decided that the company rank of the insider would not be used as a study attribute, since
other research had shown it to be of mixed predictive value for predicting abnormal stock price
returns. Similarly, he omitted insiders who were uninvolved with company decisions. (Note
that the present author does not necessarily agree with omitting variables prior to the modeling
phase, because of earlier findings of mixed predictive value. If they are indeed of no predictive
value, the models will so indicate, presumably. But if there is a chance of something interesting
going on, the model should perhaps be given an opportunity to look at it. However, Safer is the
domain expert in this area.)

4. Modeling Phase

The data were split into a training set (80% of the data) and a validation set (20%). A neural
network model was applied, which uncovered the following results:
WHAT TASKS CAN DATA MINING ACCOMPLISH? 19

a. Certain industries had the most predictable abnormal stock returns, including:
r Industry group 36: electronic equipment, excluding computer equipment
r Industry Group 28: chemical products
r Industry Group 37: transportation equipment
r Industry Group 73: business services
b. Predictions that looked further into the future (9 to 12 months) had increased ability to
identify unusual insider trading variations than did predictions that had a shorter time
frame (3 to 6 months).
c. It was easier to predict abnormal stock returns from insider trading for small companies
than for large companies.

5. Evaluation Phase

Safer concurrently applied a multivariate adaptive regression spline (MARS, not covered here)
model to the same data set. The MARS model uncovered many of the same findings as the
neural network model, including results (a) and (b) from the modeling phase. Such a conflu-
ence of results is a powerful and elegant method for evaluating the quality and effectiveness
of the model, analogous to getting two independent judges to concur on a decision. Data
miners should strive to produce such a confluence of results whenever the opportunity arises.
This is possible because often more than one data mining method may be applied appropri-
ately to the problem at hand. If both models concur as to the results, this strengthens our
confidence in the findings. If the models disagree, we should probably investigate further.
Sometimes, one type of model is simply better suited to uncovering a certain type of re-
sult, but sometimes, disagreement indicates deeper problems, requiring cycling back to earlier
phases.

6. Deployment Phase

The publication of Safer’s findings in Intelligent Data Analysis [20] constitutes one method of
model deployment. Now, analysts from around the world can take advantage of his methods to
track the abnormal stock price returns of insider trading and thereby help to protect the small
investor.

CASE STUDY 3
MINING ASSOCIATION RULES FROM LEGAL DATABASES [21]

1. Business/Research Understanding Phase

The researchers, Sasha Ivkovic and John Yearwood of the University of Ballarat, and Andrew
Stranieri of La Trobe University, Australia, are interested in whether interesting and actionable
association rules can be uncovered in a large data set containing information on applicants for
government-funded legal aid in Australia. Because most legal data is not structured in a manner
easily suited to most data mining techniques, application of knowledge discovery methods to
legal data has not developed as quickly as in other areas. The researchers’ goal is to improve
20 CHAPTER 1 INTRODUCTION TO DATA MINING

the delivery of legal services and just outcomes in law, through improved use of available legal
data.

2. Data Understanding Phase

The data are provided by Victoria Legal Aid (VLA), a semigovernmental organization that
aims to provide more effective legal aid for underprivileged people in Australia. Over 380,000
applications for legal aid were collected from the 11 regional offices of VLA, spanning 1997–
1999, including information on more than 300 variables. In an effort to reduce the number of
variables, the researchers turned to domain experts for assistance. These experts selected seven
of the most important variables for inclusion in the data set: gender, age, occupation, reason for
refusal of aid, law type (e.g., civil law), decision (i.e., aid granted or not granted), and dealing
type (e.g., court appearance).

3. Data Preparation Phase

The VLA data set turned out to be relatively clean, containing very few records with missing or
incorrectly coded attribute values. This is in part due to the database management system used
by the VLA, which performs quality checks on input data. The age variable was partitioned
into discrete intervals such as “under 18,” “over 50,” and so on.

4. Modeling Phase

Rules were restricted to having only a single antecedent and a single consequent. Many in-
teresting association rules were uncovered, along with many uninteresting rules, which is the
typical scenario for association rule mining. One such interesting rule was: If place of birth =
Vietnam, then law type = criminal law, with 90% confidence.
The researchers proceeded on the accurate premise that association rules are interesting
if they spawn interesting hypotheses. A discussion among the researchers and experts for the
reasons underlying the association rule above considered the following hypotheses:
r Hypothesis A: Vietnamese applicants applied for support only for criminal law and not
for other types, such as family and civil law.
r Hypothesis B: Vietnamese applicants committed more crime than other groups.
r Hypothesis C: There is a lurking variable. Perhaps Vietnamese males are more likely
than females to apply for aid, and males are more associated with criminal law.
r Hypothesis D: The Vietnamese did not have ready access to VLA promotional material.

The panel of researchers and experts concluded informally that hypothesis A was most
likely, although further investigation is perhaps warranted, and no causal link can be assumed.
Note, however, the intense human interactivity throughout the data mining process. Without
the domain experts’ knowledge and experience, the data mining results in this case would not
have been fruitful.

5. Evaluation Phase

The researchers adopted a unique evaluative methodology for their project. They brought in
three domain experts and elicited from them their estimates of the confidence levels for each of
144 association rules. These estimated confidence levels were then compared with the actual
confidence levels of the association rules uncovered in the data set.
WHAT TASKS CAN DATA MINING ACCOMPLISH? 21

6. Deployment Phase

A useful Web-based application, WebAssociator, was developed, so that nonspecialists could


take advantage of the rule-building engine. Users select the single antecedent and single conse-
quent using a Web-based form. The researchers suggest that WebAssociator could be deployed
as part of a judicial support system, especially for identifying unjust processes.

CASE STUDY 4
PREDICTING CORPORATE BANKRUPTCIES USING
DECISION TREES [22]

1. Business/Research Understanding Phase

The recent economic crisis in East Asia has spawned an unprecedented level of corporate
bankruptcies in that region and around the world. The goal of the researchers, Tae Kyung
Sung from Kyonggi University, Namsik Chang from the University of Seoul, and Gunhee
Lee of Sogang University, Korea, is to develop models for predicting corporate bankruptcies
that maximize the interpretability of the results. They felt that interpretability was important
because a negative bankruptcy prediction can itself have a devastating impact on a financial
institution, so that firms that are predicted to go bankrupt demand strong and logical reaso-
ning.
If one’s company is in danger of going under, and a prediction of bankruptcy could itself
contribute to the final failure, that prediction better be supported by solid “trace-able” evidence,
not by a simple up/down decision delivered by a black box. Therefore, the researchers chose
decision trees as their analysis method, because of the transparency of the algorithm and the
interpretability of results.

2. Data Understanding Phase

The data included two groups, Korean firms that went bankrupt in the relatively stable growth
period of 1991–1995, and Korean firms that went bankrupt in the economic crisis conditions of
1997–1998. After various screening procedures, 29 firms were identified, mostly in the man-
ufacturing sector. The financial data was collected directly from the Korean Stock Exchange,
and verified by the Bank of Korea and the Korea Industrial Bank.

3. Data Preparation Phase

Fifty-six financial ratios were identified by the researchers through a search of the literature
on bankruptcy prediction, 16 of which were then dropped due to duplication. There remained
40 financial ratios in the data set, including measures of growth, profitability, safety/leverage,
activity/efficiency, and productivity.

4. Modeling Phase

Separate decision tree models were applied to the “normal-conditions” data and the “crisis-
conditions” data. As we shall learn in Chapter 6, decision tree models can easily generate rule
22 CHAPTER 1 INTRODUCTION TO DATA MINING

sets. Some of the rules uncovered for the normal-conditions data were as follows:
r If the productivity of capital is greater than 19.65, predict nonbankrupt with 86%
confidence.
r If the ratio of cash flow to total assets is greater than −5.65, predict nonbankrupt with
95% confidence.
r If the productivity of capital is at or below 19.65 and the ratio of cash flow to total assets
is at or below −5.65, predict bankrupt with 84% confidence.

Some of the rules uncovered for the crisis-conditions data were as follows:

r If the productivity of capital is greater than 20.61, predict nonbankrupt with 91%
confidence.
r If the ratio of cash flow to liabilities is greater than 2.64, predict nonbankrupt with 85%
confidence.
r If the ratio of fixed assets to stockholders’ equity and long-term liabilities is greater than
87.23, predict nonbankrupt with 86% confidence.
r If the productivity of capital is at or below 20.61, and the ratio of cash flow to liabilities
is at or below 2.64, and the ratio of fixed assets to stockholders’ equity and long-term
liabilities is at or below 87.23, predict bankrupt with 84% confidence.

Cash flow and productivity of capital were found to be important regardless of the eco-
nomic conditions. While cash flow is well known in the bankruptcy prediction literature, the
identification of productivity of capital was relatively rare, which therefore demanded further
verification.

5. Evaluation Phase

The researchers convened an expert panel of financial specialists, which unanimously selected
productivity of capital as the most important attribute for differentiating firms in danger of
bankruptcy from other firms. Thus, the unexpected results discovered by the decision tree
model were verified by the experts.
To ensure that the model was generalizable to the population of all Korean manufacturing
firms, a control sample of nonbankrupt firms was selected, and the attributes of the control
sample were compared to those of the companies in the data set. It was found that the con-
trol sample’s average assets and average number of employees were within 20% of the data
sample.
Finally, the researchers applied multiple discriminant analysis as a performance benchmark.
Many of the 40 financial ratios were found to be significant predictors of bankruptcy, and the
final discriminant function included variables identified by the decision tree model.

6. Deployment Phase

There was no deployment identified per se. As mentioned earlier, deployment is often at
the discretion of users. However, because of this research, financial institutions in Korea are
now better aware of the predictors for bankruptcy for crisis conditions, as opposed to normal
conditions.
WHAT TASKS CAN DATA MINING ACCOMPLISH? 23

CASE STUDY 5
PROFILING THE TOURISM MARKET USING k-MEANS
CLUSTERING ANALYSIS [23]

1. Business/Research Understanding Phase

The researchers, Simon Hudson and Brent Ritchie, of the University of Calgary, Alberta,
Canada, are interested in studying intraprovince tourist behavior in Alberta. They would like
to create profiles of domestic Albertan tourists based on the decision behavior of the tourists.
The overall goal of the study was to form a quantitative basis for the development of an
intraprovince marketing campaign, sponsored by Travel Alberta. Toward this goal, the main
objectives were to determine which factors were important in choosing destinations in Alberta,
to evaluate the domestic perceptions of the “Alberta vacation product,” and to attempt to
comprehend the travel decision-making process.

2. Data Understanding Phase

The data were collected in late 1999 using a phone survey of 13,445 Albertans. The respondents
were screened according to those who were over 18 and had traveled for leisure at least
80 kilometers for at least one night within Alberta in the past year. Only 3071 of these 13,445
completed the survey and were eligible for inclusion in the study.

3. Data Preparation Phase

One of the survey questions asked the respondents to indicate to what extent each of the factors
from a list of 13 factors most influence their travel decisions. These were then considered to
be variables upon which the cluster analysis was performed, and included such factors as the
quality of accommodations, school holidays, and weather conditions.

4. Modeling Phase

Clustering is a natural method for generating segment profiles. The researchers chose k-means
clustering, since that algorithm is quick and efficient as long as you know the number of
clusters you expect to find. They explored between two and six cluster models before settling
on a five-cluster solution as best reflecting reality. Brief profiles of the clusters are as follows:
r Cluster 1: the young urban outdoor market. Youngest of all clusters, equally balanced
genderwise, with school schedules and budgets looming large in their travel decisions.
r Cluster 2: the indoor leisure traveler market. Next youngest and very female, mostly
married with children, with visiting family and friends a major factor in travel plans.
r Cluster 3: the children-first market. More married and more children than any other
cluster, with children’s sports and competition schedules having great weight in deciding
where to travel in Alberta.
r Cluster 4: the fair-weather-friends market. Second-oldest, slightly more male group,
with weather conditions influencing travel decisions.
r Cluster 5: the older, cost-conscious traveler market. The oldest of the clusters, most
influenced by cost/value considerations and a secure environment when making Alberta
travel decisions.
24 CHAPTER 1 INTRODUCTION TO DATA MINING

5. Evaluation Phase

Discriminant analysis was used to verify the “reality” of the cluster categorizations, correctly
classifying about 93% of subjects into the right clusters. The discriminant analysis also showed
that the differences between clusters were statistically significant.

6. Deployment Phase

These study findings resulted in the launching of a new marketing campaign, “Alberta, Made to
Order,” based on customizing the marketing to the cluster types uncovered in the data mining.
More than 80 projects were launched, through a cooperative arrangement between government
and business. “Alberta, Made to Order,” television commercials have now been viewed about
20 times by over 90% of adults under 55. Travel Alberta later found an increase of over 20%
in the number of Albertans who indicated Alberta as a “top-of-the-mind” travel destination.

REFERENCES
1. Peter Fabris, Advanced navigation, CIO Magazine, May 15, 1998, http://www.cio
.com/archive/051598-mining.html.
2. Bill Clinton, New York University speech, Salon.com, December 6, 2002, http://www
.salon.com/politics/feature/2002/12/06/clinton/print.html.
3. Mining Data to Save Children with Brain Tumors, SPSS, Inc., http://spss.com/
success/.
4. The Gartner Group, www.gartner.com.
5. David Hand, Heikki Mannila, and Padhraic Smyth, Principles of Data Mining, MIT Press,
Cambridge, MA, 2001.
6. Peter Cabena, Pablo Hadjinian, Rolf Stadler, Jaap Verhees, and Alessandro Zanasi, Discov-
ering Data Mining: From Concept to Implementation, Prentice Hall, Upper Saddle River,
NJ, 1998.
7. Rachel Konrad, Data mining: Digging user info for gold, ZDNET News, February 7, 2001,
http://zdnet.com.com/2100-11-528032.html?legacy=zdnn.
8. The Technology Review Ten, MIT Technology Review, January/February 2001.
9. Jennifer Mateyaschuk, The 1999 National IT Salary Survey: Pay up, Information Week,
http://www.informationweek.com/731/salsurvey.htm.
10. The Boston Celtics, http://www.nba.com/celtics/.
11. Peter Gwynne, Digging for data, Think Research, domino.watson.ibm.com/
comm/wwwr-thinkresearch.nsf/pages/datamine296.html.
12. John Naisbitt, Megatrends, 6th ed., Warner Books, New York, 1986.
13. Michael Berry and Gordon Linoff, Data Mining Techniques for Marketing, Sales and
Customer Support, Wiley, Hoboken, NJ, 1997.
14. Michael Berry and Gordon Linoff, Mastering Data Mining, Wiley, Hoboken, NJ, 2000.
15. Quoted in: Mihael Ankerst, The perfect data mining tool: Interactive or automated? Report
on the SIGKDD-2002 Panel, SIGKDD Explorations, Vol. 5, No. 1, July 2003.
16. Peter Chapman, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas Reinart,
Colin Shearer, and Rudiger Wirth, CRISP–DM Step-by-Step Data Mining Guide, 2000,
http://www.crisp-dm.org/.
17. Jochen Hipp and Guido Lindner, Analyzing warranty claims of automobiles: an appli-
cation description following the CRISP–DM data mining process, in Proceedings of the
EXERCISES 25

5th International Computer Science Conference (ICSC ’99), pp. 31–40, Hong Kong, De-
cember 13–15, 1999,  C Springer.

18. Jen Que Louie, President of Nautilus Systems, Inc. (www.nautilus-systems.com),


testimony before the U.S. House of Representatives Subcommittee on Technology, Infor-
mation Policy, Intergovernmental Relations, and Census, Congressional Testimony, March
25, 2003.
19. www.Claritas.com.
20. Alan M. Safer, A comparison of two data mining techniques to predict abnormal stock
market returns, Intelligent Data Analysis, Vol. 7, pp. 3–13, 2003.
21. Sasha Ivkovic, John Yearwood, and Andrew Stranieri, Discovering interesting association
rules from legal databases, Information and Communication Technology Law, Vol. 11,
No. 1, 2002.
22. Tae Kyung Sung, Namsik Chang, and Gunhee Lee, Dynamics of modeling in data min-
ing: interpretive approach to bankruptcy prediction, Journal of Management Information
Systems, Vol. 16, No. 1, pp. 63–85, 1999.
23. Simon Hudson and Brent Richie, Understanding the domestic market using cluster analysis:
a case study of the marketing efforts of Travel Alberta, Journal of Vacation Marketing,
Vol. 8, No. 3, pp. 263–276, 2002.

EXERCISES
1. Refer to the Bank of America example early in the chapter. Which data mining task or
tasks are implied in identifying “the type of marketing approach for a particular customer,
based on the customer’s individual profile”? Which tasks are not explicitly relevant?
2. For each of the following, identify the relevant data mining task(s):
a. The Boston Celtics would like to approximate how many points their next opponent
will score against them.
b. A military intelligence officer is interested in learning about the respective proportions
of Sunnis and Shias in a particular strategic region.
c. A NORAD defense computer must decide immediately whether a blip on the radar is
a flock of geese or an incoming nuclear missile.
d. A political strategist is seeking the best groups to canvass for donations in a particular
county.
e. A homeland security official would like to determine whether a certain sequence of
financial and residence moves implies a tendency to terrorist acts.
f. A Wall Street analyst has been asked to find out the expected change in stock price for
a set of companies with similar price/earnings ratios.

3. For each of the following meetings, explain which phase in the CRISP–DM process is
represented:
a. Managers want to know by next week whether deployment will take place. Therefore,
analysts meet to discuss how useful and accurate their model is.
b. The data mining project manager meets with the data warehousing manager to discuss
how the data will be collected.
c. The data mining consultant meets with the vice president for marketing, who says that
he would like to move forward with customer relationship management.
26 CHAPTER 1 INTRODUCTION TO DATA MINING

d. The data mining project manager meets with the production line supervisor to discuss
implementation of changes and improvements.
e. The analysts meet to discuss whether the neural network or decision tree models should
be applied.
4. Discuss the need for human direction of data mining. Describe the possible consequences
of relying on completely automatic data analysis tools.
5. CRISP–DM is not the only standard process for data mining. Research an alternative
methodology. (Hint: SEMMA, from the SAS Institute.) Discuss the similarities and dif-
ferences with CRISP–DM.
6. Discuss the lessons drawn from Case Study 1. Why do you think the author chose a case
study where the road was rocky and the results less than overwhelming?
7. Consider the business understanding phase of Case Study 2.
a. Restate the research question in your own words.
b. Describe the possible consequences for any given data mining scenario of the data
analyst not completely understanding the business or research problem.
8. Discuss the evaluation method used for Case Study 3 in light of Exercise 4.
9. Examine the association rules uncovered in Case Study 4.
a. Which association rule do you think is most useful under normal conditions? Under
crisis conditions?
b. Describe how these association rules could be used to help decrease the rate of company
failures in Korea.
10. Examine the clusters found in Case Study 5.
a. Which cluster do you find yourself or your relatives in?
b. Describe how you would use the information from the clusters to increase tourism in
Alberta.
CHAPTER 2
DATA PREPROCESSING

WHY DO WE NEED TO PREPROCESS THE DATA?


DATA CLEANING
HANDLING MISSING DATA
IDENTIFYING MISCLASSIFICATIONS
GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS
DATA TRANSFORMATION
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS

Chapter 1 introduced us to data mining and the CRISP—DM standard process for data
mining model development. The case studies we looked at in Chapter 1 gave us an idea
of how businesses and researchers apply phase 1 in the data mining process, business
understanding or research understanding. We saw examples of how businesses and
researchers first enunciate project objectives, then translate these objectives into the
formulation of a data mining problem definition, and finally, prepare a preliminary
strategy for achieving these objectives.
Here in Chapter 2 we examine the next two phases of the CRISP—DM standard
process, data understanding and data preparation. We show how to evaluate the qual-
ity of the data, clean the raw data, deal with missing data, and perform transformations
on certain variables.
All of Chapter 3 is devoted to this very important aspect of the data under-
standing. The heart of any data mining project is the modeling phase, which we begin
examining in Chapter 4.

WHY DO WE NEED TO PREPROCESS THE DATA?

Much of the raw data contained in databases is unpreprocessed, incomplete, and noisy.
For example, the databases may contain:
r Fields that are obsolete or redundant
r Missing values

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose


ISBN 0-471-66657-2 Copyright  C 2005 John Wiley & Sons, Inc.

27
28 CHAPTER 2 DATA PREPROCESSING

r Outliers
r Data in a form not suitable for data mining models
r Values not consistent with policy or common sense.

To be useful for data mining purposes, the databases need to undergo prepro-
cessing, in the form of data cleaning and data transformation. Data mining often
deals with data that hasn’t been looked at for years, so that much of the data con-
tains field values that have expired, are no longer relevant, or are simply missing.
The overriding objective is to minimize GIGO: to minimize the “garbage” that gets
into our model so that we can minimize the amount of garbage that our models give
out.
Dorian Pyle, in his book Data Preparation for Data Mining [1], estimates that
data preparation alone accounts for 60% of all the time and effort expanded in the
entire data mining process. In this chapter we examine two principal methods for
preparing the data to be mined, data cleaning, and data transformation.

DATA CLEANING

To illustrate the need to clean up data, let’s take a look at some of the types of errors
that could creep into even a tiny data set, such as that in Table 2.1. Let’s discuss,
attribute by attribute, some of the problems that have found their way into the data
set in Table 2.1. The customer ID variable seems to be fine. What about zip?
Let’s assume that we are expecting all of the customers in the database to have the
usual five-numeral U.S. zip code. Now, customer 1002 has this strange (to American
eyes) zip code of J2S7K7. If we were not careful, we might be tempted to classify this
unusual value as an error and toss it out, until we stop to think that not all countries
use the same zip code format. Actually, this is the zip code of St. Hyancinthe, Quebec,
Canada, so probably represents real data from a real customer. What has evidently
occurred is that a French-Canadian customer has made a purchase and put their home
zip code down in the field required. Especially in this era of the North American Free
Trade Agreement, we must be ready to expect unusual values in fields such as zip
codes, which vary from country to country.
What about the zip code for customer 1004? We are unaware of any countries
that have four-digit zip codes, such as the 6269 indicated here, so this must be an error,

TABLE 2.1 Can You Find Any Problems in This Tiny Data Set?

Customer ID Zip Gender Income Age Marital Status Transaction Amount

1001 10048 M 75000 C M 5000


1002 J2S7K7 F −40000 40 W 4000
1003 90210 10000000 45 S 7000
1004 6269 M 50000 0 S 1000
1005 55101 F 99999 30 D 3000
DATA CLEANING 29

right? Probably not. Zip codes for the New England states begin with the numeral 0.
Unless the zip code field is defined to be character (text) and not numeric, the software
will probably chop off the leading zero, which is apparently what happened here. The
zip code is probably 06269, which refers to Storrs, Connecticut, home of the University
of Connecticut.
The next field, gender, contains a missing value for customer 1003. We detail
methods for dealing with missing values later in the chapter.
The income field, which we assume is measuring annual gross income, has three
potentially anomalous values. First, customer 1003 is shown as having an income of
$10,000,000 per year. Although entirely possible, especially when considering the
customer’s zip code (90210, Beverly Hills), this value of income is nevertheless an
outlier, an extreme data value. Certain statistical and data mining modeling techniques
do not function smoothly in the presence of outliers; we examine methods of handling
outliers later in the chapter.
Poverty is one thing, but it is rare to find an income that is negative, as our
poor customer 1004 has. Unlike customer 1003’s income, customer 1004’s reported
income of −$40,000 lies beyond the field bounds for income and therefore must be
an error. It is unclear how this error crept in, with perhaps the most likely explanation
being that the negative sign is a stray data entry error. However, we cannot be sure and
should approach this value cautiously, attempting to communicate with the database
manager most familiar with the database history.
So what is wrong with customer 1005’s income of $99,999? Perhaps nothing;
it may in fact be valid. But if all the other incomes are rounded to the nearest $5000,
why the precision with customer 1005? Often, in legacy databases, certain specified
values are meant to be codes for anomalous entries, such as missing values. Perhaps
99999 was coded in an old database to mean missing. Again, we cannot be sure and
should again refer to the “wetware.”
Finally, are we clear as to which unit of measure the income variable is measured
in? Databases often get merged, sometimes without bothering to check whether such
merges are entirely appropriate for all fields. For example, it is quite possible that
customer 1002, with the Canadian zip code, has an income measured in Canadian
dollars, not U.S. dollars.
The age field has a couple of problems. Although all the other customers have
numerical values for age, customer 1001’s “age” of C probably reflects an earlier cat-
egorization of this man’s age into a bin labeled C. The data mining software will defi-
nitely not like this categorical value in an otherwise numerical field, and we will have
to resolve this problem somehow. How about customer 1004’s age of 0? Perhaps there
is a newborn male living in Storrs, Connecticut, who has made a transaction of $1000.
More likely, the age of this person is probably missing and was coded as 0 to indicate
this or some other anomalous condition (e.g., refused to provide the age information).
Of course, keeping an age field in a database is a minefield in itself, since the
passage of time will quickly make the field values obsolete and misleading. It is better
to keep date-type fields (such as birthdate) in a database, since these are constant and
may be transformed into ages when needed.
The marital status field seems fine, right? Maybe not. The problem lies in the
meaning behind these symbols. We all think we know what these symbols mean, but
30 CHAPTER 2 DATA PREPROCESSING

are sometimes surprised. For example, if you are in search of cold water in a rest room
in Montreal and turn on the faucet marked C, you may be in for a surprise, since the
C stands for chaud, which is French for hot. There is also the problem of ambiguity.
In Table 2.1, for example, does the S for customers 1003 and 1004 stand for single
or separated?
The transaction amount field seems satisfactory as long as we are confident
that we know what unit of measure is being used and that all records are transacted
in this unit.

HANDLING MISSING DATA

Missing data is a problem that continues to plague data analysis methods. Even as
our analysis methods gain sophistication, we continue to encounter missing values
in fields, especially in databases with a large number of fields. The absence of infor-
mation is rarely beneficial. All things being equal, more data is almost always better.
Therefore, we should think carefully about how we handle the thorny issue of missing
data.
To help us tackle this problem, we will introduce ourselves to a new data
set, the cars data set, originally compiled by Barry Becker and Ronny Kohavi
of Silicon Graphics, and available at the SGI online data repository at www.sgi
.com/tech/mlc/db. The data set, also available on the book series Web site ac-
companying the text, consists of information about 261 automobiles manufactured
in the 1970s and 1980s, including gas mileage, number of cylinders, cubic inches,
horsepower, and so on.
Suppose, however, that some of the field values were missing for certain records.
Figure 2.1 provides a peek at the first 10 records in the data set, with some of

Figure 2.1 Some of our field values are missing!


Exploring the Variety of Random
Documents with Different Content
successfully, and sat down stiffly to my task, first calling to my aid the lofty
and clear perceptions, the noble and sonorous expressions of my old
instructor, the Archbishop of Grenada.

I began by laying it down as a first maxim of political philosophy, that


the vital functions, the respiration as it were of all monarchy, depended on
the strict administration of the finances; that in our particular case, that duty
became imperiously urgent, irresistibly impressing on our consciences; and
that the revenue should be considered as the nerves and sinews of Spain, to
hold her rivals in check and keep her enemies in awe. After this general
declamation, I pointed out to the sovereign—for to him the memorial was
addressed—that by cutting down all pensions and perquisites dependent on
the ordinary income, he would not thereby deprive himself of that truly
royal pleasure, a princely munificence towards those of his subjects who
had established a fair claim to his favors; because, without drawing upon
his treasury, he had the means of distributing more acceptable rewards; that
for one branch of service, there were viceroyalties, lieutenancies, orders of
merit, and all sorts of military commissions; for another, high judicial
situations with salaries annexed, civil offices of magistracy with sounding
titles to give them consequence; and though last, not least, all the temporal
possessions of the church to animate the piety of its spiritual pastors.

This memorial, which was much longer than the first, occupied me
nearly three days; but as luck would have it, my performance was exactly to
my master's mind, who, finding it written with sententious cogency, and
bristled up with metaphors in the declamatory parts, complimented me in
the highest terms. That is vastly well expressed indeed, said he, laying his
finger on a passage here and there, and picking out all the most inflated
sentences he could find: that language bears the stamp of fine composition,
and might pass for the production of a classic. Courage, my friend! I foresee
that your services will be worth their weight in gold. And yet,
notwithstanding the applauses he lavished on my classical composition, a
few of his own heightening touches, he thought, would make it read still
better. He put a good deal of his own stuff into it, and the medley was
manufactured into a piece of eloquence which was considered as
unanswerable by the king and all the court. The whole city joined in opinion
with the higher orders, deriving the most flattering hopes of the future from
these grand promises, and concluding that the monarchy must recover its
pristine splendor during the ministry of so illustrious a character. His
excellency, finding that my sermon on economy was fraught with practical
inferences of utility to him, was kind enough to wish that I should profit by
the exercise of my own talents. In conformity therefore with his new system
of patronage, he gave me an annuity of five hundred crowns on the
commandery of Castile; and the acceptance of it was so much the more
palatable, as no dirty work had been done for it, but it was honestly though
cheaply earned.

CHAPTER VII.

GIL BLAS MEETS WITH HIS FRIEND FABRICIO ONCE MORE; THE
ACCIDENT, PLACE, AND CIRCUMSTANCES DESCRIBED, WITH
THE PARTICULARS OF THEIR CONVERSATION TOGETHER.

Nothing gave his lordship greater pleasure than to hear the general
decision of Madrid on the conduct of his administration. Not a day passed
but he inquired what they were saying of him in the political world. He kept
spies in pay, to bring him an exact account of what was going on in the city.
They particularized the most trivial discourses which they overheard; and
their orders being to suppress nothing, his self-love was grazed now and
then, for the people have a way of bolting out home truths, without any nice
calculation where they may glance.

Finding that the count loved political small talk, I made it my business to
frequent places of public resort after dinner, and to chime in with the
conversation of genteel people whenever opportunity offered. Should the
measures of government happen to be canvassed among them, I pricked up
my ears, and greedily took in their discourse; if any thing worth repeating
was said, his excellency was sure to hear of it. It can scarcely be necessary
to hint, that I never carried home any thing which was not likely to pay for
the porterage.
One day, returning from one of these little conversational parties, my
road lay in front of a hospital. It occurred to me to go in. I walked through
two or three wards filled with diseased patients, and examined their beds to
see that they were properly taken care of. Among these unhappy wretches,
whom I could not look at without the most painful feelings, I observed one
whose features struck me: it surely could be no other than Fabricio, my
countryman and chum! To look at him more closely, I drew near his
bedside, and finding, beyond a possibility of doubt, that it was the poet
Nunez, I stopped to look at him for a few seconds without saying a word.
He also fixed his regards on me. At length breaking silence, Do not my eyes
deceive me? said I. Is it indeed Fabricio, and here? It is indeed, answered
he, coldly, and you need not wonder at it. Since we parted, I have been
working indefatigably at the trade of an author: I have written novels, plays,
and works of genius in every department. My brain is fairly spun out, and
here I am.

I could not help laughing at such a sketch of literary biography, and still
more at the serious air of the accompanying action. What! cried I, has your
muse brought you to this pass? Has she played you such a jade's trick as
this? Even as you witness, answered he; this establishment is a sort of half-
pay receptacle for invalids on the muster-roll of disabled wit. You have
acted discreetly, my good friend, to lay yourself out for promotion in a
different line. But they tell me, you are no longer a courtier, and that your
prospects in political life were all blasted; nay, they went so far as to affirm,
that you were committed to close custody by the king's order. They told you
no more than the truth, replied I; the delightful vision of political eminence
wherein you left me last, soon shifted the scene of my incoherent dreams to
a prison and complete destitution. But for all that, my friend, here you
behold me again in a better plight than ever. That is quite out of the
question, said Nunez: your deportment is discreet and decent; you have not
that supercilious and devil-take-the-hindmost sort of aspect which good
keep communicates to the human face. The reverses of this checkered life,
replied I, have brought me down to the level of the more modest virtues; I
have taken a lesson in the school of adversity, to enjoy the possession of a
good stud without riding the great horse.
Tell me then candidly, cried Fabricio, raising his head upon his hand
with his elbow upon the pillow, what your present occupation can possibly
be. A steward perhaps to some nobleman out at elbows, or man of business
to some rich widow! Something better than either the one or the other,
rejoined I; but excuse me from saying more at present: another time your
curiosity shall be satisfied. It is enough at present to assure you that my
means are equal to my inclination, and that you may command
independence through me; but then you must submit to an embargo on your
wit, and a non-intercourse act between you and the faculty of writing,
whether in verse or prose. Can you make this sacrifice to my friendship? I
have already made it to the powers above, said he, in my last critical
sickness. A Dominican made me forswear poetry, as an amusement
bordering on criminality, but at all events beside the turnpike-road of good
sense. I wish you joy, my dear Nunez, replied I; but beware of a revoke.
There is not the least danger on that head, rejoined he: the Muses and I have
agreed on terms of separation: just as you came in at that door, I was
conning over a farewell ode. Good Master Fabricio, said I, with a wise
swagging to and fro of my head, it is a doubtful question whether your vow
of abjuration ought to pass current with the Dominican and myself: you
seem over head and ears in love with those virgins incarnate. No, no,
contended he peevishly, I have cut the connection asunder. Nay, more, I
have quarrelled with their keepers, the public. The readers of these days do
not deserve an author of more genius than themselves: I should be sorry to
write down to their comprehension. You are not to suppose that this is the
language of disgust; it is my sincere and well-weighed opinion. Applause
and hisses are just the same to me. It is a toss up who fails and who
succeeds: the wit of to-day is the blockhead of to-morrow. What cursed
fools our dramatists must be, to care for anything but their poundage when
their plays happen to be received! It is all very well for a few nights! But
only fancy a revival at the end of twenty years, and what a figure they will
cut then! The audiences of the present day turn up their noses at the stock
pieces of the last age, and it is a question whether their taste will fare better
with their more critical descendants. If that conjecture be probable, the
inventors of clap-traps now will be the butt of cat-calls hereafter. It is just
the same with novel writers, and all other manufacturers of unnecessary
literature; they strut and fret for an hour, and then are no more seen or heard
of. The glories of successful authorship are the mere vapors of a murky
atmosphere, meteors of a marsh, foul coruscations of a dunghill, cathedral
tapers to put out the galaxy, blue flames of coarse paper held over a candle.

Though these caricatures of rival renown were the mere creations of


jealousy in the poet of the Asturias, it was not my business to correct his ill
temper. I am delighted, said I, that wit and you have had so serious a
quarrel, and that the diarrhœa of your inventive faculties has been cured by
an astringent. You may depend on it, I will put you in the way of a good
livelihood, without drawing deep upon your intellectual credit. So much the
better, cried he; wit smells like carrion in my nostrils, or rather like a
pungent and deleterious perfume; fragrant to the sense, but corrosive to the
vitals. I heartily wish, my dear Fabricio, resumed I, that you may always
keep in that mind. Only wash your hands completely of poetry, and, you
may depend on it, I will enable you to keep your head above water without
picking or stealing. In the mean while, added I, slipping a purse of sixty
pistoles into his hand, accept this as a slight instance of my regard.

O friend like the friends in days of yore, cried the son of barber Nunez,
out of his wits with joy and gratitude, it was heaven itself which sent you
into this hospital, whence your goodness is now discharging me! Before we
parted, I gave him my address, and invited him to come and see me as soon
as his health would permit. He opened his eyes as an oyster does its shell,
when I told him that I lodged under the minister's roof. O illustrious Gil
Blas! said he, great as Pompey and fortunate as Sylla, whose lot it is to be
hand in glove with the dictators of modern times! I rejoice most
disinterestedly in your good fortune, because it is so very evident what a
noble use you make of it.

CHAPTER VIII.

GIL BLAS GETS FORWARD PROGRESSIVELY IN HIS MASTER'S


AFFECTIONS. SCIPIO'S RETURN TO MADRID, AND ACCOUNT OF
HIS JOURNEY.
The Count of Olivarez, whom I shall henceforward call my lord duke,
because the king was pleased to confer that dignity on him about this time,
was infested with a weakness which I did not suffer to pass without taking
toll; it was a furious desire of being beloved. The moment he fancied that
any one really liked him, his heart was caught in a trap. This was not lost
upon my keen sense of character. It was not enough to do precisely as he
ordered; I superadded a zeal in the execution which made him mine. I laid
myself out to his liking in every thing, and provided beforehand for his
most eccentric wishes.

By conduct like this, which almost always answers, I became by degrees


my master's favorite; and he, on the other hand, as if he had got round to my
blind side also, wormed himself into my affections by giving me his own.
So forward did I get into his good graces, as to halve his confidence with
Signor Carnero, his principal secretary.

Carnero had played my game, and that so successfully as to be intrusted


with the greater mysteries. We two, therefore, were the keepers of the prime
minister's conscience, and held the keys of all his secrets; with this
difference, that Carnero was consulted on state affairs, myself about his
private concerns, dividing the business into two separate departments; and
we were each of us equally pleased with our own. We lived together
without jealousy, and certainly without attachment. I had every reason to be
satisfied with my quarters, where continual intercourse gave me an
opportunity of prying into the duke's inmost soul, which was a masked
battery to all mankind beside, but plain as a pikestaff to me, when he no
longer questioned the sincerity of my attachment to him.

Santillane, said he one day, you were witness to the Duke of Lerma's
possession of an authority more like that of an absolute monarch than a
favorite minister; and yet I am still happier than he was at the very summit
of his good fortune. He had two formidable enemies in his own son, the
Duke of Uzeda, and in the confessor of Philip the Third: but there is no one
now about the king who has credit enough to stand in my way, or even, as I
am aware, the slightest inclination to do me mischief.

It is true, continued he, that on my accession to the ministry, it was my


first care to remove all hangers-on from about the prince but those of my
own family or connections. By means of viceroyalties or embassies I got rid
of all the nobility who, by their personal merit, could have interfered with
me in the good graces of the sovereign, whom I mean to engross entirely to
myself; so that I may say at the present moment, no statesman of the time
holds me in check by the ascendency of his personal influence. You see, Gil
Blas, I open my mind to you. As I have reason to think that you are mine,
heart and soul, I have chosen to put you in possession of everything. You
are a clever youth, with reflection, penetration, and discretion; in short, you
are just the very creature to acquit yourself of all possible little offices in all
possible directions; you are also a young fellow of very promising parts,
and must, in the nature of things, be in my interests.

There was no standing the attack which these flattering representations


were calculated to make upon the weakly-defended fortress of my
philosophy. Unauthorized whims of avarice and ambition mounted
suddenly into my head, and brought forward certain sentiments of political
speculation which were supposed to have been in abeyance. I gave the
minister an assurance that I should fulfil his intentions to the utmost of my
power, and held myself in readiness to execute, without examination or
interference, all the orders it might be his pleasure to give me.

While I was thus disposed to take fortune in her affable fit, Scipio
returned from his peregrination. I have no long story for you, said he. The
lords of Leyva were delighted at your reception from the king, and at the
manner in which the Count of Olivarez and you came to understand one
another.

My friend, said I, you would have delighted them still more, had you
been able to tell them on what a footing I am now with my lord. My
advances since your departure have been prodigious. Happy man be his
dole, my dear master, answered he: my mind forebodes that we shall cut a
figure.

Let us change the subject, said I, and talk of Oviedo. You have been in
the Asturias. How did you leave my mother? Ah, sir! replied he, with an
undertaker's decency of countenance, I have a melancholy tale to tell you
from that quarter. O heaven! exclaimed I, my mother then is dead! Six
months since, said my secretary, did the good lady pay the debt of nature,
and your uncle, Signor Gil Perez, about the same period.

My mother's death preyed upon my susceptible nature, though in my


childhood I had not received from her those little fondling indications of
maternal love so necessary to amalgamate with the more serious
convictions of filial duty. The good canon, too, came in for his share in
bringing me up according to the rules of godliness and honesty. My serious
grief was not lasting; but I never lost sight of a certain tender recollection,
whenever the idea of my dear relations shot across my mind.

CHAPTER IX.

HOW MY LORD DUKE MARRIED HIS ONLY DAUGHTER, AND TO


WHOM; WITH THE BITTER CONSEQUENCES OF THAT
MARRIAGE.

Very shortly after the son of Cosclina's return, my lord duke fell into a
brown study; and it lasted a complete week. I conceived, of course, that he
was brooding over some great measure of government; but family concerns
were the object of his musings. Gil Blas, said he one day after dinner, you
may perceive that my mind is a good deal distracted. Yes, my good friend, I
am pondering over an affair of the utmost consequence to my feelings. You
shall know all about it.

My daughter, Donna Maria, pursued he, is marriageable, and of course


beset with suitors. The Count de Niéblés, eldest son of the Duke de Medina
Sidonia, head of the Guzman family, and Don Lewis de Haro, eldest son of
the Marquis de Carpio and my eldest sister, are the two most likely
competitors. The latter, in particular, is superior in point of merit to all his
rivals, so that the whole court has fixed on him for my son-in-law.
Nevertheless, without entering into private motives for treating him, as well
as the Count de Niéblés, with a refusal, my present views are fixed upon
Don Ramires Nunez de Guzman, Marquis of Toral, head of the Guzmans
d'Abrados, another branch of the family. To that nobleman and his progeny,
by my daughter, I mean to leave all my property, and to entail on them the
title of Count d'Olivarez, with the additional dignity of grandee; so that my
grandchildren and their descendants, issue of the Abrados and Olivarez
branch, will be considered as taking precedence in the house of Guzman.

Tell me now, Santillane, added he, do you not like my project? Excuse
me, my lord, pleaded I, with a shrug; the design is worthy of the genius
which gave birth to it: my only fear is, lest the Duke of Medina Sidonia
should think fit to be out of humor at it. Let him take it as he list, resumed
the minister; I give myself very little concern about that. His branch is no
favorite with me: they have choused that of Abrados out of their precedence
and many of their privileges. I shall be far less affected by his ill humors
than by the disappointment of my sister, the Marchioness de Carpio, when
she sees my daughter slip through her son's fingers. But let that be as it may,
I am determined to please myself, and Don Ramires shall be the man; it is a
settled point.

My lord duke, having announced this firm resolve, did not carry it into
effect without giving a new proof of his singular policy. He presented a
memorial to the king, entreating him and the queen, in concert, to do him
the honor of taking the choice of a husband for his daughter on themselves,
at the same time acquainting them with the pretensions of the suitors, and
professing to abide by their election; but he took care, when naming the
Marquis de Toral, to evince clearly whither his own wishes pointed. The
king, therefore, with a blind deference for his minister, answered thus:—

"I think that Don Ramires Nunez deserves Donna Maria; but determine
for yourself. The match of your own choosing will be most agreeable to me.

(Signed) THE KING."


The minister made a point of showing this answer everywhere; and
affecting to consider it as a royal mandate, hastened his daughter's marriage
with the Marquis de Toral; a death-blow to the hopes of the Marchioness de
Carpio and the rest of the Guzmans who had been speculating on an
alliance with Donna Maria. These rival players of a losing game, not being
able to break off the match, put the best face they could upon it, and made
the fashionable world to resound with their costly celebrations of the event.
A superficial observer might have fancied that the whole family was
delighted with the arrangement; but the pouters and ill-wishers were soon
revenged most cruelly at my lord duke's expense. Donna Maria was brought
to bed of a daughter at the end of ten months; the infant was still-born, and
the mother died a few days afterwards.

What a loss for a father who had no eyes, as one may say, but for his
daughter, and in her loss felt the miscarriage of his design to quash the right
of precedence in the branch of Medina Sidonia! Stung to the quick by his
misfortune, he shut himself up for several days, and was visible to no one
but myself; a sincere sympathizer, from the recollection of my own
experience in his sorrow. The occasion drew forth fresh tears to Antonia's
memory. The death of the Marchioness de Toral, under circumstances so
similar, tore open a wound imperfectly skinned over, and so exasperated my
affliction, that the minister, though he had enough to do with his own
sufferings, could not help taking notice of mine. It seemed unaccountable
how exactly his feelings were echoed. Gil Blas, said he one day, when my
tears seemed to feed upon indulgence, my greatest consolation consists in
having a bosom friend so much alive to all my distresses. Ah! my lord,
answered I, giving him the full credit of my amiable tenderness, I must be
ungrateful and degenerate in my nature if I did not lament as for myself.
Can I be aware that you mourn over a daughter of accomplished merit,
whom you loved so tenderly, without shedding tears of fellow-feeling? No,
my lord, I am too much naturalized to you on the side of obligation not to
take a permanent interest in all your pleasures and disappointments.
CHAPTER X.

GIL BLAS MEETS WITH THE POET NUNEZ BY ACCIDENT, AND


LEARNS THAT HE HAS WRITTEN A TRAGEDY, WHICH IS ON THE
POINT OF BEING BROUGHT OUT AT THE THEATRE ROYAL. THE
ILL FORTUNE OF THE PIECE, AND THE GOOD FORTUNE OF ITS
AUTHOR.

The minister began to pick up his crumbs, and myself consequently to


get into feather again, when one evening I went out alone in the carriage to
take an airing. On the road I met the poet of the Asturias, who had been lost
to my knowledge ever since his discharge from the hospital. He was very
decently dressed. I called him up, gave him a seat in my carriage, and we
drove together to St. Jerome's meadow.

Master Nunez, said I, it is lucky for me to have met you accidentally; for
otherwise I should not have had the pleasure... No severe speeches,
Santillane, interrupted he with considerable eagerness: I must own frankly
that I did not mean to keep up your acquaintance, and I will tell you the
reason. You promised me a good situation provided I abjured poetry; but I
have found a very excellent one on condition of keeping my talents in
constant play. I accepted the latter alternative, as squaring best with my own
humor. A friend of mine got me an employment under Don Bertrand Gomez
del Ribero, treasurer of the king's galleys. This Don Bertrand, wanting to
have a wit in his pay, and finding my turn for poetical composition very
much in unison with his own sense of what is excellent, has chosen me in
preference to five or six authors who offered themselves as candidates for
the place of his private secretary.

I am delighted at the news, my dear Fabricio, said I, for this Don


Bertrand must be very rich. Rich indeed! answered he; they say that he does
not know himself how much he is worth. However that may be, my
business under him is as follows: He prides himself on his turn for
gallantry, at the same time wishing to pass for a man of genius; he therefore
keeps up an epistolary intercourse of wit with several ladies who have an
infinite deal, and borrows my brain to indite such letters as may amplify the
opinion of his sprightliness and elegance. I write to one for him in verse, to
another in prose, and sometimes carry the letters myself, to prove the agility
of my heels as well as the ingenuity of my head.

But you do not tell me, said I, what I most want to know. Are you well
paid for your epigrammatic cards of compliment? Yes, most plentifully,
answered he. Rich men are not always open-handed; and I know some who
are downright curmudgeons; but Don Bertrand has behaved in the most
handsome manner. Besides a salary of two hundred pistoles, I receive some
little occasional perquisites from him, sufficient to set me above the world,
and enable me to live on an equal footing with some choice spirits of the
literary circles, who are willing, like myself, to set care at defiance. But
then, resumed I, has your treasurer critical skill enough to distinguish the
beauties of a performance from its blemishes? The least likely man in the
world, answered Nunez; a flippant-tongued smatterer, with a miserable
assortment of materials for judging. Yet he gives himself out for chief
justice and lord president of Apollo's tribunal. His decisions are
adventurous, if not always lucky; while his opinions are maintained in so
high a tone and with so bullying a challenge of infallibility, that nine times
out of ten the issue of an argument is silence, though not conviction, on the
part of the opponent, as a measure of precaution against the gathering storm
of foul language and contemptuous sneers.

You may readily suppose, continued he, that I take especial care never to
contradict him, though it almost exceeds human patience to forbear; for, to
say nothing of the unpalatable phrases that might be hailed down on my
defenceless head, I should stand a very good chance of being shoved by the
shoulders out of doors. I therefore am discreet enough to approve what he
praises, and to condemn without mitigation or appeal whatever he is
pleased to find fault with. By this easy compliance—for poets are
compelled to acquire a knack of knocking under to those by whom they
live, not even excepting their booksellers—I have gained the esteem and
friendship of my patron. He has employed me to write a tragedy on a plot of
his own. I have executed it under his inspection; and if the piece succeeds, a
percentage on the laud and honor must accrue to him.

I asked our poet what was the title of his tragedy. He informed me that it
was "The Count of Saldagna," and that it would come out in two or three
days. I told him that I wished it all possible success, and thought so
favorably of his genius as to entertain considerable hopes. So do I, said he;
but hope never tells a more flattering tale than in the ear of a dramatic
author. You might as well attempt to fix the wind by nailing the
weathercock as speculate on the reception of a new piece with an audience.

At length the day of performance arrived. I could not go to the play,


being prevented by official business. The only thing to be done was to send
Scipio, that he might bring me back word how it went off, for I was
sincerely interested in the event. After waiting impatiently for his return, in
he came with a long face, which boded no good. Well, said I, how was "The
Count of Saldagna" welcomed by the critics? Very roughly, answered he;
never was there a play more brutally handled; I left the house in high anger
at the injustice and insolence of the pit. It serves him right, rejoined I.
Nunez is no better than a madman, to be always running his head against
the stone walls of a theatre. If he was in his senses, could he have preferred
the hisses and catcalls of an unfeeling mob to the ease and dignity he might
have commanded under my patronage? Thus did I inveigh with friendly
vehemence against the poet of the Asturias, and disturb the even tenor of
my mind for an event which the sufferer hailed with joy, and inserted
among the well-omened particulars of his journal.

He came to see me within two days, and appeared in high spirits.


Santillane, cried he, I am come to receive your congratulations. My fortune
is made, my friend, though my play is marred. You know what a mistake
they made on the first and last night of "The Count of Saldagna;" hissed
instead of applauding! You would have thought all the wild beasts of the
forest had been let loose, with their ears fortified against the softening
power of poetry; but the more they bellowed, the better I fared, and they
have roared me into a provision for life.

There was no knowing what to make of this incident in the drama of our
poet's adventures. What is all this, Fabricio? said I; how can theatrical
damnation have conjured up such Elysian ecstasy? It is exactly so,
answered he; I told you before that Don Bertrand had thrown in some of the
circumstances; and he was fully convinced that there was no defect but in
the taste of the spectators. They might be very good judges; but, if they
were, he was no judge at all! Nunez, said he this morning,

Victrix causa Diis placuit, sed victa Catoni.[*]

[*] Members of parliament, and the ladies, will probably expect a


translation of these hard words; but I refer the former to their
dictionaries, to which they bade a long farewell on leaving Eton or
Harrow, and the latter to an extended paraphrase of five acts in the
tragedy of Cato. Those of the softer sex who may think the Stoic
philosophy rude and uncouth, will feel their nerves vibrate in unison
with the love scenes. TRANSLATOR.

Your piece has been ill received by the public; but against that you may
place my entire approbation, and thus you ought to set your heart at rest. By
way of something to balance the bad taste of the age, I shall settle an
annuity of two thousand crowns on you: go to my solicitor, and let him
draw the deed. We have been about it: the treasurer has signed and sealed;
my first quarter is paid in advance...

I wished Fabricio joy on the unhappy fate of "The Count of Saldagna;"


and probably most authors would have envied his failure more than all the
success that ever succeeded. You are in the right, continued he, to prefer my
fortune to my fame. What a lucky peal of disapprobation in double choir! If
the public had chosen to ring the changes on my merits rather than my
misdeeds, what would they have done for my pocket? A mere paltry
nothing. The common pay of the theatre might have kept me from starving;
but the wind of popular malice has blown me a comfortable pension,
engrossed on safe and legal parchment.
CHAPTER XI.

SANTILLANE GIVES SCIPIO A SITUATION; THE LATTER SETS OUT


FOR NEW SPAIN.

My secretary could not look at the unexpected good luck of Nunez the
poet without envy; he talked of nothing else for a week. The whims of that
baggage, fortune, said he, are most unaccountable: she delights to turn her
lottery wheel into the lap of a sorry author, while she deals out her
disappointments like a step-mother to the race of good ones. I should have
no objection, though, if she would throw me up a prize in one of her vertical
progresses. That is likely enough to happen, said I, and sooner than you
imagine. Here you are in her temple; for it is scarcely too presumptuous to
call the house of a prime minister the temple of fortune, where favors are
conferred by wholesale, and votaries grow fat on the spoils of her altar. That
is very true, sir, answered he; but we must have patience, and wait till the
happy moment comes. Take my advice while it is worth having, Scipio,
replied I, and make your mind easy: perhaps you are on the eve of some
good appointment. And so it turned out; for within a few days an
opportunity offered of employing him advantageously in my lord duke's
service; and I did not suffer the happy moment to pass by.

I was engaged in chat one morning with Don Raymond Caporis, the
prime minister's steward, and our conversation turned on the sources of his
excellency's income. My lord, said he, enjoys the commanderies of all the
military orders, yielding a revenue of forty thousand crowns a year; and he
is only obliged to wear the cross of Alcantara. Moreover, his three offices of
great chamberlain, master of the horse, and high chancellor of the Indies,
bring him in an income of two hundred thousand crowns; and yet all this is
nothing in comparison of the immense sums which he receives through
other transatlantic channels; but you will be puzzled to guess how. When
vessels clear out from Seville or Lisbon for those parts of the world, he
ships wine, oil, grain, and other articles, the produce of his own estate; and
his consignments are duty free. With that perquisite in his pocket, he sells
his merchandise for four times its current price in Spain, and then lays out
the money in spices, coloring materials, and other things which cost next to
nothing in the new world, and are sold very dear in Europe. Already has he
realized some millions by this traffic, without detracting from the dues of
his royal master.

You will easily account for it, continued he, that the people concerned in
carrying on this trade return with great fortunes in their pockets; for my lord
thinks it but reasonable that they should divide their diligence between his
business and their own.

That shrewd son of chance and opportunity, of whom we are speaking,


overheard our conversation, and could not help interrupting Don Raymond
to the following purport: Upon my word, Signor Caporis, I should like to be
one of those people; for I am fond of travelling, and have long wished to
see Mexico. Your inclinations as a tourist shall soon be gratified, said the
steward, if Signor de Santillane will not stand in the way of your wishes.
However particular I may think it my duty to be about the persons whom I
send to the West Indies in that capacity,—and they are all of my
appointment,—you shall be placed on the list at all adventures, if your
master wishes it. You will confer on me a particular favor, said I to Don
Raymond; be so good as to do it in kindness to me. Scipio is a young fellow
much in my good graces, very capable in business, and will be found
irreproachable in his conduct. In a word, I would as soon answer for him as
myself.

That being the case, replied Caporis, he has only to repair immediately
to Seville: the ships are to sail for South America in a month. I shall give
him a letter at his departure for a man who will put him in the way of
making a fortune, without the slightest interference in his excellency's dues
and profits, which ought to be held sacred by him.

Scipio, delighted with his berth, was in haste to set out for Seville, with a
thousand crowns, with which I furnished him, to make purchases of wine
and oil in Andalusia, and enable him to trade on his own bottom in the West
Indies. And yet, overjoyed as he was to make a voyage, and as he hoped his
fortune therewithal, he could not part from me without tears; and the
separation raised the waters even from my dry fountains.
CHAPTER XII.

DON ALPHONSO DE LEYVA COMES TO MADRID; THE MOTIVE OF


HIS JOURNEY A SEVERE AFFLICTION TO GIL BLAS, AND A
CAUSE OF REJOICING SUBSEQUENT THEREON.

No sooner had I parted with Scipio than one of the minister's pages
brought me a note conceived in the following terms: "If Signor de
Santillane will take the trouble of calling at the sign of Saint Gabriel, in the
street of Toledo, he will there see a friend who is not indifferent to him."

Who can this nameless friend possibly be? said I to myself. What can be
the meaning of all this mystery? Obviously to occasion me the pleasure of a
surprise. I attended the summons immediately, and on my arrival at the
place appointed, was not a little astonished to find Don Alphonso de Leyva
there. Is it possible! exclaimed I: you here, my lord? Yes, my dear Gil Blas,
answered he with a close compression of my hand in his, it is Don
Alphonso himself. Well! but what brings you to Madrid? said I. You will be
not a little startled, rejoined he, and no less vexed at the occasion of my
journey. They have taken my government of Valencia from me, and the
prime minister has sent for me to give an account of my conduct. For a
whole quarter of an hour I was like a man stupefied; then, recovering the
powers of speech, Of what, said I, are you accused? I know nothing at all
about it, answered he; but my disgrace is probably owing to a visit paid
about three weeks ago to the Cardinal Duke of Lerma, who was banished
about a month since to his seat at Denia.

Yes, indeed! cried I in a pet, you may well attribute your misfortune to
that imprudent visit: there is no occasion to look out for causes and effects
elsewhere; but give me leave to say that you have not acted with your usual
good sense, in claiming acquaintance with that favorite out of favor. The
leap is taken, and the neck broken, said he; and I have nothing to do but to
make the best of a bad bargain: I shall retire with my family to our paternal
estate at Leyva, where the remnant of my days will glide away in peace and
obscurity. What taunts and teases me is the requisition of appearing before a
haughty minister, who may receive me with all the insolence of office. How
humiliating to the pride of a Spaniard! And yet it is a measure of necessity;
but before the degrading ceremony took place, I wanted to talk it over with
you. Sir, said I, do not announce your arrival to the minister, till I have
ascertained the nature of the reports to your discredit, for there are few evils
without a remedy. Whatever may be your alleged crimes, you will give me
leave, if you please, to act in the affair as gratitude and friendship shall
dictate. With this assurance, I left him at his inn, and promised to let him
hear from me soon.

As I had taken no active part in state affairs since the two memorials, in
which my eloquence was so signally displayed, I went to look for Carnero,
with a view to inquire whether Don Alphonso's government was really
taken from him. He answered in the affirmative, but professed not to know
the reason. Finding how things stood, I determined to apply at head-
quarters, and to learn the grounds of grievance from his lordship's own
mouth.

My spirits were really harassed, so that there was no need of putting on


the trappings and the suits of woe, to attract my lord duke's notice. What is
the matter, Santillane? said he as soon as he saw me. I perceive a marked
unhappiness on your countenance, and tears just ready to trickle down your
cheeks. Has any one behaved ill to you? Tell me, and you shall have your
revenge. My lord, answered I in a melancholy tone, even though my grief
would seek to hide itself, it must have vent: my despair is past endurance.
The report goes that Don Alphonso is no longer governor of Valencia; a
severer stroke could not have been inflicted on me. What say you, Gil Blas?
replied the minister in astonishment: what interest can you take in this Don
Alphonso and his government? On this question, I detailed at length my
obligations to the lords of Leyva, and modestly stated my own interference
with the Duke of Lerma, to obtain the appointment for my friend.

When his excellency had heard me through with the most polite and kind
attention, he spoke thus: Make yourself easy, Gil Blas. Besides my entire
ignorance of what you have just told me, I must own that I considered Don
Alphonso as the cardinal's creature. Only put yourself in my place: was not
the visit to his eminence a most suspicious circumstance? Yet I am willing
to believe that, owing his preferment to that minister, he might have
remembered him in his adversity from a motive of pure gratitude. I am
sorry for having displaced a man who owed his elevation to you; but if I
have pulled down your handiwork I can build it up again. I mean to do still
more than the Duke of Lerma for you. Your friend Don Alphonso was only
governor of Valencia; I appoint him viceroy of Arragon: you may send him
word so yourself, and order him hither to take the oaths.

At these words, my feelings changed from extreme grief to an excess of


joy, which completely caricatured the mediocrity of common sense, and
made me utter an incoherent rhapsody of thanks: but the want of method in
the madness of my discourse was not taken amiss; and on my hinting that
Don Alphonso was already at Madrid, he told me that I might present him
this very day. I ran to the sign of Saint Gabriel, and communicated my own
raptures to Don Cæsar's son, by informing him of his new appointment. He
could not believe what I told him, but found it a hard matter to persuade
himself that the prime minister, though likely enough to be very well
disposed towards me, should extend his friendship so far as to dispose of
viceroyalties at my instance. I carried him with me to my lord duke, who
received him very affably, complimented him on his uniform good conduct
in his government of Valencia, and finished by saying that the king,
considering him as qualified for a higher station, had named him for the
viceroyalty of Arragon. Besides, added he, your family is of a rank not to
disparage the dignity of the office, so that the Arragonese nobility will have
no plea for excepting against the choice of the court.

His excellency made no mention of me, and the public was kept in the
dark as to my share in the business; indeed, this prudent silence was lucky
both for Don Alphonso and the minister, since the tongues of defamers
would have been busy in taking to pieces the pretensions of a viceroy who
owed his preferment to my patronage.

As soon as Don Cæsar's son could speak with certainty of his new
honors, he sent off an express for Valencia with the information to his father
and Seraphina, who soon arrived in Madrid. Their first object was to find
me out, and ply me thick and threefold with acknowledgments. What a
proud and affecting sight for me, to behold the three persons in the world
nearest my heart, vying with each other in their testimonies of affection and
gratitude! The pleasure my zeal seemed personally to give them was equal
to the dignity conferred on their house by the post of viceroy. They even
talked with me on a footing of equality, and scarcely remembered my
original distance or servitude in the fervor of their present feelings. But not
to dwell on unnecessary topics, Don Alphonso, having taken the oaths and
returned thanks, left Madrid with his family, to take up his abode at
Saragossa. He made his public entry with appropriate magnificence; and the
Arragonese caused it to appear, by their cordial reception, that I had a very
pretty knack at picking out a viceroy.

CHAPTER XIII.

GIL BLAS MEETS DON GASTON DE GOGOLLOS AND DON ANDREW


DE TORDESILLAS AT THE DRAWING-ROOM, AND ADJOURNS
WITH THEM TO A MORE CONVENIENT PLACE. THE STORY OF
DON GASTON AND DONNA HELENA DE GALISTEO CONCLUDED.
SANTILLANE RENDERS SOME SERVICE TO TORDESILLAS.

I was up to the hilts in joy at having so marvellously metamorphosed an


ex-governor into a viceroy; the lords of Leyva themselves were not primed
and loaded so near to bursting. But very soon I had another opportunity of
employing my credit in the beaten track of friendship; and there is the more
occasion to quote these instances, that my readers may clearly discern with
how different a man they are in company, from that graceless Gil Blas, who,
under the former ministry, carried on a shameless traffic in the honors and
emoluments of the state.

One day I was waiting in the king's antechamber, in conversation with


some noblemen, who, knowing me to stand well with the prime minister,
were not ashamed of taking me by the hand. In the crowd was Don Gaston
de Cogollos, whom I had left a prisoner in the tower of Segovia. He was
with Don Andrew de Tordesillas, the warden. I readily quitted my company
to go and renew my acquaintance with my two friends. If they were
astonished at the sight of me, I was no less so to find them here. After
mutual greetings, Don Gaston said, Signor de Santillane, we have many
inquiries to make of each other, and this place affords little opportunity for
private intercourse; allow me to request your company where we may open
our hearts freely. I made no objection; we pushed our way through the
crowd, and left the palace. Don Gaston's carriage was ready waiting in the
street: we all three got into it, and drove to the great market-place, where
the bull-fights are exhibited. There Cogollos lived in a very handsome
house.

Signor Gil Blas, said Don Andrew on our entrance, at your departure
from Segovia you seemed to have conceived a thorough hatred against the
court, and to have formed a settled purpose of abandoning it forever. Such
was, in fact, my design, answered I; nor were my sentiments at all changed
during the lifetime of the late king; but when the prince his son came to the
throne, I had a mind to see whether the new monarch would know me
again. He did so, and received me favorably, with a strong recommendation
to the prime minister, who admitted me to his friendship, and took me more
into his confidence than ever did the Duke of Lerma. This, Signor Don
Andrew, is my story. And now tell me whether you still hold your office in
the tower of Segovia. No, indeed, answered he; my lord duke has removed
me, and put another in my room. He probably considered me as entirely
devoted to his predecessor. And I, said Don Gaston, was set at liberty for
the contrary reason; the prime minister was no sooner informed that my
imprisonment was by the Duke of Lerma's order, than he ordered me to be
released. The present business, Signor Gil Blas, is to relate the subsequent
particulars of my adventures.

The first thing I did, continued he, after thanking Don Andrew for his
kind attentions during my confinement, was to repair to Madrid. I presented
myself before the Count Duke of Olivarez, who said, You need not be
apprehensive of any blemish on your character in consequence of your late
misfortune; you are honorably acquitted: nay, your innocence is so much
the more satisfactorily established, as the Marquis of Villareal, with whom
you were supposed to be implicated, was not guilty. Though a Portuguese,
and related to the Duke of Braganza, he is less in his interests than in those
of the king my master. That connection, therefore, ought not to have been
imputed to you as a crime; but, to repair your wrongs, the king has given
you a lieutenant's commission in the Spanish guards. This I accepted,
begging it as a favor of his excellency to allow me, before I joined my
regiment, to go and see my aunt, Donna Eleonora de Laxarilla, at Coria.
The minister gave me leave of absence for a month, and I departed with
only one servant.

We had got beyond Colmenar, and were threading a narrow pass


between two mountains, when we came within sight of a gentleman
defending himself bravely against three men, who all fell upon him
together. I did not hesitate about going to his aid, but hastened forward and
planted myself by his side. I remarked, while we were fighting, that our
enemies were masked, and that we had to do with expert swordsmen. But
we triumphed over the united advantages of their skill and disparity. I ran
one of the three through the body; he fell from his horse, and the two others
immediately betook themselves to flight. The victory indeed was scarcely
less fatal to us than to the wretch whom I had killed, for we were both
dangerously wounded. But conceive my surprise, when I discovered the
gentleman to be Combados, the husband of Donna Helena. He was no less
astonished at recognizing me as his defender. Ah, Don Gaston! exclaimed
he, was it you, then, who came to my assistance? When you took my part so
generously, you little thought it was the person who had snatched your
mistress from you. I really did not know it, answered I; but though I had, do
you think I could have wavered about doing as I have done? Can you
entertain so ill an opinion of me as to believe my soul so sordid? No, no,
replied he; I think better of you; and should I die of my wounds, it will be
my prayer that yours may not disable you from profiting by my death.
Combados, said I, though I have not yet forgotten Donna Helena, know that
I do not pant after the possession of her charms at the expense of your life;
so far from it, that I congratulate myself on having contributed to your
rescue from assassination, since by so doing I have performed an acceptable
service to your wife.

While we were communing together, my servant dismounted, and


drawing near to the gentleman stretched at his length, took off his mask,
when Combados, with sensations of gratitude for his deliverance, distinctly
traced the features. It is Caprara, exclaimed he; that treacherous cousin,
who, in mere disgust at having missed a rich inheritance which he had
unjustly disputed with me, has long since cherished a murderous design
against my life, and fixed on this day to put it in execution; but heaven has
turned him over to its determined vengeance, and made him the victim of
his own attempt.

While this conversation was going on, our blood was flowing at the
same rate, and we were becoming more exhausted every minute.
Nevertheless, disabled as we were, we had strength enough to reach the
town of Villarejo, which lies within a gunshot or two from the field of
battle. At the very first house of call we sent for surgeons. The most expert
came at our summons. He examined our wounds, and reported them as
dangerous. After taking off the bandages and dressing them a second time,
he pronounced those of Don Blas to be mortal. Of mine he thought more
favorably, and the event corresponded with his prognostic.

Combados, finding himself consigned to the grave, thought only of due


preparation for a most serious event. He sent an express to his wife, with an
account of what had happened, particularizing his present sad condition.
Donna Helena soon arrived at Villarejo. Her mind was drawn different ways
by two opposite occasions of distress—the hazard of her husband's life, and
the fear of feeling the revival of a half extinguished flame at the sight of
me. This sight occasioned her to experience a terrible agitation. Madam,
said Don Blas when she appeared in his presence, you are come just in time
to receive my farewell. I am at the point of death, and I consider my fate as
a punishment from heaven for having taken you from Don Gaston by a
feint: far from murmuring at it, I exhort you with my last breath to restore
to him a heart which I had stolen from him. Donna Helena answered him
only by her tears; and indeed it was the best answer she could make; for she
had neither forgotten her first love, nor the artifices whereby she had been
influenced to renounce her plighted faith.

It happened, as the surgeon had anticipated, that in less than three days
Combados died of his wounds, while mine, on the contrary, wore the
appearance of convalescence. The young widow, whom no earthly
considerations could detach from the care of transporting her late husband's
remains to Coria, that they might be deposited with due honors in the
family vault, left Villarejo on her return, after inquiring, merely as a matter
of course, how I was going on. As soon as I was well enough to be
removed, I bent my course to Coria, where my recovery was soon
ascertained. My aunt, Donna Eleonora, and Don George de Galisteo, were
determined that my marriage with Helena should take place forthwith, lest
some new caprice of fortune should part us once more. The ceremony was
privately performed, on account of the late melancholy event, and within a
few days I returned to Madrid with Donna Helena. As my leave of absence
had expired, I was afraid lest the minister should have superseded me in my
lieutenancy; but he had not filled up the vacancy, and received my
apologies very graciously.

Thus am I, continued Cogollos, lieutenant of the Spanish guards, and my


situation is exactly to my mind. The circle of my friends is respectable and
pleasant, and I live at my ease among them. Would I could say as much!
exclaimed Don Andrew; but I am very far from being satisfied with my lot:
I have lost my appointment, which was not without its advantages, and have
no friends of sufficient interest to procure me a better berth. Excuse me,
Signor Don Andrew, cried I, with a sort of upbraiding smile, you have a
friend in me who may chance to be better than no friend at all. I have told
you already that I am a greater favorite with my lord duke than with the
Duke of Lerma; and will you tell me to my face that you have no interest at
court? Have you not already experienced the contrary? Recollect that,
through the Archbishop of Grenada's powerful recommendation, I procured
you a nomination for Mexico, where you would have made your fortune, if
love had not stepped in and marred it at Alicant. My means are now more
extensive, since I have the ear of the prime minister. I give myself up to you
then, replied Tordesillas; but do not send me into New Spain, though the
first appointment in the colonies were at your disposal.

Here we were interrupted by Donna Helena, who came into the room,
and improved even upon the visions of my fancy by the reality of her
charms. Cogollos introduced me as the companion who had solaced the
tedious hours of his imprisonment. Yes, madam, said I to Donna Helena, my
conversation did indeed soothe his sorrows, for it turned on you. The
compliment was not thrown away, and I took my leave with repeated
congratulations. With respect to Tordesillas, I assured him that within a
week he should know how far my power, as well as will, extended.

Nor were these mere words. On the very next day, the opportunity
occurred. Santillane, said his excellency, the place of governor in the royal
prison of Valladolid is vacant: it is worth more than three hundred pistoles a
year, and is yours if you will accept of it. Not if it were worth ten thousand
ducats, answered I, for it would carry me away from your lordship. But,
replied the minister, you may fill it by deputy, and only visit occasionally.
That is as it may be, rejoined I; but I shall only accept it on condition of
resigning in favor of Don Andrew de Tordesillas, a brave and loyal
gentleman; I should like to give him this place in acknowledgment of his
kindness to me in the tower of Segovia.
Gil Blas accepting appointment

This plea made the minister laugh heartily, and say, As far as I see, Gil
Blas, you mean to make yourself a general patron. Even so be it, my friend;
the vacancy is yours for Tordesillas; but tell me unfeignedly what fellow-
feeling you have in the business, for you are not such a fool as to throw
away your interest for nothing. My lord, answered I, Don Andrew charged
me nothing for all his acts of friendship; and should not a man repay his
obligations? You are become highly moral and self-mortified, replied his
excellency; rather more so than under the last administration. Precisely so,
rejoined I; then evil communication corrupted my principles; bargain and
sale were the order of the day, and I conformed to the established practice:
now, all preferment is allotted on the footing of a meritorious free gift, and
my integrity shall not be the last to fall in with the fashion.

CHAPTER XIV.

SANTILLANE'S VISIT TO POET NUNEZ; THE COMPANY AND


CONVERSATION.

One day, after dinner, a fancy seized me to go and see the poet of the
Asturias, feeling a sort of curiosity to know on what floor he lodged. I
repaired to the house of Signor Don Bertrand Gomez del Ribero, and asked
for Nunez. He does not live here now, said the porter, but over the way, in
apartments at the back of the house. I went thither, and, crossing a small
court, entered an unfurnished parlor, where my friend Fabricio was sitting at
table, doing the honors to five or six guests from the hamlet and liberty of
Parnassus.

They were at the latter end of a feast, and of course at the beginning of
an affray; but as soon as they perceived me, a dead silence succeeded to
their obstreperous argumentation. Nunez rose from his seat with much
pomp and circumstance of politeness to receive me, saying, Gentlemen,
Signor de Santillane! He does me the honor to visit me under this humble
roof; as the favorite of the prime minister, you will all join with me in
tendering your humble services. At this introduction, the worshipful
company got up and made their best bows; for my rank could not fail of
procuring me respect from the manufacturers of dedications. Though I was
neither hungry nor thirsty, it was impossible not to sit down and drink a
toast in such society.

My presence appearing to be a restraint, Gentlemen, said I, it should


seem that I have interrupted your conversation: resume it, or you drive me
away. My learned friends, said Fabricio, were discussing the "Iphigenia" of
Euripides. The bachelor, Melchior de Villégas, a clever man of the first rank
in the republic of letters, resumed the topic by asking Don Jacinto de
Romerate which was the point of interest in that tragedy. Don Jacinto
ascribed it to the imminent danger of Iphigenia. The bachelor contended,
offering to prove his proposition by all the evidence admissible at the bar of
logic or criticism, that the danger of a trumpery girl had nothing to do with
the real sympathy of that affecting piece. What has to do with it then?
bawled the old licentiate Gabriel of Leon, indignantly. It turns with the
wind, replied the bachelor.

The whole company burst into a shout of laughter at this assertion,


which they were far from considering as serious; and I myself thought that
Melchior had only launched it by way of adding the zest of wit to the
severity of critical discussion. But I was out in my calculation respecting
the character of that eminent scholar: he had not a grain of sprightliness or
pleasantry in his whole composition. Laugh as you please, gentlemen,
replied he, very coolly; I maintain that there is no circumstance but the
wind, unless it be the weathercock, to interest, to strike, to rouse the
passions of the spectator. Figure to yourselves a multitudinous army
assembled for the purpose of laying siege to Troy; take into the account the
eager haste of the officers and common men to carry their enterprise into
execution, that they may return with their best legs foremost into Greece,
where they have left everything most dear to them—their household gods,
their wives and their children: all this while a mischievous wind from the
wrong quarter keeps them port-bound at Aulis, and, as it were, drives a nail
into the very head of the expedition; so that, till better weather, it was
impossible to go and lay siege to Priam's town. Wind and weather,
therefore, make up the interest of this tragedy. My good wishes are with the
Greeks; my whole faculties are wrapped up in the success of their design;
the sailing of their fleet is with me the only hinge of the fable, and I look at
the danger of Iphigenia with somewhat of a self-interested complacency,
because by her death the winding up of the story into a brisk and favorable
gale was likely to be accelerated.

As soon as Villégas had finished his criticism, the laugh burst out more
than ever at his expense. Nunez was sly enough to side with him, that a
fairer scope and broader mark might be presented to the shafts of malicious
wit which were let fly from all the quarters in the shipman's card at this
poster of the sea and land. But the bachelor, eying them all with sublime
indifference and supreme contempt, gave them to understand how low in
the list of the ignorant and vulgar they ranked in his estimation. Every
moment did I expect to see these vaporing spirits kindle into a blaze, and
wage war against the hairy honors of each other's brainless skulls; but the
joke was not carried to that length: they confined their hostilities to
opprobrious epithets, and took their leave when they had eaten and drunk as
much as they could get.

After their departure, I asked Fabricio why he had separated himself


from his treasurer, and whether they had quarrelled. Quarrelled! answered
he: Heaven defend me from such a misfortune! I am on better terms than
ever with Signor Don Bertrand, who gave his consent to my living apart
from him: here, therefore, I receive my friends, and take my pleasure with
them unmolested. You know very well that I am not of a temper to lay up
treasures for those who are to come after me; and as it happens luckily, I am
now in circumstances to give my little classical entertainments every day. I
am delighted at it, ny dear Nunez, replied I, and once more wish you joy on
the success of your last tragedy: the great Lope, by his eight hundred
dramatic pieces, never made a quarter of the money which you have got by
the damnation of your "Count de Saldagna."
BOOK THE TWELFTH.

CHAPTER I.

GIL BLAS SENT TO TOLEDO BY THE MINISTER. THE PURPOSE OF


HIS JOURNEY AND ITS SUCCESS.

For nearly a month his excellency had been saying to me every day,
Santillane, the time is approaching when I shall call your choicest powers of
address into action; but the time that was coming never came. It is a long
lane, however, where there is no turning; and his excellency at length spoke
to me nearly as follows: They say that there is, in the company of
comedians at Toledo, a young actress of much note for her personal and
professional fascinations; it is affirmed that she dances and sings like all the
Muses and Graces put together, and that the whole theatre rings with
applause at her performance: to these perfections is added matchless and
irresistible beauty. Such a star should only shine within the circle of a court.
The king has a taste for the stage, for music, and for dancing; nor must he
be debarred from the pleasure of seeing and hearing such a prodigy. I have
determined on sending you to Toledo, that you may judge for yourself
whether she really is so extraordinary an actress: on your feeling of her
merit my measures shall be taken; for I have unlimited confidence in your
discernment.

I undertook to bring his lordship a good account of this business, and


made my arrangements for setting out with one servant, but not in the
minister's livery, by way of conducting matters more warily; and that
precaution relished well with his excellency. On my arrival at Toledo, I had
scarcely alighted at the inn, when the landlord, taking me for some country
gentleman, said, Please your honor, you are probably come to be present at

You might also like