How To Lie With Bad Data
How To Lie With Bad Data
Abstract. As Huff’s landmark book made clear, lying with statistics can be
accomplished in many ways. Distorting graphics, manipulating data or using
biased samples are just a few of the tried and true methods. Failing to use the
correct statistical procedure or failing to check the conditions for when the
selected method is appropriate can distort results as well, whether the motives
of the analyst are honorable or not. Even when the statistical procedure and
motives are correct, bad data can produce results that have no validity at all.
This article provides some examples of how bad data can arise, what kinds
of bad data exist, how to detect and measure bad data, and how to improve
the quality of data that have already been collected.
Key words and phrases: Data quality, data profiling, data rectification, data
consistency, accuracy, distortion, missing values, record linkage, data ware-
housing, data mining.
231
232 R. D. DE VEAUX AND D. J. HAND
1981), Kruskal devoted much of his time to “incon- and other data attributes. For many problems, for ex-
sistent or clearly wrong data, especially in large data ample, data gradually become less and less relevant—
sets.” As just one example, he cited a 1960 census a phenomenon sometimes termed data decay or pop-
study that showed 62 women, aged 15 to 19 with 12 ulation drift (Hand, 2004a). Thus the characteristics
or more children. Coale and Stephan (1962) pointed collected on mortgage applicants 25 years ago would
out similar anomalies when they found a large number probably not be of much use for developing a pre-
of 14-year-old widows. In a classic study by Wolins dictive risk model for new applicants, no matter how
(1962), a researcher attempted to obtain raw data from accurately they were measured at the time. In some en-
37 authors of articles appearing in American Psycho- vironments, the time scale that renders a model useless
logical Association journals. Of the seven data sets that can become frighteningly short. A model of customer
were actually obtained, three contained gross data er- behavior on a web site may quickly become out of date.
rors. Sometimes different aspects of this broader interpre-
A 1986 study by the U.S. Census estimated that be- tation of data quality work in opposition. Timeliness
tween 3 and 5% of all census enumerators engaged in and accuracy provide an obvious example (and, in-
some form of fabrication of questionnaire responses deed, one which is often seen when economic time se-
ries are revised as more accurate information becomes
without actually visiting the residence. This practice
available).
was widespread enough to warrant its own term: curb-
From the perspective of the statistical analyst, there
stoning, which is the “enumerator jargon for sitting on
are three phases in data evolution: collection, prelimi-
the curbstone filling out the forms with made-up infor-
nary analysis and modeling. Of course, the easiest way
mation” (Wainer, 2004). While curbstoning does not
to deal with bad data is to prevent poor data from be-
imply bad data per se, at the very least, such practices ing collected in the first place. Much of sample survey
imply that the data set you are analyzing does not de- methodology and experimental design is devoted to
scribe the underlying mechanism you think you are de- this subject, and many famous stories of analysis gone
scribing. wrong are based on faulty survey designs or experi-
What exactly are bad data? The quality of data is ments. The Literary Digest poll proclaiming Landon’s
relative both to the context and to the question one win over Roosevelt in 1936 that starred in Chapter 1 of
is trying to answer. If data are wrong, then they are Huff (1954) is just one of the more famous examples.
obviously bad, but context can make the distinction At the other end of the process, we have resistant and
more subtle. In a regression analysis, errors in the pre- robust statistical procedures explicitly designed to per-
dictor variables may bias the estimates of the regres- form adequately even when a percentage of the data do
sion coefficients and this will matter if the aim hinges not conform or are inaccurate, or when the assumptions
on interpreting these values, but it will not matter if of the underlying model are violated.
the aim is predicting response values for new cases In this article we will concentrate on the “middle”
drawn from the same distribution. Likewise, whether phase of bad data evolution—that is, on its discovery
data are “good” also depends on the aims: precise, ac- and correction. Of course, no analysis proceeds lin-
curate measurements are useless if one is measuring early through the process of initial collection to final
the wrong thing. Increasingly in the modern world, es- report. The discoveries in one phase can impact the
pecially in data mining, we are confronted with sec- entire analysis. Our purpose will be to discuss how to
ondary data analysis: the analysis of data that have recognize and discover these bad data using a variety
been collected for some other purpose (e.g., analyz- of examples, and to discuss their impact on subsequent
ing billing data for transaction patterns). The data may statistical analysis. In the next section we discuss the
have been perfect for the original aim, but could have causes of bad data. Section 3 discusses the ways in
serious deficiencies for the new analysis. which data can be bad. In Section 4 we turn to the prob-
For this paper, we will take a rather narrow view lem of detecting bad data and in Section 5 we provide
of data quality. In particular, we are concerned with some guidelines for improving data quality. We sum-
data accuracy, so that, for us, “poor quality data are marize and present our conclusions in Section 6.
defined as erroneous values assigned to attributes of
2. WHAT ARE THE CAUSES OF BAD DATA?
some entity,” as in Pierce (1997). A broader perspective
might also take account of relevance, timeliness, exis- There is an infinite variety to the ways in which data
tence, coherence, completeness, accessibility, security can go bad, and the specifics depend on the underlying
HOW TO LIE WITH BAD DATA 233
process that generate the data. Data may be distorted that some surviving examples of the greater mouse-
from the outset during the initial collection phase or eared bat, previously thought to be extinct, had been
they may be distorted when the data are transcribed, discovered hibernating in West Sussex. It went on to
transferred, merged or copied. Finally, they may deteri- assert that “they can weigh up to 30 kg” (see Hand,
orate, change definition or otherwise go through trans- 2004b, Chapter 4). A considerable amount of enter-
formations that render them less representative of the taining correspondence resulted from the fact that they
original underlying process they were designed to mea- had misstated the weight by three decimal places.
sure. Sometimes data are distorted from the source itself,
The breakdown in the collection phase can occur either knowingly or not. Examples occur in survey
whether the data are collected by instrument or di- work and tax returns, just to name two. It is well known
rectly recorded by human beings. Examples of break- to researchers of sexual behavior that men tend to re-
downs at the instrument level include instrument drift, port more lifetime sexual partners than women, a sit-
initial miscalibration, or a large random or otherwise uation that is highly unlikely sociologically (National
unpredictable variation in measurement. As an exam- Statistics website: www.statistics.gov.uk). Some data
ple of instrument level data collection, consider the are deliberately distorted to prevent disclosure of con-
measurement of the concentration of a particular chem- fidential information collected by governments in, for
ical compound by gas chromatography, as used in rou- example, censuses (e.g., Willenborg and de Waal,
tine drug testing. When reading the results of such a 2001) and health care data.
test, it is easy to think that a machine measures the Even if the data are initially recorded accurately, data
amount of the compound in an automatic and straight- can be compromised by data integration, data ware-
forward way, and thus that the resulting data are mea- housing and record linkage. Often a wide range of
suring some quantity directly. It turns out to be a bit sources of different types are involved (e.g., in the
more complicated. At the outset, a sample of the ma-
pharmaceutical sector, data from clinical trials, ani-
terial of interest is injected into a stream of carrier
mal trials, manufacturers, marketing, insurance claims
gas where it travels down a silica column heated by
and postmarketing surveillance might be merged).
an oven. The column then separates the mixture of
At a more mundane level, records that describe dif-
compounds according to their relative attraction to a
ferent individuals might be inappropriately merged be-
material called the adsorbent. This stream of different
cause they are described by the same key. When going
compounds travels “far enough” (via choices of col-
through his medical records for insurance purposes,
umn length and gas flow rates) so that by the time they
pass by the detector, they are well separated (at least in one of the authors discovered that he was recorded as
theory). At this point, both the arrival time and the con- having had his tonsils removed as a child. A subse-
centration of the compound are recorded by an electro- quent search revealed the fact that the records of some-
mechanical device (depending on the type of detector one else with the same name (but a different address)
used). The drifts inherent in the oven temperature, gas had been mixed in with his. More generally, what is
flow, detector sensitivity and a myriad of other environ- good quality for (the limited demands made of ) an op-
mental conditions can affect the recorded numbers. To erational data base may not be good quality for (poten-
determine actual amounts of material present, a known tially unlimited demands made of ) a data warehouse.
quantity must be tested at about the same time and the In a data warehouse, the definitions, sources and
machine must be calibrated. Thus the number reported other information for the variables are contained in
as a simple percentage of compound present has not a dictionary, often referred to as metadata. In a large
only been subjected to many potential sources of error corporation it is often the IT (information technology)
in its raw form, but is actually the output of a calibra- group that has responsibility for maintaining both the
tion model. data warehouse and metadata. Merging sources and
Examples of data distortion at the human level checking for consistent definitions form a large part of
include misreading of a scale, incorrect copying of their duties.
values from an instrument, transposition of digits and A recent example in bioinformatics shows that data
misplaced decimal points. Of course, such mistakes are problems are not limited to business and economics.
not always easy to detect. Even if every data value is In a recent issue of The Lancet, Petricoin et al. (2002)
checked for plausibility, it often takes expert knowl- reported an ability to distinguish between serum sam-
edge to know if a data value is reasonable or ab- ples from healthy women, those with ovarian cancers
surd. Consider the report in The Times of London and women with a benign ovarian disease. It was so
234 R. D. DE VEAUX AND D. J. HAND
exciting that it prompted the “U.S. Congress to pass the statistician, wanting to incorporate this predictor
a resolution urging continued funding to drive a new into a model, asked one of the physicists whether any
diagnostic test toward the clinic” (Check, 2004). The wind data existed. It was difficult to imagine very
researchers trained an algorithm on 50 cancer spectra many Antarctic stations with anemometers and so he
and 50 normals, and then predicted 116 new spectra. was very surprised when the physicist replied, “Sure,
The results were impressive with the algorithm cor- there’s plenty of it.” Excitedly he asked what spatial
rectly identifying all 50 of the cancers, 47 out of 50 nor- resolution he could provide. When the physicist coun-
mals, and classifying the 16 benign disease spectra as tered with “what resolution do you want?” the statisti-
“other.” Statisticians Baggerly, Morris and Coombes cian became suspicious. He probed further and asked
(2004) attempted to reproduce the Petricoin et al. re- if they really had anemometers set up on a 5 km grid
sults, but were unable to do so. Finally, they concluded on the sea ice? He said, “Of course not. The wind
that the three types of spectra had been preprocessed data come from a global weather model—I can gen-
differently, so that the algorithm correctly identified erate them at any resolution you want!” It turned out
differences in the data, much of which that had noth- that all the other satellite data had gone through some
ing to do with the underlying biology of cancer. sort of preprocessing before it was given to the sta-
A more subtle source of data distortion is a change tistician. Some were processed actual direct measure-
in the measurement or collection procedure. When the ments, some were processed through models and some,
cause of the change is explicit and recognized, this can like the wind, were produced solely from models. Of
be adjusted for, at least to some extent. Common exam- course, this (as with curbstoning) does not necessarily
ples include a change in the structure of the Dow Jones imply that the resulting data are bad, but it should at
Industrial Average or the recent U.K. change from the least serve to warn the analyst that the data may not be
Retail Price Index to the European Union standard Har- what they were thought to be.
monized Index of Consumer Prices. In other cases, Each of these different mechanisms for data distor-
one might not be aware of the change. Some of the tions has its own set of detection and correction chal-
changes can be subtle. In looking at historical records lenges. Ensuring good data collection through survey
to assess long-term temperature changes, Jones and and/or experimental design is certainly an important
Wigley (1990) noted “changing landscapes affect tem- first step. A bad design that results in data that are not
perature readings in ways that may produce spurious representative of the phenomenon being studied can
temperature trends.” In particular, the location of the render even the best analysis worthless. At the next
weather station assigned to a city may have changed. step, detecting errors can be attempted in a variety of
During the 19th century, most cities and towns were ways, a topic to which we will return in Section 4.
too small to impact temperature readings. As urbaniza-
tion increased, urban heat islands directly affected tem- 3. IN HOW MANY WAYS?
perature readings, creating bias in the regional trends.
While global warming may be a contributor, the dom- Data can be bad in an infinite variety of ways, and
inant factor is the placement of the weather station, some authors have attempted to construct taxonomies
which moved several times. As it became more and of data distortion (e.g., Kim et al., 2003). An important
more surrounded by the city, the temperature increased, simple categorization is into missing data and distorted
mainly because the environment itself had changed. values.
A problem related to changes in the collection pro- 3.1 Missing Data
cedure is not knowing the true source of the data.
In scientific analysis, data are often preprocessed by Data can be missing at two levels: entire records
technicians and scientists before being analyzed. The might be absent, or one or more individual fields may
statistician may be unaware (or uninterested) in the be missing. If entire records are missing, any analysis
details of the processing. To create accurate models, may well be describing or making inferences about a
however, it can be important to know the source and population different from that intended. The possibility
therefore the accuracy of the measurements. Consider that entire records may be missing is particularly prob-
a study of the effect of ocean bottom topography on lematic, since there will often be no way of knowing
sea ice formation in the southern oceans (De Veaux, this. Individual fields can be missing for a huge vari-
Gordon, Comiso and Bacherer, 1993). After learning ety of reasons, and the mechanism by which they are
that wind can have a strong effect on sea ice formation, missing is likely to influence their distribution over the
HOW TO LIE WITH BAD DATA 235
data, but at least when individual fields are missing one 3.2 Distorted Data
can see that this is the case.
Although there are an unlimited number of pos-
If the missingness of a particular value is unrelated
sible causes of distortion, a first split can be made
either to the response or predictor variables (miss-
into those attributable to instrumentation and those at-
ing completely at random—Little and Rubin, 1987,
tributed to human agency. Floor and ceiling effects are
give technical definitions), then case deletion can be
examples of the first kind (instruments here can be
employed. However, even ignoring the potential bias
mechanical or electronic, but also questionnaires), al-
problems, complete case deletion can severely reduce
though in this case it is sometimes possible to fore-
the effective sample size. In many data mining situa-
see that such things might occur and take account of
tions with a large number of variables, even though
this in the statistical modeling. Human distortions can
each field has only a relatively small proportion of
arise from misreading instruments or misrecording val-
missing values, all of the records may have some val-
ues at any level. Brunskill (1990) gave an illustration
ues missing, so that the case deletion strategy leaves
from public records of birth weights, where ounces are
one with no data at all.
commonly confused with pounds, the number 1 is con-
Complications arise when the pattern of missing
fused with 11 and errors in decimal placements pro-
data does depend on the values that would have been
duce order of magnitude errors. In such cases, using
recorded. If, for example, there are no records for pa-
ancillary information such as gestation times or new-
tients who experience severe pain, inferences to the
born heights can help to spot gross errors. Some data
entire pain distribution will be impossible (at least,
collection procedures, in an attempt to avoid missing
without making some pretty strong distributional as-
data, actually introduce distortions. A data set we ana-
sumptions). Likewise, poor choice of a missing value
lyzed had a striking number of doctors born on Novem-
code (e.g., 0 or 99 for age) or accidental inclusion of
ber 11, 1911. It turned out that most doctors (or their
a missing value code in the analysis (e.g., 99,999 for
secretaries) wanted to avoid typing in age information,
age) has been known to lead to mistaken conclusions.
but because the program insisted on a value and the
Sometimes missingness arises because of the nature
choice of 00/00/00 was invalid, the easiest way to by-
of the problem, and presents real theoretical and practi-
pass the system was simply to type 11/11/11. Such er-
cal issues. For example, in personal banking, banks ac-
rors might not seem of much consequence, but they
cept those loan applicants whom they expect to repay
can be crucial. Confusion between English and metric
the loans. For such people, the bank eventually discov-
units was responsible for the loss of the $125 million
ers the true outcome (repay, do not repay), but for those
Martian Climate Orbiter space probe (The New York
rejected for a loan, the true outcome is unknown: it is
Times, October 1, 1999). Jet Propulsion Laboratory
a missing value. This poses difficulties when the bank
engineers mistook acceleration readings measured in
wants to construct new predictive models (Hand and
English units of pound-seconds for the metric mea-
Henley, 1993; Hand, 2001). If a loan application asks
sure of force in newton-seconds. In 1985, in a prece-
for household income, replacing a missing value by a
dence setting case, the Supreme Court ruled that Dun
mean or even by a model based imputation may lead to
& Bradstreet had to pay $350,000 in libel damages to a
a highly optimistic assessment of risk.
small Vermont construction company. A part-time stu-
When the missingness in a predictor is related di-
dent worker had apparently entered the wrong data into
rectly to the response, it may be useful for exploratory
the Dun & Bradstreet data base. As a result, Dun &
and prediction purposes to create indicator variables
Bradstreet issued a credit report that mistakenly iden-
for each predictor, where the variable is a binary in-
tified the construction company as bankrupt (Percy,
dictor of whether the variable is missing or not. For
1986).
categorical predictor variables, missing values can be
treated simply as a new category. In a study of dropout
4. HOW TO DETECT DATA ERRORS
rates from a clinical trial for a depression drug, it was
found that the single most important indicator of ul- While it may be obvious that a value is missing from
timately dropping out from the study was not the de- a record, it is often less obvious that a value is in er-
pression score on the second week’s test as indicated ror. The presence of errors can (sometimes) be proven,
from complete case analysis, but simply the indicator but the absence of errors cannot. There is no guaran-
of whether the patient showed up to take it (De Veaux, tee that a data set that looks perfect will not contain
Donahue and Small, 2002). mistakes. Some of these mistakes may be intrinsically
236 R. D. DE VEAUX AND D. J. HAND
undetectable: they might be values that are well within of by presenting the data in a form whereby advantage
the range of the data and could easily have occurred. can be taken of these abilities. Such plots have become
Moreover, since errors can occur in an unlimited num- prevalent in statistical packages for examining missing
ber of ways, there is no end to the list of possible tests data patterns. Hand, Blunt, Kelly and Adams (2000)
for detecting errors. On the other hand, strategic choice gave the illustration of a plot showing a point for each
of tests can help to pinpoint the root causes that lead to missing value in a rectangular array of 1012 potential
errors and, hence, to the identification of changes in sufferers from osteoporosis measured on 45 variables.
the data collection process that will lead to the greatest It is immediately clear which cases and which variables
improvement in data quality. account for most of the problems.
When the data collection can be repeated, the re- Unfortunately, as we face larger and larger data
sults of the duplicate measurements, recordings or sets, so we are also faced with increasing difficulty in
transcriptions (e.g., the double entry system used in data profiling. The missing value plot described above
clinical trials) can be compared by automatic methods. works for a thousand cases, but would probably not be
In this “duplicate performance method,” a machine so effective for 10 million. Even in this case, however,
checks for any differences in the two data records. a Pareto chart of percent missing for each variable may
All discrepancies are noted and the only remaining er- be useful for deciding where to spend data preparation
rors are when both collectors made the same mistake. effort. Knowing that a variable is 96% missing makes
Strayhorn (1990) and West and Winkler (1991) pro- one think pretty hard about including it in a model. On
vided statistical methods for estimating that propor- the other hand, separate manual examination of each
tion. In another quality control method, known errors of 30,000 gene expression variables is not to be recom-
are added to a data set whose integrity is then assessed mended.
by an external observer. The “known errors” method When even simple summaries of all the variables in
devised statistical methods for estimating how many a data base are not feasible, some methods for reducing
errors remain based on the success of the observer in the number of potential predictors in the models might
discovering the known errors (Strayhorn, 1990; West be warranted. We see an important role for data mining
and Winkler, 1991). Taking this further, one can build tools here. It may be wise to reverse the usual para-
models (similar to those developed for software relia- digm of explore the data first, then model. Instead, ex-
bility) that estimate how many errors are likely to re- ploratory models of the data can be useful as a first step
main in a data set based on extrapolation from the rate and can serve two purposes (De Veaux, 2002). First,
of discovery of errors. At some point one decides that models such as tree models and clustering can high-
the impact of remaining errors on the conclusions is light groups of anomalous cases. Second, the models
likely to be sufficiently small that one can ignore them. can be used to reduce the number of potential predictor
Automatic methods of data collection use metadata variables and enable the analyst to examine the remain-
information to check for consistency across multiple ing predictors in more detail. The resulting process is
records or variables, integrity (e.g., correct data type), a circular one, with more examination possible at each
plausibility (within the possible range of the data) and subsequent modeling phase. Simply checking whether
coherence between related variables (e.g., number of 500 numerical predictor variables are categorical or
sons plus number of daughters equals number of chil- quantitative without the aid of metadata is a daunt-
dren). Sometimes redundant data can be collected with ing (and tedious) task. In one analysis, we were asked
such checks in mind. However, one cannot rely on soft- to develop a fraud detection model for a large credit
ware to protect one from mistakes. Even when such au- card bank. In the data set was one potential predic-
tomatic methods are in place, the analyst should spend tor variable that ranged from around 2000 to 9000,
some time looking for errors in the data prior to any roughly symmetric and unimodal, which was selected
modeling effort. as a highly significant predictor for fraud in a stepwise
Data profiling is the use of exploratory and data min- logistic regression model. It turned out that this predic-
ing tools aimed at identifying errors, rather than at the tor was a categorical variable (SIC code) used to spec-
substantive questions of interest. When the number of ify the industry from which the product purchases in
predictor variables is manageable, simple plots such the transaction came. Useless as a predictor in a logis-
as bar charts, histograms, scatterplots and time series tic regression model, it had escaped detection as a cat-
plots can be invaluable. The human eye has evolved to egorical variable among the several hundred potential
detect anomalies, and this should be taken advantage candidates. Once the preliminary model whittled the
HOW TO LIE WITH BAD DATA 237
candidate predictors down to a few dozen, it was easy even be possible to devise useful error detection and
to use standard data analysis techniques and to detect correction resource allocation strategies.
which were appropriate for the final model. Sometimes an entirely different approach to improv-
ing data quality can be used. This is simply to hide the
5. IMPROVING DATA QUALITY poor quality by coarsening or aggregating the data. In
fact, a simple example of this implicitly occurs all the
The best way to improve the quality of data is to
time: rather than reporting uncertain and error-prone fi-
improve things in the data collection phase. The ideal
nal digits of measured variables, researchers round to
would be to prevent errors from arising in the first
the nearest digit.
place. Prevention and detection have a reciprocal role
to play here. Once one has detected data errors, one
6. CONCLUSIONS AND FURTHER DISCUSSION
can investigate why they occurred and prevent them
from happening in the future. Once it has been recog- This article has been about data quality from the
nized (detected) that the question “How many miles perspective of an analyst called upon to extract some
do you commute to work each day?” permits more meaning from it. We have already remarked that there
than one interpretation, mistakes can be prevented by are also other aspects to data quality, and these are of
rewording. Progress toward direct keyboard or other equal importance when action is to be taken or deci-
electronic data entry systems means that error detec- sions made on the basis of the data. These include such
tion tools can be applied in real time at data entry— aspects as timeliness (the most sophisticated analysis
when there is still an opportunity to correct the data. applied to out-of-date data will be of limited value),
At the data base phase, metadata can be used to ensure completeness and, of central importance, fitness for
that the data conform to expected forms, and relation- purpose. Data quality, in the abstract, is all very well,
ships between variables can be used to cross-check en- but what may be perfectly fine for one use may be
tries. If the data can be collected more than once, the woefully inadequate for another. Thus ISO 8402 de-
rate of discovery of errors can be used as the basis for a fines quality as “The totality of characteristics of an
statistical model to reveal how many undetected errors entity that bare on its ability to satisfy stated and im-
are likely to remain in the data base. plied needs.”
Various other principles also come into play when It is also important to maintain a sense of propor-
considering how to improve data quality. For exam- tion in assessing and deciding how to cope with data
ple, a Pareto principle often applies: most of the er- distortions. In one large quality control problem in
rors are attributable to just a few variables. This may polymer viscosity, each 1% improvement was worth
happen simply because some variables are intrinsically about $1,000,000 a year, but viscosity itself could be
less reliable (and important) than others. Sometimes it measured only to a standard deviation of around 8%.
is possible to improve the overall level of quality sig- Before bothering about the accuracy of the predictor
nificantly by removing just a few of these low quality variables, it was first necessary to find improved ways
variables. This has a complementary corollary: a law to measure the response. In an entirely different con-
of diminishing returns applies that suggests that suc- text, much work in the personal banking sector concen-
cessive attempts to improve the quality of the data are trates on improved models for predicting risk—where,
likely to lead to less improvement. If one has a partic- again, a slight improvement translates into millions of
ular analytic aim in mind, then one might reasonably dollars of increased profit. In general, however, these
assert that data errors that do not affect the conclusions models are based on retrospective data—data drawn
do not matter. Moreover, for those that do matter, per- from distributions that are unlikely still to apply. We
haps the ease with which they can be corrected should need to be sure that the inaccuracies induced by this
have some bearing on the effort that goes into detect- population drift do not swamp the apparent improve-
ing them—although the overriding criterion should be ments we have made.
the loss consequent on the error being made. This is al- Data quality is a key issue throughout science, com-
lied with the point that the base rate of errors should be merce, and industry, and entire disciplines have grown
taken into account: if one expects to find many errors, up to address particular aspects of the problem. In man-
then it is worth attempting to find them, since the likely ufacturing and, to a lesser extent, the service industries,
rewards, in terms of an improved data base, are likely we have schools for quality control and total quality
to be large. In a well-understood environment, it might management (Six Sigma, Kaizen, etc.). In large part,
238 R. D. DE VEAUX AND D. J. HAND
these are concerned with reducing random variation. H UFF, D. (1954). How to Lie with Statistics. Norton, New York.
In official statistics, strict data collection protocols are J ONES, P. D. and W IGLEY, T. M. L. (1990). Global warming
typically used. trends. Scientific American 263(2) 84–91.
Of course, ensuring high quality data does not come K IM , W., C HOI , B.-J., H ONG , E.-K., K IM, S.-K. and L EE, D.
without a cost. The bottom line is that one must weigh (2003). A taxonomy of dirty data. Data Mining and Knowledge
up the potential gains to be made from capturing and Discovery 7 81–99.
K LEIN, B. D. (1998). Data quality in the practice of consumer prod-
recording better quality data against the costs of en-
uct management: Evidence from the field. Data Quality 4(1).
suring that quality. No matter how much money one
K RUSKAL, W. (1981). Statistics in society: Problems unsolved and
spends, and how much resource one consumes in at- unformulated. J. Amer. Statist. Assoc. 76 505–515.
tempting to detect and prevent bad data, the unfortu- L AUDON, K. C. (1986). Data quality and due process in large in-
nate fact is that bad data will always be with us. terorganizational record systems. Communications of the ACM
29 4–11.
REFERENCES L ITTLE, R. J. A. and RUBIN, D. B. (1987). Statistical Analysis with
Missing Data. Wiley, New York.
BAGGERLY, K. A, M ORRIS, J. S. and C OOMBES, K. R. (2004).
L OSHIN, D. (2001). Enterprise Knowledge Management: The Data
Reproducibility of SELDI-TOF protein patterns in serum:
Quality Approach. Morgan Kaufmann, San Francisco.
Comparing datasets from different experiments. Bioinformat-
ics 20 777–785. M ADNICK, S. E. and WANG, R. Y. (1992). Introduction to the
B RUNSKILL, A. J. (1990). Some sources of error in the coding of TDQM research program. Working Paper 92-01, Total Data
birth weight. American J. Public Health 80 72–73. Quality Management Research Program.
C HECK, E. (2004). Proteomics and cancer: Running before we can M OREY, R. C. (1982). Estimating and improving the quality of in-
walk? Nature 429 496–497. formation in a MIS. Communications of the ACM 25 337–342.
C OALE, A. J. and S TEPHAN, F. F. (1962). The case of the Indians P ERCY, T. (1986). My data, right or wrong. Datamation 32(11)
and the teen-age widows. J. Amer. Statist. Assoc. 57 338–347. 123–124.
D E V EAUX, R. D. (2002). Data mining: A view from down in the P ETRICOIN , E. F., III, A RDEKANI , A. M., H ITT, B. A.,
pit. Stats (34) 3–9. L EVINE , P. J., F USARO , V. A., S TEINBERG , S. M.,
D E V EAUX , R. D., D ONAHUE, R. and S MALL, R. D. (2002). M ILLS , G. B., S IMONE , C., F ISHMAN , D. A., KOHN, E. C.
Using data mining techniques to harvest information in clinical and L IOTTA, L. A. (2002). Use of proteomic patterns in serum
trials. Presentation at Joint Statistical Meetings, New York.
to identify ovarian cancer. The Lancet 359 572–577.
D E V EAUX , R. D., G ORDON, A., C OMISO, J. and
P IERCE, E. (1997). Modeling database error rates. Data Quality
BACHERER , N. E. (1993). Modeling of topographic effects
on Antarctic sea-ice using multivariate adaptive regression 3(1). Available at www.dataquality.com/dqsep97.htm.
splines. J. Geophysical Research—Oceans 98 20,307–20,320. P RICEWATERHOUSE C OOPERS (2004). The Tech Spotlight 22.
H AND, D. J. (2001). Reject inference in credit operations. In Hand- Available at www.pwc.com/extweb/manissue.nsf/docid/
book of Credit Scoring (E. Mays, ed.) 225–240. Glenlake Pub- 2D6E2F57E06E022F85256B8F006F389A.
lishing, Chicago. R EDMAN, T. C. (1992). Data Quality. Management and Technol-
H AND, D. J. (2004a). Academic obsessions and classification ogy. Bantam, New York.
realities: Ignoring practicalities in supervised classification. S TRAYHORN, J. M. (1990). Estimating the errors remaining in
In Classification, Clustering and Data Mining Applications a data set: Techniques for quality control. Amer. Statist. 44
(D. Banks, L. House, F. R. McMorris, P. Arabie and W. Gaul, 14–18.
eds.) 209–232. Springer, Berlin. WAINER, H. (2004). Curbstoning IQ and the 2000 presidential
H AND, D. J. (2004b). Measurement Theory and Practice: election. Chance 17(4) 43–46.
The World Through Quantification. Arnold, London.
W EST, M. and W INKLER, R. L. (1991). Data base error trapping
H AND , D. J., B LUNT, G., K ELLY, M. G. and A DAMS, N. M.
and prediction. J. Amer. Statist. Assoc. 86 987–996.
(2000). Data mining for fun and profit (with discussion). Sta-
tist. Sci. 15 111–131. W ILLENBORG, L. and DE WAAL, T. (2001). Elements of Statistical
H AND, D. J. and H ENLEY, W. E. (1993). Can reject inference ever Disclosure Control. Springer, New York.
work? IMA J. of Mathematics Applied in Business and Industry W OLINS, L. (1962). Responsibility for raw data. American Psy-
5(4) 45–55. chologist 17 657–658.