How To Lie With Bad Data

Uploaded by

Jorge Luis Ledesma Ureña

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views

How To Lie With Bad Data

Uploaded by

Jorge Luis Ledesma Ureña

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Statistical Science

2005, Vol. 20, No. 3, 231–238

DOI 10.1214/088342305000000269
© Institute of Mathematical Statistics, 2005

How to Lie with Bad Data

Richard D. De Veaux and David J. Hand

Abstract. As Huff’s landmark book made clear, lying with statistics can be
accomplished in many ways. Distorting graphics, manipulating data or using
biased samples are just a few of the tried and true methods. Failing to use the
correct statistical procedure or failing to check the conditions for when the
selected method is appropriate can distort results as well, whether the motives
of the analyst are honorable or not. Even when the statistical procedure and
motives are correct, bad data can produce results that have no validity at all.
This article provides some examples of how bad data can arise, what kinds
of bad data exist, how to detect and measure bad data, and how to improve
the quality of data that have already been collected.
Key words and phrases: Data quality, data profiling, data rectification, data
consistency, accuracy, distortion, missing values, record linkage, data ware-
housing, data mining.

1. INTRODUCTION consumer products supply chain concluded that “bad

data costs the electro industry $1.2B annually.” While
Bad data can ruin any analysis. “Garbage in, garbage
the accuracy of these claims is hard to verify, it is clear
out” is as true today in this era of terabytes of data and
that data quality is a concern to business worldwide.
distributed computing as it was in 1954 when How to
An informal survey of topics in management seminars
Lie with Statistics was published (Huff, 1954). Distor-
shows the prevalence of data quality as an important
tions in the data are likely to produce distortions in the
topic and concern for top-level executives and man-
conclusions, to the extent that these may be wildly in- agers.
accurate, completely invalid or useless. Anyone who has analyzed real data knows that the
The cost of bad data can be enormous. Estimates majority of their time on a data analysis project will
of how much bad data cost U.S. industry permeate be spent “cleaning” the data before doing any analysis.
industry publications and the Internet. Pricewater- Common wisdom puts the extent of this at 60–95% of
houseCoopers (2004), in a recent survey of “Top 500” the total project effort, and some studies (Klein, 1998)
corporations, found that most corporations are experi- suggest that “between one and ten percent of data items
encing major impacts to their business as a result of in critical organizational databases are estimated to be
poor data quality. In their survey, 75% of respondents inaccurate” (Laudon, 1986; Madnick and Wang, 1992;
reported significant problems as the result of defec- Morey, 1982; Redman, 1992). Somewhat paradoxi-
tive data. David Loshin, author of Enterprise Knowl- cally, most statistical training assumes that the data ar-
edge Management: The Data Quality Approach, states rive “precleaned.” Students, whether in Ph.D. programs
that “scrap and rework attributable to poor data qual- or in an undergraduate introductory course, are not rou-
ity accounts for 20–25% of an organization’s budget” tinely taught how to check data for accuracy or even to
(Loshin, 2001). An A. T. Kearney study of the retail worry about it. Exacerbating the problem further are
claims by software vendors that their techniques can
Richard D. De Veaux is Professor, Department of produce valid results no matter what the quality of the
Mathematics and Statistics, Bronfman Science Cen- incoming data.
ter, Williams College, Williamstown, Massachusetts How pervasive are bad data? Not only is industry
01267, USA (e-mail: [email protected]). David J. concerned with bad data, but examples are ubiqui-
Hand is Professor, Department of Mathematics, Imper- tous in the scientific literature of the past 50 years as
ial College, London SW7 2AZ, UK (e-mail: d.j.hand@ well. In the 1978 Fisher Lecture “Statistics in Soci-
imperial.ac.uk). ety: Problems Unsolved and Unformulated (Kruskal,

231
232 R. D. DE VEAUX AND D. J. HAND

1981), Kruskal devoted much of his time to “incon- and other data attributes. For many problems, for ex-
sistent or clearly wrong data, especially in large data ample, data gradually become less and less relevant—
sets.” As just one example, he cited a 1960 census a phenomenon sometimes termed data decay or pop-
study that showed 62 women, aged 15 to 19 with 12 ulation drift (Hand, 2004a). Thus the characteristics
or more children. Coale and Stephan (1962) pointed collected on mortgage applicants 25 years ago would
out similar anomalies when they found a large number probably not be of much use for developing a pre-
of 14-year-old widows. In a classic study by Wolins dictive risk model for new applicants, no matter how
(1962), a researcher attempted to obtain raw data from accurately they were measured at the time. In some en-
37 authors of articles appearing in American Psycho- vironments, the time scale that renders a model useless
logical Association journals. Of the seven data sets that can become frighteningly short. A model of customer
were actually obtained, three contained gross data er- behavior on a web site may quickly become out of date.
rors. Sometimes different aspects of this broader interpre-
A 1986 study by the U.S. Census estimated that be- tation of data quality work in opposition. Timeliness
tween 3 and 5% of all census enumerators engaged in and accuracy provide an obvious example (and, in-
some form of fabrication of questionnaire responses deed, one which is often seen when economic time se-
ries are revised as more accurate information becomes
without actually visiting the residence. This practice
available).
was widespread enough to warrant its own term: curb-
From the perspective of the statistical analyst, there
stoning, which is the “enumerator jargon for sitting on
are three phases in data evolution: collection, prelimi-
the curbstone filling out the forms with made-up infor-
nary analysis and modeling. Of course, the easiest way
mation” (Wainer, 2004). While curbstoning does not
to deal with bad data is to prevent poor data from be-
imply bad data per se, at the very least, such practices ing collected in the first place. Much of sample survey
imply that the data set you are analyzing does not de- methodology and experimental design is devoted to
scribe the underlying mechanism you think you are de- this subject, and many famous stories of analysis gone
scribing. wrong are based on faulty survey designs or experi-
What exactly are bad data? The quality of data is ments. The Literary Digest poll proclaiming Landon’s
relative both to the context and to the question one win over Roosevelt in 1936 that starred in Chapter 1 of
is trying to answer. If data are wrong, then they are Huff (1954) is just one of the more famous examples.
obviously bad, but context can make the distinction At the other end of the process, we have resistant and
more subtle. In a regression analysis, errors in the pre- robust statistical procedures explicitly designed to per-
dictor variables may bias the estimates of the regres- form adequately even when a percentage of the data do
sion coefficients and this will matter if the aim hinges not conform or are inaccurate, or when the assumptions
on interpreting these values, but it will not matter if of the underlying model are violated.
the aim is predicting response values for new cases In this article we will concentrate on the “middle”
drawn from the same distribution. Likewise, whether phase of bad data evolution—that is, on its discovery
data are “good” also depends on the aims: precise, ac- and correction. Of course, no analysis proceeds lin-
curate measurements are useless if one is measuring early through the process of initial collection to final
the wrong thing. Increasingly in the modern world, es- report. The discoveries in one phase can impact the
pecially in data mining, we are confronted with sec- entire analysis. Our purpose will be to discuss how to
ondary data analysis: the analysis of data that have recognize and discover these bad data using a variety
been collected for some other purpose (e.g., analyz- of examples, and to discuss their impact on subsequent
ing billing data for transaction patterns). The data may statistical analysis. In the next section we discuss the
have been perfect for the original aim, but could have causes of bad data. Section 3 discusses the ways in
serious deficiencies for the new analysis. which data can be bad. In Section 4 we turn to the prob-
For this paper, we will take a rather narrow view lem of detecting bad data and in Section 5 we provide
of data quality. In particular, we are concerned with some guidelines for improving data quality. We sum-
data accuracy, so that, for us, “poor quality data are marize and present our conclusions in Section 6.
defined as erroneous values assigned to attributes of
2. WHAT ARE THE CAUSES OF BAD DATA?
some entity,” as in Pierce (1997). A broader perspective
might also take account of relevance, timeliness, exis- There is an infinite variety to the ways in which data
tence, coherence, completeness, accessibility, security can go bad, and the specifics depend on the underlying
HOW TO LIE WITH BAD DATA 233

process that generate the data. Data may be distorted that some surviving examples of the greater mouse-
from the outset during the initial collection phase or eared bat, previously thought to be extinct, had been
they may be distorted when the data are transcribed, discovered hibernating in West Sussex. It went on to
transferred, merged or copied. Finally, they may deteri- assert that “they can weigh up to 30 kg” (see Hand,
orate, change definition or otherwise go through trans- 2004b, Chapter 4). A considerable amount of enter-
formations that render them less representative of the taining correspondence resulted from the fact that they
original underlying process they were designed to mea- had misstated the weight by three decimal places.
sure. Sometimes data are distorted from the source itself,
The breakdown in the collection phase can occur either knowingly or not. Examples occur in survey
whether the data are collected by instrument or di- work and tax returns, just to name two. It is well known
rectly recorded by human beings. Examples of break- to researchers of sexual behavior that men tend to re-
downs at the instrument level include instrument drift, port more lifetime sexual partners than women, a sit-
initial miscalibration, or a large random or otherwise uation that is highly unlikely sociologically (National
unpredictable variation in measurement. As an exam- Statistics website: www.statistics.gov.uk). Some data
ple of instrument level data collection, consider the are deliberately distorted to prevent disclosure of con-
measurement of the concentration of a particular chem- fidential information collected by governments in, for
ical compound by gas chromatography, as used in rou- example, censuses (e.g., Willenborg and de Waal,
tine drug testing. When reading the results of such a 2001) and health care data.
test, it is easy to think that a machine measures the Even if the data are initially recorded accurately, data
amount of the compound in an automatic and straight- can be compromised by data integration, data ware-
forward way, and thus that the resulting data are mea- housing and record linkage. Often a wide range of
suring some quantity directly. It turns out to be a bit sources of different types are involved (e.g., in the
more complicated. At the outset, a sample of the ma-
pharmaceutical sector, data from clinical trials, ani-
terial of interest is injected into a stream of carrier
mal trials, manufacturers, marketing, insurance claims
gas where it travels down a silica column heated by
and postmarketing surveillance might be merged).
an oven. The column then separates the mixture of
At a more mundane level, records that describe dif-
compounds according to their relative attraction to a
ferent individuals might be inappropriately merged be-
material called the adsorbent. This stream of different
cause they are described by the same key. When going
compounds travels “far enough” (via choices of col-
through his medical records for insurance purposes,
umn length and gas flow rates) so that by the time they
pass by the detector, they are well separated (at least in one of the authors discovered that he was recorded as
theory). At this point, both the arrival time and the con- having had his tonsils removed as a child. A subse-
centration of the compound are recorded by an electro- quent search revealed the fact that the records of some-
mechanical device (depending on the type of detector one else with the same name (but a different address)
used). The drifts inherent in the oven temperature, gas had been mixed in with his. More generally, what is
flow, detector sensitivity and a myriad of other environ- good quality for (the limited demands made of ) an op-
mental conditions can affect the recorded numbers. To erational data base may not be good quality for (poten-
determine actual amounts of material present, a known tially unlimited demands made of ) a data warehouse.
quantity must be tested at about the same time and the In a data warehouse, the definitions, sources and
machine must be calibrated. Thus the number reported other information for the variables are contained in
as a simple percentage of compound present has not a dictionary, often referred to as metadata. In a large
only been subjected to many potential sources of error corporation it is often the IT (information technology)
in its raw form, but is actually the output of a calibra- group that has responsibility for maintaining both the
tion model. data warehouse and metadata. Merging sources and
Examples of data distortion at the human level checking for consistent definitions form a large part of
include misreading of a scale, incorrect copying of their duties.
values from an instrument, transposition of digits and A recent example in bioinformatics shows that data
misplaced decimal points. Of course, such mistakes are problems are not limited to business and economics.
not always easy to detect. Even if every data value is In a recent issue of The Lancet, Petricoin et al. (2002)
checked for plausibility, it often takes expert knowl- reported an ability to distinguish between serum sam-
edge to know if a data value is reasonable or ab- ples from healthy women, those with ovarian cancers
surd. Consider the report in The Times of London and women with a benign ovarian disease. It was so
234 R. D. DE VEAUX AND D. J. HAND

exciting that it prompted the “U.S. Congress to pass the statistician, wanting to incorporate this predictor
a resolution urging continued funding to drive a new into a model, asked one of the physicists whether any
diagnostic test toward the clinic” (Check, 2004). The wind data existed. It was difficult to imagine very
researchers trained an algorithm on 50 cancer spectra many Antarctic stations with anemometers and so he
and 50 normals, and then predicted 116 new spectra. was very surprised when the physicist replied, “Sure,
The results were impressive with the algorithm cor- there’s plenty of it.” Excitedly he asked what spatial
rectly identifying all 50 of the cancers, 47 out of 50 nor- resolution he could provide. When the physicist coun-
mals, and classifying the 16 benign disease spectra as tered with “what resolution do you want?” the statisti-
“other.” Statisticians Baggerly, Morris and Coombes cian became suspicious. He probed further and asked
(2004) attempted to reproduce the Petricoin et al. re- if they really had anemometers set up on a 5 km grid
sults, but were unable to do so. Finally, they concluded on the sea ice? He said, “Of course not. The wind
that the three types of spectra had been preprocessed data come from a global weather model—I can gen-
differently, so that the algorithm correctly identified erate them at any resolution you want!” It turned out
differences in the data, much of which that had noth- that all the other satellite data had gone through some
ing to do with the underlying biology of cancer. sort of preprocessing before it was given to the sta-
A more subtle source of data distortion is a change tistician. Some were processed actual direct measure-
in the measurement or collection procedure. When the ments, some were processed through models and some,
cause of the change is explicit and recognized, this can like the wind, were produced solely from models. Of
be adjusted for, at least to some extent. Common exam- course, this (as with curbstoning) does not necessarily
ples include a change in the structure of the Dow Jones imply that the resulting data are bad, but it should at
Industrial Average or the recent U.K. change from the least serve to warn the analyst that the data may not be
Retail Price Index to the European Union standard Har- what they were thought to be.
monized Index of Consumer Prices. In other cases, Each of these different mechanisms for data distor-
one might not be aware of the change. Some of the tions has its own set of detection and correction chal-
changes can be subtle. In looking at historical records lenges. Ensuring good data collection through survey
to assess long-term temperature changes, Jones and and/or experimental design is certainly an important
Wigley (1990) noted “changing landscapes affect tem- first step. A bad design that results in data that are not
perature readings in ways that may produce spurious representative of the phenomenon being studied can
temperature trends.” In particular, the location of the render even the best analysis worthless. At the next
weather station assigned to a city may have changed. step, detecting errors can be attempted in a variety of
During the 19th century, most cities and towns were ways, a topic to which we will return in Section 4.
too small to impact temperature readings. As urbaniza-
tion increased, urban heat islands directly affected tem- 3. IN HOW MANY WAYS?
perature readings, creating bias in the regional trends.
While global warming may be a contributor, the dom- Data can be bad in an infinite variety of ways, and
inant factor is the placement of the weather station, some authors have attempted to construct taxonomies
which moved several times. As it became more and of data distortion (e.g., Kim et al., 2003). An important
more surrounded by the city, the temperature increased, simple categorization is into missing data and distorted
mainly because the environment itself had changed. values.
A problem related to changes in the collection pro- 3.1 Missing Data
cedure is not knowing the true source of the data.
In scientific analysis, data are often preprocessed by Data can be missing at two levels: entire records
technicians and scientists before being analyzed. The might be absent, or one or more individual fields may
statistician may be unaware (or uninterested) in the be missing. If entire records are missing, any analysis
details of the processing. To create accurate models, may well be describing or making inferences about a
however, it can be important to know the source and population different from that intended. The possibility
therefore the accuracy of the measurements. Consider that entire records may be missing is particularly prob-
a study of the effect of ocean bottom topography on lematic, since there will often be no way of knowing
sea ice formation in the southern oceans (De Veaux, this. Individual fields can be missing for a huge vari-
Gordon, Comiso and Bacherer, 1993). After learning ety of reasons, and the mechanism by which they are
that wind can have a strong effect on sea ice formation, missing is likely to influence their distribution over the
HOW TO LIE WITH BAD DATA 235

data, but at least when individual fields are missing one 3.2 Distorted Data
can see that this is the case.
Although there are an unlimited number of pos-
If the missingness of a particular value is unrelated
sible causes of distortion, a first split can be made
either to the response or predictor variables (miss-
into those attributable to instrumentation and those at-
ing completely at random—Little and Rubin, 1987,
tributed to human agency. Floor and ceiling effects are
give technical definitions), then case deletion can be
examples of the first kind (instruments here can be
employed. However, even ignoring the potential bias
mechanical or electronic, but also questionnaires), al-
problems, complete case deletion can severely reduce
though in this case it is sometimes possible to fore-
the effective sample size. In many data mining situa-
see that such things might occur and take account of
tions with a large number of variables, even though
this in the statistical modeling. Human distortions can
each field has only a relatively small proportion of
arise from misreading instruments or misrecording val-
missing values, all of the records may have some val-
ues at any level. Brunskill (1990) gave an illustration
ues missing, so that the case deletion strategy leaves
from public records of birth weights, where ounces are
one with no data at all.
commonly confused with pounds, the number 1 is con-
Complications arise when the pattern of missing
fused with 11 and errors in decimal placements pro-
data does depend on the values that would have been
duce order of magnitude errors. In such cases, using
recorded. If, for example, there are no records for pa-
ancillary information such as gestation times or new-
tients who experience severe pain, inferences to the
born heights can help to spot gross errors. Some data
entire pain distribution will be impossible (at least,
collection procedures, in an attempt to avoid missing
without making some pretty strong distributional as-
data, actually introduce distortions. A data set we ana-
sumptions). Likewise, poor choice of a missing value
lyzed had a striking number of doctors born on Novem-
code (e.g., 0 or 99 for age) or accidental inclusion of
ber 11, 1911. It turned out that most doctors (or their
a missing value code in the analysis (e.g., 99,999 for
secretaries) wanted to avoid typing in age information,
age) has been known to lead to mistaken conclusions.
but because the program insisted on a value and the
Sometimes missingness arises because of the nature
choice of 00/00/00 was invalid, the easiest way to by-
of the problem, and presents real theoretical and practi-
pass the system was simply to type 11/11/11. Such er-
cal issues. For example, in personal banking, banks ac-
rors might not seem of much consequence, but they
cept those loan applicants whom they expect to repay
can be crucial. Confusion between English and metric
the loans. For such people, the bank eventually discov-
units was responsible for the loss of the $125 million
ers the true outcome (repay, do not repay), but for those
Martian Climate Orbiter space probe (The New York
rejected for a loan, the true outcome is unknown: it is
Times, October 1, 1999). Jet Propulsion Laboratory
a missing value. This poses difficulties when the bank
engineers mistook acceleration readings measured in
wants to construct new predictive models (Hand and
English units of pound-seconds for the metric mea-
Henley, 1993; Hand, 2001). If a loan application asks
sure of force in newton-seconds. In 1985, in a prece-
for household income, replacing a missing value by a
dence setting case, the Supreme Court ruled that Dun
mean or even by a model based imputation may lead to
& Bradstreet had to pay $350,000 in libel damages to a
a highly optimistic assessment of risk.
small Vermont construction company. A part-time stu-
When the missingness in a predictor is related di-
dent worker had apparently entered the wrong data into
rectly to the response, it may be useful for exploratory
the Dun & Bradstreet data base. As a result, Dun &
and prediction purposes to create indicator variables
Bradstreet issued a credit report that mistakenly iden-
for each predictor, where the variable is a binary in-
tified the construction company as bankrupt (Percy,
dictor of whether the variable is missing or not. For
1986).
categorical predictor variables, missing values can be
treated simply as a new category. In a study of dropout
4. HOW TO DETECT DATA ERRORS
rates from a clinical trial for a depression drug, it was
found that the single most important indicator of ul- While it may be obvious that a value is missing from
timately dropping out from the study was not the de- a record, it is often less obvious that a value is in er-
pression score on the second week’s test as indicated ror. The presence of errors can (sometimes) be proven,
from complete case analysis, but simply the indicator but the absence of errors cannot. There is no guaran-
of whether the patient showed up to take it (De Veaux, tee that a data set that looks perfect will not contain
Donahue and Small, 2002). mistakes. Some of these mistakes may be intrinsically
236 R. D. DE VEAUX AND D. J. HAND

undetectable: they might be values that are well within of by presenting the data in a form whereby advantage
the range of the data and could easily have occurred. can be taken of these abilities. Such plots have become
Moreover, since errors can occur in an unlimited num- prevalent in statistical packages for examining missing
ber of ways, there is no end to the list of possible tests data patterns. Hand, Blunt, Kelly and Adams (2000)
for detecting errors. On the other hand, strategic choice gave the illustration of a plot showing a point for each
of tests can help to pinpoint the root causes that lead to missing value in a rectangular array of 1012 potential
errors and, hence, to the identification of changes in sufferers from osteoporosis measured on 45 variables.
the data collection process that will lead to the greatest It is immediately clear which cases and which variables
improvement in data quality. account for most of the problems.
When the data collection can be repeated, the re- Unfortunately, as we face larger and larger data
sults of the duplicate measurements, recordings or sets, so we are also faced with increasing difficulty in
transcriptions (e.g., the double entry system used in data profiling. The missing value plot described above
clinical trials) can be compared by automatic methods. works for a thousand cases, but would probably not be
In this “duplicate performance method,” a machine so effective for 10 million. Even in this case, however,
checks for any differences in the two data records. a Pareto chart of percent missing for each variable may
All discrepancies are noted and the only remaining er- be useful for deciding where to spend data preparation
rors are when both collectors made the same mistake. effort. Knowing that a variable is 96% missing makes
Strayhorn (1990) and West and Winkler (1991) pro- one think pretty hard about including it in a model. On
vided statistical methods for estimating that propor- the other hand, separate manual examination of each
tion. In another quality control method, known errors of 30,000 gene expression variables is not to be recom-
are added to a data set whose integrity is then assessed mended.
by an external observer. The “known errors” method When even simple summaries of all the variables in
devised statistical methods for estimating how many a data base are not feasible, some methods for reducing
errors remain based on the success of the observer in the number of potential predictors in the models might
discovering the known errors (Strayhorn, 1990; West be warranted. We see an important role for data mining
and Winkler, 1991). Taking this further, one can build tools here. It may be wise to reverse the usual para-
models (similar to those developed for software relia- digm of explore the data first, then model. Instead, ex-
bility) that estimate how many errors are likely to re- ploratory models of the data can be useful as a first step
main in a data set based on extrapolation from the rate and can serve two purposes (De Veaux, 2002). First,
of discovery of errors. At some point one decides that models such as tree models and clustering can high-
the impact of remaining errors on the conclusions is light groups of anomalous cases. Second, the models
likely to be sufficiently small that one can ignore them. can be used to reduce the number of potential predictor
Automatic methods of data collection use metadata variables and enable the analyst to examine the remain-
information to check for consistency across multiple ing predictors in more detail. The resulting process is
records or variables, integrity (e.g., correct data type), a circular one, with more examination possible at each
plausibility (within the possible range of the data) and subsequent modeling phase. Simply checking whether
coherence between related variables (e.g., number of 500 numerical predictor variables are categorical or
sons plus number of daughters equals number of chil- quantitative without the aid of metadata is a daunt-
dren). Sometimes redundant data can be collected with ing (and tedious) task. In one analysis, we were asked
such checks in mind. However, one cannot rely on soft- to develop a fraud detection model for a large credit
ware to protect one from mistakes. Even when such au- card bank. In the data set was one potential predic-
tomatic methods are in place, the analyst should spend tor variable that ranged from around 2000 to 9000,
some time looking for errors in the data prior to any roughly symmetric and unimodal, which was selected
modeling effort. as a highly significant predictor for fraud in a stepwise
Data profiling is the use of exploratory and data min- logistic regression model. It turned out that this predic-
ing tools aimed at identifying errors, rather than at the tor was a categorical variable (SIC code) used to spec-
substantive questions of interest. When the number of ify the industry from which the product purchases in
predictor variables is manageable, simple plots such the transaction came. Useless as a predictor in a logis-
as bar charts, histograms, scatterplots and time series tic regression model, it had escaped detection as a cat-
plots can be invaluable. The human eye has evolved to egorical variable among the several hundred potential
detect anomalies, and this should be taken advantage candidates. Once the preliminary model whittled the
HOW TO LIE WITH BAD DATA 237

candidate predictors down to a few dozen, it was easy even be possible to devise useful error detection and
to use standard data analysis techniques and to detect correction resource allocation strategies.
which were appropriate for the final model. Sometimes an entirely different approach to improv-
ing data quality can be used. This is simply to hide the
5. IMPROVING DATA QUALITY poor quality by coarsening or aggregating the data. In
fact, a simple example of this implicitly occurs all the
The best way to improve the quality of data is to
time: rather than reporting uncertain and error-prone fi-
improve things in the data collection phase. The ideal
nal digits of measured variables, researchers round to
would be to prevent errors from arising in the first
the nearest digit.
place. Prevention and detection have a reciprocal role
to play here. Once one has detected data errors, one
6. CONCLUSIONS AND FURTHER DISCUSSION
can investigate why they occurred and prevent them
from happening in the future. Once it has been recog- This article has been about data quality from the
nized (detected) that the question “How many miles perspective of an analyst called upon to extract some
do you commute to work each day?” permits more meaning from it. We have already remarked that there
than one interpretation, mistakes can be prevented by are also other aspects to data quality, and these are of
rewording. Progress toward direct keyboard or other equal importance when action is to be taken or deci-
electronic data entry systems means that error detec- sions made on the basis of the data. These include such
tion tools can be applied in real time at data entry— aspects as timeliness (the most sophisticated analysis
when there is still an opportunity to correct the data. applied to out-of-date data will be of limited value),
At the data base phase, metadata can be used to ensure completeness and, of central importance, fitness for
that the data conform to expected forms, and relation- purpose. Data quality, in the abstract, is all very well,
ships between variables can be used to cross-check en- but what may be perfectly fine for one use may be
tries. If the data can be collected more than once, the woefully inadequate for another. Thus ISO 8402 de-
rate of discovery of errors can be used as the basis for a fines quality as “The totality of characteristics of an
statistical model to reveal how many undetected errors entity that bare on its ability to satisfy stated and im-
are likely to remain in the data base. plied needs.”
Various other principles also come into play when It is also important to maintain a sense of propor-
considering how to improve data quality. For exam- tion in assessing and deciding how to cope with data
ple, a Pareto principle often applies: most of the er- distortions. In one large quality control problem in
rors are attributable to just a few variables. This may polymer viscosity, each 1% improvement was worth
happen simply because some variables are intrinsically about $1,000,000 a year, but viscosity itself could be
less reliable (and important) than others. Sometimes it measured only to a standard deviation of around 8%.
is possible to improve the overall level of quality sig- Before bothering about the accuracy of the predictor
nificantly by removing just a few of these low quality variables, it was first necessary to find improved ways
variables. This has a complementary corollary: a law to measure the response. In an entirely different con-
of diminishing returns applies that suggests that suc- text, much work in the personal banking sector concen-
cessive attempts to improve the quality of the data are trates on improved models for predicting risk—where,
likely to lead to less improvement. If one has a partic- again, a slight improvement translates into millions of
ular analytic aim in mind, then one might reasonably dollars of increased profit. In general, however, these
assert that data errors that do not affect the conclusions models are based on retrospective data—data drawn
do not matter. Moreover, for those that do matter, per- from distributions that are unlikely still to apply. We
haps the ease with which they can be corrected should need to be sure that the inaccuracies induced by this
have some bearing on the effort that goes into detect- population drift do not swamp the apparent improve-
ing them—although the overriding criterion should be ments we have made.
the loss consequent on the error being made. This is al- Data quality is a key issue throughout science, com-
lied with the point that the base rate of errors should be merce, and industry, and entire disciplines have grown
taken into account: if one expects to find many errors, up to address particular aspects of the problem. In man-
then it is worth attempting to find them, since the likely ufacturing and, to a lesser extent, the service industries,
rewards, in terms of an improved data base, are likely we have schools for quality control and total quality
to be large. In a well-understood environment, it might management (Six Sigma, Kaizen, etc.). In large part,
238 R. D. DE VEAUX AND D. J. HAND

these are concerned with reducing random variation. H UFF, D. (1954). How to Lie with Statistics. Norton, New York.
In official statistics, strict data collection protocols are J ONES, P. D. and W IGLEY, T. M. L. (1990). Global warming
typically used. trends. Scientific American 263(2) 84–91.
Of course, ensuring high quality data does not come K IM , W., C HOI , B.-J., H ONG , E.-K., K IM, S.-K. and L EE, D.
without a cost. The bottom line is that one must weigh (2003). A taxonomy of dirty data. Data Mining and Knowledge
up the potential gains to be made from capturing and Discovery 7 81–99.
K LEIN, B. D. (1998). Data quality in the practice of consumer prod-
recording better quality data against the costs of en-
uct management: Evidence from the field. Data Quality 4(1).
suring that quality. No matter how much money one
K RUSKAL, W. (1981). Statistics in society: Problems unsolved and
spends, and how much resource one consumes in at- unformulated. J. Amer. Statist. Assoc. 76 505–515.
tempting to detect and prevent bad data, the unfortu- L AUDON, K. C. (1986). Data quality and due process in large in-
nate fact is that bad data will always be with us. terorganizational record systems. Communications of the ACM
29 4–11.
REFERENCES L ITTLE, R. J. A. and RUBIN, D. B. (1987). Statistical Analysis with
Missing Data. Wiley, New York.
BAGGERLY, K. A, M ORRIS, J. S. and C OOMBES, K. R. (2004).
L OSHIN, D. (2001). Enterprise Knowledge Management: The Data
Reproducibility of SELDI-TOF protein patterns in serum:
Quality Approach. Morgan Kaufmann, San Francisco.
Comparing datasets from different experiments. Bioinformat-
ics 20 777–785. M ADNICK, S. E. and WANG, R. Y. (1992). Introduction to the
B RUNSKILL, A. J. (1990). Some sources of error in the coding of TDQM research program. Working Paper 92-01, Total Data
birth weight. American J. Public Health 80 72–73. Quality Management Research Program.
C HECK, E. (2004). Proteomics and cancer: Running before we can M OREY, R. C. (1982). Estimating and improving the quality of in-
walk? Nature 429 496–497. formation in a MIS. Communications of the ACM 25 337–342.
C OALE, A. J. and S TEPHAN, F. F. (1962). The case of the Indians P ERCY, T. (1986). My data, right or wrong. Datamation 32(11)
and the teen-age widows. J. Amer. Statist. Assoc. 57 338–347. 123–124.
D E V EAUX, R. D. (2002). Data mining: A view from down in the P ETRICOIN , E. F., III, A RDEKANI , A. M., H ITT, B. A.,
pit. Stats (34) 3–9. L EVINE , P. J., F USARO , V. A., S TEINBERG , S. M.,
D E V EAUX , R. D., D ONAHUE, R. and S MALL, R. D. (2002). M ILLS , G. B., S IMONE , C., F ISHMAN , D. A., KOHN, E. C.
Using data mining techniques to harvest information in clinical and L IOTTA, L. A. (2002). Use of proteomic patterns in serum
trials. Presentation at Joint Statistical Meetings, New York.
to identify ovarian cancer. The Lancet 359 572–577.
D E V EAUX , R. D., G ORDON, A., C OMISO, J. and
P IERCE, E. (1997). Modeling database error rates. Data Quality
BACHERER , N. E. (1993). Modeling of topographic effects
on Antarctic sea-ice using multivariate adaptive regression 3(1). Available at www.dataquality.com/dqsep97.htm.
splines. J. Geophysical Research—Oceans 98 20,307–20,320. P RICEWATERHOUSE C OOPERS (2004). The Tech Spotlight 22.
H AND, D. J. (2001). Reject inference in credit operations. In Hand- Available at www.pwc.com/extweb/manissue.nsf/docid/
book of Credit Scoring (E. Mays, ed.) 225–240. Glenlake Pub- 2D6E2F57E06E022F85256B8F006F389A.
lishing, Chicago. R EDMAN, T. C. (1992). Data Quality. Management and Technol-
H AND, D. J. (2004a). Academic obsessions and classification ogy. Bantam, New York.
realities: Ignoring practicalities in supervised classification. S TRAYHORN, J. M. (1990). Estimating the errors remaining in
In Classification, Clustering and Data Mining Applications a data set: Techniques for quality control. Amer. Statist. 44
(D. Banks, L. House, F. R. McMorris, P. Arabie and W. Gaul, 14–18.
eds.) 209–232. Springer, Berlin. WAINER, H. (2004). Curbstoning IQ and the 2000 presidential
H AND, D. J. (2004b). Measurement Theory and Practice: election. Chance 17(4) 43–46.
The World Through Quantification. Arnold, London.
W EST, M. and W INKLER, R. L. (1991). Data base error trapping
H AND , D. J., B LUNT, G., K ELLY, M. G. and A DAMS, N. M.
and prediction. J. Amer. Statist. Assoc. 86 987–996.
(2000). Data mining for fun and profit (with discussion). Sta-
tist. Sci. 15 111–131. W ILLENBORG, L. and DE WAAL, T. (2001). Elements of Statistical
H AND, D. J. and H ENLEY, W. E. (1993). Can reject inference ever Disclosure Control. Springer, New York.
work? IMA J. of Mathematics Applied in Business and Industry W OLINS, L. (1962). Responsibility for raw data. American Psy-
5(4) 45–55. chologist 17 657–658.

Capstone Project SupplyChain DataCo Supplychain FinalReport
100% (8)
Capstone Project SupplyChain DataCo Supplychain FinalReport
79 pages
Business Report SMDM Project - Coded
No ratings yet
Business Report SMDM Project - Coded
27 pages
Analysis of German Credit Data
100% (1)
Analysis of German Credit Data
24 pages
GLCA DA MS Excel HBFC Project Modified-1
No ratings yet
GLCA DA MS Excel HBFC Project Modified-1
3 pages
Basic Approach - Lesson
No ratings yet
Basic Approach - Lesson
18 pages
Basic Approach - Practice
No ratings yet
Basic Approach - Practice
16 pages
Mastering POE - Lesson
No ratings yet
Mastering POE - Lesson
14 pages
CSIS 5420 Final Exam - Answers (13 Jul 05)
No ratings yet
CSIS 5420 Final Exam - Answers (13 Jul 05)
8 pages
Missing Data Mechanisms and Imputation Methods
No ratings yet
Missing Data Mechanisms and Imputation Methods
16 pages
Hand2007 - Article - Principles Ofs DataMining
No ratings yet
Hand2007 - Article - Principles Ofs DataMining
2 pages
Perspectives On Peer Review of Data: Framing Standards and Questions
No ratings yet
Perspectives On Peer Review of Data: Framing Standards and Questions
5 pages
Combining Survey Data
No ratings yet
Combining Survey Data
20 pages
Norcott - big data and prediction (1)
No ratings yet
Norcott - big data and prediction (1)
9 pages
Quasi-Empirical Scenario Analysis and Its Application To Big Data Quality
No ratings yet
Quasi-Empirical Scenario Analysis and Its Application To Big Data Quality
21 pages
Baumer - 2015 - A Data Science Course For Undergraduates Thinking
No ratings yet
Baumer - 2015 - A Data Science Course For Undergraduates Thinking
10 pages
Mathieu CAron-Diotte 2023
No ratings yet
Mathieu CAron-Diotte 2023
13 pages
All About Data Science: Learn Data Science from scratch
From Everand
All About Data Science: Learn Data Science from scratch
Devi Prasad
No ratings yet
Big Data: A Revolution That Will Transform How We Live, Work, and Think
No ratings yet
Big Data: A Revolution That Will Transform How We Live, Work, and Think
5 pages
Big-questions--informative-data--excellent-_2018_Statistics---Probability-Le
No ratings yet
Big-questions--informative-data--excellent-_2018_Statistics---Probability-Le
3 pages
Montgomery-2018-Tree-Based Models fo
No ratings yet
Montgomery-2018-Tree-Based Models fo
16 pages
TSIKRITSIS Nikos Treating Missing Data On Survey Journal - of - Operations - Management - 2005
No ratings yet
TSIKRITSIS Nikos Treating Missing Data On Survey Journal - of - Operations - Management - 2005
10 pages
Big Data and Official Statistics - Opportunities, Challenges and Risks - Rob Kitchin
No ratings yet
Big Data and Official Statistics - Opportunities, Challenges and Risks - Rob Kitchin
20 pages
Research Paper
No ratings yet
Research Paper
8 pages
Scott 2018 - 2
No ratings yet
Scott 2018 - 2
5 pages
Statistical Paradises and Paradoxes in Big Data (I) : Law of Large Populations, Big Data Paradox, and The 2016 Us Presidential Election
No ratings yet
Statistical Paradises and Paradoxes in Big Data (I) : Law of Large Populations, Big Data Paradox, and The 2016 Us Presidential Election
42 pages
Fabrication
No ratings yet
Fabrication
9 pages
1902 06672 PDF
No ratings yet
1902 06672 PDF
24 pages
A biosocial analysis of the sources of missing data
No ratings yet
A biosocial analysis of the sources of missing data
10 pages
Honaker & King - What To Do About Missing Values - 2010
No ratings yet
Honaker & King - What To Do About Missing Values - 2010
21 pages
01 Management
No ratings yet
01 Management
38 pages
ABF Webinar - Episode 5
No ratings yet
ABF Webinar - Episode 5
46 pages
Handling Missing Values in Information Systems Research: A Review of Methods and Assumptions
No ratings yet
Handling Missing Values in Information Systems Research: A Review of Methods and Assumptions
37 pages
Data Scientists Versus Statisticians - ODSC - Open Data Science - Medium
No ratings yet
Data Scientists Versus Statisticians - ODSC - Open Data Science - Medium
6 pages
Big Data As A Source of Statistical Information - Piet J.H. Daas and Marco J.H. Puts
No ratings yet
Big Data As A Source of Statistical Information - Piet J.H. Daas and Marco J.H. Puts
10 pages
Big and Complex Data Analysis Methodologies and Applications Unlimited Ebook Download
100% (2)
Big and Complex Data Analysis Methodologies and Applications Unlimited Ebook Download
16 pages
Data Academy - Data Science Basics
No ratings yet
Data Academy - Data Science Basics
114 pages
Ethnographic Praxis - 2019 - ZHANG - Towards An Archaeological Ethnographic Approach To Big Data Rethinking Data Veracity
No ratings yet
Ethnographic Praxis - 2019 - ZHANG - Towards An Archaeological Ethnographic Approach To Big Data Rethinking Data Veracity
24 pages
Data Science vs. Statistics: Two Cultures?
No ratings yet
Data Science vs. Statistics: Two Cultures?
22 pages
Jennifer Telesco Research Paper
No ratings yet
Jennifer Telesco Research Paper
11 pages
Differentially Private Significance Tests For Regression Coefficients
No ratings yet
Differentially Private Significance Tests For Regression Coefficients
15 pages
Data Quality Management An Example Research
No ratings yet
Data Quality Management An Example Research
13 pages
What Is Data Science?: Michael L. Brodie
No ratings yet
What Is Data Science?: Michael L. Brodie
21 pages
Why Istatistics
No ratings yet
Why Istatistics
2 pages
Institute of Mathematical Statistics Statistical Science
No ratings yet
Institute of Mathematical Statistics Statistical Science
17 pages
Differences Between Qualitative and Quantative Research
100% (1)
Differences Between Qualitative and Quantative Research
8 pages
edad007 (1)englais
No ratings yet
edad007 (1)englais
9 pages
Elliott 2017
No ratings yet
Elliott 2017
16 pages
Carmichael MArron 2018 OJO
No ratings yet
Carmichael MArron 2018 OJO
22 pages
Fuller 2017
No ratings yet
Fuller 2017
11 pages
Clarke, 2018
No ratings yet
Clarke, 2018
31 pages
Neely et al 2024
No ratings yet
Neely et al 2024
21 pages
Symbolic Data Analysis
No ratings yet
Symbolic Data Analysis
62 pages
Paper 13
No ratings yet
Paper 13
16 pages
What Is Data Science Final May 162018
No ratings yet
What Is Data Science Final May 162018
22 pages
2 Peer Responses
No ratings yet
2 Peer Responses
7 pages
WhatisDataScienceFinalMay162018
No ratings yet
WhatisDataScienceFinalMay162018
22 pages
2010 Groves Total Survey Error PDF
No ratings yet
2010 Groves Total Survey Error PDF
31 pages
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
No ratings yet
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
29 pages
2017 Bookchapter DataScienceFundmentals Preprint
No ratings yet
2017 Bookchapter DataScienceFundmentals Preprint
25 pages
Why Statistics?: Popular Media and Science Publications Sound The Drum: "Big Data" Will Drive Our
No ratings yet
Why Statistics?: Popular Media and Science Publications Sound The Drum: "Big Data" Will Drive Our
2 pages
Liar, Liar Pants On Fire
No ratings yet
Liar, Liar Pants On Fire
5 pages
Comparing Alternatives For Estimation From Nonprobability Samples
No ratings yet
Comparing Alternatives For Estimation From Nonprobability Samples
34 pages
Does Size Matter.
No ratings yet
Does Size Matter.
34 pages
The Elements of Data Analytic Style
No ratings yet
The Elements of Data Analytic Style
95 pages
POL BigDataStatisticsJune2014
No ratings yet
POL BigDataStatisticsJune2014
27 pages
(Reuning & Plutzer, 2020) Valid Vs Invalid Straighlining The Complex Relationship Between Straighlining and Data Quality
No ratings yet
(Reuning & Plutzer, 2020) Valid Vs Invalid Straighlining The Complex Relationship Between Straighlining and Data Quality
21 pages
10 Challenging Problems in Data Mining Research
No ratings yet
10 Challenging Problems in Data Mining Research
8 pages
Common Errors in Statistics (and How to Avoid Them)
From Everand
Common Errors in Statistics (and How to Avoid Them)
Phillip I. Good
No ratings yet
Darrell Huff and Fifty Years of How To Lie With Statistics
100% (2)
Darrell Huff and Fifty Years of How To Lie With Statistics
5 pages
CV Jorge Ledesma (Eng) - Ago 2021
No ratings yet
CV Jorge Ledesma (Eng) - Ago 2021
2 pages
How To Lie With Statistics: Lies, Calculations and Constructions: Beyond
No ratings yet
How To Lie With Statistics: Lies, Calculations and Constructions: Beyond
5 pages
Guidelines-For-Design PDF
No ratings yet
Guidelines-For-Design PDF
8 pages
Abu Bakar Et Al. - 2017 - The Propensity To Business Startup Evidence From Global Entrepreneurship Monitor (GEM) Data in Saudi Arabia
No ratings yet
Abu Bakar Et Al. - 2017 - The Propensity To Business Startup Evidence From Global Entrepreneurship Monitor (GEM) Data in Saudi Arabia
23 pages
Chi Square Goodness-of-Fit Tests
No ratings yet
Chi Square Goodness-of-Fit Tests
5 pages
Self-Test Solutions and Answers To Selected Even-Numbered Problems
No ratings yet
Self-Test Solutions and Answers To Selected Even-Numbered Problems
30 pages
Understanding The Consequences of Bilingualism For Language Processing and Cognition
No ratings yet
Understanding The Consequences of Bilingualism For Language Processing and Cognition
18 pages
PracRes Notes PPT 2-4
No ratings yet
PracRes Notes PPT 2-4
5 pages
Puh 641 Regular Exam 2017
No ratings yet
Puh 641 Regular Exam 2017
4 pages
Principles of Regression Analysis: Statistics For Researchers
No ratings yet
Principles of Regression Analysis: Statistics For Researchers
5 pages
Essential Statistics 1st Edition by Gould and Ryan ISBN Test Bank
100% (39)
Essential Statistics 1st Edition by Gould and Ryan ISBN Test Bank
10 pages
Variables and Types
0% (1)
Variables and Types
27 pages
Chapter 1
No ratings yet
Chapter 1
8 pages
NIrupam Agarwal Business Report-ML
100% (1)
NIrupam Agarwal Business Report-ML
23 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
21 pages
Parenting and Family Adjustment Among Parents 14-20 PDF
No ratings yet
Parenting and Family Adjustment Among Parents 14-20 PDF
7 pages
Test Bank For Statistics For Business Economics 13th Edition David R Andersondennis J Sweeneythomas A Williamsjeffrey D Cammjames J Cochran
100% (36)
Test Bank For Statistics For Business Economics 13th Edition David R Andersondennis J Sweeneythomas A Williamsjeffrey D Cammjames J Cochran
34 pages
One Way Data
No ratings yet
One Way Data
6 pages
STATISTICS (Pages 131)
No ratings yet
STATISTICS (Pages 131)
131 pages
Eng Ed482454
No ratings yet
Eng Ed482454
38 pages
2011 02 08 Data Analysis
No ratings yet
2011 02 08 Data Analysis
47 pages
Chap 04 Linear Regression
No ratings yet
Chap 04 Linear Regression
99 pages
Roni Presentation
No ratings yet
Roni Presentation
17 pages
6) Exploratory Data Analysis
No ratings yet
6) Exploratory Data Analysis
29 pages
Roy Sabo, Edward Boone (Auth.) - Statistical Research Methods - A Guide For Non-Statisticians-Springer-Verlag New York (2013)
No ratings yet
Roy Sabo, Edward Boone (Auth.) - Statistical Research Methods - A Guide For Non-Statisticians-Springer-Verlag New York (2013)
218 pages
Introduction To Loglinear Models
No ratings yet
Introduction To Loglinear Models
1 page
Binary Logistic Regression and Its Application
No ratings yet
Binary Logistic Regression and Its Application
8 pages
20220713104727amwebology 18 (6) - 496 PDF
No ratings yet
20220713104727amwebology 18 (6) - 496 PDF
25 pages