100% found this document useful (3 votes)
58 views

Data Cleaning First Edition Association For Computing Machinery. All Chapters Instant Download

Association

Uploaded by

soltiarosuo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
58 views

Data Cleaning First Edition Association For Computing Machinery. All Chapters Instant Download

Association

Uploaded by

soltiarosuo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Download the full version of the textbook now at textbookfull.

com

Data cleaning First Edition Association For


Computing Machinery.

https://textbookfull.com/product/data-cleaning-
first-edition-association-for-computing-machinery/

Explore and download more textbook at https://textbookfull.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Mechanics of Machinery First Edition Mahmoud A. Mostafa

https://textbookfull.com/product/mechanics-of-machinery-first-edition-
mahmoud-a-mostafa/

textbookfull.com

Advances in Agricultural Machinery and Technologies First


Edition Guangnan Chen

https://textbookfull.com/product/advances-in-agricultural-machinery-
and-technologies-first-edition-guangnan-chen/

textbookfull.com

A Data Scientist s Guide to Acquiring Cleaning and


Managing Data in R 1st Edition Samuel E. Buttrey

https://textbookfull.com/product/a-data-scientist-s-guide-to-
acquiring-cleaning-and-managing-data-in-r-1st-edition-samuel-e-
buttrey/
textbookfull.com

The Art of Multiculturalism Bharati Mukherjee s Imaginal


Politics for the Age of Global Migration Roland Benedikter

https://textbookfull.com/product/the-art-of-multiculturalism-bharati-
mukherjee-s-imaginal-politics-for-the-age-of-global-migration-roland-
benedikter/
textbookfull.com
Conducting Action Research to Evaluate Your School Library
1st Edition Judith A. Sykes

https://textbookfull.com/product/conducting-action-research-to-
evaluate-your-school-library-1st-edition-judith-a-sykes/

textbookfull.com

Persuasion in Specialised Discourses Olga Dontcheva-


Navratilova

https://textbookfull.com/product/persuasion-in-specialised-discourses-
olga-dontcheva-navratilova/

textbookfull.com

Visuality, Emotions and Minority Culture: Feeling Ethnic


1st Edition John Nguyet Erni (Eds.)

https://textbookfull.com/product/visuality-emotions-and-minority-
culture-feeling-ethnic-1st-edition-john-nguyet-erni-eds/

textbookfull.com

Lie to Me Mcadams Molly

https://textbookfull.com/product/lie-to-me-mcadams-molly/

textbookfull.com

An Introduction to the Chemistry of the Sea 2nd Edition


Michael Pilson

https://textbookfull.com/product/an-introduction-to-the-chemistry-of-
the-sea-2nd-edition-michael-pilson/

textbookfull.com
Data Science of Renewable Energy Integration: The Nexus of
Energy, Environment, and Economic Growth (Evolutionary
Economics and Social Complexity Science, 30) 1st Edition
Ikeda
https://textbookfull.com/product/data-science-of-renewable-energy-
integration-the-nexus-of-energy-environment-and-economic-growth-
evolutionary-economics-and-social-complexity-science-30-1st-edition-
ikeda/
textbookfull.com
Data quality is one of the most important problems in data management,
since dirty data often leads to inaccurate data analytics results and
incorrect business decisions. Poor data across businesses and the U.S.
government are reported to cost trillions of dollars a year. Multiple surveys
show that dirty data is the most common barrier faced by data scientists.
Not surprisingly, developing effective and efficient data cleaning solutions
is challenging and is rife with deep theoretical and engineering problems.
This book is about data cleaning, which is used to refer to all kinds
of tasks and activities to detect and repair errors in the data. Rather than
focus on a particular data cleaning task, we give an overview of the end-
to-end data cleaning process, describing various error detection and repair
methods, and attempt to anchor these proposals with multiple taxonomies
and views. Specifically, we cover four of the most common and important
data cleaning tasks, namely, outlier detection, data transformation,
error repair (including imputing missing values), and data deduplication.
Furthermore, due to the increasing popularity and applicability of machine
learning techniques, we include a chapter that specifically explores how
machine learning techniques are used for data cleaning, and how data
cleaning is used to improve machine learning models.
This book is intended to serve as a useful reference for researchers
and practitioners who are interested in the area of data quality and data
cleaning. It can also be used as a textbook for a graduate course. Although
we aim at covering state-of-the-art algorithms and techniques, we
recognize that data cleaning is still an active field of research and therefore
provide future directions of research whenever appropriate.

ABOUT ACM BOOKS


ACM Books is a series of high-quality books
published by ACM for the computer science
community. ACM Books publications are widely
distributed in print and digital formats by major
booksellers and are available to libraries and
library consortia. Individual ACM members may access ACM
Books publications via separate annual subscription.
BOOKS.ACM.ORG • WWW.MORGANCLAYPOOLPUBLISHERS.COM
Data Cleaning
ACM Books

Editor in Chief
M. Tamer Özsu, University of Waterloo

ACM Books is a series of high-quality books for the computer science community, published
by ACM and many in collaboration with Morgan & Claypool Publishers. ACM Books
publications are widely distributed in both print and digital formats through booksellers
and to libraries (and library consortia) and individual ACM members via the ACM Digital
Library platform.

Data Cleaning
Ihab F. Ilyas, University of Waterloo
Xu Chu, Georgia Institute of Technology
2019

Conversational UX Design: A Practitioner’s Guide to the Natural


Conversation Framework
Robert J. Moore, IBM Research–Almaden
Raphael Arar, IBM Research–Almaden
2019

Heterogeneous Computing: Hardware and Software Perspectives


Mohamed Zahran, New York University
2019

Hardness of Approximation Between P and NP


Aviad Rubinstein, Stanford University
2019

Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker


Editor: Michael L. Brodie, Massachusetts Institute of Technology
2018

The Handbook of Multimodal-Multisensor Interfaces, Volume 2:


Signal Processing, Architectures, and Detection of Emotion and Cognition
Editors: Sharon Oviatt, Monash University
Björn Schuller, University of Augsburg and Imperial College London
Philip R. Cohen, Monash University
Daniel Sonntag, German Research Center for Artificial Intelligence (DFKI)
Gerasimos Potamianos, University of Thessaly
Antonio Krüger, Saarland University and German Research Center for Artificial Intelligence
(DFKI)
2018
Declarative Logic Programming: Theory, Systems, and Applications
Editors: Michael Kifer, Stony Brook University
Yanhong Annie Liu, Stony Brook University
2018
The Sparse Fourier Transform: Theory and Practice
Haitham Hassanieh, University of Illinois at Urbana-Champaign
2018
The Continuing Arms Race: Code-Reuse Attacks and Defenses
Editors: Per Larsen, Immunant, Inc.
Ahmad-Reza Sadeghi, Technische Universität Darmstadt
2018
Frontiers of Multimedia Research
Editor: Shih-Fu Chang, Columbia University
2018
Shared-Memory Parallelism Can Be Simple, Fast, and Scalable
Julian Shun, University of California, Berkeley
2017
Computational Prediction of Protein Complexes from Protein Interaction
Networks
Sriganesh Srihari, The University of Queensland Institute for Molecular Bioscience
Chern Han Yong, Duke-National University of Singapore Medical School
Limsoon Wong, National University of Singapore
2017
The Handbook of Multimodal-Multisensor Interfaces, Volume 1:
Foundations, User Modeling, and Common Modality Combinations
Editors: Sharon Oviatt, Incaa Designs
Björn Schuller, University of Passau and Imperial College London
Philip R. Cohen, Voicebox Technologies
Daniel Sonntag, German Research Center for Artificial Intelligence (DFKI)
Gerasimos Potamianos, University of Thessaly
Antonio Krüger, Saarland University and German Research Center for Artificial Intelligence
(DFKI)
2017
Communities of Computing: Computer Science and Society in the ACM
Thomas J. Misa, Editor, University of Minnesota
2017
Text Data Management and Analysis: A Practical Introduction to Information
Retrieval and Text Mining
ChengXiang Zhai, University of Illinois at Urbana–Champaign
Sean Massung, University of Illinois at Urbana–Champaign
2016
An Architecture for Fast and General Data Processing on Large Clusters
Matei Zaharia, Stanford University
2016
Reactive Internet Programming: State Chart XML in Action
Franck Barbier, University of Pau, France
2016
Verified Functional Programming in Agda
Aaron Stump, The University of Iowa
2016
The VR Book: Human-Centered Design for Virtual Reality
Jason Jerald, NextGen Interactions
2016
Ada’s Legacy: Cultures of Computing from the Victorian to the Digital Age
Robin Hammerman, Stevens Institute of Technology
Andrew L. Russell, Stevens Institute of Technology
2016
Edmund Berkeley and the Social Responsibility of Computer Professionals
Bernadette Longo, New Jersey Institute of Technology
2015
Candidate Multilinear Maps
Sanjam Garg, University of California, Berkeley
2015
Smarter Than Their Machines: Oral Histories of Pioneers in Interactive Computing
John Cullinane, Northeastern University; Mossavar-Rahmani Center for Business
and Government, John F. Kennedy School of Government, Harvard University
2015
A Framework for Scientific Discovery through Video Games
Seth Cooper, University of Washington
2014
Trust Extension as a Mechanism for Secure Code Execution on Commodity
Computers
Bryan Jeffrey Parno, Microsoft Research
2014
Embracing Interference in Wireless Systems
Shyamnath Gollakota, University of Washington
2014
Data Cleaning

Ihab F. Ilyas
University of Waterloo

Xu Chu
Georgia Institute of Technology

ACM Books #28


Copyright © 2019 by Association for Computing Machinery

All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means—electronic, mechanical, photocopy,
recording, or any other except for brief quotations in printed reviews—without the prior
permission of the publisher.
Designations used by companies to distinguish their products are often claimed as trade-
marks or registered trademarks. In all instances in which the Association for Computing
Machinery is aware of a claim, the product names appear in initial capital or all capital
letters. Readers, however, should contact the appropriate companies for more complete
information regarding trademarks and registration.

Data Cleaning
Ihab F. Ilyas
Xu Chu
books.acm.org
http://books.acm.org

ISBN: 978-1-4503-7152-0 hardcover


ISBN: 978-1-4503-7153-7 paperback
ISBN: 978-1-4503-7154-4 ePub
ISBN: 978-1-4503-7155-1 eBook
Series ISSN: 2374-6769 print 2374-6777 electronic
DOIs:
10.1145/3310205 Book 10.1145/3310205.3310211 Chapter 5
10.1145/3310205.3310206 Preface 10.1145/3310205.3310212 Chapter 6
10.1145/3310205.3310207 Chapter 1 10.1145/3310205.3310213 Chapter 7
10.1145/3310205.3310208 Chapter 2 10.1145/3310205.3310214 Chapter 8
10.1145/3310205.3310209 Chapter 3 10.1145/3310205.3310215 References/Index/Bios
10.1145/3310205.3310210 Chapter 4

A publication in the ACM Books series, #28


Editor in Chief: M. Tamer Özsu, University of Waterloo

This book was typeset in Arnhem Pro 10/14 and Flama using ZzTEX.
Cover photo: Jason Dorfman MIT / CSAIL

First Edition
10 9 8 7 6 5 4 3 2 1
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
To my family: Francis, Aida, Mirette, Andrew and Marina

To my wife Jianmei and my daughter Hannah


Contents

Preface xiii
Figure and Table Credits xv

Chapter 1 Introduction 1
1.1 Data Cleaning Workflow 3
1.2 Book Scope 4

Chapter 2 Outlier Detection 11


2.1 A Taxonomy of Outlier Detection Methods 12
2.2 Statistics-Based Outlier Detection 15
2.3 Distance-Based Outlier Detection 26
2.4 Model-Based Outlier Detection 30
2.5 Outlier Detection in High-Dimensional Data 32
2.6 Conclusion 44

Chapter 3 Data Deduplication 47


3.1 Similarity Metrics 49
3.2 Predicting Duplicate Pairs 54
3.3 Clustering 57
3.4 Blocking for Deduplication 60
3.5 Distributed Data Deduplication 66
3.6 Record Fusion and Entity Consolidation 73
3.7 Human-Involved Data Deduplication 81
3.8 Data Deduplication Tools 85
3.9 Conclusion 88

Chapter 4 Data Transformation 91


4.1 Syntactic Data Transformations 93
xii Contents

4.2 Semantic Data Transformations 107


4.3 ETL Tools 117
4.4 Conclusion 118

Chapter 5 Data Quality Rule Definition and Discovery 121


5.1 Functional Dependencies 124
5.2 Conditional Functional Dependencies 130
5.3 Denial Constraints 133
5.4 Other Types of Constraints 138
5.5 Conclusion 147

Chapter 6 Rule-Based Data Cleaning 149


6.1 Violation Detection 149
6.2 Error Repair 161
6.3 Conclusion 193

Chapter 7 Machine Learning and Probabilistic Data Cleaning 195


7.1 Machine Learning for Data Deduplication 196
7.2 Machine Learning for Data Repair 203
7.3 Data Cleaning for Analytics and Machine Learning 214

Chapter 8 Conclusion and Future Thoughts 223

References 227
Index 247
Author Biographies 259
Preface

Data quality is one of the most important problems in data management, since
dirty data often leads to inaccurate data analytics results and incorrect business
decisions. Poor data across businesses and the U.S. government are reported to
cost trillions of dollars a year. Multiple surveys show that dirty data is the most
common barrier faced by data scientists. Not surprisingly, developing effective and
efficient data cleaning solutions is challenging and is rife with deep theoretical and
engineering problems.
Data cleaning is used to refer to all kinds of tasks and activities to detect and
repair errors in the data. Rather than focus on a particular data cleaning task, in
this book, we give an overview of the end-to-end data cleaning process, describing
various error detection and repair methods, and attempt to anchor these propos-
als with multiple taxonomies and views. Specifically, we cover four of the most
common and important data cleaning tasks, namely, outlier detection, data trans-
formation, error repair (including imputing missing values), and data deduplica-
tion. Furthermore, due to the increasing popularity and applicability of machine
learning techniques, we include a chapter that specifically explores how machine
learning techniques are used for data cleaning, and how data cleaning is used to
improve machine learning models.
This book is intended to serve as a useful reference for researchers and practi-
tioners who are interested in the area of data quality and data cleaning. It can also
be used as a textbook for a graduate course. Although we aim at covering state-of-
the-art algorithms and techniques, we recognize that data cleaning is still an active
field of research and therefore provide future directions of research whenever ap-
propriate.

Ihab Ilyas
Xu Chu
March 2019
Figure and Table Credits

Figures
Figure 2.3 Based On: Patrick Wessa. Free statistics software, office for research development
and education, version 1.1. 23-r7. http://www.wessa.net, 2012
Figure 2.4 Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000.
LOF: identifying density-based local outliers. SIGMOD Rec. 29, 2 (May 2000), 93–104. DOI:
10.1145/335191.335388.
Figure 2.5 Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A
survey. ACM Comput. Surv. 41, 3, Article 15 (July 2009), 58 pages. DOI: 10.1145/1541880.1541882.
Figure 2.6 Charu C. Aggarwal. Outlier Analysis. Springer, 2013.
Figure 2.7 Xiuyao Song, Mingxi Wu, Christopher Jermaine, and Sanjay Ranka. Conditional
anomaly detection. IEEE Trans. Knowl. and Data Eng., 19(5), 2007.
Figure 3.3 Based On: Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the
prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the
2012 ACM SIGMOD International Conference on Management of Data (SIGMOD ’12). ACM, New
York, NY, USA, 85–96. DOI: 10.1145/2213836.2213847.
Figure 3.6 Jens Bleiholder and Felix Naumann. 2009. Data fusion. ACM Comput. Surv. 41, 1,
Article 1 (January 2009), 41 pages. DOI: 10.1145/1456650.1456651.
Figure 3.7 Based On: George Beskales, Mohamed A. Soliman, Ihab F. Ilyas, and Shai Ben-
David. Modeling and querying possible repairs in duplicate detection. Proc. VLDB Endowment,
2(1): 598–609, (August 2009), 598–609. DOI: 10.14778/1687627.1687695.
Figure 3.8 Based On: George Beskales, Mohamed A. Soliman, Ihab F. Ilyas, and Shai Ben-
David. Modeling and querying possible repairs in duplicate detection. Proc. VLDB Endowment,
2(1): 598–609, (August 2009), 598–609. DOI: 10.14778/1687627.1687695.
Figure 3.11 Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. Crowder:
Crowdsourcing entity resolution. Proc. VLDB Endowment, 5(11): 1483–1494, DOI: 10.14778/
2350229.2350263.
Figure 3.12 Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. Crowder:
Crowdsourcing entity resolution. Proc. VLDB Endowment, 5(11): 1483–1494, DOI: 10.14778/
2350229.2350263.
xvi Figure and Table Credits

Figure 3.13 Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan
Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: hands-off crowdsourcing for entity
matching. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of
Data (SIGMOD ’14). ACM, New York, NY, USA, 601–612. DOI: 10.1145/2588555.2588576.
Figure 3.14 Pradap Konda, Sanjib Das, Paul Suganthan GC, AnHai Doan, Adel Ardalan, Jeffrey
R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, et al. Magellan: Toward
building entity matching management systems. Proc. VLDB Endowment, 9(12): 1197–1208,
2016.
Figure 3.15 Based on: Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, George Beskales,
Mitch Cherniack, Stanley B. Zdonik, Alexander Pagan, and Shan Xu. Data curation at scale:
The data tamer system. In Proc. 6th Biennial Conf. on Innovative Data Systems Research, 2013.
http://cidrdb.org/
Figure 4.3 Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter’s Wheel: An
Interactive Data Cleaning System. In Proceedings of the 27th International Conference on
Very Large Data Bases (VLDB ’01), Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano
Paraboschi, Kotagiri Ramamohanarao, and Richard Thomas Snodgrass (Eds.). Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 381–390.
Figure 4.4 Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, and Jeffrey Heer. 2011.
Proactive wrangling: mixed-initiative end-user programming of data transformation scripts.
In Proceedings of the 24th annual ACM symposium on User interface software and technology (UIST
’11). ACM, New York, NY, USA, 65–74. DOI: 10.1145/2047196.2047205. and Jeffrey Heer, Joseph
Hellerstein, and Sean Kandel. Predictive interaction for data transformation. In Proc. 7th
Biennial Conf. on Innovative Data Systems Research, 2015. and Sean Kandel, Andreas Paepcke,
Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: interactive visual specification of data
transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems (CHI ’11). ACM, New York, NY, USA, 3363–3372. DOI: 10.1145/1978942.1979444.
Figure 4.5 Copyright © 2007 Free Software Foundation, Inc. http://fsf.org/, (http://fsf.org/)
Figure 4.6 Sumit Gulwani. 2011. Automating string processing in spreadsheets using
input-output examples. In Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium
on Principles of programming languages (POPL ’11). ACM, New York, NY, USA, 317–330.
DOI: 10.1145/1926385.1926423.
Figure 4.7 Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, and Jeffrey Heer. 2011.
Proactive wrangling: mixed-initiative end-user programming of data transformation scripts.
In Proceedings of the 24th annual ACM symposium on User interface software and technology (UIST
’11). ACM, New York, NY, USA, 65–74. DOI: 10.1145/2047196.2047205.
Figure 4.8 Z. Abedjan, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti and M. Stonebraker,
DataXFormer: A robust transformation discovery system, 2016 IEEE 32nd International
Conference on Data Engineering (ICDE), Helsinki, 2016, pp. 1134–1145. DOI: 10.1109/ICDE
.2016.7498319.
Figure 4.9 Based On: Z. Abedjan, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti and
M. Stonebraker, DataXFormer: A robust transformation discovery system, 2016 IEEE
32nd International Conference on Data Engineering (ICDE), Helsinki, 2016, pp. 1134–1145.
DOI: 10.1109/ICDE.2016.7498319.
Figure and Table Credits xvii

Figure 4.10 Z. Abedjan, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti and M. Stonebraker,


DataXFormer: A robust transformation discovery system, 2016 IEEE 32nd International
Conference on Data Engineering (ICDE), Helsinki, 2016, pp. 1134–1145. DOI: 10.1109/ICDE
.2016.7498319.
Figure 4.11 Based On: Z. Abedjan, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti and
M. Stonebraker, DataXFormer: A robust transformation discovery system, 2016 IEEE
32nd International Conference on Data Engineering (ICDE), Helsinki, 2016, pp. 1134–1145.
DOI: 10.1109/ICDE.2016.7498319.
Figure 5.3 Thorsten Papenbrock and Felix Naumann. 2016. A Hybrid Approach to Functional
Dependency Discovery. In Proceedings of the 2016 International Conference on Management of
Data (SIGMOD ’16). ACM, New York, NY, USA, 821–833. DOI: 10.1145/2882903.2915203.
Figure 5.5 Tobias Bleifuß, Sebastian Kruse, and Felix Naumann. 2017. Efficient denial
constraint discovery with hydra. Proc. VLDB Endow. 11, 3 (November 2017), 311–323. DOI:
10.14778/3157794.3157800.
Figure 5.6 Grace Fan, Wenfei Fan, and Floris Geerts. Detecting errors in numeric attributes.
In Proc. 15th Int. Conf. on Web-Age Information Management, pages 125–137. Springer, 2014a.
Figure 5.7 Jiannan Wang and Nan Tang. 2014. Towards dependable data repairing with fixing
rules. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
(SIGMOD ’14). ACM, New York, NY, USA, 457–468. DOI: 10.1145/2588555.2610494.
Figure 5.8 Matteo Interlandi and Nan Tang. Proof positive and negative in data cleaning. In
Proc. 31st Int. Conf. on Data Engineering, 2015.
Figure 5.9 Matteo Interlandi and Nan Tang. Proof positive and negative in data cleaning. In
Proc. 31st Int. Conf. on Data Engineering, 2015.
Figure 5.10 Matteo Interlandi and Nan Tang. Proof positive and negative in data cleaning. In
Proc. 31st Int. Conf. on Data Engineering, 2015.
Figure 6.2 Based On: Anup Chalamalla, Ihab F. Ilyas, Mourad Ouzzani, and Paolo Papotti.
Descriptive and prescriptive data cleaning. In Proc. ACM SIGMOD Int. Conf. on Management of
Data, pages 445–456, 2014. DOI: 10.1145/2588555.2610520.
Figure 6.3 Alexandra Meliou, Wolfgang Gatterbauer, Suman Nath, and Dan Suciu. 2011.
Tracing data errors with view-conditioned causality. In Proceedings of the 2011 ACM SIGMOD
International Conference on Management of data (SIGMOD ’11). ACM, New York, NY, USA, 505–
516. DOI: 10.1145/1989323.1989376.
Figure 6.4 Eugene Wu and Samuel Madden. Scorpion: Explaining away outliers in aggregate
queries. Proceedings of the VLDB Endowment, Vol. 6, No. 8. Copyright 2013 VLDB Endowment
2150-8097/13/06 553–564.
Figure 6.5 Anup Chalamalla, Ihab F. Ilyas, Mourad Ouzzani, and Paolo Papotti. Descriptive
and prescriptive data cleaning. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages
445–456, 2014. DOI: 10.1145/2588555.2610520.
Figure 6.6 Anup Chalamalla, Ihab F. Ilyas, Mourad Ouzzani, and Paolo Papotti. Descriptive
and prescriptive data cleaning. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages
445–456, 2014. DOI: 10.1145/2588555.2610520.
xviii Figure and Table Credits

Figure 6.7 Based On: Anup Chalamalla, Ihab F. Ilyas, Mourad Ouzzani, and Paolo Papotti.
Descriptive and prescriptive data cleaning. In Proc. ACM SIGMOD Int. Conf. on Management of
Data, pages 445–456, 2014. DOI: 10.1145/2588555.2610520.
Figure 6.9 Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. That’s
all folks! LLUNATIC goes open source. Proceedings of the VLDB Endowment, Vol. 7, No. 13.
Copyright 2014 VLDB Endowment 2150-8097/14/08:1565–1568.
Figure 6.12 Maksims Volkovs, Fei Chiang, Jaroslaw Szlichta, and Rene’e J. Miller. Continuous
data cleaning. In Proc. 30th Int. Conf. on Data Engineering, pages 244–255, 2014.
Figure 6.14 George Beskales, Ihab F. Ilyas, and Lukasz Golab. Sampling the repairs of
functional dependency violations under hard constraints. Proc. VLDB Endowment, 3(1–2):
197–207, DOI: 10.14778/1920841.1920870.
Figure 6.15 Solmaz Kolahi and Laks V. S. Lakshmanan. 2009. On approximating optimum
repairs for functional dependency violations. In Proceedings of the 12th International Con-
ference on Database Theory (ICDT ’09), Ronald Fagin (Ed.). ACM, New York, NY, USA, 53–62.
DOI: 10.1145/1514894.1514901.
Figure 6.16 Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani,
and Ihab F. Ilyas. Guided data repair. Proc. VLDB Endowment, 4(5): 279–289, DOI: 10.14778/
1952376.1952378.
Figure 6.17 Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani,
and Ihab F. Ilyas. Guided data repair. Proc. VLDB Endowment, 4(5): 279–289, DOI: 10.14778/
1952376.1952378.
Figure 6.18 Wenfei Fan and Floris Geerts. Foundations of Data Quality Management. Synthesis
Lectures on Data Management. 2012. © Morgan & Claypool.
Figure 6.19 Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan
Tang, and Yin Ye. 2015. KATARA: A Data Cleaning System Powered by Knowledge Bases
and Crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on
Management of Data (SIGMOD ’15). ACM, New York, NY, USA, 1247–1261. DOI: 10.1145/
2723372.2749431.
Figure 6.20 Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan
Tang, and Yin Ye. 2015. KATARA: A Data Cleaning System Powered by Knowledge Bases
and Crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on
Management of Data (SIGMOD ’15). ACM, New York, NY, USA, 1247–1261. DOI: 10.1145/
2723372.2749431.
Figure 6.23 George Beskales, Ihab F. Ilyas, and Lukasz Golab. Sampling the repairs of
functional dependency violations under hard constraints. Proc. VLDB Endowment, 3(1–2):
197–207, DOI: 10.14778/1920841.1920870.
Figure 7.1 Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using
active learning. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge
discovery and data mining (KDD ’02). ACM, New York, NY, USA, 269–278. DOI: 10.1145/775047
.775087.
Figure 7.2 Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park,
Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
Figure and Table Credits xix

entity matching: A design space exploration. In Proceedings of the 2018 International Conference
on Management of Data (SIGMOD ’18). ACM, New York, NY, USA, 19–34. DOI: 10.1145/3183713
.3196926.
Figure 7.3 Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park,
Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for
entity matching: A design space exploration. In Proceedings of the 2018 International Conference
on Management of Data (SIGMOD ’18). ACM, New York, NY, USA, 19–34. DOI: 10.1145/3183713
.3196926.
Figure 7.8 Jiannan Wang, Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, Tim Kraska,
and Tova Milo. A sample-and-clean framework for fast and accurate query processing on
dirty data. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 469–480, 2014. DOI:
10.1145/2588555.2610505.
Figure 7.9 Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Gold-
berg. Activeclean: Interactive data cleaning for statistical modeling. Proc. VLDB Endowment,
9(12, August 2016): 948–959. DOI: 10.14778/2994509.2994514.

Tables
Table 3.2 Jens Bleiholder and Felix Naumann. 2009. Data fusion. ACM Comput. Surv. 41, 1,
Article 1 (January 2009), 41 pages. DOI: 10.1145/1456650.1456651 and Xin Luna Dong and Felix
Naumann. Data fusion: resolving data conflicts for integration. Proc. VLDB Endowment, 2(2):
1654–1655, 2009.
Table 4.1 Based On: Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer.
2011. Wrangler: interactive visual specification of data transformation scripts. In Proceedings
of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’11). ACM, New York, NY,
USA, 3363–3372. DOI: 10.1145/1978942.1979444.
Table 5.2 Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. On
generating near-optimal tableaux for conditional functional dependencies. Proc. VLDB
Endowment, 1(1): 376–390, DOI: 10.14778/1453856.1453900.
Table 6.1 Based On: Xu Chu, Ihab F. Ilyas, and Paolo Papotti. Holistic data cleaning: Putting
violations into context. In Proc. 29th Int. Conf. on Data Engineering, pages 458–469, 2013b.
Table 6.3 Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2011. Interaction
between record matching and data repairing. In Proceedings of the 2011 ACM SIGMOD
International Conference on Management of data (SIGMOD ’11). ACM, New York, NY, USA, 469–
480. DOI: 10.1145/1989323.1989373.
Table 7.1 Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean:
holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10, 11 (August 2017),
1190–1201. DOI: 10.14778/3137628.3137631.
1
Introduction
Enterprises have been acquiring large amounts of data from a variety of sources in
order to build large data repositories that power their applications, with the goal of
enabling richer and more informed analytics. Data collection and acquisition often
introduce errors in data, e.g., missing values, typos, mixed formats, replicated en-
tries for the same real-world entity, and violations of business and data integrity
rules. A survey about the state of data science and machine learning (ML) reveals
that dirty data is the most common barrier faced by workers dealing with data.1
With the popularity of data science, it has become increasingly evident that data
curation, unification, preparation, and cleaning are key enablers in unleashing the
value of data.2 According to another survey of about 80 data scientists conducted
by CrowdFlower and published in Forbes,3 data scientists spend more than 60% of
their time in cleaning and organizing data, and 57% of data scientists regard clean-
ing and organizing data as the least enjoyable part of their work. Not surprisingly,
developing effective and efficient data cleaning solutions is challenging and is rife
with deep theoretical and engineering problems.
Regardless of the type of data errors to be fixed, data cleaning activities usually
consist of two phases: (1) error detection, where various errors and violations are
identified and possibly validated by experts; and (2) error repair, where updates to
the database are applied (or suggested to human experts) to bring the data to a
cleaner state suitable for downstream applications and analytics. Error detection
techniques can be either quantitative or qualitative. Specifically, quantitative error
detection techniques often involve statistical methods to identify abnormal behav-
iors and errors [Hellerstein 2008] (e.g., “a salary that is three standard deviations

1. https://www.kaggle.com/surveys/2017
2. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-
janitor-work.html
3. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-
least-enjoyable-data-science-task-survey-says/
2 Chapter 1 Introduction

away from the mean salary is an error”), and hence have been mostly studied in the
context of outlier detection [Aggarwal 2013]. On the other hand, qualitative error de-
tection techniques rely on descriptive approaches to specify patterns or constraints
of a consistent data instance, and for that reason these techniques identify those
data that violate such patterns or constraints as errors. For example, in a descrip-
tive statement about a company HR database, “for two employees working at the same
branch of the company, the senior employee cannot earn less salary than the junior em-
ployee,” if we find two employees with a violation of the rule, it is likely that there
is an error in at least one of them.
Various surveys and books detail specific aspects of data quality and data clean-
ing. For example, Rahm and Do [2000] classify different types of errors occurring in
an Extract-Transform-Load (ETL) process, and survey the tools available for clean-
ing data in an ETL process. Some work focuses on the effect of incompleteness data
on query answering [Grahne 1991] and the use of a Chase procedure [Maier et al.
1979] for dealing with incomplete data [Greco et al. 2012]. Hellerstein [2008] fo-
cuses on cleaning quantitative numerical data using mainly statistical techniques.
Bertossi [2011] provides complexity results for repairing inconsistent data and per-
forming consistent query answering on inconsistent data. Fan and Geerts [2012]
discuss the use of data quality rules in data consistency, data currency, and data
completeness, and their interactions. Dasu and Johnson [2003] summarize how
techniques in exploratory data mining can be integrated with data quality manage-
ment. Ganti and Sarma [2013] focus on an operator-centric approach for developing
a data cleaning solution, involving the development of customizable operators that
can be used as building blocks for developing common solutions. Ilyas and Chu
[2015] provide taxonomies and example algorithms for qualitative error detection
and repairing techniques. Multiple surveys and tutorials have been published to
summarize different definitions of outliers and the algorithms for detecting them
[Hodge and Austin 2004, Chandola et al. 2009, Aggarwal 2013]. Data deduplica-
tion, a long-standing problem that has been studied for decades [Fellegi and Sunter
1969], has also been extensively surveyed [Koudas et al. 2006, Elmagarmid et al.
2007, Herzog et al. 2007, Dong and Naumann 2009, Naumann and Herschel 2010,
Getoor and Machanavajjhala 2012].
This book, however, focuses on the end-to-end data cleaning process, describing
various error detection and repair methods, and attempts to anchor these proposals
with multiple taxonomies and views. Our goals are (1) to allow researchers and
general readers to understand the scope of current techniques and highlight gaps
and possible new directions of research; and (2) to give practitioners and system
implementers a variety of choices and solutions for their data cleaning activities.
1.1 Data Cleaning Workflow 3

External sources

Knowledge
bases

PDFs, rules,
patterns, etc.
Error Error
Discovery detection Errors repair

Data

Figure 1.1 A typical data cleaning workflow with an optional discovery step, error detection step, and
error repair step.

In what follows, we give a brief overview of the book’s scope as well as a chapter
outline.

Data Cleaning Workflow


1.1 Figure 1.1 shows a typical data cleaning workflow, consisting of an optional discov-
ery and profiling step, an error detection step, and an error repair step. To clean
a dirty dataset, we often need to model various aspects of this data, e.g., schema,
patterns, probability distributions, and other metadata. One way to obtain such
metadata is by consulting domain experts, typically a costly and time-consuming
process. The discovery and profiling step is used to discover these metadata auto-
matically. Given a dirty dataset and the associated metadata, the error detection
step finds part of the data that does not conform to the metadata, and declares this
subset to contain errors. The errors surfaced by the error detection step can be in
various forms, such as outliers, violations, and duplicates. Finally, given the errors
detected and the metadata that generate those errors, the error repair step produces
data updates that are applied to the dirty dataset. Since there are many uncertainties
in the data cleaning process, external sources such as knowledge bases and human
experts are consulted whenever possible and feasible to ensure the accuracy of the
cleaning workflow.

Example 1.1 Consider Table 1.1 containing employee records for a U.S. company. Every tuple
specifies a person in a company with her id (GID), name (FN, LN), level (LVL),
Random documents with unrelated
content Scribd suggests to you:
"Fantastic!" said the Secretary of Defense.
"Will it work with anybody?" the President asked.
Dr. Wolstadt shook his head. "No. This is a most unusual case. Mr.
Merriwether, according to the FBI reports, had a terrible memory
before the accident happened. He is actually a very intelligent man,
but he always forgot things, and that made him look stupid.
"But, fortunately, it meant that his memory was almost a total blank.
Therefore, the ray could implant all this data on his memory.
"It's like recording something on an L-P disc. If it already has music
on it, the recorder just ruins the disc. But if it's blank, the recorder
puts music on it. You see?"
"Then if it had hit anyone but me—" Phil began.
"—it would probably have driven them insane," said Dr. Wolstadt.
"That still leaves us the problem of what to do with Mr. Merriwether,"
said the President.
"I think I have an idea," Phil said. "Want to hear it?"

Some months later, two men arrived by air in the city of Moscow.
One of them went directly to the American Embassy. In his brief case
was a small, very compact machine.
The Ambassador shook his hand warmly. "The President told me to
give you the run of the place, Dr. Wolstadt. What are you going to
do?"
"I'm sorry," said Wolstadt, "but I cannot tell even you. All I require is a
room in the Embassy which faces the Kremlin."
"That can be arranged. But where is your companion? I understood
there were to be two of you."
"He is, shall we say, taking a stroll around Moscow."
"But he can't do that!" the Ambassador said. "The President said he
was the most valuable man on Earth! He might get arrested."
"That is the chance we have to take. Now if you'll show me to that
room, I'll go about my business."
Some distance away, on the opposite side of the Kremlin from the
American Embassy, Philip Merriwether, the most valuable spy that
ever existed, waited patiently for the ray that would be generated
inside the Embassy to strike his head. In a few seconds, he would
know even more than he already did.
He smiled happily. This was the life!
THE END
*** END OF THE PROJECT GUTENBERG EBOOK THE MAN WHO
KNEW EVERYTHING ***

Updated editions will replace the previous one—the old editions will
be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States copyright in
these works, so the Foundation (and you!) can copy and distribute it
in the United States without permission and without paying copyright
royalties. Special rules, set forth in the General Terms of Use part of
this license, apply to copying and distributing Project Gutenberg™
electronic works to protect the PROJECT GUTENBERG™ concept
and trademark. Project Gutenberg is a registered trademark, and
may not be used if you charge for an eBook, except by following the
terms of the trademark license, including paying royalties for use of
the Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is very
easy. You may use this eBook for nearly any purpose such as
creation of derivative works, reports, performances and research.
Project Gutenberg eBooks may be modified and printed and given
away—you may do practically ANYTHING in the United States with
eBooks not protected by U.S. copyright law. Redistribution is subject
to the trademark license, especially commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the free


distribution of electronic works, by using or distributing this work (or
any other work associated in any way with the phrase “Project
Gutenberg”), you agree to comply with all the terms of the Full
Project Gutenberg™ License available with this file or online at
www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand, agree
to and accept all the terms of this license and intellectual property
(trademark/copyright) agreement. If you do not agree to abide by all
the terms of this agreement, you must cease using and return or
destroy all copies of Project Gutenberg™ electronic works in your
possession. If you paid a fee for obtaining a copy of or access to a
Project Gutenberg™ electronic work and you do not agree to be
bound by the terms of this agreement, you may obtain a refund from
the person or entity to whom you paid the fee as set forth in
paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only be


used on or associated in any way with an electronic work by people
who agree to be bound by the terms of this agreement. There are a
few things that you can do with most Project Gutenberg™ electronic
works even without complying with the full terms of this agreement.
See paragraph 1.C below. There are a lot of things you can do with
Project Gutenberg™ electronic works if you follow the terms of this
agreement and help preserve free future access to Project
Gutenberg™ electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law in
the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name
associated with the work. You can easily comply with the terms of
this agreement by keeping this work in the same format with its
attached full Project Gutenberg™ License when you share it without
charge with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the terms
of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.

1.E. Unless you have removed all references to Project Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears, or
with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is derived


from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is posted


with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning of
this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute this


electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1 with
active links or immediate access to the full terms of the Project
Gutenberg™ License.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or expense
to the user, provide a copy, a means of exporting a copy, or a means
of obtaining a copy upon request, of the work in its original “Plain
Vanilla ASCII” or other form. Any alternate format must include the
full Project Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or providing


access to or distributing Project Gutenberg™ electronic works
provided that:

• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt that
s/he does not agree to the terms of the full Project Gutenberg™
License. You must require such a user to return or destroy all
copies of the works possessed in a physical medium and
discontinue all use of and all access to other copies of Project
Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™


electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except


for the “Right of Replacement or Refund” described in paragraph
1.F.3, the Project Gutenberg Literary Archive Foundation, the owner
of the Project Gutenberg™ trademark, and any other party
distributing a Project Gutenberg™ electronic work under this
agreement, disclaim all liability to you for damages, costs and
expenses, including legal fees. YOU AGREE THAT YOU HAVE NO
REMEDIES FOR NEGLIGENCE, STRICT LIABILITY, BREACH OF
WARRANTY OR BREACH OF CONTRACT EXCEPT THOSE
PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE THAT THE
FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of receiving it,
you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or entity
that provided you with the defective work may elect to provide a
replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth in
paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.
1.F.6. INDEMNITY - You agree to indemnify and hold the
Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and distribution
of Project Gutenberg™ electronic works, harmless from all liability,
costs and expenses, including legal fees, that arise directly or
indirectly from any of the following which you do or cause to occur:
(a) distribution of this or any Project Gutenberg™ work, (b)
alteration, modification, or additions or deletions to any Project
Gutenberg™ work, and (c) any Defect you cause.

Section 2. Information about the Mission of


Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many small
donations ($1 to $5,000) are particularly important to maintaining tax
exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About Project


Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.

Project Gutenberg™ eBooks are often created from several printed


editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.

You might also like