(Ebook) Geographic Data Mining and Knowledge Discovery, Second Edition (Chapman & Hall CRC Data Mining and Knowledge Discovery Series) by Harvey J. Miller, Jiawei Han ISBN 9781420073973, 1420073974 download
(Ebook) Geographic Data Mining and Knowledge Discovery, Second Edition (Chapman & Hall CRC Data Mining and Knowledge Discovery Series) by Harvey J. Miller, Jiawei Han ISBN 9781420073973, 1420073974 download
https://ebooknice.com/product/statistical-data-mining-using-sas-
applications-second-edition-chapman-hall-crc-data-mining-and-
knowledge-discovery-series-2178366
(Ebook) Biological Data Mining (Chapman & Hall Crc Data Mining and
Knowledge Discovery Series) by Jake Y. Chen, Stefano Lonardi ISBN
1420086847
https://ebooknice.com/product/biological-data-mining-chapman-hall-crc-
data-mining-and-knowledge-discovery-series-2172726
(Ebook) Data Mining Using SAS Applications (Chapman & Hall CRC Data
Mining and Knowledge Discovery Series) by George Fernandez ISBN
9781584883456, 1584883456
https://ebooknice.com/product/data-mining-using-sas-applications-
chapman-hall-crc-data-mining-and-knowledge-discovery-series-1770930
(Ebook) Data Mining with R: Learning with Case Studies (Chapman &
Hall/CRC Data Mining and Knowledge Discovery Series) by Torgo, Luis
ISBN 9781439810187, 1439810184
https://ebooknice.com/product/data-mining-with-r-learning-with-case-
studies-chapman-hall-crc-data-mining-and-knowledge-discovery-
series-55441032
https://ebooknice.com/product/information-discovery-on-electronic-
health-records-chapman-hall-crc-data-mining-and-knowledge-discovery-
series-1930942
https://ebooknice.com/product/foundations-of-predictive-analytics-
chapman-hall-crc-data-mining-and-knowledge-discovery-series-5671462
https://ebooknice.com/product/data-mining-and-knowledge-discovery-
technologies-advances-in-data-warehousing-and-mining-1765648
Geographic
Data Mining and
Knowledge Discovery
Second Edition
SERIES EDITOR
Vipin Kumar
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, Minnesota, U.S.A.
This series aims to capture new developments and applications in data mining and knowledge
discovery, while summarizing the computational tools and techniques useful in data analysis. This
series encourages the integration of mathematical, statistical, and computational methods and
techniques through the publication of a broad range of textbooks, reference works, and hand-
books. The inclusion of concrete examples and applications is highly encouraged. The scope of the
series includes, but is not limited to, titles in the areas of data mining and knowledge discovery
methods and applications, modeling, algorithms, theory and foundations, data and knowledge
visualization, data mining systems and tools, and privacy and security issues.
PUBLISHED TITLES
Geographic
Data Mining and
Knowledge Discovery
Second Edition
Edited by
Harvey J. Miller
Jiawei Han
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher can-
not assume responsibility for the validity of all materials or the consequences of their use. The
authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Geographic data mining and knowledge discovery / editors, Harvey J. Miller and
Jiawei Han. -- 2nd ed.
p. cm.
Includes bibliographical references and index.
ISBN 978-1-4200-7397-3 (hard back : alk. paper)
1. Geodatabases. 2. Data mining. I. Miller, Harvey J. II. Han, Jiawei. III. Title.
G70.2.G4365 2009
910.285’6312--dc22 2009010969
Acknowledgments ..................................................................................................vii
About the Editors ....................................................................................................ix
List of Contributors ................................................................................................xi
Chapter 15 Periodic Pattern Discovery from Trajectories of Moving Objects ... 389
Huiping Cao, Nikos Mamoulis, and David W. Cheung
Mark Gahegan
Eveline Bernier
University of Auckland
Laval University
Auckland, New Zealand
Quebec City, Canada
Marc Gervais
Arnold P. Boedihardjo Laval University
Virginia Tech Quebec City, Canada
Blacksburg, Virginia
Diansheng Guo
Huiping Cao University of South Carolina
University of Hong Kong Columbia, South Carolina
Hong Kong
Otto Huisman
International Institute for
Martin Charlton
GeoInformation Science
National University of Ireland
and Earth Observation (ITC)
County Kildare, Ireland
Enschede, Netherlands
Sanjay Chawla Micheline Kamber
University of Sydney Burnaby, Canada
Sydney, Australia
Menno-Jan Kraak
David W. Cheung International Institute for
University of Hong Kong GeoInformation Science
Hong Kong and Earth Observation (ITC)
Enschede, Netherlands
Urška Demšar
Antonietta Lanza
National University of Ireland
Università degli Studi di Bari
County Kildare, Ireland
Bari, Italy
Rodolphe Devillers Patrick Laube
Memorial University of Newfoundland University of Melbourne
St. John’s, Canada Victoria, Australia
CONTENTS
1.1 INTRODUCTION
Similar to many research and application fields, geography has moved from a data-
poor and computation-poor to a data-rich and computation-rich environment. The
scope, coverage, and volume of digital geographic datasets are growing rapidly. Public
and private sector agencies are creating, processing, and disseminating digital data on
land use, socioeconomic conditions, and infrastructure at very detailed levels of geo-
graphic resolution. New high spatial and spectral resolution remote sensing systems
and other monitoring devices are gathering vast amounts of geo-referenced digital
imagery, video, and sound. Geographic data collection devices linked to location-
ware technologies (LATs) such as global positioning system (GPS) receivers allow
field researchers to collect unprecedented amounts of data. LATs linked to or embed-
ded in devices such as cell phones, in-vehicle navigation systems, and wireless Internet
clients provide location-specific content in exchange for tracking individuals in space
and time. Information infrastructure initiatives such as the U.S. National Spatial Data
Infrastructure are facilitating data sharing and interoperability. Digital geographic data
repositories on the World Wide Web are growing rapidly in both number and scope.
The amount of data that geographic information processing systems can handle will
continue to increase exponentially through the mid-21st century.
Traditional spatial analytical methods were developed in an era when data collec-
tion was expensive and computational power was weak. The increasing volume and
diverse nature of digital geographic data easily overwhelm mainstream spatial anal-
ysis techniques that are oriented toward teasing scarce information from small and
homogenous datasets. Traditional statistical methods, particularly spatial statistics,
have high computational burdens. These techniques are confirmatory and require
the researcher to have a priori hypotheses. Therefore, traditional spatial analytical
techniques cannot easily discover new and unexpected patterns, trends, and relation-
ships that can be hidden deep within very large and diverse geographic datasets.
In March 1999, the National Center for Geographic Information and Analysis
(NCGIA) — Project Varenius held a workshop on discovering geographic knowl-
edge in data-rich environments in Kirkland, Washington, USA. The workshop
brought together a diverse group of stakeholders with interests in developing and
applying computational techniques for exploring large, heterogeneous digital geo-
graphic datasets. Drawing on papers submitted to that workshop, in 2001 we pub-
lished Geographic Data Mining and Knowledge Discovery, a volume that brought
together some of the cutting-edge research in the area of geographic data mining and
geographic knowledge discovery in a data-rich environment. There has been much
progress in geographic knowledge discovery (GKD) over the past eight years, includ-
ing the development of new techniques for geographic data warehousing (GDW),
spatial data mining, and geo-visualization. In addition, there has been a remarkable
rise in the collection and storage of data on spatiotemporal processes and mobile
objects, with a consequential rise in knowledge discovery techniques for these data.
The second edition of Geographic Data Mining and Knowledge Discovery is a
major revision of the first edition. We selected chapters from the first edition and
asked authors for updated manuscripts that reflect changes and recent developments
in their particular domains. We also solicited new chapters on topics that were not
covered well in the first edition but have become more prominent recently. This
includes several new chapters on spatiotemporal and mobile objects databases, a
topic only briefly mentioned in the 2001 edition.
This chapter introduces geographic data mining and GKD. In this chapter, we pro-
vide an overview of knowledge discovery from databases (KDD) and data mining.
We identify why geographic data is a nontrivial special case that requires distinctive
consideration and techniques. We also review the current state-of-the-art in GKD,
including the existing literature and the contributions of the chapters in this volume.
type of data that increasingly comprise enterprise databases and the novelty of the
patterns sought in KDD.
KDD goes beyond the traditional domain of statistics to accommodate data not
normally amenable to statistical analysis. Statistics usually involves a small and clean
(noiseless) numeric database scientifically sampled from a large population with spe-
cific questions in mind. Many statistical models require strict assumptions (such as
independence, stationarity of underlying processes, and normality). In contrast, the
data being collected and stored in many enterprise databases are noisy, nonnumeric,
and possibly incomplete. These data are also collected in an open-ended manner
without specific questions in mind (Hand 1998). KDD encompasses principles and
techniques from statistics, machine learning, pattern recognition, numeric search,
and scientific visualization to accommodate the new data types and data volumes
being generated through information technologies.
KDD is more strongly inductive than traditional statistical analysis. The gen-
eralization process of statistics is embedded within the broader deductive process
of science. Statistical models are confirmatory, requiring the analyst to specify a
model a priori based on some theory, test these hypotheses, and perhaps revise
the theory depending on the results. In contrast, the deeply hidden, interesting
patterns being sought in a KDD process are (by definition) difficult or impos-
sible to specify a priori, at least with any reasonable degree of completeness.
KDD is more concerned about prompting investigators to formulate new predic-
tions and hypotheses from data as opposed to testing deductions from theories
through a sub-process of induction from a scientific database (Elder and Pregibon
1996; Hand 1998). A guideline is that if the information being sought can only be
vaguely described in advance, KDD is more appropriate than statistics (Adriaans
and Zantinge 1996).
KDD more naturally fits in the initial stage of the deductive process when the
researcher forms or modifies theory based on ordered facts and observations from
the real world. In this sense, KDD is to information space as microscopes, remote
sensing, and telescopes are to atomic, geographic, and astronomical spaces, respec-
tively. KDD is a tool for exploring domains that are too difficult to perceive with
unaided human abilities. For searching through a large information wilderness, the
powerful but focused laser beams of statistics cannot compete with the broad but
diffuse floodlights of KDD. However, floodlights can cast shadows and KDD cannot
compete with statistics in confirmatory power once the pattern is discovered.
per week”), two dimensions (e.g., “total sales by item and store”) and so on, up to
N dimensions. The data cube is an N-dimensional generalization of the more com-
monly known SQL aggregation functions and “Group-By” operator. However, the
analogous SQL query only generates the zero and one-dimensional aggregations;
the data cube operator generates these and the higher dimensional aggregations all
at once (Gray et al. 1997).
The power set of aggregations over selected dimensions is called a “data cube”
because the logical arrangement of aggregations can be viewed as a hypercube in
an N-dimensional information space (see Gray et al. 1997, Figure 2). The data cube
can be pre-computed and stored in its entirety, computed “on-the-fly” only when
requested, or partially pre-computed and stored (see Harinarayan, Rajaman and
Ullman 1996). The data cube can support standard OLAP operations including roll-
up, drill-down, slice, dice, and pivot on measures computed by different aggregation
operators, such as max, min, average, top-10, variance, and so on.
TABLE 1.1
Data-Mining Tasks and Techniques
Knowledge Type Description Techniques
Segmentation or clustering Determining a finite set of implicit Cluster analysis
groups that describe the data.
Classification Predict the class label that a set of Bayesian classification
data belongs to based on some Decision tree induction
training datasets Artificial neural networks
Support vector machine (SVM)
Association Finding relationships among Association rules
itemsets or association/correlation Bayesian networks
rules, or predict the value of some
attribute based on the value of
other attributes
Deviations Finding data items that exhibit Clustering and other data-mining
unusual deviations from methods
expectations Outlier detection
Evolution analysis
Trends and regression Lines and curves summarizing the Regression
analysis database, often over time Sequential pattern extraction
Generalizations Compact descriptions of the data Summary rules
Attribute-oriented induction
Smyth (1996), as well as several of the chapters in this current volume for other
overviews and classifications of data-mining techniques.
Segmentation or clustering involves partitioning a selected set of data into mean-
ingful groupings or classes. It usually applies cluster analysis algorithms to examine
the relationships between data items and determining a finite set of implicit classes
so that the intraclass similarity is maximized and interclass similarity is minimized.
The commonly used data-mining technique of cluster analysis determines a set of
classes and assignments to these classes based on the relative proximity of data items
in the information space. Cluster analysis methods for data mining must accommo-
date the large data volumes and high dimensionalities of interest in data mining; this
usually requires statistical approximation or heuristics (see Farnstrom, Lewis and
Elkan 2000). Bayesian classification methods, such as AutoClass, determine classes
and a set of weights or class membership probabilities for data items (see Cheesman
and Stutz 1996).
Classification refers to finding rules or methods to assign data items into pre-
existing classes. Many classification methods have been developed over many years
of research in statistics, pattern recognition, machine learning, and data mining,
including decision tree induction, naïve Bayesian classification, neural networks,
support vector machines, and so on. Decision or classification trees are hierarchi-
cal rule sets that generate an assignment for each data item with respect to a set of
known classes. Entropy-based methods such as ID3 and C4.5 (Quinlan 1986, 1992)
derive these classification rules from training examples. Statistical methods include
the chi-square automatic interaction detector (CHAID) (Kass 1980) and the classi-
fication and regression tree (CART) method (Breiman et al. 1984). Artificial neural
networks (ANNs) can be used as nonlinear clustering and classification techniques.
Unsupervised ANNs such as Kohonen Maps are a type of neural clustering where
weighted connectivity after training reflects proximity in information space of the
input data (see Flexer 1999). Supervised ANNs such as the well-known feed forward/
back propagation architecture require supervised training to determine the appropri-
ate weights (response function) to assign data items into known classes.
Associations are rules that predict the object relationships as well as the value
of some attribute based on the value of other attributes (Ester, Kriegel and Sander
1997). Bayesian networks are graphical models that maintain probabilistic depen-
dency relationships among a set of variables. These networks encode a set of con-
ditional probabilities as directed acyclic networks with nodes representing variables
and arcs extending from cause to effect. We can infer these conditional probabilities
from a database using several statistical or computational methods depending on the
nature of the data (see Buntine 1996; Heckerman 1997). Association rules are a par-
ticular type of dependency relationship. An association rule is an expression X Y
(c%, r%) where X and Y are disjoint sets of items from a database, c% is the confi-
dence and r% is the support. Confidence is the proportion of database transactions
containing X that also contain Y; in other words, the conditional probability P(Y | X ) .
Support is proportion of database transactions that contain X and Y, i.e., the union of
X and Y, P( X Y ) (see Hipp, Güntzer and Nakhaeizadeh 2000). Mining associa-
tion rules is a difficult problem since the number of potential rules is exponential
with respect to the number of data items. Algorithms for mining association rules
typically use breadth-first or depth-first search with branching rules based on mini-
mum confidence or support thresholds (see Agrawal et al. 1996; Hipp, Güntzer and
Nakhaeizadeh 2000).
Deviations are data items that exhibit unexpected deviations or differences from
some norm. These cases are either errors that should be corrected/ignored or rep-
resent unusual cases that are worthy of additional investigation. Outliers are often
a byproduct of other data-mining methods, particularly cluster analysis. However,
rather than treating these cases as “noise,” special-purpose outlier detection meth-
ods search for these unusual cases as signals conveying valuable information (see
Breuing et al. 1999).
Trends are lines and curves fitted to the data, including linear and logistic regres-
sion analysis, that are very fast and easy to estimate. These methods are often com-
bined with filtering techniques such as stepwise regression. Although the data often
violate the stringent regression assumptions, violations are less critical if the esti-
mated model is used for prediction rather than explanation (i.e., estimated parame-
ters are not used to explain the phenomenon). Sequential pattern extraction explores
time series data looking for temporal correlations or pre-specified patterns (such as
curve shapes) in a single temporal data series (see Agrawal and Srikant 1995; Berndt
and Clifford 1996).
Generalization and characterization are compact descriptions of the database.
As the name implies, summary rules are a relatively small set of logical statements
that condense the information in the database. The previously discussed classifica-
tion and association rules are specific types of summary rules. Another type is a
characteristic rule; this is an assertion that data items belonging to a specified con-
cept have stated properties, where “concept” is some state or idea generalized from
particular instances (Klösgen and Żytkow 1996). An example is “all professors in
the applied sciences have high salaries.” In this example, “professors” and “applied
sciences” are high-level concepts (as opposed to low-level measured attributes such
as “assistant professor” and “computer science”) and “high salaries” is the asserted
property (see Han, Cai and Cercone 1993).
A powerful method for finding many types of summary rules is attribute-ori-
ented induction (also known as generalization-based mining). This strategy per-
forms hierarchical aggregation of data attributes, compressing data into increasingly
generalized relations. Data-mining techniques can be applied at each level to extract
features or patterns at that level of generalization (Han and Fu 1996). Background
knowledge in the form of a concept hierarchy provides the logical map for aggregat-
ing data attributes. A concept hierarchy is a sequence of mappings from low-level
to high-level concepts. It is often expressed as a tree whose leaves correspond to
measured attributes in the database with the root representing the null descriptor
(“any”). Concept hierarchies can be derived from experts or from data cardinality
analysis (Han and Fu 1996).
A potential problem that can arise in a data-mining application is the large num-
ber of patterns generated. Typically, only a small proportion of these patterns will
encapsulate interesting knowledge. The vast majority may be trivial or irrelevant. A
data-mining engine should present only those patterns that are interesting to particu-
lar users. Interestingness measures are quantitative techniques that separate inter-
esting patterns from trivial ones by assessing the simplicity, certainty, utility, and
novelty of the generated patterns (Silberschatz and Tuzhilin 1996; Tan, Kumar and
Srivastava 2002). There are many interestingness measures in the literature; see Han
and Kamber (2006) for an overview.
Keim and Kriegel (1994) and Lee and Ong (1996) describe software systems
that incorporate visualization techniques for supporting database querying and data
mining. Keim and Kriegel (1994) use visualization to support simple and complex
query specification, OLAP, and querying from multiple independent databases. Lee
and Ong’s (1996) WinViz software uses multidimensional visualization techniques
to support OLAP, query formulation, and the interpretation of results from unsuper-
vised (clustering) and supervised (decision tree) segmentation techniques. Fayyad,
Grinstein and Wierse (2001) provide a good overview of visualization methods for
data mining.
Page 103. A full stop was added after beat out and disgusted.
Page 123. A full stop was added after faster than the Clearchus.
Page 214. Changed Reunion to Réunion.
Page 250. Changed and we quartered the gounds to and we quartered the grounds.
Page 310. Changed lasily to lazily.
Page 360. Changed ‘ “deepo ” we called it’ to ‘ “deepo’ ” we called it’.
*** END OF THE PROJECT GUTENBERG EBOOK SHE BLOWS! AND
SPARM AT THAT! ***
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebooknice.com