772s Data - Mining.concepts - And.techniques.2nd - Ed
772s Data - Mining.concepts - And.techniques.2nd - Ed
Earl Cox
Location-Based Services
Jochen Schiller and Agns Voisard
Soumen Chakrabarti
Jim Melton
Terry Halpin
Joe Celko
Joe Celko
Richard T. Snodgrass
Jim Melton
Don Chamberlin
Distributed Algorithms
Nancy A. Lynch
University of I
AM S TE R D AM B O S T O N H E I D
ELBERGLONDON
N E W YO R K O X F O R D PAR IS S
AN D I E G O SAN F R AN C I S C S I
N G AP O R E S Y D N E YT O K Y
Application submitted
Preface
xxi
Chapter1 Introduction 1
Relational Databases 10
Data Warehouses 12
Transactional Databases 14
Cluster Analysis 25
Outlier Analysis 26
Evolution Analysis 27
1.7
1.9
2.3.2
Noisy Data 62
2.3.3
Data Cleaning as a Process 65
Data Integration 67
Data Transformation 70
Data Reduction 72
2.5.1
2.5.2
Dimensionality Reduction 77
Numerosity Reduction 80
Summary 97 Exercises 97
3.4.1
3.4.2
3.5.1
Discovery-Driven Exploration
5.2.3
5.2.6
5.3.1
6.4.3
Classification by Backpropagati
A Multilayer Feed-Forward Ne
Backpropagation 329
6.7
337
k-Nearest-Neighbor Classifier
6.10.2
Rough Set Approach
351
6.10.3
Prediction 354
7.2
7.2.4
CLIQUE: A Dimension-Growt
Statistical Distribution-Based O
Frequent-Pattern Mining in Da
8.2
489
9.2
Social Network Analysis 556
9.2.2
Characteristics of Social Networks
5
9.2.3
Link Mining: Tasks and Challenges
56
9.2.4
Mining on Social Networks
565
9.3
9.3.3
10.3
Multidimensional Analysis of M
Automatic Classification of We
Appendix
OLE D
An Introduction to Microsofts
Bibliography
703
Index 745
This book has several strong features that set it apa ing.
It presents a very broad yet in-depth coverage fro
especially regarding several recent research topics on
ing, social network analysis, and multirelational data the
advanced topics are written to be as self-contained in
order of interest by the reader. All of the major m sented.
Because we take a database point of view to dat many
important topics in data mining, such as scalable a OLAP
analysis, that are often overlooked or minimally
To the Instructor
To the Professional
1.1
The abundance of data, coupled with the need for been described as a
data rich but information poor situ dous amount of data, collected and
stored in large and far exceeded our human ability for comprehension
wit As a result, data collected in large data repositories beco that are
Knowledge
Database
Data
World Wide
Other
Warehouse
Web
Repos
1.3
AllElectronics).
Clean
Transform
Warehouse
Load
Refresh
entertainment
(a)
item (types)
(b)
Drill-down
R
Chicago
ies)
New York
(cit
Toronto
ess
addr
Vancouver
150
(months)time
Jan
Feb
100
March
150
computer
security
home
phone
item
(types)
entertainment
ies
ount
ess
(c
Q1
Cana
(quarters)
Q2
addr
Q3
time
Q4
en
and so on.
Object-Relational Databases
Data Streams
Can Be Mined?
buys(X;computer) ) buys(X;software)
In general, the class labels are not present in the traini not
known to begin with. Clustering can be used to gen clustered
or grouped based on the principle of maxim minimizing the
interclass similarity. That is, clusters of o within a cluster have
high similarity in comparison to on to objects in other clusters.
Each cluster that is formed c from which rules can be derived.
Clustering can also fac is, the organization of observations
into a hierarchy of together.
1.5
So, you may ask, are all of the patterns inter tion of the patterns
potentially generated would
This raises some serious questions for data pattern interesting? Can a data
mining system ge a data mining system generate only interesting pa
confidence(X ) Y ) = P(Y
Database
Statistics
technology
Information
Data
science
Mining
Visualization
Other discipli
Background knowledge
Concept hierarchies
Simplicity
Novelty
tasks, from data characteriz task has different requirements. The design
of an effec requires a deep understanding of the power, limitation
group by T.cust ID
1.9
similarity analysis). These tasks may ent ways and require the
development of nume
Exercises
Is it another hype?
Bibliographic Notes
transactions
T2
Data
transformation
T3
T4
Data reduction
... T2000
A1
A2
22, 32, 100, 59, 48
T1
a
tt
ri
b
u
t
e
s
A
3
..
.
A
1
2
6
tr
a
n
s
a
ct
io
n
s
T
4
T1
...
T
1
x1+x
x=
i=1
median = L
+
N=2(freq)
freq
median
mean mode = 3 (m
This implies that the mode for unimodal frequ can easily
be computed if the mean and median In a unimodal
frequency curve with perfect median, and mode are all at
the same center va data in most real applications are not
symmetr skewed, where the mode occurs at a value that
is or negatively skewed, where the mode occu
(Figure 2.2(c)).
IQR = Q3 Q1.
quartile.
Minimum;Q1;Median;Q3;Maximum:
60
40
20
Branch 1 Branch 2
Branch 3
Figure 2.3 Boxplot for the unit price data for items sold
at fo time period.
1
N
(x
x) =
i=1
Aside from the bar charts, pie charts, and line graphs u
ical data presentation software packages, there are ot the
display of data summaries and distributions. Th plots, q-q
plots, scatter plots, and loess curves. Such grap
inspection of your data.
Figure 2.4 shows a histogram for the data set of Table equalwidth ranges representing $20 increments and th sold.
Histograms are at least a century old and are a method.
However, they may not be as effective as the qu methods for
comparing groups of univariate observati
Table 2.1 A set of unit price data for items sold at a bran
40
43
47
..
74
75
78
..
115
117
120
there may not be a value with exactly a fracti Note that the
0.25 quantile corresponds to qua and the 0.75 quantile is
Q3 .
Let
i0
fi= N
Figure 2.5 A quantile plot for the unit price data of Table 2.1.
compare their Q1, median, Q3, and other fi values at a g plot for
the unit price data of Table 2.1.
Branch 1 (unit p
700
600
sold
500
400
Items
300
200
100
20
40
60
Unit price (
Figure 2.7 A scatter plot for the data set of Table 2.1.
Figure 2.10 A loess curve for the data set of Table 2.1.
2.3
Data Cleaning
Bin 1: 4, 4, 15
cr
(oij eij)
2 =
ij
i=1j=1
ij
(25090)
+
(50
210)
+
90
210
284:44+121:90+
71:11+30
v =
vminA
(newmax
A
maxAminA
formed to
73;60012;000
normalized to v by computing