Data Mining Data Warehousing and Knowledge Discovery
Data Mining Data Warehousing and Knowledge Discovery
Overview
Why Data Mining? Data Mining concepts Data Mining algorithms
Tabular data mining Association, Classification and Clustering Sequence data mining Streaming data mining
Security
Data Mining
Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data
No Query But an Interestingness criteria
Data Mining
+
Data Interestingness criteria
=
Hidden patterns
Data Mining
Type of Patterns
+
Data Interestingness criteria
=
Hidden patterns
Data Mining
Type of data Type of Interestingness criteria
+
Data Interestingness criteria
=
Hidden patterns
Type of Data
Tabular
Relational Multi-dimensional
Spatial Temporal
Tree (Ex: XML data) Graphs (Ex: WWW, BioMolecular data) Sequence (Ex: DNA, activity logs) Text, Multimedia
Type of Interestingness
Data
Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pencil Books
Interesting 1-element item-sets: {Bag}, {Uniform}, {Crayons}, {Pencil}, {Books} Interesting 2-element item-sets: {Bag,Uniform} {Bag,Crayons} {Bag,Pencil} {Bag,Books} {Uniform,Crayons} {Uniform,Pencil} {Pencil,Books}
Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform
Crayons Uniform Interesting 3-element item-sets: {Bag,Uniform,Crayons} Pencil Books Bag Books Bag Bag Pencil Books
Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pencil Books
1.
2.
3. 4.
Use apriori to generate frequent itemsets of different sizes At each iteration divide each frequent itemset X into two parts LHS and RHS. This represents a rule of the form LHS RHS The confidence of such a rule is support(X)/support(LHS) Discard all rules whose confidence is less than minconf.
Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pencil Books
Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pencil Books
{Bag} {Uniform, Crayons} {Bag, Uniform} {Crayons} {Bag, Crayons} {Uniform} {Uniform} {Bag, Crayons} {Uniform, Crayons} {Bag} {Crayons} {Bag, Uniform}
Uniform Pencil People who buy school uniform and a set of crayons are likely to buy a school Books bag. Bag Books People who buy just a set of crayons are Bag likely to buy a school bag and school Bag uniform as well. Pencil Books
Classification Techniques
Decision Tree Identification
Outlook Sunny Overcast Sunny Cloudy Overcast Overcast Temp 30 15 16 27 25 17 Play? Yes No Yes Yes Yes No
Cloudy
Cloudy
17
35
No
Yes
Classification Techniques
Hunts method for decision tree identification: Given N element types and m decision classes: 1. For i 1 to N do 1. Add element i to the i-1 element item-sets from the previous iteration 2. Identify the set of decision classes for each item-set 3. If an item-set has only one decision class, then that item-set is done, remove that item-set from subsequent iterations 2. done
Classification Techniques
Decision Tree Identification Example
Outlook Sunny Overcast Sunny Cloudy Overcast Overcast Temp Warm Chilly Chilly Play? Yes No Yes
Sunny Cloudy
Yes Yes/No
Overcast
Yes/No
Cloudy
Cloudy
Chilly
Warm
No
Yes
Classification Techniques
Decision Tree Identification Example
Outlook Sunny Overcast Sunny Cloudy Overcast Overcast Temp Warm Chilly Chilly Play? Yes No Yes
Sunny Cloudy
Yes Yes/No
Overcast
Yes/No
Cloudy
Cloudy
Chilly
Warm
No
Yes
Classification Techniques
Decision Tree Identification Example
Outlook Sunny Overcast Sunny Cloudy Overcast Overcast Temp Warm Chilly Chilly Play? Yes No Yes
Yes No
Cloudy
Cloudy
Chilly
Warm
No
Yes
Cloudy Pleasant
Yes
Classification Techniques
Decision Tree Identification Example
Outlook Sunny Overcast Sunny Cloudy Overcast Overcast Temp Warm Chilly Chilly Play? Yes No Yes
Cloudy
Cloudy
Chilly
Warm
No
Yes
Overcast Pleasant
Yes
Classification Techniques
Decision Tree Identification Example Yes/No Cloudy Yes/No Warm Yes Pleasant Chilly No Yes Yes Sunny Yes Overcast Yes/No Chilly No Pleasant
Classification Techniques
Decision Tree Identification Example Top down technique for decision tree identification Decision tree created is sensitive to the order in which items are considered If an N-item-set does not result in a clear decision, classification classes have to be modeled by rough sets.
Clustering Techniques
Clustering partitions the data set into clusters or equivalence classes.
Similarity among members of a class more than similarity among members across classes.
(Cloudy,Pleasant,Play)
Cloudy Dont Play Play Sunny Warm Pleasant Chilly
Clustering Techniques
General Strategy:
1. Draw a graph connecting items which are close to one another with edges. 2. Partition the graph into maximally connected subcomponents. 1. Construct an MST for the graph 2. Merge items that are connected by the minimum weight of the MST into a cluster
Clustering Techniques
Clustering types:
Hierarchical clustering: Clusters are formed at different levels by merging clusters at a lower level
Clustering Techniques
Nearest Neighbour Clustering Algorithm:
Given n elements x1, x2, xn, and threshold t, . 1. j 1, k 1, Clusters = {} 2. Repeat 1. Find the nearest neighbour of xj 2. Let the nearest neighbour be in cluster m 3. If distance to nearest neighbour > t, then create a new cluster and k k+1; else assign xj to cluster m 4. j j+1 3. until j > n
Clustering Techniques
Iterative partitional clustering:
Given n elements x1, x2, xn, and k clusters, each with a center. 1. Assign each element to its closest cluster center 2. After all assignments have been made, compute the cluster centroids for each of the cluster 3. Repeat the above two steps with the new centroids until the algorithm converges
The order of items within an itemset does not matter; but the order of itemsets matter A subsequence is a sequence with some itemsets deleted
Interesting 3-sequences = {}
c c c c b
a a
b
Most specific state machine
aabcb
aac
aabc
a
b
c
a a c b
Data Warehousing
A platform for online analytical processing (OLAP) Warehouses collect transactional data from several transactional databases and organize them in a fashion amenable to analysis Also called data marts A critical component of the decision support system (DSS) of enterprises Some typical DW queries: Which item sells best in each region that has retail outlets Which advertising strategy is best for South India? Which (age_group/occupation) in South India likes fast food, and which (age_group/occupation) likes to cook?
Data Warehousing
OLTP
Data Cleaning
Inventory
OLTP vs OLAP
Transactional Data (OLTP) Analysis Data (OLAP)
Small or medium size databases Very large databases Transient data Archival data
Frequent insertions and updates Infrequent updates Small query shadow Normalization important to handle updates Very large query shadow De-normalization important to handle queries
Data Cleaning
Performs logical transformation of transactional data to suit the data warehouse Model of operations model of enterprise Usually a semi-automatic process
Data Cleaning
Orders Order_id Price Cust_id
Data Warehouse Customers Products Orders Inventory Price Time Sales Cust_id Cust_prof Tot_sales
Time
Drill-down
Collapse dimensions
Star Schema
Dim Tbl_1 Dim Tbl_1
Dim Tbl_1
Fact table
Dim Tbl_1
References
Agrawal, R. Srikant: ``Fast Algorithms for Mining Association Rules'', Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994. R. Agrawal, R. Srikant, ``Mining Sequential Patterns'', Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995. R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant: "The Quest Data Mining System", Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August, 1996. Surajit Chaudhuri, Umesh Dayal. An Overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record. 26(1), March 1997. Jennifer Widom. Research Problems in Data Warehousing. Proc. of Intl Conf. On Information and Knowledge Management, 1995.
References
A. Shoshani. OLAP and Statistical Databases: Similarities and Differences. Proc. of ACM PODS 1997. Panos Vassiliadis, Timos Sellis. A Survey on Logical Models for OLAP Databases. ACM SIGMOD Record M. Gyssens, Laks VS Lakshmanan. A Foundation for MultiDimensional Databases. Proc of VLDB 1997, Athens, Greece. Srinath Srinivasa, Myra Spiliopoulou. Modeling Interactions Based on Consistent Patterns. Proc. of CoopIS 1999, Edinburg, UK. Srinath Srinivasa, Myra Spiliopoulou. Discerning Behavioral Patterns By Mining Transaction Logs. Proc. of ACM SAC 2000, Como, Italy.