Data Mining and Decision Trees: Prof. Sin-Min Lee Department of Computer Science
Data Mining and Decision Trees: Prof. Sin-Min Lee Department of Computer Science
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s2000s:
Data mining and data warehousing, multimedia databases, and Web databases
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases
Other Applications
Text mining (news group, email, documents) and Web analysis.
Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc.
Cross-market analysis
Associations/co-relations between product sales
Corporate Analysis and Risk Management Finance planning and asset evaluation
cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)
Resource planning:
summarize and compare the resources and spending
Competition:
monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market
Approach
use historical data to build models of fraudulent behavior and use data mining to help identify similar instances
Examples
auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references
Retail
Analysts estimate that 38% of retail shrink is due to dishonest employees.
Sports
Other Applications
IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat
Astronomy
JPL and the Palomar Observatory discovered 22 quasars with the help of data mining
Data Mining
Data Cleaning
Data Integration Databases
Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Continued
The company is especially interested in the characteristics of insurants with a high deviating claim behavior. With data mining, these so-called risk-profiles can be discovered and the company can use this information to adapt its premium polity.
This allows the company to mail its prospects selectively, thus maximizing the response. For example: 1. Company X sends a mailing to a number of prospects. 2. The response is 2%.
Reduced direct mail costs by 30% while garnering 95% of the campaigns revenue.
Decision Trees A decision tree is a special case of a state-space graph. It is a rooted tree in which each internal node corresponds to a decision, with a subtree at these nodes for each possible outcome of the decision.
Decision trees can be used to model problems in which a series of decisions leads to a solution.
The possible solutions of the problem correspond to the paths from the root to the leaves of the decision tree.
Decision Trees
Example: The n-queens problem How can we place n queens on an nn chessboard so that no two queens can capture each other?
A queen can move any number of squares horizontally, vertically, and diagonally. Here, the possible target squares of the queen Q are marked with an x.
x x
x x x x x x
x x x Q x x x x x x x x x x x x x
Question: How many possible configurations of 44 chessboards containing 4 queens are there?
Answer: There are 16!/(12!4!) = (13141516)/(234) = 13754 = 1820 possible configurations. Shall we simply try them out one by one until we encounter a solution? No, it is generally useful to think about a search problem more carefully and discover constraints on the problems solutions. Such constraints can dramatically reduce the size of the relevant state space.
Obviously, in any solution of the n-queens problem, there must be exactly one queen in each column of the board. Otherwise, the two queens in the same column could capture each other. Therefore, we can describe the solution of this problem as a sequence of n decisions:
place
1st
queen
Q Q
place
place place
2nd
3rd 4th
queen
queen
Q Q Q Q
Q Q Q
queen
Q Q Q
Neural Network
Many inputs and a single output Trained on signal and background sample Well understood and mostly accepted in HEP
Many inputs and a single output Trained on signal and background sample
Decision Tree
Decision tree Find good splitstatistics to compute info gain: count matrix Sufficient
outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild humidity high high high high normal normal normal high normal normal normal high normal high windy FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE play no no yes yes yes no yes no yes yes yes yes yes no
outlook
temperature humidity windy
high normal
FALSE TRUE
Decision trees
Simple depth-first construction Needs entire data to fit in memory Unsuitable for large data sets Need to scale up
Decision Trees
Planning Tool
Decision Trees
Enable a business to quantify decision making Useful when the outcomes are uncertain Places a numerical value on likely or potential outcomes Allows comparison of different possible decisions to be made
Decision Trees
Limitations:
How accurate is the data used in the construction of the tree? How reliable are the estimates of the probabilities? Data may be historical does this data relate to real time? Necessity of factoring in the qualitative factors human resources, motivation, reaction, relations with suppliers and other stakeholders
Process
The Process
Economic growth rises 0.7 Expected outcome 300,000 Expand by opening new outlet Economic growth declines 0.3 Maintain current status 0 The circle denotes the point where different outcomes could occur. The estimates of the probability and the knowledge of the expected outcome allow the firm to make a calculation of the likely return. In this example it is: A square denotes the point where a decision is made, In this example, a business is contemplating There is also the outlet. option The to do nothing and current status wouldcontinues have an outcome opening a new uncertainty is maintain the state the of the economy quo! if theThis economy to grow of Economic 0. growth rises: 0.7 x 300,000 = 210,000 healthily the option is estimated to yield profits of 300,000. However, if the economy fails to grow as expected, the declines: potential 0.3 lossxis estimated 500,000. Economic growth 500,000 = at -150,000 The calculation would suggest it is wise to go ahead with the decision ( a net benefit figure of +60,000) Expected outcome -500,000
The Process
Economic growth rises 0.5 Expected outcome 300,000 Expand by opening new outlet Economic growth declines 0.5 Maintain current status 0 Expected outcome -500,000
Look what happens however if the probabilities change. If the firm is unsure of the potential for growth, it might estimate it at 50:50. In this case the outcomes will be: Economic growth rises: 0.5 x 300,000 = 150,000 Economic growth declines: 0.5 x -500,000 = -250,000 In this instance, the net benefit is -100,000 the decision looks less favourable!
Advantages
Disadvantages