4 Datamining
4 Datamining
What is data mining? Data Mining process Data mining functions Data mining technologies Text mining and Web mining Deploy Data mining for competitive advantage
Data Mining
Data mining is a process of identifying hidden patterns and relationships within data
Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns The use of specific class of tools (data mining techniques) in the analysis of data
explore data (usually large amounts of data typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction and predictive data mining is the most common type of data mining and one that has the most direct business applications.
The data warehouse that enterprises are building until now have largely ignored Factors make data mining feasible
organizations are gathering more data from on-line TPS with lower storage cost high computation power allows using complex data mining algorithm
With data mining, it is possible to better manage product warranties, predict purchases of retail stock, unearth fraud, determine credit risk, and define new products and services.
http://www.almaden.ibm.com/cs/ quest/TECH.html
Association - occurrences are linked to a single event. beer purchasers also buy peanuts 70% of the time Sequences - events are linked over time. a new
Classification - patterns are recognized that describe the characteristics of a group, such as customers who cancel credit cards
Association
Given a database of transactions, where each transaction consists of a set of items, discover all associations such that the presence of one set of items in a transaction implies the presence of another set of items.
Association rules
In 80% of the cases when people buy bread, they also buy milk. This tells us of the association between bread and milk. We represent it as - bread => milk | 80% This should be read as - "Bread means or implies milk, 80% of the time." Here 80% is the "confidence factor" of the rule. Association rules can be between more than 2 items. For example
bread, milk => jam | 60% bread => milk, jam | 40%
Supermarket shelf management. Goal: To identify items that are bought together by sufficiently many customers. Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items.
Given a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints.
In point-of-sale transaction sequences, Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk) Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket)
Classification definition
Each record contains a set of attributes (predictors), and a categorical variable- as known as class. Light/regular coke, delayed flight/not, competitive eBay bidding/not, fraudulent/not, respondent/not
Find a model: values of Predictors class membership. Goal: previously unseen records should be assigned a class as accurately as possible. Classification algorithms: Nave rule, Nave Bayes, kNearest Neighbors, classification trees, Neural Nets,
Example of Classification
Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach:
Use credit card transactions and the information on its account-holder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account.
Deviation/Anomaly Detection
Typical network traffic at University level may reach over 100 million connections per day
Clustering
The process of organizing objects into groups whose members are similar in some way The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? There is no absolute best criterion which would be independent of the final aim of the clustering. Distance-based, fit-to-descriptive concepts An unsupervised learning problem
Clustering Definition
Given a set of records (rows), each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures.
given a large database of customer data containing their properties and past buying records; Biology: categorizing of plants and animals given their features; Libraries: book arrangement on shelves; Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds; City-planning: identifying groups of houses according to their house type, value and geographical location; Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones; WWW: document classification; clustering weblog data to discover groups of similar access patterns.
Prediction
Predict a value of a given continuous variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Examples: Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices.
http://www.thearling.com/text/dmtechniques/d mtechniques.htm
Statistical analysis
While in majority of the well known statistical packages traditional statistical methods are supplemented by some elements of data mining, their main data analysis methods remain to be of the classical nature: correlation, regression, and factor analyses and other techniques of that kind. Such systems cannot determine the form of dependencies hidden in data and require that the user provides his/her own hypotheses that will be tested by the system
The same data is used for model development & reliability assessment Good for describing relationships (e.g., regression) Over-fitting can be common limited predictive abilities
Different datasets are used for model development, calibration & assessment The objective is for prediction
In Data Mining:
Focus is on predictive accuracy How will the model perform on a new dataset?
42
Over-fitting
Credit card spending
Level of Income 43
Time-Series Forecasting
Time-series forecasting is a forecasting method that uses a set of historical values to predict an outcome. These historic values, often referred to as a "time series", are spaced equally over time and can represent anything from monthly sales data to daily electricity consumption to hourly call volumes. Time-series forecasting assumes that a time series is a combination of a pattern and some random error. The goal is to separate the pattern from the error by understanding the pattern's trend, its long-term increase or decrease, and its seasonality, the change caused by seasonal factors such as fluctuations in use and demand.
http://www.decisioneering.com/time-series-forecasting.html
Decision Tree
This method can be applied for solution of classification tasks As a result of applying this method to a training set, a hierarchical structure of classifying rules of the type "IF...THEN..." is created. This structure has a form of a tree (similar to the species detector from botanics or zoology).
Decision Tree
In order to decide to which class an object or a situation should be assigned one has to answer questions located at the tree nodes, starting from the root. Following this procedure one eventually comes to one of the final nodes (called leaves), where he/she finds a conclusion to which class the considered object should be assigned.
Decision tree
ID 1 2 3 4 5 6 7
Decision Tree
Rule Induction
If Debt is High then Risk is High If Debt is low and salaried then Risk is Low If Debt is low and self-employed then Risk is median
Given: questionnaire with financial and personal information Question: should money be lent? Simple statistical method covers 90% of cases Borderline cases referred to loan officers But: 50% of accepted borderline cases defaulted! Solution: reject all borderline cases?
age years with current employer years at current address years with the bank other credit cards possessed, human experts only 50%
Imitates structure of live neural tissue built from separate neurons In order to make meaningful predictions a neural network first has to be trained on data describing previous situations for which both, input parameters and correct reactions to them are known.
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html
An artificial neural network consists of a number of small primitive processing units linked together via weighted, directed connections. A learning algorithm is used to train neural networks based on sample data
weights w1 x1 w2 weights w3 Y1 Y2 input layer output layer
x2
x3
hidden layer
Employment x3
This approach proved to be effective in problems of image recognition. However, experience shows that it is not suited well for, say, financial or serious medical applications. knowledge reflected in terms of weights of a couple hundred intra-neural connections cannot be analyzed and interpreted by a human.
Genetic Algorithm
A genetic algorithm is a search technique used in computing to find true or approximate solutions to optimization and search problems, and is often abbreviated as GA. Genetic algorithms are categorized as global search heuristics. Genetic algorithms are a particular class of evolutionary algorithms that use techniques inspired by evolutionary biology such as inheritance, mutation, selection, and crossover (also called recombination).
Genetic Algorithm
http://www.statsoft.com/textbook/stdatmin.html#Models%20for%20Data %20Mining
Text Mining
Text Mining
Application of data mining to nonstructured or less structured text files. It entails the generation of meaningful numerical indices from the unstructured text and then processing these indices using various data mining algorithms
Find the hidden content of documents, including additional useful relationships Relate documents across previous unnoticed divisions Group documents by common themes
Automatic detection of e-mail spam or phishing through analysis of the document content Automatic processing of messages or emails to route a message to the most appropriate party to process that message Analysis of warranty claims, help desk calls/reports, and so on to identify the most common problems and relevant responses
Analysis of related scientific publications in journals to create an automated summary view of a particular discipline Creation of a relationship view of a document collection Qualitative analysis of documents to detect deception In 2007, Europol's Serious Crime division developed an analysis system in order to track transnational organized crime.
Eliminate commonly used words (stopwords) Replace words with their stems or roots (stemming algorithms) Consider synonyms and phrases Calculate the weights of the remaining terms
Web Mining
Web Mining
The discovery and analysis of interesting and useful information from the Web, about the Web, and usually through Web-based tools
Web Mining
Web content mining The extraction of useful information from Web pages Web structure mining The development of useful information from the links included in the Web documents Web usage mining The extraction of useful information from the data being generated through webpage visits, transaction, etc.
Determine the lifetime value of clients Design cross-marketing strategies across products Evaluate promotional campaigns Target electronic ads and coupons at user groups Predict user behavior Present dynamic information to users
Web Mining
Social network analysis views social relationships in terms of network theory, consisting of nodes (representing individual actors within the network) and ties (which represent relationships between the individuals, such as friendship, kinship, organizational position, sexual relationships, etc.)
Social network analysis has emerged as a key technique in modern sociology. It has also gained a significant following in anthropology, biology, communication studies, economics, geography, history, information science, organizational studies, social psychology, development studies, and sociolinguistics and is now commonly available as a consumer tool
The act of building data-mining models does not, by itself, guarantee any business value To be used as competitive weapon, data mining must be part of a larger process that ensures that the information learned by data mining is transformed into actionable results
Problem definition
Wish to understand and separate customer based for two product lines: long distance and Internet access service Very competitive market Time to react limited Broad-based marketing programs inefficient for customer retention and cross-sell. Cost $275-$400 for each new subscriber
Discovery
Who are the most important, most profitable customers based on a lifetime value calculation? A new user type was identified: Power users who are heavy phone users constantly on the phone
Implementation
Create marketing campaign that provide compelling offers to power users Multiple offers may be made and data mining is used to determine which offers are most effective for which types of people at different times A customer-loyalty program to retain as many of the Power Users as they can before they leave
Taking Action
Campaigns are best targeted at the time a customer contacts you The point of contact: a call center or a Web site interaction Data-mining models need to be integrated into customer touch point
A customer calls for billing item interpretation The operator retrieves customer information from call center program While the operator explains to the customer, data mining generates campaign targeting based on up-to-date information Tailored product recommendation and special discount offer displayed to operator The operator relays the offers to the customer, referring to a displayed script
Check the success of marketing campaign real time Customer response is captured for campaign refinement Evaluating the effectiveness of data mining model Dynamic learning engine for fine tuning
Integration
Integrating data mining with business strategies and marketing campaigns Integrating data mining with a decision delivery mechanism Creating a feedback loop to monitor the success of the campaigns