0% found this document useful (0 votes)
61 views

4 Datamining

This document provides an overview of data mining. It defines data mining as the process of discovering hidden patterns in large data sets. The key sections discuss what data mining is, the typical data mining process, common data mining functions like association, classification, clustering and prediction, technologies used for data mining like statistics, decision trees and neural networks, and how data mining differs from classical statistical analysis in its focus on prediction rather than model fit.

Uploaded by

ironchefff
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

4 Datamining

This document provides an overview of data mining. It defines data mining as the process of discovering hidden patterns in large data sets. The key sections discuss what data mining is, the typical data mining process, common data mining functions like association, classification, clustering and prediction, technologies used for data mining like statistics, decision trees and neural networks, and how data mining differs from classical statistical analysis in its focus on prediction rather than model fit.

Uploaded by

ironchefff
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 90

Data Mining

What is data mining? Data Mining process Data mining functions Data mining technologies Text mining and Web mining Deploy Data mining for competitive advantage

Data Mining

What is Data Mining?

Data mining is a process of identifying hidden patterns and relationships within data

What is Data Mining?

Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns The use of specific class of tools (data mining techniques) in the analysis of data

What is Data Mining?

Data Mining is an analytic process designed to

explore data (usually large amounts of data typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction and predictive data mining is the most common type of data mining and one that has the most direct business applications.

The DM process Data view

Data mining became feasible

The data warehouse that enterprises are building until now have largely ignored Factors make data mining feasible
organizations are gathering more data from on-line TPS with lower storage cost high computation power allows using complex data mining algorithm

The Use of Data Mining

With data mining, it is possible to better manage product warranties, predict purchases of retail stock, unearth fraud, determine credit risk, and define new products and services.

The importance of data mining


Data mining will become much more important, and companies will throw away nothing about their customers because it will be so valuable. If youre not doing this, youre out of business
--- Dr. Penzias, a Nobel Prize winner interviewed in ComputerWorld in January 1999

The process of Data Mining

The DM Process - Overview


Reporting

Different techniques: (10%)


(90%)

The steps in Data Mining (1)


1. Develop an understanding of the purpose of the data mining project 2. Obtain the data set to be used in the analysis. Random sampling. 3. Explore, clean, and preprocess the data. Missing data and outliers

The steps in Data Mining (2)


4. Reduce the data if necessary, separate them into training, validation, and test datasets, eliminating unnecessary variables, transforming variables, and creating new variables.

The steps in Data Mining (3)


5. Determine the data mining task (classification, prediction, clustering etc.) 6. Choose the data mining techniques to be used (regression, neural nets, etc.) 7. Use algorithms to perform the task. It is typically an iterative process.

The steps in Data Mining (4)


8. Interpret the results of the algorithm. Select the best algorithm and test its performance. 9. Deploy the model. Integrate the model into operational systems and run it on real records to produce decisions or actions.

Data Mining Functions

http://www.almaden.ibm.com/cs/ quest/TECH.html

Information obtained from Data Mining

Data mining yields five basic types of information:

Association - occurrences are linked to a single event. beer purchasers also buy peanuts 70% of the time Sequences - events are linked over time. a new

carpet purchase linked to new curtains

Classification - patterns are recognized that describe the characteristics of a group, such as customers who cancel credit cards

Information obtained from Data Mining

Clustering - discovers undiscovered groupings ``Buyers of expensive sport cars are


typically young urban professionals whereas luxury sedans are bought by elderly wealthy persons.''

Forecasting - estimates future value such as inventory turnover

Association

Given a database of transactions, where each transaction consists of a set of items, discover all associations such that the presence of one set of items in a transaction implies the presence of another set of items.

Association rules

In 80% of the cases when people buy bread, they also buy milk. This tells us of the association between bread and milk. We represent it as - bread => milk | 80% This should be read as - "Bread means or implies milk, 80% of the time." Here 80% is the "confidence factor" of the rule. Association rules can be between more than 2 items. For example

bread, milk => jam | 60% bread => milk, jam | 40%

Association Rule Discovery Applications

Supermarket shelf management. Goal: To identify items that are bought together by sufficiently many customers. Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items.

Sequential Pattern Discovery: Definition

Given a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints.

Sequential Pattern Discovery: Examples

In point-of-sale transaction sequences, Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk) Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket)

Classification definition

Given a collection of records (training set )

Each record contains a set of attributes (predictors), and a categorical variable- as known as class. Light/regular coke, delayed flight/not, competitive eBay bidding/not, fraudulent/not, respondent/not

Find a model: values of Predictors class membership. Goal: previously unseen records should be assigned a class as accurately as possible. Classification algorithms: Nave rule, Nave Bayes, kNearest Neighbors, classification trees, Neural Nets,

Example of Classification

Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach:

Use credit card transactions and the information on its account-holder as attributes.

When does a customer buy, what does he buy, how often he pays on time, etc

Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account.

Deviation/Anomaly Detection

Detect significant deviations from normal behavior Applications:

Credit Card Fraud Detection

Network Intrusion Detection

Typical network traffic at University level may reach over 100 million connections per day

Clustering

The process of organizing objects into groups whose members are similar in some way The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? There is no absolute best criterion which would be independent of the final aim of the clustering. Distance-based, fit-to-descriptive concepts An unsupervised learning problem

Clustering Definition

Given a set of records (rows), each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures.

Clustering vs. Classification

Possible Use of Clustering

Marketing: finding groups of customers with similar behavior

given a large database of customer data containing their properties and past buying records; Biology: categorizing of plants and animals given their features; Libraries: book arrangement on shelves; Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds; City-planning: identifying groups of houses according to their house type, value and geographical location; Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones; WWW: document classification; clustering weblog data to discover groups of similar access patterns.

Prediction

Predict a value of a given continuous variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Examples: Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices.

http://www.thearling.com/text/dmtechniques/d mtechniques.htm

Data Mining Technologies

Data mining technologies

Technology used for data mining


visualization statistic analysis decision trees rule induction neural networks

Statistical analysis

While in majority of the well known statistical packages traditional statistical methods are supplemented by some elements of data mining, their main data analysis methods remain to be of the classical nature: correlation, regression, and factor analyses and other techniques of that kind. Such systems cannot determine the form of dependencies hidden in data and require that the user provides his/her own hypotheses that will be tested by the system

Data Mining Vs. Classical Statistical Analysis

In Classical Statistical Analysis :

The same data is used for model development & reliability assessment Good for describing relationships (e.g., regression) Over-fitting can be common limited predictive abilities
Different datasets are used for model development, calibration & assessment The objective is for prediction

In Data Mining:

The Three Sisters of analysis

Focus is on fit of data to model

Focus is on predictive accuracy How will the model perform on a new dataset?

The Fit Concept

error terms (e)/ residuals

42

Over-fitting
Credit card spending

Level of Income 43

Time-Series Forecasting

Time-series forecasting is a forecasting method that uses a set of historical values to predict an outcome. These historic values, often referred to as a "time series", are spaced equally over time and can represent anything from monthly sales data to daily electricity consumption to hourly call volumes. Time-series forecasting assumes that a time series is a combination of a pattern and some random error. The goal is to separate the pattern from the error by understanding the pattern's trend, its long-term increase or decrease, and its seasonality, the change caused by seasonal factors such as fluctuations in use and demand.

http://www.decisioneering.com/time-series-forecasting.html

Decision Tree

This method can be applied for solution of classification tasks As a result of applying this method to a training set, a hierarchical structure of classifying rules of the type "IF...THEN..." is created. This structure has a form of a tree (similar to the species detector from botanics or zoology).

Decision Tree

In order to decide to which class an object or a situation should be assigned one has to answer questions located at the tree nodes, starting from the root. Following this procedure one eventually comes to one of the final nodes (called leaves), where he/she finds a conclusion to which class the considered object should be assigned.

Decision tree

ID 1 2 3 4 5 6 7

Debt High High High Low Low Low Low

Income High High Low Low Low High High

Employment Self-employed Salaried Salaried Salaried Self-employed Self-employed Salaried

Credit risk Bad Bad Bad Good Bad Good Good

Decision Tree

Rule Induction

If Debt is High then Risk is High If Debt is low and salaried then Risk is Low If Debt is low and self-employed then Risk is median

Processing loan applications


(American Express)

Given: questionnaire with financial and personal information Question: should money be lent? Simple statistical method covers 90% of cases Borderline cases referred to loan officers But: 50% of accepted borderline cases defaulted! Solution: reject all borderline cases?

No! Borderline cases are most active customers

Enter machine learning


1000 training examples of borderline cases 20 attributes:


age years with current employer years at current address years with the bank other credit cards possessed, human experts only 50%

Learned rules: correct on 70% of cases

Rules could be used to explain decisions to customers

Artificial Neural Networks

Imitates structure of live neural tissue built from separate neurons In order to make meaningful predictions a neural network first has to be trained on data describing previous situations for which both, input parameters and correct reactions to them are known.

Artificial Neural Networks

http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html

Artificial Neural Networks

An artificial neural network consists of a number of small primitive processing units linked together via weighted, directed connections. A learning algorithm is used to train neural networks based on sample data
weights w1 x1 w2 weights w3 Y1 Y2 input layer output layer

x2
x3

hidden layer

Artificial Neuron Networks


Debt Income x1 x2 wij Y Risk

Employment x3

Application of Neural Networks

This approach proved to be effective in problems of image recognition. However, experience shows that it is not suited well for, say, financial or serious medical applications. knowledge reflected in terms of weights of a couple hundred intra-neural connections cannot be analyzed and interpreted by a human.

Neural Networks Software


http://www.wardsystems.com

Genetic Algorithm

A genetic algorithm is a search technique used in computing to find true or approximate solutions to optimization and search problems, and is often abbreviated as GA. Genetic algorithms are categorized as global search heuristics. Genetic algorithms are a particular class of evolutionary algorithms that use techniques inspired by evolutionary biology such as inheritance, mutation, selection, and crossover (also called recombination).

Genetic Algorithm

http://www.statsoft.com/textbook/stdatmin.html#Models%20for%20Data %20Mining

Text Mining

Text Mining
Application of data mining to nonstructured or less structured text files. It entails the generation of meaningful numerical indices from the unstructured text and then processing these indices using various data mining algorithms

Text mining helps organizations

Find the hidden content of documents, including additional useful relationships Relate documents across previous unnoticed divisions Group documents by common themes

Applications of text mining

Automatic detection of e-mail spam or phishing through analysis of the document content Automatic processing of messages or emails to route a message to the most appropriate party to process that message Analysis of warranty claims, help desk calls/reports, and so on to identify the most common problems and relevant responses

Applications of text mining

Analysis of related scientific publications in journals to create an automated summary view of a particular discipline Creation of a relationship view of a document collection Qualitative analysis of documents to detect deception In 2007, Europol's Serious Crime division developed an analysis system in order to track transnational organized crime.

How to mine Text?

Eliminate commonly used words (stopwords) Replace words with their stems or roots (stemming algorithms) Consider synonyms and phrases Calculate the weights of the remaining terms

Web Mining

Web Mining
The discovery and analysis of interesting and useful information from the Web, about the Web, and usually through Web-based tools

Types of Web Mining

Web Mining

Web content mining The extraction of useful information from Web pages Web structure mining The development of useful information from the links included in the Web documents Web usage mining The extraction of useful information from the data being generated through webpage visits, transaction, etc.

Uses for Web mining:


Determine the lifetime value of clients Design cross-marketing strategies across products Evaluate promotional campaigns Target electronic ads and coupons at user groups Predict user behavior Present dynamic information to users

Web Mining

Social network analysis

Social network analysis

Social network analysis views social relationships in terms of network theory, consisting of nodes (representing individual actors within the network) and ties (which represent relationships between the individuals, such as friendship, kinship, organizational position, sexual relationships, etc.)

Social Network Analysis

Social network analysis has emerged as a key technique in modern sociology. It has also gained a significant following in anthropology, biology, communication studies, economics, geography, history, information science, organizational studies, social psychology, development studies, and sociolinguistics and is now commonly available as a consumer tool

Deploying Data Mining for Competitive Advantage

Deploying Data Mining for Competitive Advantage

The act of building data-mining models does not, by itself, guarantee any business value To be used as competitive weapon, data mining must be part of a larger process that ensures that the information learned by data mining is transformed into actionable results

A process of deploying data mining for competitive advantage

Problem definition Discovery Implementation Taking action Monitoring the results

Problem definition

Wish to understand and separate customer based for two product lines: long distance and Internet access service Very competitive market Time to react limited Broad-based marketing programs inefficient for customer retention and cross-sell. Cost $275-$400 for each new subscriber

Discovery

Who are the most important, most profitable customers based on a lifetime value calculation? A new user type was identified: Power users who are heavy phone users constantly on the phone

Implementation

Create marketing campaign that provide compelling offers to power users Multiple offers may be made and data mining is used to determine which offers are most effective for which types of people at different times A customer-loyalty program to retain as many of the Power Users as they can before they leave

Taking Action

Campaigns are best targeted at the time a customer contacts you The point of contact: a call center or a Web site interaction Data-mining models need to be integrated into customer touch point

Customer interaction process


A customer calls for billing item interpretation The operator retrieves customer information from call center program While the operator explains to the customer, data mining generates campaign targeting based on up-to-date information Tailored product recommendation and special discount offer displayed to operator The operator relays the offers to the customer, referring to a displayed script

Monitoring the results

Check the success of marketing campaign real time Customer response is captured for campaign refinement Evaluating the effectiveness of data mining model Dynamic learning engine for fine tuning

Integration

Integrating data mining with business strategies and marketing campaigns Integrating data mining with a decision delivery mechanism Creating a feedback loop to monitor the success of the campaigns

Data Mining Case studies

You might also like