0% found this document useful (0 votes)
33 views

Data Mining - Detailed - Simple Terms

Data mining is a technique used to explore large datasets to find patterns and make useful predictions. It involves techniques like clustering, classification, association rule mining, and sequential pattern mining. Clustering groups similar data points together, while classification predicts which category a data point belongs to based on predefined classes. Association rule mining finds relationships between variables, and sequential pattern mining identifies ordered sequences of events in transaction data. Data mining can be used for both descriptive purposes, such as summarizing trends in the data, and predictive purposes, like forecasting future outcomes. The results of data mining can help businesses make better decisions.

Uploaded by

Shivangi Patel
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Data Mining - Detailed - Simple Terms

Data mining is a technique used to explore large datasets to find patterns and make useful predictions. It involves techniques like clustering, classification, association rule mining, and sequential pattern mining. Clustering groups similar data points together, while classification predicts which category a data point belongs to based on predefined classes. Association rule mining finds relationships between variables, and sequential pattern mining identifies ordered sequences of events in transaction data. Data mining can be used for both descriptive purposes, such as summarizing trends in the data, and predictive purposes, like forecasting future outcomes. The results of data mining can help businesses make better decisions.

Uploaded by

Shivangi Patel
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Mining

Basic Concept:
Data mining is a technique of Business Analytics to explore (find)
 Describe the data by summarizing it, finding trends or patterns, making associations
or correlations, categorizing into various clusters
 Predicting by classification, regression and identifying the deviation.

In simple terms data mining is making explorations from the large data sets to make useful
decision.
= = =================================================================
Few other names of Data Mining
Data mining is also called by few other names such as: Data mining is also called as
Knowledge discovery or Knowledge discovery from data (KDD), Knowledge extraction,
data/pattern analysis, information harvesting, etc.
= = =================================================================
Which type of Business Analytics technique is Data Mining?
Data mining has got techniques for both description as well as prediction. Few techniques of
data mining are descriptive and few techniques are predictive.
For example few of the techniques of data mining are listed below categorized into descriptive and
predictive:
Descriptive BA technique of Data mining Predictive BA technique of Data mining
includes: includes:
 Clustering  Classification
 Association rule discovery  Regression
 Sequential Pattern Discovery  Sequential Pattern Discovery
 Deviation Detection

= = =================================================================
Explanation of each of the above techniques of Data Mining with applications
1. Clustering 2. Classification
Both are the techniques which characterize objects into groups by one or more features.
However, following are the differences between the two.
Categories in which the objects/ data will be Categories in which the data shall be
grouped is not pre-defined grouped is pre-defined
Application 1: Use of clustering in market Application 1: classifying the customer
segmentation – a company conducted the into risk category while granting loan by
focus group discussion with the customers bank – For instance a customer comes to
about their need for health drink; the avail loan from the bank; bank already has 3
software will automatically create categories of customers in context of their
categories of data and group data with probability to default in the future: High
similar characteristics in the same cluster Risky, Moderate Risky and Less Risky.
(group). Now, the bank will take customers data
For instance, people who said they need related to their age, occupation, income, past
health drink for strong bones are grouped in credit history etc.
one group,
people who said they need health drink for On the basis of pre-defined set of rules to
more energy during intense workout and guide decision making (decision tree), the
sports will be grouped together in one software will predict how risky that
group, particular customer will be in terms of
those who said they need it for building defaulting the loan, accordingly keep the
immunity will be kept in one common group customer in one of the above categories
and so on. (High Risky, Moderate Risky and Less
Here, the existing data is just grouped in Risky).
different clusters. Same type of data is kept Here, till now the future activity has not been
in one cluster. No prediction is made, just the done but based on the existing data and pre-
data is described. defined rules, the software is making
prediction about the customer.
Application 2: Assume that you want to Application 2: A retailer wants to find out
analyze a speech given by a politician the future profitability or lifetime value of the
(analyzing the text, the content he spoke). customers and group them into “highly
profitable”, “profitable” or “less profitable”
The software will interpret the words spoken/ customers.
written, and will make categories and cluster
similar data into similar groups as under: The company will collect following data
o Wherever he spoke about women from customers’ past purchase history:
empowerment, women safety, women o How recently customer purchased? (last
education, women rights, domestic date of purchase)
violence, physical assault against women, o How frequently he shopped? (dates of
women trafficking, the software will purchases in the given duration, say 2
keep all these content in one category years)
“women-related issues” o What were the monetary values of his past
o Wherever, he spoke about GDP, transactions? (how much was the bill
inflation, taxes, spending power of amount of each of the purchases)
people, the software will keep all these
content in one common category From the above 3 pieces of information, the
“economic issues” software itself will predict the lifetime value
o Wherever, he spoke about school of the customers and classify into any of
education, child trafficking, child labor, these 3 categories: “highly profitable”,
child health, the “children-related issues” “profitable” or “less profitable”.
o And so on….
The software itself will understand the
context (meaning) of the content and group
words/content of similar meaning into one
group and so on….
As the software does not predict but simply As the software predict on the basis of set of
describe the data by keeping data of similar pre-defined rules of decision making, the
characteristics into same group, the type of type of analytics is Predictive.
analytics is Descriptive.
Here, as the category was not there but the As here, the category has to be given to the
software automatically creates categories software to group the data; hence, this is used
(groups), this is used for un-supervised in case of supervised learning.
learning.

3. Association (Association rule discovery)


It is a technique which indicates that certain data (event) is linked to other data (event).
The applications of Chi-Square test, Correlation test (studied in your subject of Research
Methodology or Business Statistics are nothing but application of association rule discovery).

Application 1:
Is there any association between gender and preference for pink color (here gender is
nominal data, preference for example is on likert scale, so interval/ ordinal data, hence to
check the association, Chi-Square test can be used).

Application 2:
A restaurant wants to know the factors associated/ related with overall customer
satisfaction with the restaurant. Following data is available:
Rate your Highly Satisfied Neither Dissatisfied Highly
satisfaction satisfied satisfied nor dissatisfied
with our dissatisfied
restaurants:
Food quality
Food taste
Variety
Ambience
Staff service
Staff
behavior
etc…..

Rate your
overall
satisfaction
with our
restaurant

Now, here Correlation test can be used.


If data is normally distributed Pearson Correlation test and if data is not normally distributed
Spearman’s Rank Correlation test will be used.

Reason: Purpose is to find association/ correlation


Variables at a time: 2 (one is overall satisfaction another is food quality), overall
satisfaction will be the common variable and one by one food taste, variety,
ambience etc. will be considered.
Type of data of variables: both the variables are on Likert scale (interval/ ordinal)

Application 3:
Market - Basket Analysis
Assume that a retailer need to see which items should be kept together on the shelf-space; it can
correlate the sales data of different items. This is how a store came to know that when the sales
of diaper picked up, the sales of beer picked up. They found a relation between the sales of
diaper and beer and placed two items together.

4. Sequential pattern mining


This data mining technique focuses on identifying a series of events that takes place in sequence.
It discovers similar patterns/ trends in the transaction data of a certain period.

If it is used just to describe the existing pattern, it would fall under Descriptive BA; however, if
based on existing pattern/ trend, it is used to forecast the future trend, it would fall under
Predictive BA.’

Application 1: Tracking patterns


For example, you might see that your sales of a certain product seem to spike just before the
holidays, or notice that warmer weather drives more people to your website.
Application 2: Predict most likely product the customer can buy after purchase of an item
(Predictive)
Thus, customers buy together different times in a year. Then businesses can use this information
to recommend customers. That they buy it with better deals based on their purchasing frequency
in the past.

For instance, this technique can reveal what items of clothing customers are more likely to buy
after an initial purchase of say, a pair of shoes. Understanding sequential patterns can help
organizations recommend additional items to customers to encourage sales.  

For instance, customer shopping sequences: First buy computer, then CD-ROM, and then digital
camera, within 3 months. ƒ

Application 3: Stock Market Pricing Prediction (Predictive)


Studying the prices of stocks of companies from the past data using trend analysis; and
predicting the future price of the stock on the basis of the existing pattern.

5. Outlier Detection
In data mining, anomaly detection (also outlier detection) is the identification of items, events or
observations which do not conform to an expected pattern or other items in a dataset. Anomalies
are also known as outliers, novelties, noise, deviations, and exceptions. It is an item that deviates
from the common average within a dataset or a combination of data. Hence, it indicates that
something out of the ordinary has happened. And further requires additional attention.

Application 1: Cyber Security - Intrusion detection


If somebody is trying to enter into the company’s network system can be identified through this
technique. Intrusion detection identifies all of the suspicious patterns that may indicate a network
or system attack from someone attempting to break into or compromise a system.

Example: Network - Based Intrusion Detection Network intrusion detection system usually
consists of several sensors at different node and also known as Intrusion Detection Prevention
system. It checks the network traffic by connecting to a network hub or network tap. Basically it
identifies the content of individual packets for malicious traffic. It provides a real time corrective
action to correspondence attack. There are many more techniques of intrusion detection in the
link below for additional reading.

Application 2: Preventing Money Laundering (Money Laundering Prevention Solution to


banks)

Ayasdi (software) offers an anomaly detection based solution for detecting and stopping money-
laundering transactions. They claim the software analyzes the entities customers are paying or
receiving payments from to make sure the funds are coming from legitimate sources.
Additionally, they claim to accomplish this by analyzing customer behavior over time and
detecting potentially harmful patterns as they appear. The Ayasdi AML offering contains the
following four capabilities:
 Auto Feature Engineering: Automatically detecting aspects of transactional data that are likely
to reveal potential fraud patterns.
 Intelligent Segmentation: Creating thresholds that form customer segments based on their
transaction history and real time behavior.
 Behavioral Insights: Keeping track of daily changes in customer behavior and creating lists of
customers showing the most deviation.
 Intelligent Event Triage: Recognizing which transaction events require further investigation
and which ones can be treated as acceptable deviations.

6. Regression (y = a + bx)
It is used to identify the likelihood of a specific variable, given the presence of other variables.
This technique analyzes the dependency of some attribute values, which is dependent upon the
values of other attributes mainly, present in same item. It can be simple linear regression/
multiple regression and other types as well.

Application 1: Predicting the child’s behavior based on family history.


Application 2: Determining sales on the basis of advertising budget (what-if-analysis) ---
example we discussed in the class (simple linear regression ) OR determining sales on the basis
of advertising budget, sales promotion budget and event budget (multiple regression)
OTHER APPLICATIONS OF DATA MINING ARE FOUND IN THE PPT (Data
Mining_New)

Data Mining Process (Detail can be seen in book / Data Mining_New ppt / internet…
however in simple terms it is as under:

1. Business Understanding
What is the business purpose/ problem/ issue/ opportunity for which data mining is to be
done?
For example to retain the customer, a restaurant needs to know the customer satisfaction with
its restaurant

2. Data Understanding
Identify relevant data required to take decision related to this problem/ issue.
For example to know customer satisfaction, the restaurant can use data such as that of overall
customer satisfaction, satisfaction with food quality, taste, variety, ambience, staff behavior
etc. It can also see sales data, number of customers lost in a given time period etc.

3. Data Preparation
The purpose of data preparation (more commonly called data preprocessing) is to take the
data identified in the previous step and prepare it for analysis by data mining methods (ETL
process)

4. Model Building
Selecting the appropriate data mining technique (whether to use classification, clustering,
regression etc. to solve our given business situation)
5. Testing and Evaluation
This step assesses the degree to which the selected model (or models) meets the business
objectives and, if so, to what extent.

6. Deployment
The knowledge gained from such exploration will need to be organized and presented in a way
that the end user can understand and benefit from (presenting the result of analysis)

https://techdifferences.com/difference-between-classification-and-clustering.html

https://www.upgrad.com/blog/most-common-examples-of-data-mining/

https://www.talend.com/resources/data-mining-techniques/

https://data-flair.training/blogs/data-mining-techniques/

https://www.datasciencecentral.com/profiles/blogs/the-7-most-important-data-mining-techniques

https://www.guru99.com/data-mining-tutorial.html

https://www.ijert.org/research/outlier-detection-for-different-applications-review-IJERTV2IS3508.pdf

https://emerj.com/ai-sector-overviews/anomaly-detection-in-banking/

You might also like