Data Mining - Detailed - Simple Terms
Data Mining - Detailed - Simple Terms
Basic Concept:
Data mining is a technique of Business Analytics to explore (find)
Describe the data by summarizing it, finding trends or patterns, making associations
or correlations, categorizing into various clusters
Predicting by classification, regression and identifying the deviation.
In simple terms data mining is making explorations from the large data sets to make useful
decision.
= = =================================================================
Few other names of Data Mining
Data mining is also called by few other names such as: Data mining is also called as
Knowledge discovery or Knowledge discovery from data (KDD), Knowledge extraction,
data/pattern analysis, information harvesting, etc.
= = =================================================================
Which type of Business Analytics technique is Data Mining?
Data mining has got techniques for both description as well as prediction. Few techniques of
data mining are descriptive and few techniques are predictive.
For example few of the techniques of data mining are listed below categorized into descriptive and
predictive:
Descriptive BA technique of Data mining Predictive BA technique of Data mining
includes: includes:
Clustering Classification
Association rule discovery Regression
Sequential Pattern Discovery Sequential Pattern Discovery
Deviation Detection
= = =================================================================
Explanation of each of the above techniques of Data Mining with applications
1. Clustering 2. Classification
Both are the techniques which characterize objects into groups by one or more features.
However, following are the differences between the two.
Categories in which the objects/ data will be Categories in which the data shall be
grouped is not pre-defined grouped is pre-defined
Application 1: Use of clustering in market Application 1: classifying the customer
segmentation – a company conducted the into risk category while granting loan by
focus group discussion with the customers bank – For instance a customer comes to
about their need for health drink; the avail loan from the bank; bank already has 3
software will automatically create categories of customers in context of their
categories of data and group data with probability to default in the future: High
similar characteristics in the same cluster Risky, Moderate Risky and Less Risky.
(group). Now, the bank will take customers data
For instance, people who said they need related to their age, occupation, income, past
health drink for strong bones are grouped in credit history etc.
one group,
people who said they need health drink for On the basis of pre-defined set of rules to
more energy during intense workout and guide decision making (decision tree), the
sports will be grouped together in one software will predict how risky that
group, particular customer will be in terms of
those who said they need it for building defaulting the loan, accordingly keep the
immunity will be kept in one common group customer in one of the above categories
and so on. (High Risky, Moderate Risky and Less
Here, the existing data is just grouped in Risky).
different clusters. Same type of data is kept Here, till now the future activity has not been
in one cluster. No prediction is made, just the done but based on the existing data and pre-
data is described. defined rules, the software is making
prediction about the customer.
Application 2: Assume that you want to Application 2: A retailer wants to find out
analyze a speech given by a politician the future profitability or lifetime value of the
(analyzing the text, the content he spoke). customers and group them into “highly
profitable”, “profitable” or “less profitable”
The software will interpret the words spoken/ customers.
written, and will make categories and cluster
similar data into similar groups as under: The company will collect following data
o Wherever he spoke about women from customers’ past purchase history:
empowerment, women safety, women o How recently customer purchased? (last
education, women rights, domestic date of purchase)
violence, physical assault against women, o How frequently he shopped? (dates of
women trafficking, the software will purchases in the given duration, say 2
keep all these content in one category years)
“women-related issues” o What were the monetary values of his past
o Wherever, he spoke about GDP, transactions? (how much was the bill
inflation, taxes, spending power of amount of each of the purchases)
people, the software will keep all these
content in one common category From the above 3 pieces of information, the
“economic issues” software itself will predict the lifetime value
o Wherever, he spoke about school of the customers and classify into any of
education, child trafficking, child labor, these 3 categories: “highly profitable”,
child health, the “children-related issues” “profitable” or “less profitable”.
o And so on….
The software itself will understand the
context (meaning) of the content and group
words/content of similar meaning into one
group and so on….
As the software does not predict but simply As the software predict on the basis of set of
describe the data by keeping data of similar pre-defined rules of decision making, the
characteristics into same group, the type of type of analytics is Predictive.
analytics is Descriptive.
Here, as the category was not there but the As here, the category has to be given to the
software automatically creates categories software to group the data; hence, this is used
(groups), this is used for un-supervised in case of supervised learning.
learning.
Application 1:
Is there any association between gender and preference for pink color (here gender is
nominal data, preference for example is on likert scale, so interval/ ordinal data, hence to
check the association, Chi-Square test can be used).
Application 2:
A restaurant wants to know the factors associated/ related with overall customer
satisfaction with the restaurant. Following data is available:
Rate your Highly Satisfied Neither Dissatisfied Highly
satisfaction satisfied satisfied nor dissatisfied
with our dissatisfied
restaurants:
Food quality
Food taste
Variety
Ambience
Staff service
Staff
behavior
etc…..
Rate your
overall
satisfaction
with our
restaurant
Application 3:
Market - Basket Analysis
Assume that a retailer need to see which items should be kept together on the shelf-space; it can
correlate the sales data of different items. This is how a store came to know that when the sales
of diaper picked up, the sales of beer picked up. They found a relation between the sales of
diaper and beer and placed two items together.
If it is used just to describe the existing pattern, it would fall under Descriptive BA; however, if
based on existing pattern/ trend, it is used to forecast the future trend, it would fall under
Predictive BA.’
For instance, this technique can reveal what items of clothing customers are more likely to buy
after an initial purchase of say, a pair of shoes. Understanding sequential patterns can help
organizations recommend additional items to customers to encourage sales.
For instance, customer shopping sequences: First buy computer, then CD-ROM, and then digital
camera, within 3 months. ƒ
5. Outlier Detection
In data mining, anomaly detection (also outlier detection) is the identification of items, events or
observations which do not conform to an expected pattern or other items in a dataset. Anomalies
are also known as outliers, novelties, noise, deviations, and exceptions. It is an item that deviates
from the common average within a dataset or a combination of data. Hence, it indicates that
something out of the ordinary has happened. And further requires additional attention.
Example: Network - Based Intrusion Detection Network intrusion detection system usually
consists of several sensors at different node and also known as Intrusion Detection Prevention
system. It checks the network traffic by connecting to a network hub or network tap. Basically it
identifies the content of individual packets for malicious traffic. It provides a real time corrective
action to correspondence attack. There are many more techniques of intrusion detection in the
link below for additional reading.
Ayasdi (software) offers an anomaly detection based solution for detecting and stopping money-
laundering transactions. They claim the software analyzes the entities customers are paying or
receiving payments from to make sure the funds are coming from legitimate sources.
Additionally, they claim to accomplish this by analyzing customer behavior over time and
detecting potentially harmful patterns as they appear. The Ayasdi AML offering contains the
following four capabilities:
Auto Feature Engineering: Automatically detecting aspects of transactional data that are likely
to reveal potential fraud patterns.
Intelligent Segmentation: Creating thresholds that form customer segments based on their
transaction history and real time behavior.
Behavioral Insights: Keeping track of daily changes in customer behavior and creating lists of
customers showing the most deviation.
Intelligent Event Triage: Recognizing which transaction events require further investigation
and which ones can be treated as acceptable deviations.
6. Regression (y = a + bx)
It is used to identify the likelihood of a specific variable, given the presence of other variables.
This technique analyzes the dependency of some attribute values, which is dependent upon the
values of other attributes mainly, present in same item. It can be simple linear regression/
multiple regression and other types as well.
Data Mining Process (Detail can be seen in book / Data Mining_New ppt / internet…
however in simple terms it is as under:
1. Business Understanding
What is the business purpose/ problem/ issue/ opportunity for which data mining is to be
done?
For example to retain the customer, a restaurant needs to know the customer satisfaction with
its restaurant
2. Data Understanding
Identify relevant data required to take decision related to this problem/ issue.
For example to know customer satisfaction, the restaurant can use data such as that of overall
customer satisfaction, satisfaction with food quality, taste, variety, ambience, staff behavior
etc. It can also see sales data, number of customers lost in a given time period etc.
3. Data Preparation
The purpose of data preparation (more commonly called data preprocessing) is to take the
data identified in the previous step and prepare it for analysis by data mining methods (ETL
process)
4. Model Building
Selecting the appropriate data mining technique (whether to use classification, clustering,
regression etc. to solve our given business situation)
5. Testing and Evaluation
This step assesses the degree to which the selected model (or models) meets the business
objectives and, if so, to what extent.
6. Deployment
The knowledge gained from such exploration will need to be organized and presented in a way
that the end user can understand and benefit from (presenting the result of analysis)
https://techdifferences.com/difference-between-classification-and-clustering.html
https://www.upgrad.com/blog/most-common-examples-of-data-mining/
https://www.talend.com/resources/data-mining-techniques/
https://data-flair.training/blogs/data-mining-techniques/
https://www.datasciencecentral.com/profiles/blogs/the-7-most-important-data-mining-techniques
https://www.guru99.com/data-mining-tutorial.html
https://www.ijert.org/research/outlier-detection-for-different-applications-review-IJERTV2IS3508.pdf
https://emerj.com/ai-sector-overviews/anomaly-detection-in-banking/