Module 7 Introduction To Data Mining
Module 7 Introduction To Data Mining
DATA MINING
Learning Objectives
At the end of the module, the student should be able to:
1. Define data mining and some common approaches used in data
mining;
2. Distinguish the difference among database, data warehouse, and
datamart;
3. Differentiate Online Analytical Processing (OLAP) and Online
Transactional Processing (OLTP);
4. Describe the data mining methodologies.
Data Mining
Data mining is a field of business analytics focused on better
understanding characteristics and patterns among variables in
large databases using a variety of statistical and analytical
tools (Evans, 2017).
Data mining includes a wide variety of statistical procedures
for exploring data, including regression analysis (Evans, 2017).
Data mining attempts to discover patterns, trends, and
relationships among data, especially nonobvious and
unexpected patterns (Albright & Winston, 2020).
Data Mining (Jaggia et al, 2021)
Data mining describes the process of applying a set of
analytical techniques necessary for the development of
machine learning and artificial intelligence.
The goal of data mining is to uncover hidden patterns
and relationships in data, which allows us to gain insights
and derive relevant information to help make decisions
(Jaggia et al., 2021).
Data Mining (Albright and Winston, 2020 )
The place to start is with a data warehouse. A data warehouse is a huge
database that is designed specifically to study patterns in data. It should:
1. Combine data from multiple sources to discover as many relationships as
possible;
2. Contain accurate and consistent data;
3. Be structured to enable quick and accurate responses to a variety of
queries; and
4. Allow follow-up responses to specific relevant questions.
A data warehouse represents a type of data base that is specifically
structured to enable data mining.
Data Mining (Jaggia et al, 2021, page 318)
Data mining is recognized as a building block of machine learning and
artificial intelligence.
Data Mining Process (Jaggia et al, 2021, page 319)
There is a growing need for the establishment of standards in this field.
Two commonly adopted are CRISP-DM and SEMMA methodologies
Data Mining Process (Jaggia et al, 2021, page 319)
What is CRISP-DM Methodology?
When conducting data mining
analysis, practitioners generally adopt
either CRISP-DM methodology or
SEMMA methodology.
CRISP-DM stands for Cross-Industry
Standard Process for Data Mining and
consists of six major phases. It was
developed in the 1990s by SPSS,
TeraData, Daimler AG, NCR and OHRA.
Data Mining Process (Jaggia et al, 2021, page 319)
https://www.guru99.com/oltp-vs-olap.html
OLTP vs. OLAP
• Examples of OLTP applications are ATM centers, online
banking, online booking, sending text messages, etc.
• Examples of the use of OLAP are as follows:
• Spotify analyzed songs by users to come up with a
personalized homepage of their songs and playlist.
• Netflix movie recommendation system.
https://www.xplenty.com/blog/snowflake-schemas-vs-star-schemas-what-are-they-and-how-are-they-
different/#:~:text=Star%20schemas%20will%20only%20join,for%20datamarts%20with%20simple%20relationships.
Benefits of OLTP method
• It administers daily transactions of an organization.
• OLTP widens the customer base of an organization by
simplifying individual processes.
• OLTP systems are optimized for transactional superiority
instead of data analysis, thus it can handle simultaneous
transactions that OLAP cannot perform due to a large
volume of data and are integrated with different data
sources for building a consolidated database.
https://www.xplenty.com/blog/snowflake-schemas-vs-star-schemas-what-are-they-and-how-are-they-
different/#:~:text=Star%20schemas%20will%20only%20join,for%20datamarts%20with%20simple%20relationships.
OLTP vs. OLAP Comparison Chart
Online Analytical Processing (OLAP) Online Transactional Processing (OLTP)
Consists of historical data from various databases Consist only operational current data
❑Scaled values of the inputs enter the network at the left, they are weighted by
the W values and summed, and these sums are sent to the hidden nodes.
❑At the hidden nodes, the sums are “squished” by an S-shaped logistic-type
function.
❑These squished values are then weighted and summed, and the sum is sent to
the output node, where it is squished again and rescaled.
© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
D. Classification Trees (Albright and Winston, 2020)
© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
D. Classification Trees (Albright and Winston, 2020)
An example of a classification tree:
This classification tree leads directly to the following rules:
1. If a person makes less than 4 mall trips,
a. If the person lives in the West, classify as a trier.
b. If the person doesn’t live in the West, classify as a non-trier.
2. If the person makes 4 or 5 mall trips,
a. If the person doesn’t live in the East, classify as a trier.
b. If the person lives in the East, classify as a non trier.
3. If the person makes at least 6 mall trips, classify as a trier.
© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Clustering Methods (Albright and Winston, 2020)
• Probably the most common unsupervised method is clustering,
known in marketing circles as segmentation.
• It tries to group entities (customers, companies, cities, etc.)
into similar clusters, based on the values of their variables.
• There are no fixed groups like the triers and nontriers in
classification.
• Instead, the purpose of clustering is to discover the number
of groups and their characteristics, based entirely on the
data.
Clustering Methods (Albright and Winston, 2020)
• Clustering or segmentation tries to attach cases to categories (or
clusters), with high similarity within categories and high dissimilarity
across categories.
• The key to all clustering methods is the development of a
dissimilarity measure. Once a dissimilarity measure is developed, a
clustering algorithm attempts to find cluster of rows where rows
within a cluster are similar and rows in different clusters are
dissimilar.
Clustering Methods (Albright and Winston, 2020)
• A popular application of cluster analysis is called customer of
market segmentation, where companies analyze a large amount of
customer-related demographic and behavioral data and group
customers into different market segments.
• Two common clustering techniques are hierarchical clustering and
K-means clustering.
Clustering Methods (Albright and Winston, 2020)
• For example, a credit card company might group customers into
those who pay off their account balance every month versus those
who carry a monthly balance, and within these two customer
segments, group them further according to their spending habits.
• The company would likely target each of the customer segments
with different promotion and advertising campaigns or design
different financial products for each group.
Common Clustering Methods (Jaggia et al, 2021)
Agglomerative clustering (or nesting) is referred to as AGNES while divisive clustering (or analysis) is referred to as DIANA.
Common Clustering Methods (Jaggia et al, 2021)
Association Rule Analysis (Jaggia et al, 2021)
• Another widely used unsupervised data mining technique, it is also
referred to as affinity analysis or market basket analysis.
• It is essentially a “what goes with what” study designed to identify
events that tend to occur together.
• For example, retail companies seek to identify products that
consumers tend to purchase together. This type of information is
useful for retail store managers in displaying their products on the
shelf or when promotional campaigns are developed.
Association Rule Analysis (Jaggia et al, 2021)
Forecasting Methods (Jaggia et al, 2021)
Forecasting Methods (Jaggia et al, 2021)
Quantitative Forecasting Methods (Jaggia et al, 2021)
SIMPLE SMOOTHING TECHNIQUES
1. Moving Average Technique
Quantitative Forecasting Methods (Jaggia et al, 2021)
SIMPLE SMOOTHING TECHNIQUES
2. Simple Exponential Smoothing Technique
Quantitative Forecasting Methods (Jaggia et al, 2021)
LINEAR REGRESSION MODELS FOR TREND AND SEASONALITY
1. The Linear Trend Model
Quantitative Forecasting Methods (Jaggia et al, 2021)
LINEAR REGRESSION MODELS FOR TREND AND SEASONALITY
2. The Linear Trend Model with Seasonality
Quantitative Forecasting Methods (Jaggia et al, 2021)
NONLINEAR REGRESSION MODELS FOR TREND AND SEASONALITY
3. The Exponential Trend Model
Quantitative Forecasting Methods (Jaggia et al, 2021)
NONLINEAR REGRESSION MODELS FOR TREND AND SEASONALITY
4. The Polynomial Trend Model
Quantitative Forecasting Methods (Jaggia et al, 2021)
NONLINEAR REGRESSION MODELS WITH SEASONALITY
1. The Exponential Trend Model with SEASONAL DUMMY VARIABLES
Quantitative Forecasting Methods (Jaggia et al, 2021)
NONLINEAR REGRESSION MODELS WITH SEASONALITY
2. The Quadratic Trend Model with SEASONAL DUMMY VARIABLES
Reference
• Business Analytics: Data Analysis and Introduction to Decision Making
by Albright, C. and Winston, W. 5th Edition
Copyright 2020 by Cengage Learning.
• Business Analytics: Communicating with Numbers by Jaggia, S., Kelly,
A., Lertwachara, K. and Chen, L.
Copyright 2021 by McGraw-Hill Education.