0% found this document useful (0 votes)
15 views

Module 7 Introduction To Data Mining

Data mining is a process that uses statistical and machine learning techniques to uncover hidden patterns in large datasets. It involves developing models from sample data known as training data in order to discover patterns in new data. There are two main methodologies for data mining - CRISP-DM which consists of 6 phases for conducting data mining analysis, and SEMMA which focuses on a core set of tasks. Data mining relies on data warehouses, which are large databases designed to analyze patterns in historical data from multiple sources. Data warehouses differ from operational databases in that they focus on analytical querying rather than transaction processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Module 7 Introduction To Data Mining

Data mining is a process that uses statistical and machine learning techniques to uncover hidden patterns in large datasets. It involves developing models from sample data known as training data in order to discover patterns in new data. There are two main methodologies for data mining - CRISP-DM which consists of 6 phases for conducting data mining analysis, and SEMMA which focuses on a core set of tasks. Data mining relies on data warehouses, which are large databases designed to analyze patterns in historical data from multiple sources. Data warehouses differ from operational databases in that they focus on analytical querying rather than transaction processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

INTRODUCTION TO

DATA MINING
Learning Objectives
At the end of the module, the student should be able to:
1. Define data mining and some common approaches used in data
mining;
2. Distinguish the difference among database, data warehouse, and
datamart;
3. Differentiate Online Analytical Processing (OLAP) and Online
Transactional Processing (OLTP);
4. Describe the data mining methodologies.
Data Mining
Data mining is a field of business analytics focused on better
understanding characteristics and patterns among variables in
large databases using a variety of statistical and analytical
tools (Evans, 2017).
Data mining includes a wide variety of statistical procedures
for exploring data, including regression analysis (Evans, 2017).
Data mining attempts to discover patterns, trends, and
relationships among data, especially nonobvious and
unexpected patterns (Albright & Winston, 2020).
Data Mining (Jaggia et al, 2021)
Data mining describes the process of applying a set of
analytical techniques necessary for the development of
machine learning and artificial intelligence.
The goal of data mining is to uncover hidden patterns
and relationships in data, which allows us to gain insights
and derive relevant information to help make decisions
(Jaggia et al., 2021).
Data Mining (Albright and Winston, 2020 )
The place to start is with a data warehouse. A data warehouse is a huge
database that is designed specifically to study patterns in data. It should:
1. Combine data from multiple sources to discover as many relationships as
possible;
2. Contain accurate and consistent data;
3. Be structured to enable quick and accurate responses to a variety of
queries; and
4. Allow follow-up responses to specific relevant questions.
A data warehouse represents a type of data base that is specifically
structured to enable data mining.
Data Mining (Jaggia et al, 2021, page 318)
Data mining is recognized as a building block of machine learning and
artificial intelligence.
Data Mining Process (Jaggia et al, 2021, page 319)
There is a growing need for the establishment of standards in this field.
Two commonly adopted are CRISP-DM and SEMMA methodologies
Data Mining Process (Jaggia et al, 2021, page 319)
What is CRISP-DM Methodology?
When conducting data mining
analysis, practitioners generally adopt
either CRISP-DM methodology or
SEMMA methodology.
CRISP-DM stands for Cross-Industry
Standard Process for Data Mining and
consists of six major phases. It was
developed in the 1990s by SPSS,
TeraData, Daimler AG, NCR and OHRA.
Data Mining Process (Jaggia et al, 2021, page 319)

Some practitioners prefer the SEMMA methodology. Developed by


the SAS Institute, this methodology focuses on a core set of tasks and
provides step by step process for analyzing data.
Database vs Data Warehouse
• A database is an organized collection of information stored in a way
that makes logical sense and that facilitates easier search, retrieval,
manipulation, and analysis of data.
• Perhaps the most common way of classifying databases is SQL vs.
NoSQL (also known as relational vs. non-relational).
• A data warehouse is a system that aggregates and stores information
from a variety of disparate sources within an organization.
• The goal of a data warehouse is explicitly business-oriented; it is
designed to facilitate decision-making by allowing end-users to
consolidate and analyze information from different sources.
https://www.xplenty.com/blog/data-warehouse-vs-database-what-are-the-key-differences/
Database vs. Data Warehouse
• A data warehouse and database both stores data that are
multidimensional. Database stores real-time information about one
particular part of your business, while data warehouse stores
historical data about your business. It does not store current
information, nor is it updated in real-time.
• Database - its main job is to process the daily transactions that your
company makes, e.g., recording which items have sold, while data
warehouse is a system that pulls together data from many different
sources within an organization for reporting and analysis. The reports
created from complex queries within a data warehouse are used to
make business decisions.
Data warehouse and Data mart ( Albright and Winston, 2020)

Data warehouse is a huge database designed specifically to study


patterns in data.
• A data mart is a scaled-down or part of a data warehouse, structured
specifically for one part of an organization, such as sales.
Database vs. Data Warehouse
• A data warehouse is a relational or multidimensional database that is
designed for query and analysis.
• Database, however is focused on the day-to-day operations of the
company, while data warehouse is for making analysis of historical
data to extract insights from it
• Data warehouses are not optimized for transaction processing, which
is the domain of OLTP systems but rather, Data warehouse is for
analytical processing which is the main focus of OLAP.
• The Online Analytical Processing (OLAP) and Online Transactional
Processing (OLTP) is the main or key difference between data
warehouse and database.
Database vs Data Warehouse
• Database. This type of processing immediately responds to
user requests, and so is used to process the day-to-day
operations of a business in real-time. For example, if a user
wants to reserve a hotel room using an online booking form,
the process is executed with OLTP.
• Data warehouses. This process gives analysts the power to
look at your data from different points of view. For example,
even though your database records sales data for every
minute of every day, you may just want to know the total
amount sold each day.
Database vs. Data Warehouse
•Databases use OnLine Transactional Processing
(OLTP) to delete, insert, replace, and update
large numbers of short online transactions
quickly.
•Data warehouses use OnLine Analytical
Processing (OLAP) to analyze massive volumes of
data rapidly.
Database vs. Data Warehouse Comparison Chart
Parameter Database Data Warehouse
Use Recording data Analyzing data
Processing OnLine Transactional Processing
OnLine Analytical Processing (OLAP)
Methods (OLTP)
Concurrent
Thousands Limited Number
Users
Use Cases Small transactions Complex Analysis
Downtime Always available Some scheduled downtime
For CRUD (create, read, update, and
Optimization For complex analysis
delete) operations
Data
Real-time detailed data Summarized historical data
Type/timeline
https://www.xplenty.com/blog/data-warehouse-vs-database-what-are-the-key-differences/
OLTP vs. OLAP
• OLTP and OLAP: The two terms look similar but refer to
different kinds of systems.
• Online transaction processing (OLTP) captures,
stores, and processes data from transactions in real time.
• Online analytical processing (OLAP) uses complex queries
to analyze aggregated historical data from OLTP systems.
• Examples of OLTP application are ATM center, online
banking, online booking, sending a text message, etc.

https://www.guru99.com/oltp-vs-olap.html
OLTP vs. OLAP
• Examples of OLTP applications are ATM centers, online
banking, online booking, sending text messages, etc.
• Examples of the use of OLAP are as follows:
• Spotify analyzed songs by users to come up with a
personalized homepage of their songs and playlist.
• Netflix movie recommendation system.

Source: Difference between OLAP and OLTP in DBMS - GeeksforGeeks


KEY DIFFERENCE between OLTP and OLAP:
• Online Analytical Processing (OLAP) is a category of software tools
that analyze data stored in a database whereas Online transaction
processing (OLTP) supports transaction-oriented applications in a
3-tier architecture.
• OLAP is characterized by a large volume of data while OLTP is
characterized by large numbers of short online transactions.
• In OLAP, data warehouse is created uniquely so that it can
integrate different data sources for building a consolidated
database whereas OLTP uses traditional DBMS.
Benefits of using OLAP services
• OLAP creates a single platform for all type of business
analytical needs which includes planning, budgeting,
forecasting, and analysis.
• The main benefit of OLAP is the consistency of
information and calculations.
• Easily apply security restrictions on users and objects
to comply with regulations and protect sensitive data.

https://www.xplenty.com/blog/snowflake-schemas-vs-star-schemas-what-are-they-and-how-are-they-
different/#:~:text=Star%20schemas%20will%20only%20join,for%20datamarts%20with%20simple%20relationships.
Benefits of OLTP method
• It administers daily transactions of an organization.
• OLTP widens the customer base of an organization by
simplifying individual processes.
• OLTP systems are optimized for transactional superiority
instead of data analysis, thus it can handle simultaneous
transactions that OLAP cannot perform due to a large
volume of data and are integrated with different data
sources for building a consolidated database.
https://www.xplenty.com/blog/snowflake-schemas-vs-star-schemas-what-are-they-and-how-are-they-
different/#:~:text=Star%20schemas%20will%20only%20join,for%20datamarts%20with%20simple%20relationships.
OLTP vs. OLAP Comparison Chart
Online Analytical Processing (OLAP) Online Transactional Processing (OLTP)

Consists of historical data from various databases Consist only operational current data

It is subject oriented. Used for Data mining,


It is application oriented. Used for business tasks.
Analytics, Decision making, etc.
The data is used in planning, problem solving, and The data is used to perform day to day fundamental
Decision making. operations.
It provides a multi-dimensional view of different
It reveals a snapshot of present business tasks.
business tasks.
The size of the data is relatively small as the historical
Large amount of data is stored typically in TB, PB.
data is archived. For example MB, GB.
This data is generally manage by CEO, Managing
Manage by clerks, managers, encoder.
Directors, General Manager.
Only read and rarely write operations. Both read and write operations
a
Methods of Data mining (Albright and Winston, 2020)
Once a data warehouse is in place, analysts can begin to mine the data with a collection of
methodologies:
• Classification analysis
• Prediction
• Cluster analysis
• Market basket analysis
• Forecasting
Numerous software packages are available that perform various data mining
procedures.
Supervised and Unsupervised Data Mining Techniques
(Albright and Winston, 2020)
• In supervised data mining techniques, there is a dependent variable
that the method is trying to predict.

Source: Jaggia et al, 2021, p. 320


Supervised and Unsupervised Data Mining Techniques
(Albright and Winston, 2020)
• In unsupervised data mining techniques, there is no dependent
variable. Instead, these techniques search for patterns and
structure among all of the variables.
• Clustering or segmentation is the most common unsupervised
method.
• Another popular unsupervised method is market basket
analysis (also called association analysis), where patterns of
customer purchases are examined to see which items
customers tend to purchase together, in the same “market
basket.”
Supervised and Unsupervised Data Mining Techniques
(Jaggia et al, 2021)
Classification Methods (Albright and Winston, 2020)

• One of the most important problems studied in data mining


is the classification problem.
• This is basically the same problem attacked by regression
analysis, but now the dependent variable is categorical.
• Each of the classification methods has the same
objective: to use data from the explanatory variables to
classify each record (person, company, or whatever) into
one of the known categories.
Classification Methods (Albright and Winston, 2020)

• It attempts to find variables that are related to a


categorical (often binary) variable.

• For example, classification analysis would attempt


to find explanatory variables that would help
predict whether a credit card holder will pay their
balances in a reasonable amount of time or not.
Classification Methods (Albright and Winston, 2020)
• Data partitioning plays an important role in classification.
• The data set is partitioned into two or even three distinct subsets before
algorithms are applied.
• The first subset, usually with about 70% to 80% of the records, is called
the training set. The algorithm is trained with data in the training set.
• The second subset, called the testing set, usually contains the rest of
the data. The model from the training set is tested on the testing set.
• Some software packages might also let you specify a third subset, often
called a prediction set, where the values of the dependent variables
are unknown. Then you can use the model to classify these unknown
values.
A. Logistic Regression (Albright and Winston, 2020)
• Logistic regression is a popular method for classifying individuals,
given the values of a set of explanatory variables.
• It estimates the probability that an individual is in a particular
category.
• It uses a nonlinear function of the explanatory variables for
classification.
• It is essentially regression with a binary (0-1) dependent variable.
• For the two-category problem, the binary variable indicates
whether an observation is in category 0 or category 1.
B. Discriminant Analysis (Albright and Winston, 2020)

• Most software include another classification procedure


called discriminant analysis.
• This is a classical technique developed many decades ago
that is still in use.
• It is somewhat similar to logistic regression and has the
same basic goals.
• However, it is not as prominent in data mining discussions
as logistic regression.
C. Neural Network (Albright and Winston, 2020)
• The neural network (or neural net) methodology is an attempt
to model the complex behavior of the human brain.
• It sends inputs (the values of explanatory variables) through a
complex nonlinear network to produce one or more outputs
(the values of the dependent variable).
• It can be used to predict a categorical dependent variable or a
numeric dependent variable.
C. Neural Network (Albright and Winston, 2020)
• The neural network (or neural net) methodology is an attempt
to model the complex behavior of the human brain.
• The biggest advantage of neural nets is that they often
provide more accurate predictions than any other
methodology, especially when relationships are highly
nonlinear.
• However, neural nets do not provide easily interpretable
equations where you can see the contributions of the
individual explanatory variables.
Neural Networks (Albright and Winston, 2020)
• Each neural net has an associated network diagram, like the
one shown below.

❑This figure assumes two inputs and one output.


❑The network also includes a “hidden layer” in the middle
with two hidden nodes.
© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Neural Networks (Albright and Winston, 2020)

❑Scaled values of the inputs enter the network at the left, they are weighted by
the W values and summed, and these sums are sent to the hidden nodes.
❑At the hidden nodes, the sums are “squished” by an S-shaped logistic-type
function.
❑These squished values are then weighted and summed, and the sum is sent to
the output node, where it is squished again and rescaled.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
D. Classification Trees (Albright and Winston, 2020)

• Classification trees are also capable of discovering nonlinear


relationships, but it is more intuitive.
• This method, which has many variations, has existed for decades and
has been implemented in a variety of software packages.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
D. Classification Trees (Albright and Winston, 2020)
An example of a classification tree:
This classification tree leads directly to the following rules:
1. If a person makes less than 4 mall trips,
a. If the person lives in the West, classify as a trier.
b. If the person doesn’t live in the West, classify as a non-trier.
2. If the person makes 4 or 5 mall trips,
a. If the person doesn’t live in the East, classify as a trier.
b. If the person lives in the East, classify as a non trier.
3. If the person makes at least 6 mall trips, classify as a trier.

The ability of classification trees to provide such simples rules, plus


Fairly accurate classifications, has made this a very popular
classification technique.

© 2020 Cengage. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-
protected website or school-approved learning management system for classroom use.
Clustering Methods (Albright and Winston, 2020)
• Probably the most common unsupervised method is clustering,
known in marketing circles as segmentation.
• It tries to group entities (customers, companies, cities, etc.)
into similar clusters, based on the values of their variables.
• There are no fixed groups like the triers and nontriers in
classification.
• Instead, the purpose of clustering is to discover the number
of groups and their characteristics, based entirely on the
data.
Clustering Methods (Albright and Winston, 2020)
• Clustering or segmentation tries to attach cases to categories (or
clusters), with high similarity within categories and high dissimilarity
across categories.
• The key to all clustering methods is the development of a
dissimilarity measure. Once a dissimilarity measure is developed, a
clustering algorithm attempts to find cluster of rows where rows
within a cluster are similar and rows in different clusters are
dissimilar.
Clustering Methods (Albright and Winston, 2020)
• A popular application of cluster analysis is called customer of
market segmentation, where companies analyze a large amount of
customer-related demographic and behavioral data and group
customers into different market segments.
• Two common clustering techniques are hierarchical clustering and
K-means clustering.
Clustering Methods (Albright and Winston, 2020)
• For example, a credit card company might group customers into
those who pay off their account balance every month versus those
who carry a monthly balance, and within these two customer
segments, group them further according to their spending habits.
• The company would likely target each of the customer segments
with different promotion and advertising campaigns or design
different financial products for each group.
Common Clustering Methods (Jaggia et al, 2021)

Agglomerative clustering (or nesting) is referred to as AGNES while divisive clustering (or analysis) is referred to as DIANA.
Common Clustering Methods (Jaggia et al, 2021)
Association Rule Analysis (Jaggia et al, 2021)
• Another widely used unsupervised data mining technique, it is also
referred to as affinity analysis or market basket analysis.
• It is essentially a “what goes with what” study designed to identify
events that tend to occur together.
• For example, retail companies seek to identify products that
consumers tend to purchase together. This type of information is
useful for retail store managers in displaying their products on the
shelf or when promotional campaigns are developed.
Association Rule Analysis (Jaggia et al, 2021)
Forecasting Methods (Jaggia et al, 2021)
Forecasting Methods (Jaggia et al, 2021)
Quantitative Forecasting Methods (Jaggia et al, 2021)
SIMPLE SMOOTHING TECHNIQUES
1. Moving Average Technique
Quantitative Forecasting Methods (Jaggia et al, 2021)
SIMPLE SMOOTHING TECHNIQUES
2. Simple Exponential Smoothing Technique
Quantitative Forecasting Methods (Jaggia et al, 2021)
LINEAR REGRESSION MODELS FOR TREND AND SEASONALITY
1. The Linear Trend Model
Quantitative Forecasting Methods (Jaggia et al, 2021)
LINEAR REGRESSION MODELS FOR TREND AND SEASONALITY
2. The Linear Trend Model with Seasonality
Quantitative Forecasting Methods (Jaggia et al, 2021)
NONLINEAR REGRESSION MODELS FOR TREND AND SEASONALITY
3. The Exponential Trend Model
Quantitative Forecasting Methods (Jaggia et al, 2021)
NONLINEAR REGRESSION MODELS FOR TREND AND SEASONALITY
4. The Polynomial Trend Model
Quantitative Forecasting Methods (Jaggia et al, 2021)
NONLINEAR REGRESSION MODELS WITH SEASONALITY
1. The Exponential Trend Model with SEASONAL DUMMY VARIABLES
Quantitative Forecasting Methods (Jaggia et al, 2021)
NONLINEAR REGRESSION MODELS WITH SEASONALITY
2. The Quadratic Trend Model with SEASONAL DUMMY VARIABLES
Reference
• Business Analytics: Data Analysis and Introduction to Decision Making
by Albright, C. and Winston, W. 5th Edition
Copyright 2020 by Cengage Learning.
• Business Analytics: Communicating with Numbers by Jaggia, S., Kelly,
A., Lertwachara, K. and Chen, L.
Copyright 2021 by McGraw-Hill Education.

You might also like