0% found this document useful (0 votes)
2 views16 pages

Handout 2 Data Mining

Data mining is the process of extracting knowledge from large datasets using various statistical and analytical tools to identify patterns and relationships among variables. It is essential for organizations to manage and analyze vast amounts of data, enabling informed decision-making and uncovering hidden insights. Techniques include supervised and unsupervised learning, clustering, classification, and association rule mining, with applications across marketing, finance, and manufacturing.

Uploaded by

mishhra.shailja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views16 pages

Handout 2 Data Mining

Data mining is the process of extracting knowledge from large datasets using various statistical and analytical tools to identify patterns and relationships among variables. It is essential for organizations to manage and analyze vast amounts of data, enabling informed decision-making and uncovering hidden insights. Techniques include supervised and unsupervised learning, clustering, classification, and association rule mining, with applications across marketing, finance, and manufacturing.

Uploaded by

mishhra.shailja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

11/30/2024

Slide - 1

What is Data Mining

Slide - 2

What is Data Mining

It is the process of mining knowledge


from large amount of data

Data Mining
Techniques

Useful Data

Slide - 3

1
11/30/2024

Data Mining
• Data mining is focused on better understanding of
characteristics and patterns among variables in large
databases using a variety of statistical and analytical tools.
– It is used to identify relationships among variables in
large data sets and understand hidden patterns that
they may contain.

Slide - 4

Why do we use Data Mining


• Companies and organizations get huge amount of data
from different sources and platforms.
• As size of database increases it becomes difficult to
manually search for useful information in it.
• Data mining techniques are used which include AI and
mathematical complex algorithms for getting specific and
useful data.
• This specific data helps in decision making

Slide - 5

Why do we use Data Mining


• We also get trends, patterns, insights of collected data.
• Data Mining is also called as “Knowledge Discovery in
Database (KDD).”
• This data mining term was introduced in 1990

Slide - 6

2
11/30/2024

Data Mining
• Data mining can be considered part descriptive and part
prescriptive analytics.
• In descriptive analytics, data-mining tools help analysts to
identify patterns in data.
• Excel charts and PivotTables, for example, are useful tools
for describing patterns and analyzing data sets; however,
they require manual intervention.
• Regression analysis and forecasting models help us to
predict relationships or future values of variables of
interest.

Slide - 7

Data Mining Overview (1/12)


• The terms ‘artificial intelligence,’ ‘machine learning,’ and ‘data
mining’ are all used interchangeably.
• Their definitions overlap with no clear boundaries.
• They describe applications of computer software used to obtain
insightful solutions that traditional data analysis techniques may
not be able to achieve.
• In a very broad sense, artificial intelligence is used to describe
computer systems that demonstrate human-like intelligence and
cognitive abilities
– Deduction
– Pattern recognition
– Interpretation of complex data
• Examples: Deep Blue playing chess, Watson

11-8

Slide - 8

Data Mining Overview (2/12)


• Machine learning describes techniques that integrate self-
learning algorithms. (Coined by Arthur Samuel, IBM,1959)
• Its an application of artificial intelligence that allows the
computer to learn automatically without human intervention
or assistance.
• Designed to evaluate results and to improve performance
over time.
• Machine learning techniques can uncover hidden patterns
and relationships in data.
• Use self-learning algorithms to evaluate results and improve
performance over time.
• Examples: Predict rider demand to strategically dispatch 11-9

drivers for Uber Slide - 9

3
11/30/2024

Data Mining Overview (3/12)


• Data mining describes the process of applying a set of
analytical techniques necessary for the development of
machine learning and artificial intelligence.
• Data mining is often recognized as a building block of
machine learning and artificial intelligence.
– Uncover hidden patterns and relationships in data
– Gain insights and derive relevant information to help make
decisions

• Data mining techniques are used for data segmentation,


pattern recognition, classification, and prediction.
• Example: Group customers into segments for customized
promotions. 11-10

Slide - 10

Data Mining Overview (4/12) : Process


• Data mining is a complex process of examining data and
applying analytical techniques to gain valuable insights.
• Requires a systematic approach to managing and
conducting data mining projects.
• A popular approach is based on the Cross-Industry
Standard Process for Data Mining (CRISP-DM)
methodology.
• Although there are other data mining methodologies, many
practitioners prefer CRISP-DM.
• It emphasizes business goals and objectives prior to
preparing the data and choosing analysis techniques.
11-11

Slide - 11

Data Mining Overview (5/12)


• CRISP-DM was developed in the 1990s by a group of five
companies: SPSS, TeraData, Daimler AG, NCR, and OHRA.
• CRISP-DM consists of six major phases.
1. Business understanding: situational context, specific objectives, project
schedule, deliverables
2. Data understanding: collecting raw data, preliminary results, potential
hypotheses
3. Data preparation: record and variable selection, wrangling, cleaning
4. Modeling: selection and execution of data mining techniques, convert or
transform data to formats/types needed for certain analyses, document
assumptions, cross-validation
5. Evaluation: evaluate performance of competing models, select best
models, review and interpret results, develop recommendations
6. Deployment: develop a set of actionable insights and a strategy for
deployment/monitoring/feedback 11-12

Slide - 12

4
11/30/2024

Data Mining Overview (6/12)

11-13

Slide - 13

Data Mining Overview (7/12)


• It is important to note that not every step of the CRISP-DM framework is
needed for all data mining applications.
• The data preparation phase plays a significant role in the data mining
process.
• An analyst or analytics team tends to spend a sizable portion of the
project time (often 80%) on understanding, cleansing, transforming, and
preparing, data leading up to the modeling activities.
• The CRISP-DM methodology is popular among data mining
practitioners because it offers a holistic approach to data mining with
detailed phases, tasks, and activities.
• Other data mining methodologies include SEMMA (for Sample, Explore,
Modify, Model, and Assess) and KDD (Knowl- edge Discovery in
Databases).
11-14

Slide - 14

Data Mining Overview (8/12)


• Data mining algorithms are classified into two types of techniques
depending on the way they learn about data.
– Supervised data mining techniques are use for developing predictive models.
– Unsupervised data mining techniques are effective for data exploration,
dimension reduction, and pattern recognition.

• The key distinction between supervised and unsupervised techniques is


that, in supervised data mining, the target variable is identified.
– In regression models, the target variable is the response variable.
– The historical values of the target variable exist in the data set.
• Data mining algorithms can examine the impact of the predictor
variables on the target variable.
• On the contrary, in unsupervised data mining, no target variable is
identified.
11-15

Slide - 15

5
11/30/2024

Data Mining Overview (9/12)


• Some of the most commonly used supervised data mining
algorithms are based on classic statistical techniques.
• Examples include the linear regression model and the logistic
regression model.
• Use information on the predictor variables (𝑥 , 𝑥 , … , 𝑥 ) to
predict and/or describe changes in the target variable (𝑦) .
• A regression model is therefore “trained” or “supervised” because
the known values of the target variable are used to build the
model.
• The performance of the model can be evaluated based on how
the predicted values deviate from the actual values.
11-16

Slide - 16

Data Mining Overview (10/12)


• Common applications of supervised data mining include
classification and prediction models.
• In a classification model, the target variable is categorical.
– Predict the class memberships of new cases
– Example: example: classify stock buy, hold, or sale
• In a prediction model, the target variable is numerical.
– Predict the target for a new case
– Example: spending of a customer
• Other machine learning algorithms: k-Nearest Neighbors,
naïve Bayes, Decision Trees

11-17

Slide - 17

Data Mining Overview (11/12)


• Unsupervised data mining requires no knowledge of the
target variable.
• The algorithms allow the computer to identify patterns and
relationships in the data without any specific guidance from
the analyst.
• Unsupervised learning is considered to be an important part
of exploratory data analysis and descriptive analytics.
• Used prior to conducting supervised learning in order to
understand the data set, formulate questions, or summarize
data.
• Common applications of unsupervised learning include
dimension reduction and pattern recognition. 11-18

Slide - 18

6
11/30/2024

Data Mining Overview (12/12)


• Dimension reduction converts a set of high-dimensional
data (large number of variables) into data with lesser
dimensions without losing much of the information.
– Deploy before other data mining methods
– Reduce information redundancy, improve model stability
– Relevant for big data to bring out important patterns and build more
stable models

• Pattern recognition recognizing patterns using machine


learning.
– Recurring sequences
– Frequent combinations
– Recognizable features
– Common characteristics 11-19

Slide - 19

Data Mining Techniques include

Statistics AI ML
It include: Different AI algo’s It include:
1. Cluster Techniques 1. KNN algo
2. Regression 2. Apriori algo
3. Classification 3. K mean algo
4. Segmentation 4. Naïve bayes algo

Shopping on Amazon Slide - 20


Shopping on Amazon

Slide - 21

7
11/30/2024

The Scope of Data Mining


• Cluster Analysis
– identifying groups in which elements are in some way similar
• Classification
– analyzing data to predict how to classify a new data element
• Association
– analyzing databases to identify natural associations among
variables and create rules for target marketing or buying
recommendations
• Cause-and-effect Modeling
– developing analytic models to describe relationships between
metrics that drive business performance

Slide - 22

Cluster Analysis
• Cluster analysis, also called data segmentation, is a
collection of techniques that seek to group or segment a
collection of objects (observations or records) into subsets
or clusters, such that those within each cluster are more
closely related to one another than objects assigned to
different clusters.
– The objects within clusters should exhibit a high
amount of similarity, whereas those in different clusters
will be dissimilar.

Slide - 23

Clustering Methods
• Hierarchical clustering
– Agglomerative
clustering methods,
which proceed by series
of fusions of the n
objects into groups.
– Divisive clustering
methods, which
separate n objects
successively into finer
groupings.

Slide - 24

8
11/30/2024

Single Linkage Clustering


• An agglomerative method that keeps forming clusters from
the individual objects until only one cluster is left.
• In the single linkage method, the distance between two
clusters r and s, is defined as the minimum
distance between any object in cluster r and any object in
cluster s.

Slide - 28

Dendogram
• Visualization of the clustering process. The y-axis
measures the intercluster distance. A dendogram shows
the sequence in which clusters are formed as you move up
the diagram.

Slide - 33

Classification
• Classification methods seek to classify a categorical
outcome into one of two or more categories based on
various data attributes.
• For each record in a database, we have a categorical
variable of interest and a number of additional predictor
variables.
• For a given set of predictor variables, we would like to
assign the best value of the categorical variable.

Slide - 34

9
11/30/2024

Classification Techniques
• k-Nearest Neighbors (k-NN) Algorithm
– Finds records in a database that have similar numerical
values of a set of predictor variables.
• Discriminant Analysis
– Uses predefined classes based on a set of linear
discriminant functions of the predictor variables.

Slide - 42

k-Nearest Neighbors (k-NN)


• The k-nearest neighbors (k-NN) algorithm is a
classification scheme that attempts to find records in a
database that are similar to one we wish to classify.
Similarity is based on the “closeness” of a record to
numerical predictors in the other records, using normalized
Euclidean distances.

Slide - 43

k-Nearest Neighbor Rules


• The nearest neighbor to a record is the one that that has
the smallest distance from it.
– If k = 1, then the 1-NN rule classifies a record in the
same category as its nearest neighbor.
– k-NN rule finds the k-Nearest Neighbors to each record
we want to classify and then assigns the classification
as the classification of majority of the k nearest
neighbors.
• Typically, various values of k are used and then results
inspected to determine which is best.

Slide - 44

10
11/30/2024

Discriminant Analysis
• Discriminant analysis is a technique for classifying a set
of observations into predefined classes. The purpose is to
determine the class of an observation based on a set of
predictor variables.
• With only two classification groups, we can apply
regression analysis. Unfortunately, when there are more
than two, linear regression cannot be applied, and special
software must be used.

Slide - 47

Association Rule Mining


• Association rule mining, often called affinity analysis,
seeks to uncover associations and/or correlation
relationships in large data sets.
– Association rules identify attributes that occur together
frequently in a given data set.
– Market basket analysis, for example, is used to
determine groups of items consumers tend to purchase
together.
• Association rules provide information in the form of if-then
(antecedent-consequent) statements.

Slide - 51

Cause-and-Effect Modeling
• Correlation analysis can help us develop cause-and-effect
models that relate lagging and leading measures.
– Lagging measures tell us what has happened and are
often external business results such as profit, market
share, or customer satisfaction.
– Leading measures predict what will happen and are
usually internal metrics such as employee satisfaction,
productivity, and turnover.

Slide - 57

11
11/30/2024

Data Mining Advantages


• Marketing/Retailing:
• Direct marketers can benefit from data mining by providing
precise and helpful trends regarding their target audience's
purchase habits. These trends enable marketers to target
their target market more precisely with their marketing efforts.
For consumers with a long history of purchasing software, a
software company's marketing may promote its new product.
• Data mining can aid marketers in making predictions about
the goods their target customers may be interested in buying.
Marketers can surprise consumers and enhance the
shopping experience by making this forecast.

Slide - 60

Data Mining Advantages


• Banking/Crediting:
• Financial companies can benefit from data mining in areas
like credit documentation and loan records.
• A bank, for instance, can determine the degree of risk
associated with each specific loan by assessing prior
consumers who share comparable features.
• Data mining can also assist credit card issuers in alerting
customers to possibly fraudulent credit card transactions.
Credit card issuers can cut their losses even though data
mining technology only sometimes predicts fraudulent
charges with 100% accuracy
Slide - 61

Data Mining Advantages


• Manufacturing:
• Manufacturers can spot defective equipment and establish the
best control parameters by using data mining on operational
engineering data.
• For instance, semiconductor manufacturers face a dilemma
since even in diverse wafer production facilities' manufacturing
environments, the quality of the wafers is generally the same,
and some even have faults for unexplained reasons.
• Data mining has been used to identify the control parameter
ranges that result in the fabrication of the golden wafer. The
desired grade wafers are then produced using those ideal
control settings Slide - 62

12
11/30/2024

Data Mining Advantages


• Customer Identification:
• Every consumer in the market is unique in their ways. Their
fundamental behavior and traits differ.
• As a result, it is easier to comprehend their preferences with
the right methodology. Businesses may better identify their
clients with data mining, increasing the likelihood that they will
buy their products

Slide - 63

Data Mining Advantages


• Detecting Criminal Activities:
• Governments and other institutions can use market analysis
data to identify criminals.
• For instance, the data can be structured to make it easier to
analyze a customer's prior transactions. As a result, it might
quickly reveal any fraudulent activity.

Slide - 64

Data Mining Advantages


• Marketing Techniques:
• Businesses can build data models using data mining
approaches.
• They could quickly determine which people would be interested
in their products using these models. As a result, the firms may
be sure that the products they introduce will be profitable.
• Therefore, whatever new products are presented will help the
company's profits expand.

Slide - 65

13
11/30/2024

Data Mining Advantages


• Criminal Justice:
• By discovering patterns in location, crime type, habit, and other
behavior patterns, data mining can help law enforcement locate
and apprehend criminal offenders.

Slide - 66

Data Mining Disadvantages


• Privacy Issues:
• Businesses gather data about their customers in various ways
to understand the trends in their buying habits. Particularly now
that the internet is booming with social networks, e-commerce,
forums, and blogs, concerns about personal privacy have been
growing significantly.
• People worry that their personal information will be collected
and used unethically, which could get them into a lot of trouble
due to privacy concerns.
• However, businesses don't last forever; on occasion, they
might be bought out by another company or go out of business
entirely. At this time, they likely sell or leak the personal
information they possess Slide - 67

Data Mining Disadvantages


• Safety Concerns:
• A major concern is security. Social Security numbers, birthdays,
salary information, and other details about customers and
employees are owned by businesses. But it still needs to be
determined how well this information is protected.
• Many large corporations like Ford Motor Credit Company and
Sony Pictures have seen hackers access and steal large
amounts of consumer data.
• The credit card was stolen, and identity theft became a major
issue because so much financial and personal information was
available.

Slide - 68

14
11/30/2024

Data Mining Disadvantages


• Information that has been misused or is erroneous:
• Data mining techniques can be used improperly to gather
information for unethical objectives.
• Using this information to their advantage, unethical individuals
or organizations could discriminate against a certain group of
people or take advantage of the weak.
• A further drawback of data mining is its imperfect accuracy.
Inaccurate information will have major repercussions if used to
make decisions.

Slide - 69

Data Mining Disadvantages


• Expensive:
• A particularly expensive procedure is data mining. For instance,
businesses need to hire more staff and technical experts to
ensure that data mining is done properly. Advanced data mining
software is necessary for many firms but may be expensive.
Because they need to yield more useful insights, data mining
often costs more than it saves for most small enterprises.

Slide - 70

Data Mining Disadvantages


• Technical Knowledge:
• Depending on how they should be used, various mining tools
are available. They each have a distinctive algorithm and
design.
• Selecting the appropriate tool will only be possible with the
required technical knowledge. Therefore, it is necessary to send
out a competent specialist to handle the tool selection

Slide - 71

15
11/30/2024

Data Mining Disadvantages


• Accuracy:
• Even though data mining has created a framework for simple
data collection with its techniques, its accuracy is still
constrained. Making decisions can be complicated by
erroneous information that has been acquired.

Slide - 72

Data Mining Disadvantages


• Large databases are needed for data mining:
• Although data mining is one of the most effective tools in a
marketer's arsenal, it has its challenges.
• One such disadvantage is that huge datasets are necessary for
data mining to be effective.
• For instance, if an email list contains just 100 subscribers,
more than the data from those emails will be required for data
mining.
• On the other hand, more information will be available, and data
mining will be more successful if the list has 100,000 persons

Slide - 73

Data Mining Disadvantages


• Data mining methods are not perfect:
• Accurate information is only sometimes produced through data
mining. There are numerous methods for analyzing data, some
of which are more precise than others.
• Predictive models, for instance, rely on the expectation that
particular data patterns will be discovered. When only some
facts back a forecast, this can result in overestimating how
accurate it will turn out.
• Another problem arises when a database contains missing
data that must be considered to produce an accurate analysis.

Slide - 74

16

You might also like