0% found this document useful (0 votes)
88 views

Dmbi Unit-3

The document discusses data mining and the KDD (Knowledge Discovery in Databases) process. It defines KDD as the overall process of discovering useful knowledge from data. The KDD process involves data preparation, applying data mining algorithms to extract patterns, evaluating and interpreting the results. It outlines the typical 9 steps in the KDD process, including data cleaning, feature reduction, selecting an algorithm, mining patterns, and interpreting results. The document also discusses how data mining is used in business contexts like marketing, manufacturing and quality improvement to optimize processes, identify high-value customers, predict issues and improve efficiency.

Uploaded by

Paras Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

Dmbi Unit-3

The document discusses data mining and the KDD (Knowledge Discovery in Databases) process. It defines KDD as the overall process of discovering useful knowledge from data. The KDD process involves data preparation, applying data mining algorithms to extract patterns, evaluating and interpreting the results. It outlines the typical 9 steps in the KDD process, including data cleaning, feature reduction, selecting an algorithm, mining patterns, and interpreting results. The document also discusses how data mining is used in business contexts like marketing, manufacturing and quality improvement to optimize processes, identify high-value customers, predict issues and improve efficiency.

Uploaded by

Paras Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

UNIT 3

Data Mining Basics

















What is the KDD Process?


The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the "high-level" application of
particular data mining methods. It is of interest to researchers in machine learning,
pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition
for expert systems, and data visualization.
The unifying goal of the KDD process is to extract knowledge from data in the
context of large databases.

It does this by using data mining methods (algorithms) to extract (identify) what is
deemed knowledge, according to the specifications of measures and thresholds, using
a database along with any required preprocessing, subsampling, and transformations
of that database.

An Outline of the Steps of the KDD Process

The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:

1. Developing an understanding of
o the application domain
o the relevant prior knowledge
o the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
o Removal of noise or outliers.
o Collecting necessary information to model or account for noise.
o Strategies for handling missing data fields.
o Accounting for time sequence information and known changes.
4. Data reduction and projection.
o Finding useful features to represent the data depending on the goal of the
task.
o Using dimensionality reduction or transformation methods to reduce the
effective number of variables under consideration or to find invariant
representations for the data.
5. Choosing the data mining task.
o Deciding whether the goal of the KDD process is classification,
regression, clustering, etc.
6. Choosing the data mining algorithm(s).
o Selecting method(s) to be used for searching for patterns in the data.
o Deciding which models and parameters may be appropriate.
o Matching a particular data mining method with the overall criteria of the
KDD process.
7. Data mining.
o Searching for patterns of interest in a particular representational form or
a set of such representations as classification rules or trees, regression,
clustering, and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.

The terms knowledge discovery and data mining are distinct.

KDD refers to the overall process of discovering useful knowledge from data. It
involves the evaluation and possibly interpretation of the patterns to make the decision
of what qualifies as knowledge. It also includes the choice of encoding schemes,
preprocessing, sampling, and projections of the data prior to the data mining step.
Data mining refers to the application of algorithms for extracting patterns from data
without the additional steps of the KDD process.

Definitions Related to the KDD Process

Knowledge discovery in databases is the non-trivial process of


identifying valid, novel, potentially useful, and ultimately understandable
patterns in data.

Data A set of facts, F.


Pattern An expression E in a language L describing facts in a subset FE of F.
KDD is a multi-step process involving data preparation, pattern searching,
Process
knowledge evaluation, and refinement with iteration after modification.
Discovered patterns should be true on new data with some degree of
Valid certainty.
Generalize to the future (other data).
Novel Patterns must be novel (should not be previously known).
Useful Actionable; patterns should potentially lead to some useful actions.
The process should lead to human insight.
Understandable Patterns must be made understandable in order to facilitate a better
understanding of the underlying data.

The Business Context of Data Mining


Why does an organisation have to practise data mining when it does not bring impact to their
businesses? In product marketing, the marketing manager should identify the segment of the
population who is most likely to respond to your product. Identifying these segments of
population involves understanding the overall population and deploying the right technique to
classify the population. Likewise, in predictive modelling, there are several ways to interact
with the customers using different channels. These include direct marketing, print advertising,
telemarketing, radio, television advertising and so on. It is only through data mining, that an
analyst would conclude which is the optimal channel for sending the communication to the
customers.

In addition to segmenting and targeting, data mining is also popularly used for budgeting the
marketing spend, so the budget allocation can be optimised across marketing drivers. The
analysis is carried out based on previous year spend and their impact on the sales. Therefore
with the spend information for each driver, like, Print, TV, Radio, Online, etc, one could
determine the ROIs for each driver that would uncover the impact of these channels on the
sales. Based on this analysis the marketing manager could allocate media pend in the coming
year to achieve the most effective results on sales.

Process improvement through data mining


The role of data in manufacturing has always been understated or unstated. The
way companies cope with quality improvement has been transformed by new
forms of data use and data analytics. The experts in the field report a considerable
shift from exclusive dependence on post-manufacturing inspection work and
retrospective analysis to the prediction and early identification of problem areas
and maintenance requirements. New sources of data—from sensors to callcenter
conversations—are bringing traditional product inspections on a new level. By
transforming the management of quality and safety in asset-based businesses,
these innovations are gradually improving manufacturing sector.
Data transforms technology, and it’s only the beginning of striking changes.

The quality and safety revolution in organizations was marked by


numerous technical breakthroughs such as real-time data from connected vehicle
sensors and GPS and text derived from warranty reports and conversions of
callcenter speech conversations, just to name a few. On the other hand, the data
is now combined in a repository that allows for multiple data formats and analysis
across them.This is where exactly machine learning algorithms come to play. Their
role is to identify trends in the data and to make predictions.

Why to use data mining?

Businesses use data mining to draw conclusions and solve specific problems. One
of the key benefits of data mining is that it is fundamentally applicable to any
process and helps improve the flexibility and efficiency of operations. Thus,
data use in manufacturing facilitates schedule adherence, monitoring automation,
modeling for capacity, and reduction of waste. The departments are completely
transformed and factories become smarter by achieving full data transparency.

How manufacturing businesses take advantage of data mining

ABB, a huge manufacturer of a global importance, is currently using process


mining for purchase-to-pay and production processes. Earlier, the employees from
the ABB plant in Hanau, Germany, would extract evaluations from their SAP
systems several times a day, import them into Excel, and use complex formulas to
analyze and understand processes. Today, the relevant production and assembly
team leaders at ABB receive an email in the morning that outlines the previous
day’s production variants, throughput times, and number of rejections. As a result,
the plant’s full ecosystem of quality improvement processes is immediately visible
with process mining. The system only gets better at identifying patterns as more
data gets fed in. Instead of relying on complex manual analysis of processes,
operational processes provide instant results.

Drastic changes have impacted vehicle manufacturing industry too. In this sector,
the products are relatively expensive, with high-end manufacturers focusing on
service and product quality. They note that the business benefits related to the
introduction of data-driven innovations have all the chances to speed up
identification and resolution of quality problems, as well as cut warranty spending,
which amounts to between 2-6 % of total sales in the automobile industry. For the
customers and users of these vehicles and machines, early identification and
preventive maintenance often results in greater uptime. For instance, in one case
involving an automotive company, 28,000 vehicles were saved from recall by the
identification of a problem before vehicles hit the market.

Data mining tools can be very beneficial for discovering interesting and useful
patterns in complicated manufacturing quality improvement processes. These
patterns can be used to improve manufacturing quality. However, data
accumulated in manufacturing plants have unique characteristics, such as
unbalanced distribution of the target attribute, and a small training set relative to
the number of input features. Anyways, business process improvement has to
start somewhere. Using an approach that incorporates big data, analytics and
business intelligence approach is simply the most reliable, proven way to make
improvements that last. Once you know what to measure, track it, analyse it, and
improve it, you’ll have the right foundations in place to enhance processes
throughout your business. Time and product waste will be the things of the past.

Data mining as a tool for research and knowledge


development in nursing.
The ability to collect and store data has grown at a dramatic rate in all disciplines over the past two
decades. Healthcare has been no exception. The shift toward evidence-based practice and outcomes
research presents significant opportunities and challenges to extract meaningful information from massive
amounts of clinical data to transform it into the best available knowledge to guide nursing practice. Data
mining, a step in the process of Knowledge Discovery in Databases, is a method of unearthing
information from large data sets. Built upon statistical analysis, artificial intelligence, and machine learning
technologies, data mining can analyze massive amounts of data and provide useful and interesting
information about patterns and relationships that exist within the data that might otherwise be missed. As
domain experts, nurse researchers are in ideal positions to use this proven technology to transform the
information that is available in existing data repositories into useful and understandable knowledge to
guide nursing practice and for active interdisciplinary collaboration and research.

Data mining in marketing


• Data mining technology allows to learn more about their customers and make smart marketing
decisions.
• The data mining business, grows 10 percent a year as the amount of data produced is booming.
• DM Information can help to
– increase return on investment (ROI)
– improve CRM and market analysis
– reduce marketing campaign costs
– facilitate fraud detection and customer retention.
• The 4Ps is one way of the best way of defining the marketing:
–Product (or Service)
–Price
–Place
–Promotion

Benefits Using Data Mining in Marketing


• Predict future trends
• customer purchase habits
• Help with decision making
• Improve company revenue and lower costs
• Market basket analysis
• Quick Fraud detection

Barriers Using Data Mining in Marketing


• User privacy/security
• Amount of data is overwhelming
• Great cost at implementation stage
• Possible misuse of information
• Possible in accuracy of data

Data Mining Techniques for Marketing


• Knowledge-based Marketing
• Market Basket Analysis
• Social Media Marketing

Knowledge-based Marketing
• It is marketing which makes use of the macro- and micro-environmental knowledge that is
available to the marketing functional unit in an organization.
• There are three major areas of application of data mining for knowledge-based marketing are
customers profiling, deviation analysis, and trend analysis.
• The Customers profiling systems can analyse the frequency of purchases, companies can know how
many times the customers can buy this product or visit the store.
• The Deviation analysis gives the marketer a good capability to query changes that occurred as a
result of recent price changes or promotions.
• The Trend analysis can determine trends in sales, costs and profits by products or markets in order
to achieve the highest amount of sales.

Market Basket Analysis


• Most common and useful types of data analysis for marketing and retailing.
• Determine what products customers purchase together.
• Improve the effectiveness of marketing and sales tactics using customer data already available to
the company.

Social Media Marketing


• SMM is a form of internet marketing that implements various social media networks in order to
achieve marketing communication and branding goals.
• SMM primarily covers activities involving social sharing of content, videos, and images for
marketing purposes, as well as paid social media advertising.

Data Mining Tools for Marketing


• WEKA
• Rapid Miner
• R-Programming Tool
• Python Based Orange and NTLK
• KNIME
Major Data Mining Techniques: Classification and Prediction

There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −

• Classification
• Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to
categorize bank loan applications as either safe or risky, or a prediction model to predict
the expenditures in dollars of potential customers on computer equipment given their
income and occupation.

What is classification?
Following are the examples of cases where the data analysis task is Classification −
• A bank loan officer wants to analyze the data in order to know which customer (loan applicant)
are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with a given profile, who
will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no
for marketing data.

What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a
model or a predictor will be constructed that predicts a continuous-valued-function or
ordered value.
Note − Regression analysis is a statistical methodology that is most often used for
numeric prediction.

How Does Classification Works?


With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process includes two
steps −
• Building the Classifier or Model
• Using Classifier for Classification
Building the Classifier or Model
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database tuples and their associated
class labels.
• Each tuple that constitutes the training set is referred to as a category or class. These tuples
can also be referred to as sample, object or data points.

Using Classifier for Classification


In this step, the classifier is used for classification. Here the test data is used to estimate
the accuracy of classification rules. The classification rules can be applied to the new
data tuples if the accuracy is considered acceptable.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction. Preparing the
data involves the following activities −
• Data Cleaning − Data cleaning involves removing the noise and treatment of missing values.
The noise is removed by applying smoothing techniques and the problem of missing values
is solved by replacing a missing value with most commonly occurring value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes. Correlation analysis
is used to know whether any two given attributes are related.
• Data Transformation and reduction − The data can be transformed by any of the following
methods.
o Normalization − The data is transformed using normalization. Normalization involves
scaling all values for given attribute in order to make them fall within a small specified
range. Normalization is used when in the learning step, the neural networks or the
methods involving measurements are used.
o Generalization − The data can also be transformed by generalizing it to the higher
concept. For this purpose we can use the concept hierarchies.
Note − Data can also be reduced by some other methods such as wavelet
transformation, binning, histogram analysis, and clustering.

Comparison of Classification and Prediction Methods


Here is the criteria for comparing the methods of Classification and Prediction −
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class label
correctly and the accuracy of the predictor refers to how well a given predictor can guess the
value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the classifier or
predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct predictions from
given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently;
given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor understands.

Classification by Decision Tree Induction


A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test,
and each leaf node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents
a test on an attribute. Each leaf node represents a class.

The benefits of having a decision tree are as follows −

• It does not require any domain knowledge.


• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.

Decision Tree Induction Algorithm


A machine researcher named J. Ross Quinlan in 1980 developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was
the successor of ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is
no backtracking; the trees are constructed in a top-down recursive divide-and-conquer
manner.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting
subset.

Output:
A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_list is empty then


return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)


to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and


multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute


for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition


let Dj be the set of data tuples in D satisfying outcome j; // a
partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;

Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise
or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
• Pre-pruning − The tree is pruned by halting its construction early.
• Post-pruning - This approach removes a sub-tree from a fully grown tree.

Cost Complexity
The cost complexity is measured by the following two parameters −

• Number of leaves in the tree, and


• Error rate of the tree.

KNN Algorithm
K-Nearest Neighbors is one of the most basic yet essential classification algorithms in
Machine Learning. It belongs to the supervised learning domain and finds intense
application in pattern recognition, data mining and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does
not make any underlying assumptions about the distribution of data (as opposed to other
algorithms such as GMM, which assume a Gaussian distribution of the given data).
We are given some prior data (also called training data), which classifies coordinates into
groups identified by an attribute.
As an example, consider the following table of data points containing two features:

Now, given another set of data points (also called testing data), allocate these points a
group by analyzing the training set. Note that the unclassified points are marked as
‘White’.

Intuition
If we plot these points on a graph, we may be able to locate some clusters or groups.
Now, given an unclassified point, we can assign it to a group by observing what group its
nearest neighbors belong to. This means a point close to a cluster of points classified as
‘Red’ has a higher probability of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’ and the
second point (5.5, 4.5) should be classified as ‘Red’.
Algorithm
Let m be the number of training data samples. Let p be an unknown point.
1. Store the training samples in an array of data points arr[]. This means each element of this
array represents a tuple (x, y).
2. for i=0 to m:
3. Calculate Euclidean distance d(arr[i], p).
4. Make set S of K smallest distances obtained. Each of these distances corresponds to an
already classified data point.
5. Return the majority label among S.
K can be kept as an odd number so that we can calculate a clear majority in the case
where only two groups are possible (e.g. Red/Blue). With increasing K, we get smoother,
more defined boundaries across different classifications. Also, the accuracy of the above
classifier increases as we increase the number of data points in the training set.

You might also like