Dmbi Unit-3
Dmbi Unit-3
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
It does this by using data mining methods (algorithms) to extract (identify) what is
deemed knowledge, according to the specifications of measures and thresholds, using
a database along with any required preprocessing, subsampling, and transformations
of that database.
The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
1. Developing an understanding of
o the application domain
o the relevant prior knowledge
o the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
o Removal of noise or outliers.
o Collecting necessary information to model or account for noise.
o Strategies for handling missing data fields.
o Accounting for time sequence information and known changes.
4. Data reduction and projection.
o Finding useful features to represent the data depending on the goal of the
task.
o Using dimensionality reduction or transformation methods to reduce the
effective number of variables under consideration or to find invariant
representations for the data.
5. Choosing the data mining task.
o Deciding whether the goal of the KDD process is classification,
regression, clustering, etc.
6. Choosing the data mining algorithm(s).
o Selecting method(s) to be used for searching for patterns in the data.
o Deciding which models and parameters may be appropriate.
o Matching a particular data mining method with the overall criteria of the
KDD process.
7. Data mining.
o Searching for patterns of interest in a particular representational form or
a set of such representations as classification rules or trees, regression,
clustering, and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.
KDD refers to the overall process of discovering useful knowledge from data. It
involves the evaluation and possibly interpretation of the patterns to make the decision
of what qualifies as knowledge. It also includes the choice of encoding schemes,
preprocessing, sampling, and projections of the data prior to the data mining step.
Data mining refers to the application of algorithms for extracting patterns from data
without the additional steps of the KDD process.
In addition to segmenting and targeting, data mining is also popularly used for budgeting the
marketing spend, so the budget allocation can be optimised across marketing drivers. The
analysis is carried out based on previous year spend and their impact on the sales. Therefore
with the spend information for each driver, like, Print, TV, Radio, Online, etc, one could
determine the ROIs for each driver that would uncover the impact of these channels on the
sales. Based on this analysis the marketing manager could allocate media pend in the coming
year to achieve the most effective results on sales.
Businesses use data mining to draw conclusions and solve specific problems. One
of the key benefits of data mining is that it is fundamentally applicable to any
process and helps improve the flexibility and efficiency of operations. Thus,
data use in manufacturing facilitates schedule adherence, monitoring automation,
modeling for capacity, and reduction of waste. The departments are completely
transformed and factories become smarter by achieving full data transparency.
Drastic changes have impacted vehicle manufacturing industry too. In this sector,
the products are relatively expensive, with high-end manufacturers focusing on
service and product quality. They note that the business benefits related to the
introduction of data-driven innovations have all the chances to speed up
identification and resolution of quality problems, as well as cut warranty spending,
which amounts to between 2-6 % of total sales in the automobile industry. For the
customers and users of these vehicles and machines, early identification and
preventive maintenance often results in greater uptime. For instance, in one case
involving an automotive company, 28,000 vehicles were saved from recall by the
identification of a problem before vehicles hit the market.
Data mining tools can be very beneficial for discovering interesting and useful
patterns in complicated manufacturing quality improvement processes. These
patterns can be used to improve manufacturing quality. However, data
accumulated in manufacturing plants have unique characteristics, such as
unbalanced distribution of the target attribute, and a small training set relative to
the number of input features. Anyways, business process improvement has to
start somewhere. Using an approach that incorporates big data, analytics and
business intelligence approach is simply the most reliable, proven way to make
improvements that last. Once you know what to measure, track it, analyse it, and
improve it, you’ll have the right foundations in place to enhance processes
throughout your business. Time and product waste will be the things of the past.
Knowledge-based Marketing
• It is marketing which makes use of the macro- and micro-environmental knowledge that is
available to the marketing functional unit in an organization.
• There are three major areas of application of data mining for knowledge-based marketing are
customers profiling, deviation analysis, and trend analysis.
• The Customers profiling systems can analyse the frequency of purchases, companies can know how
many times the customers can buy this product or visit the store.
• The Deviation analysis gives the marketer a good capability to query changes that occurred as a
result of recent price changes or promotions.
• The Trend analysis can determine trends in sales, costs and profits by products or markets in order
to achieve the highest amount of sales.
There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −
• Classification
• Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to
categorize bank loan applications as either safe or risky, or a prediction model to predict
the expenditures in dollars of potential customers on computer equipment given their
income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
• A bank loan officer wants to analyze the data in order to know which customer (loan applicant)
are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with a given profile, who
will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no
for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a
model or a predictor will be constructed that predicts a continuous-valued-function or
ordered value.
Note − Regression analysis is a statistical methodology that is most often used for
numeric prediction.
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting
subset.
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise
or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
• Pre-pruning − The tree is pruned by halting its construction early.
• Post-pruning - This approach removes a sub-tree from a fully grown tree.
Cost Complexity
The cost complexity is measured by the following two parameters −
KNN Algorithm
K-Nearest Neighbors is one of the most basic yet essential classification algorithms in
Machine Learning. It belongs to the supervised learning domain and finds intense
application in pattern recognition, data mining and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does
not make any underlying assumptions about the distribution of data (as opposed to other
algorithms such as GMM, which assume a Gaussian distribution of the given data).
We are given some prior data (also called training data), which classifies coordinates into
groups identified by an attribute.
As an example, consider the following table of data points containing two features:
Now, given another set of data points (also called testing data), allocate these points a
group by analyzing the training set. Note that the unclassified points are marked as
‘White’.
Intuition
If we plot these points on a graph, we may be able to locate some clusters or groups.
Now, given an unclassified point, we can assign it to a group by observing what group its
nearest neighbors belong to. This means a point close to a cluster of points classified as
‘Red’ has a higher probability of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’ and the
second point (5.5, 4.5) should be classified as ‘Red’.
Algorithm
Let m be the number of training data samples. Let p be an unknown point.
1. Store the training samples in an array of data points arr[]. This means each element of this
array represents a tuple (x, y).
2. for i=0 to m:
3. Calculate Euclidean distance d(arr[i], p).
4. Make set S of K smallest distances obtained. Each of these distances corresponds to an
already classified data point.
5. Return the majority label among S.
K can be kept as an odd number so that we can calculate a clear majority in the case
where only two groups are possible (e.g. Red/Blue). With increasing K, we get smoother,
more defined boundaries across different classifications. Also, the accuracy of the above
classifier increases as we increase the number of data points in the training set.