100% found this document useful (1 vote)

503 views41 pages

Data Mining Lab Manual

This document describes an assignment to develop a credit risk assessment system using data mining techniques. It provides background on the importance of assessing creditworthiness for banks and outlines some methods for acquiring knowledge about credit, including interviewing loan officers, reviewing materials, using common sense, and examining case histories. It then introduces a German credit dataset that can be used to complete the assignment and notes some attributes in the data. The document lists subtasks for the lab assignment, including experiments to identify categorical/numeric attributes in the data, generate association rules using the Apriori algorithm, and create a decision tree for classification.

Uploaded by

rajianand2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

503 views41 pages

Data Mining Lab Manual

Uploaded by

rajianand2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 41

IT6711-DATA MINING LAB

Credit Risk Assessment

Description: The business of banks is making loans. Assessing the credit worthiness of an
applicant is of crucial importance. You have to develop a system to help a loan officer
decide whether the credit of a customer is good, or bad. A banks business rules regarding
loans must consider two opposing factors. On the one hand, a bank wants to make as
many loans as possible. Interest on these loans is the bans profit source. On the other
hand, a bank cannot afford to make too many bad loans. Too many bad loans could lead
to the collapse of the bank. The banks loan policy must involve a compromise not too
strict, and not too lenient.
To do the assignment, you first and foremost need some knowledge about the world of
credit . You can acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try
to represent her knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on
finance. Translate this knowledge from text form to production rule form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which
can be used to judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers correctly
judged when not to, approve a loan application.
The German Credit Data :
Actual historical credit data is not always easy to come by because of confidentiality rules.
Here is one such dataset ( original) Excel spreadsheet version of the German credit data
(download from web).
In spite of the fact that the data is German, you should probably make use of it for this
assignment, (Unless you really can consult a real loan officer !)
A few notes on the German dataset :

DM stands for Deutsche Mark, the unit of currency, worth about 90 cents Canadian
(but looks and acts like a quarter).

Owns_telephone. German phone rates are much higher than in Canada so fewer
people own telephones.

Foreign_worker. There are millions of these in Germany (many from Turkey). It is

very hard to get German citizenship if you were not born of German parents.

There are 20 attributes used in judging a loan applicant. The goal is the classify the
applicant into one of two categories, good or bad.

IT6711-DATA MINING LAB

Subtasks : (Turn in your answers to the following tasks)
Laboratory Manual For Data Mining

EXPERIMENT-1
Aim: To list all the categorical(or nominal) attributes and the real valued attributes using Weka
mining tool.
Tools/ Apparatus: Weka mining tool..
Procedure:
1) Open the Weka GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system bank.csv.
5) Clicking on any attribute in the left panel will show the basic statistics on that selected
attribute.
SampleOutput:

SARANATHAN COLLEGE OF ENGINEERING

Page 2

IT6711-DATA MINING LAB

EXPERIMENT-2

Aim: To identify the rules with some of the important attributes by a) manually and b) Using
Weka .
Tools/ Apparatus: Weka mining tool..
Theory:
Association rule mining is defined as: Let be a set of n binary attributes called items. Let be a set
of transactions called the database. Each transaction in D has a unique transaction ID and
contains a subset of the items in I. A rule is defined as an implication of the form X=>Y where
X,Y C I and X Y= . The sets of items (for short itemsets) X and Y are called antecedent (left
hand side or LHS) and consequent (righthandside or RHS) of the rule respectively.
To illustrate the concepts, we use a small example from the supermarket domain.
The set of items is I = {milk,bread,butter,beer} and a small database containing the items (1
codes presence and 0 absence of an item in a transaction) is shown in the table to the right. An
example rule for the supermarket could be meaning that if milk and bread is bought, customers
also buy butter.
Note: this example is extremely small. In practical applications, a rule needs a support of several
hundred transactions before it can be considered statistically significant, and datasets often
contain thousands or millions of transactions.
To select interesting rules from the set of all possible rules, constraints on various measures of
significance and interest can be used. The bestknown constraints are minimum thresholds on
support and confidence. The support supp(X) of an itemset X is defined as the proportion of
transactions in the data set which contain the itemset. In the example database, the itemset
{milk,bread} has a support of 2 / 5 = 0.4 since it occurs in 40% of all transactions (2 out of 5
transactions).
The confidence of a rule is defined . For example, the rule has a confidence of 0.2 / 0.4 = 0.5 in
the database, which means that for 50% of the transactions containing milk and bread the rule is
correct. Confidence can be interpreted as an estimate of the probability P(Y | X), the probability
of finding the RHS of the rule in transactions under the condition that these transactions also
contain the LHS .
ALGORITHM:

SARANATHAN COLLEGE OF ENGINEERING

Page 3

IT6711-DATA MINING LAB

Association rule mining is to find out association rules that satisfy the predefined minimum
support and confidence from a given database. The problem is usually decomposed into two
subproblems. One is to find those itemsets whose occurrences exceed a predefined threshold in
the database those itemsets are called frequent or large itemsets. The second problem is to
generate association rules from those large itemsets with the constraints of minimal confidence.
Suppose one of the large itemsets is Lk, Lk = {I1, I2, , Ik}, association rules with this itemsets
are generated in the following way: the first rule is {I1, I2, , Ik1} and {Ik}, by checking the
confidence this rule can be determined as interesting or not. Then other rule are generated by
deleting the last items in the antecedent and inserting it to the consequent, further the confidences
of the new rules are checked to determine the interestingness of them. Those processes iterated
until the antecedent becomes empty. Since the second subproblem is quite straight forward, most
of the researches focus on the first subproblem. The Apriori algorithm finds the frequent sets L In
Database D.
Find frequent set Lk 1.
Join Step.
o Ck is generated by joining Lk 1with itself
Prune Step.
o Any (k 1) itemset that is not frequent cannot be a subset of a
frequent k itemset, hence should be removed.
Where (Ck: Candidate itemset of size k)
(Lk: frequent itemset of size k)

Apriori Pseudocode
Apriori (T,)
L<{ Large 1itemsets that appear in more than transactions }
K<2
while L(k1)
C(k)<Generate( Lk 1)
for transactions t T
C(t)Subset(Ck,t)

SARANATHAN COLLEGE OF ENGINEERING

Page 4

IT6711-DATA MINING LAB

for candidates c C(t)
count[c]<count[ c]+1
L(k)<{ c C(k)| count[c]
K<K+ 1
return L(k) k
Procedure:
1) Given the Bank database for mining.
2) Select EXPLORER in WEKA GUI Chooser.
3) Load Bank.csv in Weka by Open file in Preprocess tab.
4) Select only Nominal values.
5) Go to Associate Tab.
6) Select Apriori algorithm from Choose button present in Associator
weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
7) Select Start button
8) now we can see the sample rules.
Sample output:

SARANATHAN COLLEGE OF ENGINEERING

Page 5

IT6711-DATA MINING LAB

EXPERIMENT-3

Aim: To create a Decision tree by training data set using Weka mining tool.
Tools/ Apparatus: Weka mining tool..

SARANATHAN COLLEGE OF ENGINEERING

Page 6

IT6711-DATA MINING LAB

Theory:
Classification is a data mining function that assigns items in a collection to target categories or
classes. The goal of classification is to accurately predict the target class for each case in the
data. For example, a classification model could be used to identify loan applicants as low,
medium, or high credit risks.
A classification task begins with a data set in which the class assignments are known. For
example, a classification model that predicts credit risk could be developed based on observed
data for many loan applicants over a period of time.
In addition to the historical credit rating, the data might track employment history, home
ownership or rental, years of residence, number and type of investments, and so on. Credit rating
would be the target, the other attributes would be the predictors, and the data for each customer
would constitute a case.
Classifications are discrete and do not imply order. Continuous, floatingpoint values would
indicate a numerical, rather than a categorical, target. A predictive model with a numerical target
uses a regression algorithm, not a classification algorithm.
The simplest type of classification problem is binary classification. In binary classification, the
target attribute has only two possible values: for example, high credit rating or low credit rating.
Multiclass targets have more than two values: for example, low, medium, high, or unknown
credit rating.
In the model build (training) process, a classification algorithm finds relationships between the
values of the predictors and the values of the target. Different classification algorithms use
different techniques for finding relationships. These relationships are summarized in a model,
which can then be applied to a different data set in which the class assignments are unknown.
Classification models are tested by comparing the predicted values to known target values in a
set of test data. The historical data for a classification project is typically divided into two data
sets: one for building the model the other for testing the model.
Scoring a classification model results in class assignments and probabilities for each case. For
example, a model that classifies customers as low, medium, or high value would also predict the
probability of each classification for each customer.
Classification has many applications in customer segmentation, business modeling, marketing,
credit analysis, and biomedical and drug response modeling.
Different Classification Algorithms
Oracle Data Mining provides the following algorithms for classification:
Decision Tree

SARANATHAN COLLEGE OF ENGINEERING

Page 7

IT6711-DATA MINING LAB

Decision trees automatically generate rules, which are conditional statements that reveal the
logic used to build the tree.
Naive Bayes
Naive Bayes uses Bayes' Theorem, a formula that calculates a probability by counting the
frequency of values and combinations of values in the historical data.
Procedure:
1) Open Weka GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system bank.csv.
5) Go to Classify tab.
6) Here the c4.5 algorithm has been chosen which is entitled as j48 in Java and can be selected
by clicking the button choose
7) and select tree j48
9) Select Test options Use training set
10) if need select attribute.
11) Click Start .
12)now we can see the output details in the Classifier output.
13) right click on the result list and select visualize tree option .
Sample output:

SARANATHAN COLLEGE OF ENGINEERING

Page 8

IT6711-DATA MINING LAB

SARANATHAN COLLEGE OF ENGINEERING

Page 9

IT6711-DATA MINING LAB

The decision tree constructed by using the implemented C4.5 algorithm

SARANATHAN COLLEGE OF ENGINEERING

Page 10

IT6711-DATA MINING LAB

EXPERIMENT-4

Aim: To find the percentage of examples that are classified correctly by using the above created
decision tree model? ie.. Testing on the training set.
Tools/ Apparatus: Weka mining tool..
Theory:
Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is
unrelated to the presence (or absence) of any other feature. For example, a fruit may be
considered to be an apple if it is red, round, and about 4" in diameter. Even though these features
depend on the existence of the other features, a naive Bayes classifier considers all of these
properties to independently contribute to the probability that this fruit is an apple.
An advantage of the naive Bayes classifier is that it requires a small amount of training data to
estimate the parameters (means and variances of the variables) necessary for classification.
Because independent variables are assumed, only the variances of the variables for each class
need to be determined and not the entirecovariance matrix The naive Bayes probabilistic model :
The probability model for a classifier is a conditional model
P(C|F1 .................Fn) over a dependent class variable C with a small number of outcomes or
classes, conditional on several feature variables F1 through Fn. The problem is that if the
number of features n is large or when a feature can take on a large number of values, then basing
such a model on probability tables is infeasible. We therefore reformulate the model to make it
more tractable.
Using Bayes' theorem, we write P(C|F1...............Fn)=[{p(C)p(F1..................Fn|
C)}/p(F1,........Fn)]

In plain English the above equation can be written as

Posterior= [(prior *likehood)/evidence]
In practice we are only interested in the numerator of that fraction, since the denominator does
not depend on C and the values of the features Fi are given, so that the denominator is effectively

SARANATHAN COLLEGE OF ENGINEERING

Page 11

IT6711-DATA MINING LAB

Now the "naive" conditional independence assumptions come into play: assume that each feature
Fi is conditionally independent of every other feature Fj for ji .
This means that p(Fi|C,Fj)=p(Fi|C)
and so the joint model can be expressed as p(C,F1,.......Fn)=p(C)p(F1|C)p(F2|C)...........
=p(C) p(Fi|C)
This means that under the above independence assumptions, the conditional distribution over the
class variable C can be expressed like this:
p(C|F1..........Fn)= p(C) p(Fi|C)
Z
where Z is a scaling factor dependent only on F1.........Fn, i.e., a constant if the values of the
feature variables are known.
Models of this form are much more manageable, since they factor into a so called class prior
p(C) and independent probability distributions p(Fi|C). If there are k classes and if a model for
eachp(Fi|C=c) can be expressed in terms of r parameters, then the corresponding naive Bayes
model has (k 1) + n r k parameters. In practice, often k = 2 (binary classification) and r = 1
(Bernoulli variables as features) are common, and so the total number of parameters of the naive
Bayes model is 2n + 1, where n is the number of binary features used for prediction
P(h/D)= P(D/h) P(h) P(D)
P(h) : Prior probability of hypothesis h
P(D) : Prior probability of training data D
P(h/D) : Probability of h given D
P(D/h) : Probability of D given h
Nave Bayes Classifier : Derivation

SARANATHAN COLLEGE OF ENGINEERING

Page 12

IT6711-DATA MINING LAB

D : Set of tuples
Each Tuple is an n dimensional attribute vector
X : (x1,x2,x3,. xn)
Let there me m Classes : C1,C2,C3Cm
NB classifier predicts X belongs to Class Ci iff
P (Ci/X) > P(Cj/X) for 1<= j <= m , j <> i
Maximum Posteriori Hypothesis
P(Ci/X) = P(X/Ci) P(Ci) / P(X)
Maximize P(X/Ci) P(Ci) as P(X) is constant
Nave Bayes Classifier : Derivation
With many attributes, it is computationally expensive to evaluate P(X/Ci)
Nave Assumption of class conditional independence
P(X/Ci) = n P( xk/ Ci)
k=1
P(X/Ci) = P(x1/Ci) * P(x2/Ci) ** P(xn/ Ci)

Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) Go to Classify tab.
7) Choose Classifier Tree
8) Select NBTree i.e., Navie Baysiean tree.

SARANATHAN COLLEGE OF ENGINEERING

Page 13

IT6711-DATA MINING LAB

9) Select Test options Use training set
10) if need select attribute.
11) now Start weka.
12)now we can see the output details in the Classifier output.

Sample output:

=== Evaluation on training set ===

=== Summary ===

Correctly Classified Instances

Incorrectly Classified Instances
Kappa statistic

554

92.3333 %

7.6667 %

0.845

Mean absolute error

0.1389

Root mean squared error

0.2636

Relative absolute error

27.9979 %

Root relative squared error

52.9137 %

Total Number of Instances

600

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.894

0.052

0.935

0.894

0.914

0.936 YES

0.948

0.106

0.914

0.948

0.931

0.936

Weighted Avg.

0.923

0.081

0.924

0.923

NO
0.936

SARANATHAN COLLEGE OF ENGINEERING

Page 14

IT6711-DATA MINING LAB

=== Confusion Matrix ===

a b <-- classified as
245 29 | a = YES
17 309 | b = NO

EXPERIMENT-5

Aim: To Is testing a good idea.

Tools/ Apparatus: Weka Mining tool
Procedure:
1) In Test options, select the Supplied test set radio button
2) click Set
3) Choose the file which contains records that were not in the training set we used to create
the model.
4) click Start(WEKA will run this test data set through the model we already created. )
5) Compare the output results with that of the 4th experiment
Sample output:

SARANATHAN COLLEGE OF ENGINEERING

Page 15

IT6711-DATA MINING LAB

This can be experienced by the different problem solutions while doing practice.
The important numbers to focus on here are the numbers next to the "Correctly Classified
Instances" (92.3 percent) and the "Incorrectly Classified Instances" (7.6 percent). Other
important numbers are in the "ROC Area" column, in the first row (the 0.936) Finally, in the
"Confusion Matrix," it shows the number of false positives and false negatives. The false
positives are 29, and the false negatives are 17 in this matrix.
Based on our accuracy rate of 92.3 percent, we say that upon initial analysis, this is a good
model.
One final step to validating our classification tree, which is to run our test set through the model
and ensure that accuracy of the model
Comparing the "Correctly Classified Instances" from this test set with the "Correctly Classified
Instances" from the training set, we see the accuracy of the model , which indicates that the
model will not break down with unknown data, or when future data is applied to it.

EXPERIMENT-6

Aim: To create a Decision tree by cross validation training data set using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Theory:
Decision tree learning, used in data mining and machine learning, uses a decision tree as a
predictive model which maps observations about an item to conclusions about the item's target
value In these tree structures, leaves represent classifications and branches represent
conjunctions of features that lead to those classifications. In decision analysis, a decision tree can
be used to visually and explicitly represent decisions and decision making. In data mining, a

SARANATHAN COLLEGE OF ENGINEERING

Page 16

IT6711-DATA MINING LAB

decision tree describes data but not decisions rather the resulting classification tree can be an
input for decision making. This page deals with decision trees in data mining.
Decision tree learning is a common method used in data mining. The goal is to create a model
that predicts the value of a target variable based on several input variables. Each interior node
corresponds to one of the input variables there are edges to children for each of the possible
values of that input variable. Each leaf represents a value of the target variable given the values
of the input variables represented by the path from the root to the leaf.
A tree can be "learned" by splitting the source set into subsets based on an attribute value test.
This process is repeated on each derived subset in a recursive manner called recursive
partitioning. The recursion is completed when the subset at a node all has the same value of the
target variable, or when splitting no longer adds value to the predictions.
In data mining, trees can be described also as the combination of mathematical and
computational techniques to aid the description, categorisation and generalization of a given set
of data.
Data comes in records of the form:
(x, y) = (x1, x2, x3..., xk, y)
The dependent variable, Y, is the target variable that we are trying to understand, classify or
generalise. The vector x is comprised of the input variables, x1, x2, x3 etc., that are used for that
task.

SARANATHAN COLLEGE OF ENGINEERING

Page 17

IT6711-DATA MINING LAB

10) Set Folds Ex:10
11) if need select attribute.
12) now Start weka.
13)now we can see the output details in the Classifier output.
14)Compare the output results with that of the 4th experiment
15) check whether the accuracy increased or decreased?
Sample output:

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances

Incorrectly Classified Instances

539

89.8333 %

10.1667 %

SARANATHAN COLLEGE OF ENGINEERING

Page 18

IT6711-DATA MINING LAB

Kappa statistic

0.7942

Mean absolute error

0.167

Root mean squared error

0.305

Relative absolute error

33.6511 %

Root relative squared error

61.2344 %

Total Number of Instances

600

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.861

0.071

0.911

0.861

0.886

0.883 YES

0.929

0.139

0.889

0.929

0.909

0.883

Weighted Avg.

0.898

0.108

0.899

0.898

NO
0.883

=== Confusion Matrix ===

a b <-- classified as
236 38 | a = YES
23 303 | b = NO

SARANATHAN COLLEGE OF ENGINEERING

Page 19

IT6711-DATA MINING LAB

EXPERIMENT-7

Aim: Delete one attribute from GUI Explorer and see the effect using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) In the "Filter" panel, click on the "Choose" button. This will show a popup window with list
available filters.
7) Select weka.filters.unsupervised.attribute.Remove
8) Next, click on text box immediately to the right of the "Choose" button
9) In the resulting dialog box enter the index of the attribute to be filtered out (Make sure that the
"invertSelection" option is set to false )
10) Then click "OK" . Now, in the filter box you will see "Remove -R 1"
11) Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute
and create a new working relation
12) To save the new working relation as an ARFF file, click on save button in the top panel.
13) Go to OPEN file and browse the file that is newly saved (attribute deleted file)
14) Go to Classify tab.
15) Choose Classifier Tree
16) Select j48 tree
17) Select Test options Use training set
18) if need select attribute.

SARANATHAN COLLEGE OF ENGINEERING

Page 20

IT6711-DATA MINING LAB

19) now Start weka.
20)now we can see the output details in the Classifier output.
21) right click on the result list and select visualize tree option .
22) Compare the output results with that of the 4th experiment
23) check whether the accuracy increased or decreased?
24)check whether removing these attributes have any significant effect.

Sample output:

SARANATHAN COLLEGE OF ENGINEERING

Page 21

IT6711-DATA MINING LAB

SARANATHAN COLLEGE OF ENGINEERING

Page 22

IT6711-DATA MINING LAB

EXPERIMENT-8

Aim: Select some attributes from GUI Explorer and perform classification and see the effect
using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) select some of the attributes from attributes list which are to be removed. With this step only
the attributes necessary for classification are left in the attributes panel.
7) The go to Classify tab.
8) Choose Classifier Tree
9) Select j48
10) Select Test options Use training set
11) if need select attribute.
12) now Start weka.
13)now we can see the output details in the Classifier output.
14) right click on the result list and select visualize tree option .
15)Compare the output results with that of the 4th experiment
16) check whether the accuracy increased or decreased?

SARANATHAN COLLEGE OF ENGINEERING

Page 23

IT6711-DATA MINING LAB

17)check whether removing these attributes have any significant effect.
Sample output:

SARANATHAN COLLEGE OF ENGINEERING

Page 24

IT6711-DATA MINING LAB

EXPERIMENT-9

Aim: To create a Decision tree by cross validation training data set by changing the cost matrix
in Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) Go to Classify tab.
7) Choose Classifier Tree
8) Select j48
9) Select Test options Training set.
10)Click on more options.
11)Select cost sensitive evaluation and click on set button
12)Set the matrix values and click on resize. Then close the window.
13)Click Ok
14)Click start.
15) we can see the output details in the Classifier output
16) Select Test options Cross-validation.
17) Set Folds Ex:10

SARANATHAN COLLEGE OF ENGINEERING

Page 25

IT6711-DATA MINING LAB

18) if need select attribute.
19) now Start weka.
20)now we can see the output details in the Classifier output.
21)Compare results of 15th and 20th steps.
22)Compare the results with that of experiment 6.

Sample output:

SARANATHAN COLLEGE OF ENGINEERING

Page 26

IT6711-DATA MINING LAB

EXPERIMENT-10

Aim: Is small rule better or long rule check the bias,by training data set using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Procedure:
This will be based on the attribute set, and the requirement of relationship among attribute we
want to study. This can be viewed based on the database and user requirement.

EXPERIMENT-11

Aim: To create a Decision tree by using Prune mode and Reduced error Pruning and show
accuracy for cross validation trained data set using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Theory :

SARANATHAN COLLEGE OF ENGINEERING

Page 27

IT6711-DATA MINING LAB

Reduced-error pruning
Each node of the (over-fit) tree is examined for pruning
A node is pruned (removed) only if the resulting pruned tree
performs no worse than the original over the validation set
Pruning a node consists of
Removing the sub-tree rooted at the pruned node
Making the pruned node a leaf node
Assigning the pruned node the most common classification of the training instances attached to
that node
Pruning nodes iteratively
Always select a node whose removal most increases the DT accuracy over the validation set
Stop when further pruning decreases the DT accuracy over the validation set
IF (Children=yes) (income=>30000)
THEN (car=Yes)

Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) select some of the attributes from attributes list
7) Go to Classify tab.
8) Choose Classifier Tree
9) Select NBTree i.e., Navie Baysiean tree.
10) Select Test options Use training set

SARANATHAN COLLEGE OF ENGINEERING

Page 28

IT6711-DATA MINING LAB

11) right click on the text box besides choose button ,select show properties
12) now change unprone mode false to true.
13) change the reduced error pruning % as needed.
14) if need select attribute.
15) now Start weka.
16)now we can see the output details in the Classifier output.
17) right click on the result list and select visualize tree option .

Sample output:

SARANATHAN COLLEGE OF ENGINEERING

Page 29

IT6711-DATA MINING LAB

EXPERIMENT-12

Aim: To compare OneR classifier which uses single attribute and rule with J48 and PART
classifiers, by training data set using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) select some of the attributes from attributes list
7) Go to Classify tab.

SARANATHAN COLLEGE OF ENGINEERING

Page 30

IT6711-DATA MINING LAB

8) Choose Classifier TreesRules
9) Select J48 .
10) Select Test options Use training set
11) if need select attribute.
12) now Start weka.
13)now we can see the output details in the Classifier output.
14) right click on the result list and select visualize tree option .
(or)
java weka.classifiers.trees.J48 -t c:\temp\bank.arff

Procedure for OneR:

1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) select some of the attributes from attributes list
7) Go to Classify tab.
8) Choose Classifier Rules
9) Select OneR .
10) Select Test options Use training set
11) if need select attribute.
12) now Start weka.
13)now we can see the output details in the Classifier output.

SARANATHAN COLLEGE OF ENGINEERING

Page 31

IT6711-DATA MINING LAB

Procedure for PART:

1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) select some of the attributes from attributes list
7) Go to Classify tab.
8) Choose Classifier Rules
9) Select PART .
10) Select Test options Use training set
11) if need select attribute.
12) now Start weka.
13)now we can see the output details in the Classifier output.

Attribute relevance with respect to the class relevant attribute (science)

IF accounting=1 THEN class=A (Error=0, Coverage = 7 instance)
IF accounting=0 THEN class=B (Error=4/13, Coverage = 13 instances)

Sample output:
J48
java weka.classifiers.trees.J48 -t c:/temp/bank.arff

SARANATHAN COLLEGE OF ENGINEERING

Page 32

IT6711-DATA MINING LAB

One R

SARANATHAN COLLEGE OF ENGINEERING

Page 33

IT6711-DATA MINING LAB

PART

K-means clustering implementation in weka tool

SARANATHAN COLLEGE OF ENGINEERING

Page 34

IT6711-DATA MINING LAB

Procedure:
Step1: Open the data file in Weka Explorer. It is presumed that the required data fields have been
discretized. In this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for association rule algorithm.
Step3: We will use K-means algorithm. This is the default algorithm.
Step4: Inorder to change the parameters for the run (example support, confidence etc) we click
on the text box immediately to the right of the choose button.

SARANATHAN COLLEGE OF ENGINEERING

Page 35

IT6711-DATA MINING LAB

SARANATHAN COLLEGE OF ENGINEERING

Page 36

IT6711-DATA MINING LAB

Scheme:
weka.clusterers.SimpleKMeans -init 0 -max-candidates 100 -periodic-pruning 10000 -mindensity 2.0 -t1 -1.25 -t2 -1.0 -N 3 -A "weka.core.EuclideanDistance -R first-last" -I 500 -num-slots 1 -S
10
Relation: labor-neg-data
Instances: 57
Attributes: 17
duration
wage-increase-first-year
wage-increase-second-year
wage-increase-third-year
cost-of-living-adjustment
working-hours
pension
standby-pay
shift-differential
education-allowance
statutory-holidays
vacation
longterm-disability-assistance
contribution-to-dental-plan
bereavement-assistance
contribution-to-health-plan
class
Test mode: evaluate on training data

SARANATHAN COLLEGE OF ENGINEERING

Page 37

IT6711-DATA MINING LAB

=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 3
Within cluster sum of squared errors: 119.5224194214812
Initial starting points (random):
Cluster 0: 1,5.7,3.971739,3.913333,none,40,empl_contr,7.444444,4,no,11,generous,yes,full,yes,full,good
Cluster 1: 1,2,3.971739,3.913333,tc,40,ret_allw,4,0,no,11,generous,no,none,no,none,bad
Cluster 2: 2,2.5,3,3.913333,tcf,40,none,7.444444,4.870968,no,11,below_average,yes,half,yes,full,bad
Missing values globally replaced with mean/mode
Final cluster centroids:
Cluster#
Full Data
0
1
2
(57.0)
(36.0)
(5.0)
(16.0)
===========================================================================
=============
duration
2.1607
2.2267
1.4
2.25
wage-increase-first-year
3.8036
4.4695
3.2
2.4938
wage-increase-second-year
3.9717
4.4175
4.183
2.9027
wage-increase-third-year
3.9133
4.1093
3.9133
3.4725
cost-of-living-adjustment
none
none
none
none
working-hours
38.0392
37.4766
39.2078
38.94
pension
empl_contr empl_contr
none empl_contr
standby-pay
7.4444
7.9938
6.7556
6.4236
shift-differential
4.871
5.4776
3.1484
4.0444
education-allowance
no
no
no
no
statutory-holidays
11.0943
11.4801
10.6
10.3809
vacation
below_average
generous below_average below_average
longterm-disability-assistance
yes
yes
no
yes
contribution-to-dental-plan
half
half
none
half
bereavement-assistance
yes
yes
no
yes
contribution-to-health-plan
full
full
none
full
class
good
good
bad
bad
Attribute

Time taken to build model (full training data) : 0.01 seconds

=== Model and evaluation on training set ===
Clustered Instances
0
1
2

36 ( 63%)
5 ( 9%)
16 ( 28%)

SARANATHAN COLLEGE OF ENGINEERING

Page 38

IT6711-DATA MINING LAB

SARANATHAN COLLEGE OF ENGINEERING

Page 39

IT6711-DATA MINING LAB

Scheme:
weka.clusterers.SimpleKMeans -init 0 -max-candidates 100 -periodic-pruning 10000 -mindensity 2.0 -t1 -1.25 -t2 -1.0 -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -num-slots 1 -S
10
Relation: labor-neg-data
Instances: 57
Attributes: 17
duration
wage-increase-first-year
wage-increase-second-year
wage-increase-third-year
cost-of-living-adjustment
working-hours
pension
standby-pay
shift-differential
education-allowance
statutory-holidays
vacation
longterm-disability-assistance
contribution-to-dental-plan
bereavement-assistance
class
Ignored:
contribution-to-health-plan
Test mode: Classes to clusters evaluation on training data
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 5
Within cluster sum of squared errors: 122.05464734126849
Initial starting points (random):
Cluster 0: 1,5.7,3.971739,3.913333,none,40,empl_contr,7.444444,4,no,11,generous,yes,full,yes,good
Cluster 1: 1,2,3.971739,3.913333,tc,40,ret_allw,4,0,no,11,generous,no,none,no,bad
Missing values globally replaced with mean/mode
Final cluster centroids:
Cluster#
Full Data
0
1
(57.0)
(43.0)
(14.0)
==========================================================================
duration
2.1607
2.213
2
wage-increase-first-year
3.8036
4.2024
2.5786
wage-increase-second-year
3.9717
4.221
3.2062
wage-increase-third-year
3.9133
4.0329
3.5462
cost-of-living-adjustment
none
none
none
Attribute

SARANATHAN COLLEGE OF ENGINEERING

Page 40

IT6711-DATA MINING LAB

working-hours
38.0392
37.6557
39.2171
pension
empl_contr empl_contr
none
standby-pay
7.4444
7.7778
6.4206
shift-differential
4.871
5.2018
3.8548
education-allowance
no
no
no
statutory-holidays
11.0943
11.2878
10.5
vacation
below_average below_average below_average
longterm-disability-assistance
yes
yes
yes
contribution-to-dental-plan
half
half
none
bereavement-assistance
yes
yes
yes
class
good
good
bad
Time taken to build model (full training data) : 0 seconds
=== Model and evaluation on training set ===
Clustered Instances
0
1

43 ( 75%)
14 ( 25%)

Class attribute: contribution-to-health-plan

Classes to Clusters:
0 1 <-- assigned to cluster
20 8 | none
9 0 | half
14 6 | full
Cluster 0 <-- none
Cluster 1 <-- full
Incorrectly clustered instances :

31.0

54.386 %

SARANATHAN COLLEGE OF ENGINEERING

Page 41

Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Django ORM Cheatsheet
No ratings yet
Django ORM Cheatsheet
13 pages
Lesson 7 The Human Person in Society Full Version
100% (10)
Lesson 7 The Human Person in Society Full Version
10 pages
Weka 9
No ratings yet
Weka 9
7 pages
The Leadership Gap PDF
No ratings yet
The Leadership Gap PDF
11 pages
Social Traditionalism Social Experimentalism: Who Were The Teachers? Why Did They Teach?
No ratings yet
Social Traditionalism Social Experimentalism: Who Were The Teachers? Why Did They Teach?
5 pages
A Beginner's Guide To Leadership and Submission
68% (19)
A Beginner's Guide To Leadership and Submission
14 pages
Physio B - Vision Experiment (Daenerys)
100% (4)
Physio B - Vision Experiment (Daenerys)
4 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
34 pages
Views On Big Data
No ratings yet
Views On Big Data
16 pages
Natural Language Processing
100% (1)
Natural Language Processing
12 pages
Rajkiya Engineering College Kannauj: Datawarehousing & Data Mining Lab (RCS-654)
No ratings yet
Rajkiya Engineering College Kannauj: Datawarehousing & Data Mining Lab (RCS-654)
28 pages
Unit 2 AI
No ratings yet
Unit 2 AI
22 pages
The Age of Analytics - Competing in A Data-Driven World - McKinsey & Company
No ratings yet
The Age of Analytics - Competing in A Data-Driven World - McKinsey & Company
6 pages
Shivendra Frontpage
No ratings yet
Shivendra Frontpage
10 pages
The Price Prediction For Used Cars Using Multiple Linear Regression Model
No ratings yet
The Price Prediction For Used Cars Using Multiple Linear Regression Model
6 pages
Artificial Intelligence & Expert System
100% (1)
Artificial Intelligence & Expert System
18 pages
Jntuk Machine Learning 3-2 Unit-4
No ratings yet
Jntuk Machine Learning 3-2 Unit-4
32 pages
Data Analysis Power Bi Classnotes
No ratings yet
Data Analysis Power Bi Classnotes
4 pages
Data Warehousing Full
No ratings yet
Data Warehousing Full
41 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Data Mining Lab Manual
33% (3)
Data Mining Lab Manual
44 pages
Pythonic Data Cleaning With Numpy and Pandas
No ratings yet
Pythonic Data Cleaning With Numpy and Pandas
11 pages
Natural Language Toolkit NLTK PDF
No ratings yet
Natural Language Toolkit NLTK PDF
23 pages
Weka Tutorial
No ratings yet
Weka Tutorial
2 pages
DATA MINING Chapter 1 and 2 Lect Slide
No ratings yet
DATA MINING Chapter 1 and 2 Lect Slide
47 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Java Notes
No ratings yet
Java Notes
169 pages
SQL For Everyone
No ratings yet
SQL For Everyone
11 pages
BE02000041 Funda of AI Unit 1 Introduction
No ratings yet
BE02000041 Funda of AI Unit 1 Introduction
63 pages
Django Ppts
No ratings yet
Django Ppts
243 pages
Software Engineering Notes (Unit-III)
No ratings yet
Software Engineering Notes (Unit-III)
21 pages
Mca 3 Sem Artificial Intelligence Kca301 2023
No ratings yet
Mca 3 Sem Artificial Intelligence Kca301 2023
2 pages
HTML Tables and Forms (PDFDrive)
100% (1)
HTML Tables and Forms (PDFDrive)
68 pages
Data Science Course Content
No ratings yet
Data Science Course Content
8 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Sapphire Interview Questions
No ratings yet
Sapphire Interview Questions
5 pages
Recommender System
No ratings yet
Recommender System
45 pages
Big Data Essay
No ratings yet
Big Data Essay
6 pages
Unit-3 DWDM
No ratings yet
Unit-3 DWDM
11 pages
Lecture 2.1.9 Comparison of BNN and ANN
No ratings yet
Lecture 2.1.9 Comparison of BNN and ANN
5 pages
Data Mining and Business Intelligence Lab Manual
No ratings yet
Data Mining and Business Intelligence Lab Manual
52 pages
Cd-Rom Included: Business User Action
100% (1)
Cd-Rom Included: Business User Action
11 pages
Basic Python
No ratings yet
Basic Python
111 pages
Skill Enhancement Course (SEC) Artificial Intelligence
No ratings yet
Skill Enhancement Course (SEC) Artificial Intelligence
54 pages
OOMD Summer
No ratings yet
OOMD Summer
12 pages
DBMS Notes
No ratings yet
DBMS Notes
180 pages
Deep Neural Network
No ratings yet
Deep Neural Network
12 pages
Data Mining Lab File
No ratings yet
Data Mining Lab File
20 pages
Unit V
No ratings yet
Unit V
13 pages
Data Mining (DM)
No ratings yet
Data Mining (DM)
45 pages
Practical 5: Introduction To Weka For Classfication
100% (1)
Practical 5: Introduction To Weka For Classfication
4 pages
Pandas
100% (1)
Pandas
1,131 pages
Tableau Lab Manual
No ratings yet
Tableau Lab Manual
6 pages
Appendix Weka
No ratings yet
Appendix Weka
17 pages
76 - Sample - Chapter Kunci M2K3 No 9
No ratings yet
76 - Sample - Chapter Kunci M2K3 No 9
94 pages
unit V
No ratings yet
unit V
67 pages
An Introduction To Text: Mining
No ratings yet
An Introduction To Text: Mining
39 pages
Data Mining-Multimedia Datamining
No ratings yet
Data Mining-Multimedia Datamining
8 pages
PPT1
No ratings yet
PPT1
93 pages
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
10 pages
Virtualization and Five Step Process
No ratings yet
Virtualization and Five Step Process
19 pages
BIRT 2.6 Data Analysis and Reporting
From Everand
BIRT 2.6 Data Analysis and Reporting
John Ward
2/5 (1)
Excel 2013/2016: Get Your Hands Dirty
From Everand
Excel 2013/2016: Get Your Hands Dirty
Sam Akrasi
No ratings yet
ISO 80000-3 A Complete Guide
From Everand
ISO 80000-3 A Complete Guide
Gerardus Blokdyk
No ratings yet
Unit 5 Nosql Databases
No ratings yet
Unit 5 Nosql Databases
9 pages
Sns College of Engineering: Big Data Analytics
No ratings yet
Sns College of Engineering: Big Data Analytics
17 pages
IA Defn
No ratings yet
IA Defn
25 pages
Information Architecture: Professor Larry Heimann Carnegie Mellon University 88-272 Lecture Notes - Fall 1999
No ratings yet
Information Architecture: Professor Larry Heimann Carnegie Mellon University 88-272 Lecture Notes - Fall 1999
24 pages
9/28/2019 1 Data Analytics/Case Studies RTSA
No ratings yet
9/28/2019 1 Data Analytics/Case Studies RTSA
5 pages
An Introduction To Information Architecture: ECT 250: Survey of E-Commerce Technology
No ratings yet
An Introduction To Information Architecture: ECT 250: Survey of E-Commerce Technology
45 pages
Pressman CH 10 Architectural Style
No ratings yet
Pressman CH 10 Architectural Style
44 pages
Introduction To Software Architecture: TV Prabhakar
No ratings yet
Introduction To Software Architecture: TV Prabhakar
42 pages
Introduction To Software Architecture: TV Prabhakar
No ratings yet
Introduction To Software Architecture: TV Prabhakar
42 pages
Generative Grammar PDF
No ratings yet
Generative Grammar PDF
11 pages
RS-Lesson 2 - Understanding Logical Reasoning - Noted
No ratings yet
RS-Lesson 2 - Understanding Logical Reasoning - Noted
12 pages
Hennessy
100% (1)
Hennessy
16 pages
INER7411Summative Assignment
No ratings yet
INER7411Summative Assignment
14 pages
Perception and Individual Decision Making
No ratings yet
Perception and Individual Decision Making
27 pages
Kriya Yoga 2
No ratings yet
Kriya Yoga 2
9 pages
Rabeeya Arshad, FA17-BPY-034 Final Paper
No ratings yet
Rabeeya Arshad, FA17-BPY-034 Final Paper
11 pages
International Law Syllabus
No ratings yet
International Law Syllabus
6 pages
Guidelines For Conducting........
No ratings yet
Guidelines For Conducting........
9 pages
Letter of Complaint (English)
No ratings yet
Letter of Complaint (English)
5 pages
The Difference Between Reason and Will Reason and Will
No ratings yet
The Difference Between Reason and Will Reason and Will
20 pages
Akihiro Oi Thesis
100% (3)
Akihiro Oi Thesis
7 pages
Summary
No ratings yet
Summary
5 pages
01 - Questions With Answers
No ratings yet
01 - Questions With Answers
4 pages
Analysis of Large Web Sequences Using Aprioriall - Set Algorithm
No ratings yet
Analysis of Large Web Sequences Using Aprioriall - Set Algorithm
5 pages
Electrical Power Good
No ratings yet
Electrical Power Good
9 pages
BATTENHOUSE, Roy - A Companion To The Study of St. Augustine
100% (2)
BATTENHOUSE, Roy - A Companion To The Study of St. Augustine
447 pages
Stci T30
No ratings yet
Stci T30
2 pages
Puisi
No ratings yet
Puisi
7 pages
Cultural Competence Checklist: Personal Reflection: Appendix A
No ratings yet
Cultural Competence Checklist: Personal Reflection: Appendix A
1 page
The Myth of Androgyne. Migrations, Mutations and Deconstructions (PHD Abstract, 2020)
No ratings yet
The Myth of Androgyne. Migrations, Mutations and Deconstructions (PHD Abstract, 2020)
39 pages
Why Was The Quran Revealed in Arabic
No ratings yet
Why Was The Quran Revealed in Arabic
4 pages
Aviation Student: Brian K. Lee
No ratings yet
Aviation Student: Brian K. Lee
3 pages
Philosophical Term Paper Topics
100% (1)
Philosophical Term Paper Topics
5 pages
Theory & Practice of Electromagnetic Design of DC Motors & Actuators George P. Gogue & Joseph J. Stupak, Jr. G2 Consulting, Beaverton, OR 97007
100% (1)
Theory & Practice of Electromagnetic Design of DC Motors & Actuators George P. Gogue & Joseph J. Stupak, Jr. G2 Consulting, Beaverton, OR 97007
24 pages