Data Mining Lab Manual
Data Mining Lab Manual
DM stands for Deutsche Mark, the unit of currency, worth about 90 cents Canadian
(but looks and acts like a quarter).
Owns_telephone. German phone rates are much higher than in Canada so fewer
people own telephones.
There are 20 attributes used in judging a loan applicant. The goal is the classify the
applicant into one of two categories, good or bad.
EXPERIMENT-1
Aim: To list all the categorical(or nominal) attributes and the real valued attributes using Weka
mining tool.
Tools/ Apparatus: Weka mining tool..
Procedure:
1) Open the Weka GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system bank.csv.
5) Clicking on any attribute in the left panel will show the basic statistics on that selected
attribute.
SampleOutput:
EXPERIMENT-2
Aim: To identify the rules with some of the important attributes by a) manually and b) Using
Weka .
Tools/ Apparatus: Weka mining tool..
Theory:
Association rule mining is defined as: Let be a set of n binary attributes called items. Let be a set
of transactions called the database. Each transaction in D has a unique transaction ID and
contains a subset of the items in I. A rule is defined as an implication of the form X=>Y where
X,Y C I and X Y= . The sets of items (for short itemsets) X and Y are called antecedent (left
hand side or LHS) and consequent (righthandside or RHS) of the rule respectively.
To illustrate the concepts, we use a small example from the supermarket domain.
The set of items is I = {milk,bread,butter,beer} and a small database containing the items (1
codes presence and 0 absence of an item in a transaction) is shown in the table to the right. An
example rule for the supermarket could be meaning that if milk and bread is bought, customers
also buy butter.
Note: this example is extremely small. In practical applications, a rule needs a support of several
hundred transactions before it can be considered statistically significant, and datasets often
contain thousands or millions of transactions.
To select interesting rules from the set of all possible rules, constraints on various measures of
significance and interest can be used. The bestknown constraints are minimum thresholds on
support and confidence. The support supp(X) of an itemset X is defined as the proportion of
transactions in the data set which contain the itemset. In the example database, the itemset
{milk,bread} has a support of 2 / 5 = 0.4 since it occurs in 40% of all transactions (2 out of 5
transactions).
The confidence of a rule is defined . For example, the rule has a confidence of 0.2 / 0.4 = 0.5 in
the database, which means that for 50% of the transactions containing milk and bread the rule is
correct. Confidence can be interpreted as an estimate of the probability P(Y | X), the probability
of finding the RHS of the rule in transactions under the condition that these transactions also
contain the LHS .
ALGORITHM:
Apriori Pseudocode
Apriori (T,)
L<{ Large 1itemsets that appear in more than transactions }
K<2
while L(k1)
C(k)<Generate( Lk 1)
for transactions t T
C(t)Subset(Ck,t)
EXPERIMENT-3
Aim: To create a Decision tree by training data set using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
EXPERIMENT-4
Aim: To find the percentage of examples that are classified correctly by using the above created
decision tree model? ie.. Testing on the training set.
Tools/ Apparatus: Weka mining tool..
Theory:
Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is
unrelated to the presence (or absence) of any other feature. For example, a fruit may be
considered to be an apple if it is red, round, and about 4" in diameter. Even though these features
depend on the existence of the other features, a naive Bayes classifier considers all of these
properties to independently contribute to the probability that this fruit is an apple.
An advantage of the naive Bayes classifier is that it requires a small amount of training data to
estimate the parameters (means and variances of the variables) necessary for classification.
Because independent variables are assumed, only the variances of the variables for each class
need to be determined and not the entirecovariance matrix The naive Bayes probabilistic model :
The probability model for a classifier is a conditional model
P(C|F1 .................Fn) over a dependent class variable C with a small number of outcomes or
classes, conditional on several feature variables F1 through Fn. The problem is that if the
number of features n is large or when a feature can take on a large number of values, then basing
such a model on probability tables is infeasible. We therefore reformulate the model to make it
more tractable.
Using Bayes' theorem, we write P(C|F1...............Fn)=[{p(C)p(F1..................Fn|
C)}/p(F1,........Fn)]
Now the "naive" conditional independence assumptions come into play: assume that each feature
Fi is conditionally independent of every other feature Fj for ji .
This means that p(Fi|C,Fj)=p(Fi|C)
and so the joint model can be expressed as p(C,F1,.......Fn)=p(C)p(F1|C)p(F2|C)...........
=p(C) p(Fi|C)
This means that under the above independence assumptions, the conditional distribution over the
class variable C can be expressed like this:
p(C|F1..........Fn)= p(C) p(Fi|C)
Z
where Z is a scaling factor dependent only on F1.........Fn, i.e., a constant if the values of the
feature variables are known.
Models of this form are much more manageable, since they factor into a so called class prior
p(C) and independent probability distributions p(Fi|C). If there are k classes and if a model for
eachp(Fi|C=c) can be expressed in terms of r parameters, then the corresponding naive Bayes
model has (k 1) + n r k parameters. In practice, often k = 2 (binary classification) and r = 1
(Bernoulli variables as features) are common, and so the total number of parameters of the naive
Bayes model is 2n + 1, where n is the number of binary features used for prediction
P(h/D)= P(D/h) P(h) P(D)
P(h) : Prior probability of hypothesis h
P(D) : Prior probability of training data D
P(h/D) : Probability of h given D
P(D/h) : Probability of D given h
Nave Bayes Classifier : Derivation
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) Go to Classify tab.
7) Choose Classifier Tree
8) Select NBTree i.e., Navie Baysiean tree.
Sample output:
554
92.3333 %
46
7.6667 %
0.845
0.1389
0.2636
27.9979 %
52.9137 %
600
0.052
0.935
0.894
0.914
0.936 YES
0.948
0.106
0.914
0.948
0.931
0.936
Weighted Avg.
0.923
0.081
0.924
0.923
0.923
NO
0.936
a b <-- classified as
245 29 | a = YES
17 309 | b = NO
EXPERIMENT-5
EXPERIMENT-6
Aim: To create a Decision tree by cross validation training data set using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Theory:
Decision tree learning, used in data mining and machine learning, uses a decision tree as a
predictive model which maps observations about an item to conclusions about the item's target
value In these tree structures, leaves represent classifications and branches represent
conjunctions of features that lead to those classifications. In decision analysis, a decision tree can
be used to visually and explicitly represent decisions and decision making. In data mining, a
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) Go to Classify tab.
7) Choose Classifier Tree
8) Select J48
9) Select Test options Cross-validation.
539
89.8333 %
61
10.1667 %
0.7942
0.167
0.305
33.6511 %
61.2344 %
600
0.071
0.911
0.861
0.886
0.883 YES
0.929
0.139
0.889
0.929
0.909
0.883
Weighted Avg.
0.898
0.108
0.899
0.898
0.898
NO
0.883
a b <-- classified as
236 38 | a = YES
23 303 | b = NO
EXPERIMENT-7
Aim: Delete one attribute from GUI Explorer and see the effect using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) In the "Filter" panel, click on the "Choose" button. This will show a popup window with list
available filters.
7) Select weka.filters.unsupervised.attribute.Remove
8) Next, click on text box immediately to the right of the "Choose" button
9) In the resulting dialog box enter the index of the attribute to be filtered out (Make sure that the
"invertSelection" option is set to false )
10) Then click "OK" . Now, in the filter box you will see "Remove -R 1"
11) Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute
and create a new working relation
12) To save the new working relation as an ARFF file, click on save button in the top panel.
13) Go to OPEN file and browse the file that is newly saved (attribute deleted file)
14) Go to Classify tab.
15) Choose Classifier Tree
16) Select j48 tree
17) Select Test options Use training set
18) if need select attribute.
Sample output:
EXPERIMENT-8
Aim: Select some attributes from GUI Explorer and perform classification and see the effect
using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) select some of the attributes from attributes list which are to be removed. With this step only
the attributes necessary for classification are left in the attributes panel.
7) The go to Classify tab.
8) Choose Classifier Tree
9) Select j48
10) Select Test options Use training set
11) if need select attribute.
12) now Start weka.
13)now we can see the output details in the Classifier output.
14) right click on the result list and select visualize tree option .
15)Compare the output results with that of the 4th experiment
16) check whether the accuracy increased or decreased?
EXPERIMENT-9
Aim: To create a Decision tree by cross validation training data set by changing the cost matrix
in Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) Go to Classify tab.
7) Choose Classifier Tree
8) Select j48
9) Select Test options Training set.
10)Click on more options.
11)Select cost sensitive evaluation and click on set button
12)Set the matrix values and click on resize. Then close the window.
13)Click Ok
14)Click start.
15) we can see the output details in the Classifier output
16) Select Test options Cross-validation.
17) Set Folds Ex:10
Sample output:
EXPERIMENT-10
Aim: Is small rule better or long rule check the bias,by training data set using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Procedure:
This will be based on the attribute set, and the requirement of relationship among attribute we
want to study. This can be viewed based on the database and user requirement.
EXPERIMENT-11
Aim: To create a Decision tree by using Prune mode and Reduced error Pruning and show
accuracy for cross validation trained data set using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Theory :
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) select some of the attributes from attributes list
7) Go to Classify tab.
8) Choose Classifier Tree
9) Select NBTree i.e., Navie Baysiean tree.
10) Select Test options Use training set
Sample output:
EXPERIMENT-12
Aim: To compare OneR classifier which uses single attribute and rule with J48 and PART
classifiers, by training data set using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system bank.csv.
6) select some of the attributes from attributes list
7) Go to Classify tab.
Sample output:
J48
java weka.classifiers.trees.J48 -t c:/temp/bank.arff
One R
PART
Scheme:
weka.clusterers.SimpleKMeans -init 0 -max-candidates 100 -periodic-pruning 10000 -mindensity 2.0 -t1 -1.25 -t2 -1.0 -N 3 -A "weka.core.EuclideanDistance -R first-last" -I 500 -num-slots 1 -S
10
Relation: labor-neg-data
Instances: 57
Attributes: 17
duration
wage-increase-first-year
wage-increase-second-year
wage-increase-third-year
cost-of-living-adjustment
working-hours
pension
standby-pay
shift-differential
education-allowance
statutory-holidays
vacation
longterm-disability-assistance
contribution-to-dental-plan
bereavement-assistance
contribution-to-health-plan
class
Test mode: evaluate on training data
36 ( 63%)
5 ( 9%)
16 ( 28%)
Scheme:
weka.clusterers.SimpleKMeans -init 0 -max-candidates 100 -periodic-pruning 10000 -mindensity 2.0 -t1 -1.25 -t2 -1.0 -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -num-slots 1 -S
10
Relation: labor-neg-data
Instances: 57
Attributes: 17
duration
wage-increase-first-year
wage-increase-second-year
wage-increase-third-year
cost-of-living-adjustment
working-hours
pension
standby-pay
shift-differential
education-allowance
statutory-holidays
vacation
longterm-disability-assistance
contribution-to-dental-plan
bereavement-assistance
class
Ignored:
contribution-to-health-plan
Test mode: Classes to clusters evaluation on training data
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 5
Within cluster sum of squared errors: 122.05464734126849
Initial starting points (random):
Cluster 0: 1,5.7,3.971739,3.913333,none,40,empl_contr,7.444444,4,no,11,generous,yes,full,yes,good
Cluster 1: 1,2,3.971739,3.913333,tc,40,ret_allw,4,0,no,11,generous,no,none,no,bad
Missing values globally replaced with mean/mode
Final cluster centroids:
Cluster#
Full Data
0
1
(57.0)
(43.0)
(14.0)
==========================================================================
duration
2.1607
2.213
2
wage-increase-first-year
3.8036
4.2024
2.5786
wage-increase-second-year
3.9717
4.221
3.2062
wage-increase-third-year
3.9133
4.0329
3.5462
cost-of-living-adjustment
none
none
none
Attribute
43 ( 75%)
14 ( 25%)
31.0
54.386 %