COC131 Tutorial w6
COC131 Tutorial w6
1
Briefly inspect the output produced by each Associator and try
to interpret its meaning.
(b) In association rule mining the number of possible association rules
can be very large even with tiny datasets, hence it is in our best
intrest to reduce the count of rules found, to only the most inter-
esting ones. This is usually achieved by setting minimum thresh-
olds on support and confidence values. Still in the Associate
view, select the Apriori algorithm again, click on the textbox
next to the Choose button and try, in turn, different values for
the following parameters lowerBoundMinSupport (min thresh-
old for support), minMetric (min threshold for confidence). As
you change these parameter values what do you notice about the
rules that are found by the associator? Note that the parameter
numRules limits the maximum number of rules that the associ-
ator looks for, you can try changing this value.
(c) This time run the Apriory algorithm with the outputItemSets
parameter set to true. You will notice that the algorithm now also
outputs a list of Generated sets of large itemsets: at different
levels. If you have the modules Data Mining book by Witten &
Frank with you, then you can compare and contrast the Apriori
associators output with the association rules on pages 114-116 (I
will have a couple copies circulating in the lab during the session,
just ask me for one). I also strongly recommend to read through
chapter 4.5 in your own time, while playing with the weather data
in Weka, this chapter gives a nice & easy introduction to associa-
tion rules. Notice in particular how the item sets and association
rules compare with Weka and tables 4.10-4.11 in the book.
(d) Compare the association rules output from Apriori and Tertius
(you can do this by navigating through the already build associ-
ator models in the Result list on the right side of the screen).
Make sure that the Apriory algorithm shows at least 20 rules.
Think about how the association rules generated by the two dif-
ferent methods compare to each other?
Something to always remember with association rules, is that they
should not be used for prediction directly, that is without further anal-
ysis or domain knowledge, as they do not necessarily indicate causality.
They are however a very helpful starting point for further exploration
and for building a better understanding of our data.
3. As you should certainly know by this point, in order to identify associa-
2
tions between parameters a correlation matrix and scatter plot matrix
can be very usefull. In order to remind yourself of this it might be
helpfull to look back to tutorials 2, 3 or 5.
= 0 + 1 x +
So a most accurate model is that which yields a best fit line to the
data in question, we are looking for minimal sum of squared deviations
between actual and fitted values, this is called method of least squares.
So now that we have briefly reminded ourselves of the very basics of
regression lets directly move onto an example in Weka.
(a) In Weka go back to the Preprocess tab. Open the iris data-set
(iris.arff, this should be in the ./data/ directory of the Weka
install).
(b) In the Attributes section (bottom left of the screen) select the
class feature and click Remove. We need to do this, as simple
linear regression cannot deal with non numeric values.
(c) Next select the Classify tab to get into the Classification per-
spective of Weka, and choose LinearRegression (under func-
tions).
(d) Clicking on the textbox next to the Choose button brings up
the parameter editor window. Click on the More button to get
information about the parameters. Make sure that attributeSe-
lectionMethod is set to No attribute selection and eliminate-
ColinearAttributes is set to False.
(e) Finally make sure that you select the parameter petalwidth in
the dropdown box just under the Test Options. Hit Start to run
the regression. Inspect the results, in particular pay attention to
the Linear Regression Model formula returned, and the coefficients
3
and intercept of the straight line equation. As this is a numeric
prediction/regression problem, accuracy is measured with Root
Mean Squared Error, Mean Absolute Error and the likes. As most
of you will have clearly noticed, you can repeat this process for
regressing the other features in turn, and compare how well the
different features can be predicted.