IS328 Assignment1-Unlocked
IS328 Assignment1-Unlocked
This assignment covers both theoretical and practical aspect of this course. The marking rubric is heavily based on Data & Information
Management, which is in liaison with course outline and BSE program map. Rubrics have been taken from ACS-SCIMS rubrics V1.0. This
assessment covers the following course learning outcomes:
CLO 2: Perform pre-processing tasks to refine data sets
CLO 3: Apply various data mining methods for interpreting results
Overview
The goals of this assignment I are
• As a class, to make you familiar with WEKA tool and understand some of the data preprocessing methods, attribute selection
methods, classification and clustering algorithms.
• As a team of 2 members, you can discuss your findings on the chosen questions and consolidate your learnings as a report (Answers
to each question: Algorithm, Working screen shots and results, Comparison and Analysis).
• Make a consolidated report about your findings
• This assignment is an important part of the course and counts for 20% of your final grade. Grades will be based on the completeness
of your findings, analysis and the quality of the report.
In bank-data.arff data set, each record is uniquely identified by a customer id (the "id" attribute). Remove this attribute before the data mining
step by using the Attribute filters in WEKA. In the Filter panel, click on the Choose button. This will show a popup window with a list
available filter. Scroll down the list and select the weka.filters.unsupervised.attribute.Remove filter.
Next, click on text field immediately to the right of the "Choose" button. In the resulting dialog box enter the index of the attribute to be
filtered out (this can be a range, or a list separated by commas). In this case, enter 1 which is the index of the "id" attribute (see the left
panel). Make sure that the invertSelection option is set to false (otherwise everything except attribute 1 will be filtered). Then click "OK".
Now, in the filter box you will see Remove -R 1
Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute and create a new working relation (whose name
now includes the details of the filter that was applied). The result is depicted. Display the result.
2. Discretization
Some techniques, such as association rule mining, can only be performed on categorical data. This requires performing discretization on
numeric or continuous attributes. There are 3 such attributes in this data set: "age", "income", and "children". In the case of the "children"
attribute the range of possible values are only 0, 1, 2, and 3. In this case, we have opted for keeping all of these values in the data. This
means we can simply discretize by removing the keyword "numeric" as the type for the "children" attribute in the ARFF file and replacing
it with the set of discrete values. Do this directly in our text editor and save the resulting relation in a separate file bank-data2.arff.
Rely on WEKA to perform discretization on the "age" and "income" attributes. In this, divide each of these into 3 bins (intervals). The
WEKA discretization filter can divide the ranges blindly, or used various statistical techniques to automatically determine the best way of
partitioning the data. In this case, perform simple binning.
First, load our filtered data set into WEKA by opening the file "bank-data2.arff". Select the "children" attribute in this new data set, that it
is now a categorical attribute with four possible discrete values. Now, once again activate the Filter dialog box, but this time, select
weka.filters.unsupervised.attribute.Discretize.
Next, to change the defaults for this filter, click on the box immediately to the right of the "Choose" button. This will open the Discretize
Filter dialog box. Enter the index for the attributes to be discretized. In this case we enter 1 corresponding to attribute "age". Also, enter 3
as the number of bins (note that it is possible to discretize more than one attribute at the same time (by using a list of attribute indices). Since
its simple binning, all of the other available options are set to "false".
Click "Apply" in the Filter panel. This will result in a new working relation with the selected attribute partitioned into 3 bins. To examine
the results, save the new working relation in the file bank-data3.arff.
Next, apply the same process to discretize the "income" attribute into 3 bins. Again, Weka automatically performs the binning and replaces
the values in the "income" column with the appropriate automatically generated labels. Save the new file into bank-data3.arff", replacing
the older version.
Clearly, the WEKA labels, while readable, leave much to be desired as far as naming conventions go. Thus use the global search/replace
functions in TextPad/WordPad to replace these labels with more succinct and readable ones. Replace All the age label "(-inf-34.333333]"
with the label "0_34".
Note that the new label now appears in place of the old one both in the attribute section of the ARFF file as well as in the relevant data
records. Repeat this manual re-labeling process with all of the WEKA-assigned labels for the "age" and the "income" attributes. Also,
change the relation name in the ARFF file to bank-data-final and save the file as bank-data-final.arff.
3. Missing Values
II. Do not perform all II. Perform most of the II. Perform all the
required tasks correctly and required tasks correctly and required tasks
consistently consistently correctly and 5
consistently
III. Provide inaccurate and/or III. Provide relatively
incomplete reports accurate and complete III. Provide
reports accurate and
complete
reports
Sub Total &
comments
A.
1. You suspect marked differences in promotional purchasing trends between female and male Acme credit card customers. You wish
to confirm or refute our suspicion. Perform a supervised data mining session using the CreditCardPromotion database (ccpromo.arff)
in conjunction with PART. Use sex as the output attribute. Designate all other attributes as input attributes and use all 15 instances
for training. Write a summary confirming or refuting our hypothesis. Base the analysis on rules created for each class.
2. Repeat the exercise using J48 rather than PART but base the analysis on the created decision tree.
B.
1. For this Question, use WEKA’s J48 decision tree algorithm to perform a data mining session with the cardiology patient data. Open
the WEKA explorer and load the cardiology-weka.arff file. This is the mixed form of the dataset containing both categorical and
numeric data.The data contains 303 instances representing patients who have a heart condition (sick) as well as those who do not.
a. What attribute did J48 choose as the top-level decision tree node?
b. Draw a diagram showing the attributes and values for the first two levels of the J48 created decision tree.
c. What percent of the instances were correctly classified?
d. How many healthy class instances were correctly classified?
e. How many sick class instances were falsely classified as healthy individuals?
f. Determine how True Positive Rate (TP Rate) and False Positive Rate (FP Rate) are computed.
Load the CreditScreening dataset into the WEKA Explorer. Make sure that class is designated as the output attribute.
a. Use J48 together with 10-fold cross validation to mine the data. Record your results including the attributes used to create the root
node and first level of the decision tree.
b. Use Info Gain attribute evaluation to determine the most predictive categorical attribute for each of the two classes. Return to Weka
and preprocess mode. Eliminate all but the two most predictive input attributes from the attribute list. Be sure to save the output
attribute class. Use J48 with 10-fold cross validation to mine the data. Record your results. Compare the results to those seen in part
D.
Load the CreditScreening dataset into the WEKA Explorer. Make sure that class is designated as the output attribute.
a. Use SimpleCart together with 10-fold cross validation to mine the data. How many nodes are seen in the decision tree? What is the
classification accuracy?
b. Compare your results with those seen in question C part b.
E.
Use Wordpad or MS Word to open the soybean dataset located in the folder ─c:\program files\weka-3-6\data or Weka data set. This dataset
represents one of the more famous data mining successes. Classification accuracy of unseen instances is likely to be above 90% with most
classifiers.
a. Scroll through the file to get a better understanding of the dataset. Open WEKA’s Explorer and load this dataset. Classify the data
by applying J48 with a 10-fold cross validation. Report your results.
1. Completely fill Mark Allocation Sheet and submit with your assignment. Failing to do so may result in deduction of 50% marks.
2. This assignment can to be submitted in groups of 2 members. Assign a group leader and submit the assignment through the group
leader’s moodle account. You have to submit just one zip/rar file of your project. The submission filename should read
A1_Sxxx_Syyy.zip or A1_Sxxx_Syyy.rar where Sxxx, Syyy are student ids of the group members. For example,
A1_S11003232_S01004488.zip or A1_S11003232_S01004488.rar. Incorrect submission will result in high penalty.
3. 25 Marks are allocated for applying appropriate DM techniques and deriving the correct results in Question 2 (A to E: 5 marks each)
for the drafting and consolidation of the report.
After having discussed as group, we recommend the following mark allocation to each group member based on contribution or lack of it
throughout the assignment.
Certification
Presentation
Assign 1
Assign 2
IS328
Test1
MST
Core Body of Knowledge
Complex Computing
ICT Professional Ethics M
Knowledge ü ü
Professional expectations M
Teamwork concepts/issues M ü ü
Communication M ü ü
Societal Issues/Legal issues/Privacy M
Understanding the ICT profession
ICT Problem Solving: Abstraction
Design
Technology Resources Hardware and Software
Fundamentals
Data and Information Management M ü ü
Networking
Technology Building Human Factors
Programming
Systems Development / Acquisition
ICT Management IT Governance and organisational
issues
IT Project management
Service management
Security management