0% found this document useful (0 votes)

63 views

AI32 Guide To Weka PDF

This document provides a guide to using the machine learning toolkit Weka. It discusses the ARFF data format used by Weka, how to load and prepare data in Weka, how to select and evaluate different classifiers using cross-validation, and how to interpret the output metrics including accuracy, precision, recall and confusion matrices. Additional resources for learning more about machine learning and Weka are also provided.

Uploaded by

datruccone

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views

AI32 Guide To Weka PDF

Uploaded by

datruccone

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

AI32 Guide to Weka

Andrew Roberts http://www.comp.leeds.ac.uk/andyr 1st March 2005

Introduction

Weka is an excellent system for learning about machine learning techniques. Of course, it is a generic toolkit, which means it provides many more features than you will require for your AI32 coursework. Therefore, this guide was written to ensure you know all you require to complete your tasks.

ARFF format

For Weka to analyse datasets, it needs a format that users can input so that it understands the structure of the data. The ARFF format is what Weka uses and is very simple. There are only a few tags to be aware of, which all begin with the @ symbol. Lines beginning with % are for comments.

2.1

Header

The header section of an ARFF le is very simple and merely denes the name of the dataset along with the set of attributes and their associated types. @relation The @relation <name> tag is declared at the beginning of the le, where <name> is the name of the relation you wish to use. @attribute Attributes are dened in the following format: @attribute <attribute-name> <type>. Currently, an attribute can be one of four types: numeric can be a real or integer value. nominal-specication where the value must a from a pre-dened set of possible values. 1

string textual values. date [<date-format>] for storing dates. The the optional date format argument instructs Weka on how to parse and print the dates. However, this type will not be very useful for NLP tasks. Example header: @relation employees

@attribute empName string @attribute empSalary numeric @attribute empGender {male, female} @atttribute empDob date "yyyy-MM-dd" Note also that attributes are case-sensitive, so the label Name is different to name. Its also worth being aware that Weka can act weirdly when there is a value within the data that is the same as an attribute name. Therefore, it is recommended not to use names that are normal English words to avoid this problem (as you can see in the example, Ive added a small prex to each name.)

2.2

Data

The second half to an ARFF le is the data itself. The @data tag signies to Weka that the instance data is about to commence. Each line after the tag is a set of values (separated by commas) that represent a single instance. It should be obvious that Weka will expect the order of the values to be in the same order in which the attributes were declared. @data Andrew Roberts, 50000, male, 1980-11-09 Phil Space, 20000, male, 1976-04-12 Note that string values and nominal values are case sensitive.

Getting Weka up and running

1. To load Weka: Linux java -jar /usr/local/weka-3-4/weka.jar Windows I assume theres an icon somewhere! 2. Select Explorer on the window that pops up. 2

3.1

Loading les

By default, the Preprocess tab of the Explorer will appear. To load your ARFF le, click Open File and use the dialog box to locate your le. If all goes well you should see the window ll up with information about all the attributes in the data le, plus various statistics.

3.2

Selecting features

Depending on the task at hand, it may be the case that you only wish to focus on a subset of the available attributes. On the left hand side of the preprocess window you will see all the attributes. To discard unwanted attributes, you must tick them, and then click the Remove button towards the bottom left of the window. Alternatively, it may be quicker to tick the attributes you want to keep, click Invert, and then click Remove.

3.3

How to classify

Once you are happy with your data and attributes, you can begin experimenting with classiers. You must click the Classier tab towards the top of the window. At the top left of that window is a button labelled Choose, accompanied with a text-box that contains ZeroR. ZeroR is the currently selected classier, and clicking the button will present a screen which will enable you to select one of the many others available. ZeroR is not a very useful classier. Imagine you have a set of instances that can be classied into the classes Yes or No. ZeroR scans the training data and nds the most frequent class. All future data will be classied according to that most frequent class. Its lack of sophistication, however, makes it a good baseline to compare against. By comparing the accuracy of your (much more sophisticated) classier versus the ZeroR baseline, you can measure the relative improvement. Of course, if you perform worse than ZeroR, you probably ought to be concerned! It will be left as an exercise to the reader to investigate which other classiers Weka has available, and what they all do.

Cross-validation

Classiers rely on being trained before they can reliably be used on new data. Of course, it stands to reason that the more instances the classier is exposed to during the training phase, the more reliable it will be as it has more experience. However, once trained, we would like to test the classier too, so that we are condent that it works successfully. For this, yet more unseen instances are required. A problem which often occurs is the lack of readily available training/test data. These instances must be pre-classied which is typically time-consuming

(hence the reason we are trying to automate it with a software classier!) A nice method to circumvent this issue is know as cross-validation. It works as follows: 1. Separate data in to xed number of partitions (or folds) 2. Select the rst fold for testing, whilst the remaining folds are used for training. 3. Perform classication and obtain performance metrics. 4. Select the next partition as testing and use the rest as training data. 5. Repeat classication until each partition has been used as the test set. 6. Calculate an average performance from the individual experiments. The experience of many machine learning experiments suggest that using 10 partitions (tenfold cross-validation) often yields the same error rate as if the entire data set had been used for training.

Understanding the output

Weka spews all sorts of information after completing the classication task. Some parts are fairly self-explanatory, such as Correctly Classied Instances, whereas confusion matrices are not necessarily intuitive at rst glace and requires some practice to interpret.

5.1

Accuracy

Nothing really difcult here, especially as its given in the Weka output. It gives a measure for the overall accuracy of the classier: accuracy = number of correctly classied instances number of instances

5.2

Precision and recall

number of correctly classied instances of class X number of instances classied as belonging to class X number of correctly classied instances of class X number of instances in class X

With respect to classiers: precision(X) =

recall(X) =

5.3

Confusion matrix

Confusion matrices are very useful for evaluating classiers, as they provide an efcient snapshot of its performance displaying the distribution of correct and incorrect instances. Typical Weka output contains the following: === Confusion Matrix === a b <-- classified as 7 2 | a = yes 3 2 | b = no Weka was trying to classify instances into two possible classes: yes or no. For the sake of simplicity, Weka substitutes yes for a, and no for b. The columns represent the instances that were classied as that class. So, the rst column shows that in total 10 instances were classied a by Weka, and 4 were classied as b. The rows represent the actual instances that belong to that class. So, what this now tells us is the number of times a given class is correctly/incorrectly classied. From the matrix, we can observe that 7 of the instances that should have been classed as a, were in fact correctly identied. Similarly, 2 bs were classied correctly. However, we can also see that 2 as were incorrectly classied as b, where as 3 bs were classed as a. This ne-grained perspective can provide interesting insights. It also allows you to assess the suitability of a particular classier. Imagine the following classier that uses a set of attributes to decide whether a patient has cancer or not. === Confusion Matrix === a b <-- classified as 10 3 | a = cancer 1 6 | b = no cancer This classier successfully classies 16 out of the 20 cases presented. However, an alarming 3 patients will be given the all clear when they did in fact have cancer. The one patient who was told they had cancer despite it being the opposite will also not be happy in the short term, but you can clearly see that the outcome is much more favourable. Now, the classier was updated, and when fed the same data, it resulted in the following matrix: === Confusion Matrix === a b <-- classified as 10 0 | a = cancer 6 4 | b = no cancer This time the classier does a worse job overall, only correctly classifying 14/20 cases. Yet, it nds no false-negatives. Of course, this is because the system is over cautious and instead labels many false-positives, but its probably 5

more preferable in this better to be safe than sorry scenario. Hopefully you can now see why confusion matrices are useful for evaluating classiers beyond a straightforward precision score.

Additional resources
Weka project page http://www.cs.waikato.ac.nz/ml/weka/ MLnet, the Machine Learning Network Online Information Service http://www.mlnet.org/ Why using large feature sets can cause problems http://en.wikipedia. org/wiki/Curse_of_dimensionality Machine Learning links from the Google Directory http://directory. google.com/Top/Computers/Artificial_Intelligence/Machine_ Learning

How To Hack Atm
87% (15)
How To Hack Atm
1 page
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
A Coomer's Guide To AI Dungeon
No ratings yet
A Coomer's Guide To AI Dungeon
30 pages
HW4 Text-1
No ratings yet
HW4 Text-1
8 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
AI Money Machine
100% (2)
AI Money Machine
267 pages
Generative AI For Beginners1
100% (1)
Generative AI For Beginners1
85 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
Gestalt
100% (3)
Gestalt
39 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
IS328 Assignment1-Unlocked
No ratings yet
IS328 Assignment1-Unlocked
10 pages
Clustering Iris Data With Weka
No ratings yet
Clustering Iris Data With Weka
6 pages
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Assignment 1-Preprocessing Handon
No ratings yet
Assignment 1-Preprocessing Handon
6 pages
DMLB 1
No ratings yet
DMLB 1
3 pages
6.034 Design Assignment 2: 1 Data Sets
No ratings yet
6.034 Design Assignment 2: 1 Data Sets
6 pages
Manisha 3001 Week 12
No ratings yet
Manisha 3001 Week 12
22 pages
WEKA Manual
No ratings yet
WEKA Manual
25 pages
Lab3 KNN
No ratings yet
Lab3 KNN
4 pages
Weka-: Data Warehousing and Data Mining Lab Manual-Week 9
100% (1)
Weka-: Data Warehousing and Data Mining Lab Manual-Week 9
8 pages
Weka Lab
No ratings yet
Weka Lab
11 pages
Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
No ratings yet
Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
55 pages
COC131 Tutorial w6
No ratings yet
COC131 Tutorial w6
4 pages
DM Assignments
No ratings yet
DM Assignments
4 pages
Data Mining - Lab - Manual
No ratings yet
Data Mining - Lab - Manual
20 pages
Data Mining Lab Manual
33% (3)
Data Mining Lab Manual
44 pages
You're Not Good at Excel and You Don't Even Know It - by Breakcorporate - Medium
No ratings yet
You're Not Good at Excel and You Don't Even Know It - by Breakcorporate - Medium
1 page
DWDM Record With Alignment
No ratings yet
DWDM Record With Alignment
69 pages
Data Mining Lab Manual: Aurora's PG College Moosarambagh Mca Department
No ratings yet
Data Mining Lab Manual: Aurora's PG College Moosarambagh Mca Department
42 pages
Data Science and Machine Learning Essentials: Lab 4B - Working With Classification Models
No ratings yet
Data Science and Machine Learning Essentials: Lab 4B - Working With Classification Models
29 pages
DM Lab Material
No ratings yet
DM Lab Material
88 pages
Sourabh IT Lab 2
No ratings yet
Sourabh IT Lab 2
48 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
40 pages
Data Mining Lab File
No ratings yet
Data Mining Lab File
20 pages
Data Warehousing and Data Mining Lab
No ratings yet
Data Warehousing and Data Mining Lab
53 pages
Weka Exercise 1
No ratings yet
Weka Exercise 1
7 pages
WEKA LAB MANUAL (1)
No ratings yet
WEKA LAB MANUAL (1)
49 pages
Assignment 1-Preprocessing Handon
No ratings yet
Assignment 1-Preprocessing Handon
13 pages
Using Weka
No ratings yet
Using Weka
6 pages
DM - Weka
No ratings yet
DM - Weka
27 pages
Data Mining: Index
No ratings yet
Data Mining: Index
47 pages
Data-Mining-Lab-Manual Cs 703b
No ratings yet
Data-Mining-Lab-Manual Cs 703b
41 pages
Wekappt
No ratings yet
Wekappt
58 pages
Data Mining With Weka Heart Disease Dataset: 1 Problem Description
No ratings yet
Data Mining With Weka Heart Disease Dataset: 1 Problem Description
4 pages
MC0717 Lab Manual
No ratings yet
MC0717 Lab Manual
42 pages
DM Tools Sample-1
No ratings yet
DM Tools Sample-1
72 pages
Chapter 3 MGSC
No ratings yet
Chapter 3 MGSC
28 pages
Assigment 3
No ratings yet
Assigment 3
2 pages
Lab Exercise
No ratings yet
Lab Exercise
9 pages
DWDM Lab Manual Using Weka-For MIC
No ratings yet
DWDM Lab Manual Using Weka-For MIC
42 pages
Lab 3
No ratings yet
Lab 3
3 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
47 pages
Weka Tutorial
No ratings yet
Weka Tutorial
32 pages
DWDM Lab 2
No ratings yet
DWDM Lab 2
3 pages
CAP3770 Lab#4 DecsionTree Sp2017
No ratings yet
CAP3770 Lab#4 DecsionTree Sp2017
4 pages
DATA WAREHOUSING -TO WRITE
No ratings yet
DATA WAREHOUSING -TO WRITE
23 pages
Demonstration of Preprocessing On Dataset Student - Arff Aim: This Experiment Illustrates Some of The Basic Data Preprocessing Operations That Can Be
100% (1)
Demonstration of Preprocessing On Dataset Student - Arff Aim: This Experiment Illustrates Some of The Basic Data Preprocessing Operations That Can Be
4 pages
Practical 7
No ratings yet
Practical 7
2 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
Machine Learning An Algorithmic Perspective (2nd Ed) - 40-42
No ratings yet
Machine Learning An Algorithmic Perspective (2nd Ed) - 40-42
3 pages
Orange Tutorial
No ratings yet
Orange Tutorial
19 pages
Weka Data Miningvsem
No ratings yet
Weka Data Miningvsem
7 pages
Data Handling Best Practices
No ratings yet
Data Handling Best Practices
4 pages
Experiment No: 01 Data Exploration & Data Preprocessing
No ratings yet
Experiment No: 01 Data Exploration & Data Preprocessing
54 pages
Lab04
No ratings yet
Lab04
7 pages
Java: Advanced Guide to Programming Code with Java: Java Computer Programming, #4
From Everand
Java: Advanced Guide to Programming Code with Java: Java Computer Programming, #4
Charlie Masterson
No ratings yet
Java: Advanced Guide to Programming Code with Java
From Everand
Java: Advanced Guide to Programming Code with Java
Charlie Masterson
No ratings yet
JavaScript: Advanced Guide to Programming Code with Javascript: JavaScript Computer Programming, #4
From Everand
JavaScript: Advanced Guide to Programming Code with Javascript: JavaScript Computer Programming, #4
Charlie Masterson
No ratings yet
JavaScript: Advanced Guide to Programming Code with JavaScript
From Everand
JavaScript: Advanced Guide to Programming Code with JavaScript
Charlie Masterson
No ratings yet
Java: Best Practices to Programming Code with Java: Java Computer Programming, #3
From Everand
Java: Best Practices to Programming Code with Java: Java Computer Programming, #3
Charlie Masterson
No ratings yet
Test Ninjas Digital Sat Math Cheat Sheet
100% (4)
Test Ninjas Digital Sat Math Cheat Sheet
38 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
Roadmap How To Learn AI in 2024 (Uncovered AI)
No ratings yet
Roadmap How To Learn AI in 2024 (Uncovered AI)
6 pages
Rationality From AI To Zombies
86% (7)
Rationality From AI To Zombies
1,813 pages
Attention Is All You Need
50% (2)
Attention Is All You Need
11 pages
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
If I Ain't Got You - Alicia Keys
No ratings yet
If I Ain't Got You - Alicia Keys
2 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
No ratings yet
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
456 pages
Algebra Workbook
100% (3)
Algebra Workbook
299 pages
Improved Statistical Test
87% (171)
Improved Statistical Test
20 pages
CAN Bus - The Ultimate Guide
100% (3)
CAN Bus - The Ultimate Guide
114 pages
Situationalawareness 1 30
No ratings yet
Situationalawareness 1 30
30 pages
Mythic Magazine #009
100% (3)
Mythic Magazine #009
27 pages
Psych Unit 7a Practice Quiz
No ratings yet
Psych Unit 7a Practice Quiz
4 pages
Guide of Recruitment - Bethi Arun Kumar
100% (2)
Guide of Recruitment - Bethi Arun Kumar
21 pages
Yuval Noah Harari Argues That AI Has Hacked The Operating System of Human Civilisation
100% (1)
Yuval Noah Harari Argues That AI Has Hacked The Operating System of Human Civilisation
7 pages

AI32 Guide To Weka PDF

Uploaded by

AI32 Guide To Weka PDF

Uploaded by

AI32 Guide to Weka

Andrew Roberts http://www.comp.leeds.ac.uk/andyr 1st March 2005

Getting Weka up and running

Understanding the output

Precision and recall

With respect to classiers: precision(X) =

You might also like