University of Waikato: Data Mining With Weka
University of Waikato: Data Mining With Weka
University of Waikato
WEEK 1
The glass data
I’m going to look at a different dataset. I’m going to look at the “glass” dataset,
which is a rather more extensive dataset. It’s a real world dataset, not a terribly big
one. Let’s open it. Here we’ve got 214 instances and 10 attributes. Here are the 10
attributes, it’s not clear what they are. Let’s look at the “class”, by default the last
attribute shown. There are seven values for the class, and the labels of these values
give you some indication of what this dataset is about. We have “headlamps”,
“tableware” (starting from the bottom), “containers”. Then we have “building” and
“vehicle” windows, both “float” and “non-float”. You may not know this, but there
are different ways of making glass, and the floating process is a way of making
glass. These are seven different kinds of glass.
What are the attribute values? I don’t know what you remember about physics, and
I guess it doesn’t matter if you don’t remember. RI stands for the refractive index.
It’s always a good idea to check for reasonableness when you’re looking at
datasets. It’s really important to get down and dirty with your data. Here we’re
looking at the values of the refractive index—a minimum of 1.511, a maximum of
1.534. It’s good to think about whether these are reasonable values for refractive
index. If you go to the web and have a look around, you’ll find that these are good
values for the refractive index.
Na. If you did chemistry, you’ll recognize Na as sodium. Here, it looks like these are
percentages, the different percentages of sodium, Magnesium, Mg, and so on. We
would expect Silicon (Si) to make up the majority of glass. It varies between 69.81%
and 75.41%. These are percentages of different elements in the glass.
We can confirm our guesses here by looking at the data file itself. Let me just find
the “glass” data. It’s in Weka datasets, and it’s glass.arff. This is the ARFF file
format. It starts with a bunch of comments about the glass database. These lines
beginning with percentage signs (%) are comments. You can read about this. We
don’t have time to read it now.
FutureLearn 1
You can see about the attributes and it does say that the attributes are refractive
index, sodium, magnesium, and so on. And the type of glass, just like I said, is about
windows, containers, and tableware, and so on. We get down to the end of the
comments, and here we have stuff for Weka. This is the ARFF format. The relation
has a name, you’ll see it printed in the interface when you look. The attributes are
defined, they are real valued attributes, numeric attributes. The “type” attribute is
nominal, and the different values of type are enumerated here in quotes.
That defines the relation and the attributes. Then we have an ‘@data’ line, and
following that in the ARFF format, are simply the instances, one after the other,
with the attribute values all on one line, ending with the class by default. This is the
class value for the first instance. I think there are 214 instances here. There’s the
last one. That’s the ARFF format. It is a very simple, textual file format.
Now we’ve confirmed our guesses about these numbers being percentages and
different elements. We can think about this some more. It’s important then, that
these numbers are reasonable. If they went negative, for example, that would
indicate some kind of corrupted value—you can’t have a negative percentage. We’re
expected silicon to be the majority component; we’re expecting the refractive index
to be in this kind of range. It’s always a good idea when you get a dataset to just
click around in the Weka interface and make sure things look real. Rather small
amounts of aluminum in glass; I guess that’s not surprising; I don’t know very much
about glass myself. We’re just checking for reasonableness here—a very good thing
to do. That’s it then.
In this lesson, we’ve looked at the classification problem. We’ve looked at the
nominal weather data and the numeric weather data. We’ve talked about nominal
versus numeric attributes, and we’ve talked about the ARFF file format. We’ve
looked at the glass.arff dataset, and I’ve talked about sanity checking of attributes,
and the importance of getting down and dirty with your data.
FutureLearn 2