HW 1
HW 1
Homework Assignment 1
Due date: Tue, September 27 (in class)
1 Introduction
This assignment has two main parts and goals. In the first you will handle a data engineering task. Given
data for an application of text classification you will prepare the data so that it is suitable for use by the
weka system. In particular, as described below, you will write a program to process the data and produce
a dataset file in the weka format. In the second part you will run weka system, in particular using several
variations of the decision tree algorithms discussed in class. Thus you will try out and test these algorithms
when applied to the text classification problem.
2 Data Pre-Processing
The data for this assignment includes text documents containing articles from two different sources. The
task is to identify the source or type of each document. We have two datasets of similar structure but
different contents.
The first dataset includes articles that were posted on newsgroups comp.sys.mac.hardware and
comp.sys.ibm.pc.hardware. Each document is a message posted to one of the groups. Our class labels Yes
and No capture membership in these groups (Yes corresponds to ibm). Similarly the second dataset includes
articles that were posted on newsgroups rec.sport.baseball and rec.sport.hockey. Our class labels are
again Yes and No (Yes corresponds to baseball).
Our goal is to learn a classifier that, when given a new article, assigns it to the correct group. (The data is
a subset of the “20-newsgroups” dataset that has been widely used in text learning studies.)
The Data
For each dataset, we have roughly 1200 examples and each example resides in a separate text file. The
datasets are in the directories /comp/150ML/files/hw1/ibmmac/ and /comp/150ML/files/hw1/sport/. To
help you in your task I have “cleaned” the examples (removed non-alphabet characters) so each example
comes in two versions e.g. the file 23 and the file 23.clean. You can choose to use either of these.
The class label of each example is indicated in a separate “index” file. Since 1200 examples may be much
to start with I created a small index index.Short which includes only 100 examples, 50 of each class. The
index for the full dataset is in the file index.Full. Here are a few lines from the index file to illustrate the
syntax.
589|Yes|
590|Yes|
591|Yes|
592|No|
593|No|
589, 590 etc are the file names of examples and Yes and No are the labels. I suggest that you start your work
with the Short index (to save in run/wait time). Once things are reasonably stable you can and should work
with all the examples.
1
Your Task
Write a program that converts the data into an example file appropriate for weka. Please note - your program
should not depend in any way on the domain itself. That is, it should work without any modification for
any other text classification task whose examples are formatted in the same manner (each example in a text
file and the list of examples and their labels in an index file). Hopefully, if your program works for the two
different datasets it should be sufficiently general to handle other domains as well, but please make sure not
to “cheat” by using domain knowledge in the design of your program.
I have not given you any training in converting text for use in machine learning but I would still like you
to take a stab at doing this well. The basic facts are that the data you have represents examples as plain
text and weka uses an attribute-value representation. This means that you will need to decide what features
to use for the dataset and base these on the text. Since examples are given by plain text, the features can
depend on occurrence of words in example documents. E.g. “does word . . . appear in the example?” or
“how many times does word . . . appear in the example?”. Or you can choose more complex features. It is
up to you to decide what kind of features to choose. It seems that any reasonable scheme (e.g. one based
on words) can produce a very large number of such features, possibly too much for our learning algorithms.
Thus we also want a method to pick a small number of “best” features.
It should be clear that different schemes will yield different levels of performance from the learning system,
but there is no “right answer” here and you don’t have to optimize. Spend some time thinking about a
reasonable scheme, and choose one; after that just stick with your scheme and run the experiments. If you
like you can experiment with several schemes but this is not required.
To summarize, you should decide on the type of features you use and how to rank features based on the data.
Then your program can pick the N best such features. Your program will read the index file, all example
text-files mentioned in the index file and produce a weka dataset from these. Your program should be able
to take N as an argument and produce different datasets. In particular, you should use your program to
produce 3 or 4 weka datasets with 10, 50, 100, 500, and (if things are going well) 1000 features for use in
the next part. This should be done for each of the two domains (original datasets).
You can write the program in any language. However, please make sure that your code runs on our sun
systems.
1. Use the weka system to run classifiers ZeroR (majority vote), OneR (one level decision tree), and j48.J48
(full decision tree algorithm; gain ratio heuristic) on the data. We will rely on weka’s cross-validation
result as a measure of classification accuracy.
Report the results you are getting with each method. How does the decision tree algorithm scale (run
time and accuracy) when the number of features is increased?
2. For one fixed number of features pick random subsets of examples of increasing size (say 100, 300, 600,
800, 1200) and draw a learning curve for this setting. (NB In the original data, all examples of the Yes
class appear before examples of the No class so we must pick a random subset appropriately.)
3. Experiment with the variations of the the decision tree learning algorithm. What is the effect of each
of the following? (1) perform more/less strict pruning. (2) run with no pruning (-U). (3) Compare
default pruning to REP. (4) Run classifier Id3 (decision tree algorithm; info gain; no pruning; requires
nominal features). How does the number of features affect the behavior in these scenarios?
2
2. Source code of program(s) that perform data conversion. If it’s not obvious please give instructions
on how to run your code. Please also make sure that your code is well documented and written with
good style.