HW 1

This homework assignment involves preprocessing text data to prepare it for machine learning algorithms. Students are asked to write a program to convert text documents into a format suitable for the Weka machine learning software. The program should select features from the text and output Weka datasets with varying numbers of features. Students will then use Weka to perform classification on the datasets, experimenting with different algorithms and hyperparameters to evaluate their effectiveness. Results are to be analyzed and submitted in a short report along with the source code.

Uploaded by

postscript

Available Formats

Download as PS, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

HW 1

Uploaded by

postscript

Available Formats

Download as PS, PDF, TXT or read online on Scribd

You are on page 1/ 3

COMP 150ML – Machine Learning – Fall 2005

Homework Assignment 1
Due date: Tue, September 27 (in class)

1 Introduction
This assignment has two main parts and goals. In the first you will handle a data engineering task. Given
data for an application of text classification you will prepare the data so that it is suitable for use by the
weka system. In particular, as described below, you will write a program to process the data and produce
a dataset file in the weka format. In the second part you will run weka system, in particular using several
variations of the decision tree algorithms discussed in class. Thus you will try out and test these algorithms
when applied to the text classification problem.

2 Data Pre-Processing
The data for this assignment includes text documents containing articles from two different sources. The
task is to identify the source or type of each document. We have two datasets of similar structure but
different contents.
The first dataset includes articles that were posted on newsgroups comp.sys.mac.hardware and
comp.sys.ibm.pc.hardware. Each document is a message posted to one of the groups. Our class labels Yes
and No capture membership in these groups (Yes corresponds to ibm). Similarly the second dataset includes
articles that were posted on newsgroups rec.sport.baseball and rec.sport.hockey. Our class labels are
again Yes and No (Yes corresponds to baseball).
Our goal is to learn a classifier that, when given a new article, assigns it to the correct group. (The data is
a subset of the “20-newsgroups” dataset that has been widely used in text learning studies.)

The Data
For each dataset, we have roughly 1200 examples and each example resides in a separate text file. The
datasets are in the directories /comp/150ML/files/hw1/ibmmac/ and /comp/150ML/files/hw1/sport/. To
help you in your task I have “cleaned” the examples (removed non-alphabet characters) so each example
comes in two versions e.g. the file 23 and the file 23.clean. You can choose to use either of these.
The class label of each example is indicated in a separate “index” file. Since 1200 examples may be much
to start with I created a small index index.Short which includes only 100 examples, 50 of each class. The
index for the full dataset is in the file index.Full. Here are a few lines from the index file to illustrate the
syntax.

589|Yes|
590|Yes|
591|Yes|
592|No|
593|No|

589, 590 etc are the file names of examples and Yes and No are the labels. I suggest that you start your work
with the Short index (to save in run/wait time). Once things are reasonably stable you can and should work
with all the examples.

1
Your Task
Write a program that converts the data into an example file appropriate for weka. Please note - your program
should not depend in any way on the domain itself. That is, it should work without any modification for
any other text classification task whose examples are formatted in the same manner (each example in a text
file and the list of examples and their labels in an index file). Hopefully, if your program works for the two
different datasets it should be sufficiently general to handle other domains as well, but please make sure not
to “cheat” by using domain knowledge in the design of your program.
I have not given you any training in converting text for use in machine learning but I would still like you
to take a stab at doing this well. The basic facts are that the data you have represents examples as plain
text and weka uses an attribute-value representation. This means that you will need to decide what features
to use for the dataset and base these on the text. Since examples are given by plain text, the features can
depend on occurrence of words in example documents. E.g. “does word . . . appear in the example?” or
“how many times does word . . . appear in the example?”. Or you can choose more complex features. It is
up to you to decide what kind of features to choose. It seems that any reasonable scheme (e.g. one based
on words) can produce a very large number of such features, possibly too much for our learning algorithms.
Thus we also want a method to pick a small number of “best” features.
It should be clear that different schemes will yield different levels of performance from the learning system,
but there is no “right answer” here and you don’t have to optimize. Spend some time thinking about a
reasonable scheme, and choose one; after that just stick with your scheme and run the experiments. If you
like you can experiment with several schemes but this is not required.
To summarize, you should decide on the type of features you use and how to rank features based on the data.
Then your program can pick the N best such features. Your program will read the index file, all example
text-files mentioned in the index file and produce a weka dataset from these. Your program should be able
to take N as an argument and produce different datasets. In particular, you should use your program to
produce 3 or 4 weka datasets with 10, 50, 100, 500, and (if things are going well) 1000 features for use in
the next part. This should be done for each of the two domains (original datasets).
You can write the program in any language. However, please make sure that your code runs on our sun
systems.

3 Machine Learning Experiments - Your Task

For each of the domains (original datasets) run the following variations with the weka system:

1. Use the weka system to run classifiers ZeroR (majority vote), OneR (one level decision tree), and j48.J48
(full decision tree algorithm; gain ratio heuristic) on the data. We will rely on weka’s cross-validation
result as a measure of classification accuracy.
Report the results you are getting with each method. How does the decision tree algorithm scale (run
time and accuracy) when the number of features is increased?

2. For one fixed number of features pick random subsets of examples of increasing size (say 100, 300, 600,
800, 1200) and draw a learning curve for this setting. (NB In the original data, all examples of the Yes
class appear before examples of the No class so we must pick a random subset appropriately.)

3. Experiment with the variations of the the decision tree learning algorithm. What is the effect of each
of the following? (1) perform more/less strict pruning. (2) run with no pruning (-U). (3) Compare
default pruning to REP. (4) Run classifier Id3 (decision tree algorithm; info gain; no pruning; requires
nominal features). How does the number of features affect the behavior in these scenarios?

4 What You Need to Submit

1. An explanation and brief rational for the scheme choosing features.

2
2. Source code of program(s) that perform data conversion. If it’s not obvious please give instructions
on how to run your code. Please also make sure that your code is well documented and written with
good style.

3. A short report on the experiments you ran and their results.

5 Submitting Your Assignment

Please Submit printed versions of all the above in class.

Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
Google Hacking
0% (1)
Google Hacking
20 pages
Test For Systems Administrator
No ratings yet
Test For Systems Administrator
6 pages
Assignment 2 - Data Structure Comparison
No ratings yet
Assignment 2 - Data Structure Comparison
5 pages
Assignment 3: Named Entity Recognition: Training Dataset
No ratings yet
Assignment 3: Named Entity Recognition: Training Dataset
4 pages
HW LM
No ratings yet
HW LM
36 pages
Project_1
No ratings yet
Project_1
4 pages
Selenium Framework Creation and Accessing Test Data From Excel
No ratings yet
Selenium Framework Creation and Accessing Test Data From Excel
14 pages
Assignment 4 - Comp8547
No ratings yet
Assignment 4 - Comp8547
2 pages
CSCI374_Homework1
No ratings yet
CSCI374_Homework1
5 pages
Lab 3
No ratings yet
Lab 3
3 pages
Influential Vocabulary Detection
No ratings yet
Influential Vocabulary Detection
15 pages
03-list-iterator-comparator-instructions
No ratings yet
03-list-iterator-comparator-instructions
10 pages
CSE 5311: Design and Analysis of Algorithms Programming Project Topics
No ratings yet
CSE 5311: Design and Analysis of Algorithms Programming Project Topics
3 pages
TCP1201 As2
No ratings yet
TCP1201 As2
5 pages
CPE 202 Lecture Notes
No ratings yet
CPE 202 Lecture Notes
41 pages
Weka Tutorial
100% (1)
Weka Tutorial
58 pages
Ani Diya Ay Mazak Ay
No ratings yet
Ani Diya Ay Mazak Ay
59 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
01-Chap 01-BASIC CONCEPTS
No ratings yet
01-Chap 01-BASIC CONCEPTS
48 pages
Lecture 1 Notes
No ratings yet
Lecture 1 Notes
8 pages
6.034 Design Assignment 2: 1 Data Sets
No ratings yet
6.034 Design Assignment 2: 1 Data Sets
6 pages
7641 Assignment 1
No ratings yet
7641 Assignment 1
4 pages
Project 1
No ratings yet
Project 1
4 pages
Rintro Wekacomplete
No ratings yet
Rintro Wekacomplete
135 pages
Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
No ratings yet
Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
55 pages
Table of Contents:: Predictnow - Ai Lets You Apply Machine Learning Predictions To Your Data Without Any Programming
No ratings yet
Table of Contents:: Predictnow - Ai Lets You Apply Machine Learning Predictions To Your Data Without Any Programming
15 pages
Miniproject 1: Machine Learning 101: Preamble
No ratings yet
Miniproject 1: Machine Learning 101: Preamble
5 pages
Oop
No ratings yet
Oop
55 pages
Harvard CS197 Lecture 4 Notes
No ratings yet
Harvard CS197 Lecture 4 Notes
15 pages
Algorithms and Data Structures
No ratings yet
Algorithms and Data Structures
11 pages
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
100% (1)
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
6 pages
CSCE 120: Learning To Code: Organizing Data I Hacktivity 12.1
No ratings yet
CSCE 120: Learning To Code: Organizing Data I Hacktivity 12.1
3 pages
6 - Programming
No ratings yet
6 - Programming
19 pages
CSCI946 Assignment_1_task_sheet
No ratings yet
CSCI946 Assignment_1_task_sheet
4 pages
Lab-11 Random Forest
No ratings yet
Lab-11 Random Forest
2 pages
CSC2001F 2024 DSAssignment1
No ratings yet
CSC2001F 2024 DSAssignment1
6 pages
FIT5196-S2-2020 Assessment 1: Task 1: Parsing Text Files (U)
No ratings yet
FIT5196-S2-2020 Assessment 1: Task 1: Parsing Text Files (U)
4 pages
TMLS20 Machine Learning Coursework-1
No ratings yet
TMLS20 Machine Learning Coursework-1
5 pages
Text Classification_movie Review_news Wires
No ratings yet
Text Classification_movie Review_news Wires
5 pages
MP2: Design Patterns
No ratings yet
MP2: Design Patterns
4 pages
COMP 4650 6490 Assignment 3 2023-v1.1
No ratings yet
COMP 4650 6490 Assignment 3 2023-v1.1
6 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
www.anyscale.com
No ratings yet
www.anyscale.com
78 pages
Promodel Version History
No ratings yet
Promodel Version History
37 pages
Exam Study Guide - ENGR131
No ratings yet
Exam Study Guide - ENGR131
11 pages
WEKA LAB MANUAL (1)
No ratings yet
WEKA LAB MANUAL (1)
49 pages
Trust-In Machine Learning Models
No ratings yet
Trust-In Machine Learning Models
11 pages
Module 1 DSA 24
No ratings yet
Module 1 DSA 24
81 pages
Lab 10
No ratings yet
Lab 10
7 pages
Unit 1 Notes Lecture 1
No ratings yet
Unit 1 Notes Lecture 1
12 pages
LESSON 2 Algorithm Basics
No ratings yet
LESSON 2 Algorithm Basics
8 pages
ETL - Interview Question&Answers-2
No ratings yet
ETL - Interview Question&Answers-2
51 pages
Topic Analysis Presentation
No ratings yet
Topic Analysis Presentation
23 pages
Introduction To Data Structure
No ratings yet
Introduction To Data Structure
7 pages
HW 12
No ratings yet
HW 12
5 pages
Interview Level QA On C
No ratings yet
Interview Level QA On C
20 pages
20 Questions On Feature Engineering and Eda
No ratings yet
20 Questions On Feature Engineering and Eda
9 pages
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Introduction to Algorithms & Data Structures: A solid foundation for the real world of machine learning and data analytics
From Everand
Introduction to Algorithms & Data Structures: A solid foundation for the real world of machine learning and data analytics
Bolakale Aremu
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
?vfvcdtyvz - F - XV - Z - Uvcdtyczwev - Cvzyvuvdd73: Cva'Ce?'"%
No ratings yet
?vfvcdtyvz - F - XV - Z - Uvcdtyczwev - Cvzyvuvdd73: Cva'Ce?'"%
14 pages
Is Lumpy Investment Relevant For The Business Cycle?
No ratings yet
Is Lumpy Investment Relevant For The Business Cycle?
31 pages
C M S 2004 International Press Vol. 2, No. 1, Pp. 137-144: Omm. Ath. CI
No ratings yet
C M S 2004 International Press Vol. 2, No. 1, Pp. 137-144: Omm. Ath. CI
8 pages
Scalable, Tax Evasion-Free Anonymous Investing
No ratings yet
Scalable, Tax Evasion-Free Anonymous Investing
8 pages
Solution To Mock Midterm 2: 1 Allais-Baumol-Tobin Model
No ratings yet
Solution To Mock Midterm 2: 1 Allais-Baumol-Tobin Model
4 pages
Approximating Prices of Bonds With Log - Normal Interest Rate
No ratings yet
Approximating Prices of Bonds With Log - Normal Interest Rate
17 pages
Improving Dense Packings of Equal Disks in A Square: N N M M N N
No ratings yet
Improving Dense Packings of Equal Disks in A Square: N N M M N N
9 pages
NM Ad'S Statement of Disclosure: Mrose - Iesg@dbc - Mtview.ca - Us
No ratings yet
NM Ad'S Statement of Disclosure: Mrose - Iesg@dbc - Mtview.ca - Us
13 pages
Gaining Confidence in Mathematics: Instructional Technology For Girls
No ratings yet
Gaining Confidence in Mathematics: Instructional Technology For Girls
8 pages
Encouraging Cooperative Solution of Mathematics Problems
No ratings yet
Encouraging Cooperative Solution of Mathematics Problems
9 pages
Corporate Hedging: What, Why and How?
No ratings yet
Corporate Hedging: What, Why and How?
48 pages
CV
No ratings yet
CV
12 pages
Efficient Computation of Optimal Trading Strategies
No ratings yet
Efficient Computation of Optimal Trading Strategies
44 pages
Optimal Designation of Hedging Relationships Under FASB Statement 133
No ratings yet
Optimal Designation of Hedging Relationships Under FASB Statement 133
13 pages
Ewing 96 K
No ratings yet
Ewing 96 K
8 pages
Semantic (Web) Technology in Action: Ontology Driven Information Systems For Search, Integration and Analysis
No ratings yet
Semantic (Web) Technology in Action: Ontology Driven Information Systems For Search, Integration and Analysis
9 pages
Accessibility of Computer Science: A Re Ection For Faculty Members
No ratings yet
Accessibility of Computer Science: A Re Ection For Faculty Members
30 pages
Chap 9
No ratings yet
Chap 9
5 pages
Studies in Nonlinear Dynamics and Econometrics: Quarterly Journal Volume 4, Number 4 The MIT Press
No ratings yet
Studies in Nonlinear Dynamics and Econometrics: Quarterly Journal Volume 4, Number 4 The MIT Press
6 pages
Do Risk Premia Protect From Banking Crises?: Hans Gersbach Jan Wenzelburger
No ratings yet
Do Risk Premia Protect From Banking Crises?: Hans Gersbach Jan Wenzelburger
32 pages
Virtual Dressing Room - Fyp-1 Presentation Template
0% (2)
Virtual Dressing Room - Fyp-1 Presentation Template
6 pages
Atoi
No ratings yet
Atoi
2 pages
SMS Setting: Here We Are Going To Configure Pager in SCOT For Send Message On Mobile Via Mobile
No ratings yet
SMS Setting: Here We Are Going To Configure Pager in SCOT For Send Message On Mobile Via Mobile
21 pages
K Mean
No ratings yet
K Mean
12 pages
Cheat Sheet 4
No ratings yet
Cheat Sheet 4
3 pages
PROG Quiz
100% (1)
PROG Quiz
24 pages
Using Oracle LogMiner
No ratings yet
Using Oracle LogMiner
8 pages
NETACT Commands
No ratings yet
NETACT Commands
4 pages
A Survey On Various Encryption Techniques: John Justin M, Manimurugan S
No ratings yet
A Survey On Various Encryption Techniques: John Justin M, Manimurugan S
4 pages
Cisco Cme Guide Updated
No ratings yet
Cisco Cme Guide Updated
5 pages
Network Reference Models: © 2006 Cisco Systems, Inc. All Rights Reserved. Cisco Public ITE I Chapter 6
No ratings yet
Network Reference Models: © 2006 Cisco Systems, Inc. All Rights Reserved. Cisco Public ITE I Chapter 6
22 pages
16EC5704 Verilog HDL
0% (1)
16EC5704 Verilog HDL
5 pages
Osa Multi Tasking
No ratings yet
Osa Multi Tasking
14 pages
Adnan Hamdus Salam: Database Administrator
No ratings yet
Adnan Hamdus Salam: Database Administrator
5 pages
General Description: 8bit M Icrocontroller With ADC
No ratings yet
General Description: 8bit M Icrocontroller With ADC
50 pages
IPSec
No ratings yet
IPSec
22 pages
A Study of Perseption CAD1
No ratings yet
A Study of Perseption CAD1
7 pages
The C Preprocessor: Header Files
No ratings yet
The C Preprocessor: Header Files
6 pages
1 PB PDF
No ratings yet
1 PB PDF
7 pages
Skip Graph in Distributed Environments: A Review: 1.1 Skiplist
No ratings yet
Skip Graph in Distributed Environments: A Review: 1.1 Skiplist
5 pages
OPC UA Part 2 - Security Model 1.03 Specification
No ratings yet
OPC UA Part 2 - Security Model 1.03 Specification
39 pages
Genawr
No ratings yet
Genawr
16 pages
Datasheet: Unibox U-500
No ratings yet
Datasheet: Unibox U-500
5 pages
Docker Cheat Sheet: by Devops G
100% (1)
Docker Cheat Sheet: by Devops G
11 pages
Os Virtual Memory
No ratings yet
Os Virtual Memory
5 pages
VHDL Data Objects
No ratings yet
VHDL Data Objects
10 pages
Web Front-End Architecture With Node - Js Platform
No ratings yet
Web Front-End Architecture With Node - Js Platform
50 pages

HW 1

Uploaded by

HW 1

Uploaded by

COMP 150ML – Machine Learning – Fall 2005

3 Machine Learning Experiments - Your Task

4 What You Need to Submit

3. A short report on the experiments you ran and their results.

5 Submitting Your Assignment

You might also like