0% found this document useful (0 votes)
40 views1 page

A Small Set of Digital Library

This document discusses automatically classifying syllabus documents from other documents. It presents an approach that defines true syllabi and false syllabi, selects genre-specific features, prepares a training corpus of 1020 documents, and applies machine learning classifiers. Feature selection methods are used to reduce the feature space, including general features, genre-specific features, and a hybrid approach. The goal is to build an initial collection of computer science syllabi from over 8000 possible syllabus pages obtained through web searches.

Uploaded by

Srinivas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views1 page

A Small Set of Digital Library

This document discusses automatically classifying syllabus documents from other documents. It presents an approach that defines true syllabi and false syllabi, selects genre-specific features, prepares a training corpus of 1020 documents, and applies machine learning classifiers. Feature selection methods are used to reduce the feature space, including general features, genre-specific features, and a hybrid approach. The goal is to build an initial collection of computer science syllabi from over 8000 possible syllabus pages obtained through web searches.

Uploaded by

Srinivas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Automatic Genre-Specific Text Classification

BACKGROUND et al., 2007]. After randomly examining the result set,


we found it to contain many documents that were not A
There has been recent interest in collecting and studying truly syllabi: we refer to this as noise. To help with
the syllabus genre. A small set of digital library course the task of properly identifying true syllabi, we de-
syllabi was manually collected and carefully analyzed, fined true syllabi and false syllabi, and then selected
especially with respect to their reading lists, in order to features specific to the syllabus genre. We randomly
define the digital library curriculum [Pomerantz, Oh, sampled the collection to prepare a training corpus of
Yang, Fox, & Wildemuth, 2006]. In the MIT Open- size 1020. All 1020 files were in one of the following
CourseWare project, 1,400 MIT course syllabi were formats: HTML, PDF, PostScript, or Text. Finally, we
manually collected and made publicly available, which applied Naïve Bayes, Support Vector Machines, and
required a lot of work by students and faculty. its variants to learn classifiers to produce the syllabus
Some efforts have already been devoted to au- repository.
tomating the syllabus collection process. A syllabus
acquisition approach similar to ours is described in Class Definition
[Matsunaga, Yamada, Ito, & Hirokaw, 2003]. How-
ever, their work differs from ours in the way syllabi A syllabus component is one of the following: course
are identified. They crawled Web pages from Japanese code, title, class time, class location, offering institute,
universities and sifted through them using a thesaurus teaching staff, course description, objectives, web site,
with common words which occur often in syllabi. A prerequisite, textbook, grading policy, schedule, assign-
decision tree was used to classify syllabus pages and ment, exam, or resource. A true syllabus is a page that
entry pages (for example, a page containing links to all describes a course by including most of these syllabus
the syllabi of a particular course over time). Similarly, components, which can be located in the current page
[Thompson, Smarr, Nguyen, & Manning, 2003] used a or be obtained by following outgoing links. A false syl-
classification approach to classify education resources labus (or noise) is a page for other purposes (such as an
– especially syllabi, assignments, exams, and tutorials. instructor’s homepage with a link to syllabi for his/her
Using the word features of each document, the authors teaching purpose) instead of describing a course.
were able to achieve very good performance (F1 score: The two class labels were assigned by three team
0.98). However, this result is based upon their relative members to the 1020 samples with unanimous agree-
clean data set, only including the four kinds of educa- ment. A skewed class distribution was observed in
tion resources, which still took efforts to collect. We, the sample set with 707 true syllabus and 313 false
on the other hand, to better apply to a variety of data syllabus pages. We used this sample set as our train-
domains, test and report our approach on search results ing corpus.
for syllabi on the Web.
In addition, our genre feature selection work is also Feature Selection
inspired by research on genre classification, which aims
to classify data according to genre types by selecting In a text classification task, a document is represented
features that distinguish one genre from another, i.e., as a vector of features usually from a high dimensional
identifying home pages in sets of web pages [Kennedy space that consists of unique words occurring in docu-
& Shepherd, 2005]. ments. A good feature selection method reduces the
feature space so that most learning algorithms can
handle and contribute to high classification accuracy.
MAIN FOCUS We applied three feature selection methods in our
study: general feature selection, genre-specific feature
A text classification task usually can be accomplished selection, and a hybrid of the two.
by defining classes, selecting features, preparing a
training corpus, and building a classifier. In order to 1. General Features - In a study of feature selec-
build quickly an initial collection of CS syllabi, we tion methods for text categorization tasks [Yang
obtained more than 8000 possible syllabus pages by & Pedersen, 1997], the authors concluded that
programmatically searching using Google [Tungare Document Frequency (DF) is a good choice since



You might also like