A Small Set of Digital Library

This document discusses automatically classifying syllabus documents from other documents. It presents an approach that defines true syllabi and false syllabi, selects genre-specific features, prepares a training corpus of 1020 documents, and applies machine learning classifiers. Feature selection methods are used to reduce the feature space, including general features, genre-specific features, and a hybrid approach. The goal is to build an initial collection of computer science syllabi from over 8000 possible syllabus pages obtained through web searches.

Uploaded by

Srinivas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views1 page

A Small Set of Digital Library

Uploaded by

Srinivas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Automatic Genre-Specific Text Classification

BACKGROUND et al., 2007]. After randomly examining the result set,

we found it to contain many documents that were not A
There has been recent interest in collecting and studying truly syllabi: we refer to this as noise. To help with
the syllabus genre. A small set of digital library course the task of properly identifying true syllabi, we de-
syllabi was manually collected and carefully analyzed, fined true syllabi and false syllabi, and then selected
especially with respect to their reading lists, in order to features specific to the syllabus genre. We randomly
define the digital library curriculum [Pomerantz, Oh, sampled the collection to prepare a training corpus of
Yang, Fox, & Wildemuth, 2006]. In the MIT Open- size 1020. All 1020 files were in one of the following
CourseWare project, 1,400 MIT course syllabi were formats: HTML, PDF, PostScript, or Text. Finally, we
manually collected and made publicly available, which applied Naïve Bayes, Support Vector Machines, and
required a lot of work by students and faculty. its variants to learn classifiers to produce the syllabus
Some efforts have already been devoted to au- repository.
tomating the syllabus collection process. A syllabus
acquisition approach similar to ours is described in Class Definition
[Matsunaga, Yamada, Ito, & Hirokaw, 2003]. How-
ever, their work differs from ours in the way syllabi A syllabus component is one of the following: course
are identified. They crawled Web pages from Japanese code, title, class time, class location, offering institute,
universities and sifted through them using a thesaurus teaching staff, course description, objectives, web site,
with common words which occur often in syllabi. A prerequisite, textbook, grading policy, schedule, assign-
decision tree was used to classify syllabus pages and ment, exam, or resource. A true syllabus is a page that
entry pages (for example, a page containing links to all describes a course by including most of these syllabus
the syllabi of a particular course over time). Similarly, components, which can be located in the current page
[Thompson, Smarr, Nguyen, & Manning, 2003] used a or be obtained by following outgoing links. A false syl-
classification approach to classify education resources labus (or noise) is a page for other purposes (such as an
– especially syllabi, assignments, exams, and tutorials. instructor’s homepage with a link to syllabi for his/her
Using the word features of each document, the authors teaching purpose) instead of describing a course.
were able to achieve very good performance (F1 score: The two class labels were assigned by three team
0.98). However, this result is based upon their relative members to the 1020 samples with unanimous agree-
clean data set, only including the four kinds of educa- ment. A skewed class distribution was observed in
tion resources, which still took efforts to collect. We, the sample set with 707 true syllabus and 313 false
on the other hand, to better apply to a variety of data syllabus pages. We used this sample set as our train-
domains, test and report our approach on search results ing corpus.
for syllabi on the Web.
In addition, our genre feature selection work is also Feature Selection
inspired by research on genre classification, which aims
to classify data according to genre types by selecting In a text classification task, a document is represented
features that distinguish one genre from another, i.e., as a vector of features usually from a high dimensional
identifying home pages in sets of web pages [Kennedy space that consists of unique words occurring in docu-
& Shepherd, 2005]. ments. A good feature selection method reduces the
feature space so that most learning algorithms can
handle and contribute to high classification accuracy.
MAIN FOCUS We applied three feature selection methods in our
study: general feature selection, genre-specific feature
A text classification task usually can be accomplished selection, and a hybrid of the two.
by defining classes, selecting features, preparing a
training corpus, and building a classifier. In order to 1. General Features - In a study of feature selec-
build quickly an initial collection of CS syllabi, we tion methods for text categorization tasks [Yang
obtained more than 8000 possible syllabus pages by & Pedersen, 1997], the authors concluded that
programmatically searching using Google [Tungare Document Frequency (DF) is a good choice since

Listening N Speaking LESSON PLAN Sunny
No ratings yet
Listening N Speaking LESSON PLAN Sunny
3 pages
Proceedings of International Symposium
No ratings yet
Proceedings of International Symposium
1 page
Towards A Syllabus Repository For Computer Science Courses
100% (1)
Towards A Syllabus Repository For Computer Science Courses
5 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
228 International Conference On Engineering Technologies (ICENTE'17)
No ratings yet
228 International Conference On Engineering Technologies (ICENTE'17)
3 pages
CAP 11 Io1
No ratings yet
CAP 11 Io1
18 pages
11Mar_Mayer
No ratings yet
11Mar_Mayer
106 pages
Different Type of Feature Selection For Text Classification
No ratings yet
Different Type of Feature Selection For Text Classification
6 pages
Assessing Approaches To Genre Classification
No ratings yet
Assessing Approaches To Genre Classification
72 pages
Learning To Classify Documents According To Genre: Aidan Finn and Nicholas Kushmerick
No ratings yet
Learning To Classify Documents According To Genre: Aidan Finn and Nicholas Kushmerick
26 pages
Learning To Classify Documents According To Formal and Informal Style
No ratings yet
Learning To Classify Documents According To Formal and Informal Style
31 pages
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
No ratings yet
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
55 pages
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
No ratings yet
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
20 pages
Document
No ratings yet
Document
7 pages
Naive Bayes and Sentiment
No ratings yet
Naive Bayes and Sentiment
19 pages
Naive Bayes Sentiment Analysis
No ratings yet
Naive Bayes Sentiment Analysis
23 pages
Task 3
No ratings yet
Task 3
17 pages
Project Report For News Classification
No ratings yet
Project Report For News Classification
5 pages
Naive Bayes and Sentiment Classification
No ratings yet
Naive Bayes and Sentiment Classification
23 pages
Vietnamese Spam Filtering Report
No ratings yet
Vietnamese Spam Filtering Report
21 pages
NLP ch4 l1
No ratings yet
NLP ch4 l1
23 pages
Classification PDF
No ratings yet
Classification PDF
54 pages
Vietnamese Text Clasification
No ratings yet
Vietnamese Text Clasification
7 pages
in4080_2022_lecture_03
No ratings yet
in4080_2022_lecture_03
62 pages
Improving Naive Bayesian Spam Filtering: Master Thesis
No ratings yet
Improving Naive Bayesian Spam Filtering: Master Thesis
68 pages
NLP-based Course Clustering and Recommendation: Kentaro Suzuki, Hyunwoo Park December 10, 2009
No ratings yet
NLP-based Course Clustering and Recommendation: Kentaro Suzuki, Hyunwoo Park December 10, 2009
21 pages
Automatic Induction of Rule Based Text Categorization
No ratings yet
Automatic Induction of Rule Based Text Categorization
10 pages
(IJCST-V5I2P25) :S. Gopi, A. Berno Raj, M. Abinav, P.Gokul Sarathy, D.P. Bharath
No ratings yet
(IJCST-V5I2P25) :S. Gopi, A. Berno Raj, M. Abinav, P.Gokul Sarathy, D.P. Bharath
4 pages
Text Classification
No ratings yet
Text Classification
53 pages
lect33-textcat (1)
No ratings yet
lect33-textcat (1)
70 pages
Machine Learning in Automated Text Categorization FABRIZIO SEBASTIANI Consiglio Nazionale Delle Ricerche
No ratings yet
Machine Learning in Automated Text Categorization FABRIZIO SEBASTIANI Consiglio Nazionale Delle Ricerche
3 pages
Text Classification Using Support Vector Machine IJERTV1IS3174
No ratings yet
Text Classification Using Support Vector Machine IJERTV1IS3174
4 pages
book2
No ratings yet
book2
3 pages
Group08_BDM01_Topic-Modelling-in-Text-Classification
No ratings yet
Group08_BDM01_Topic-Modelling-in-Text-Classification
19 pages
11 Text Categorization
No ratings yet
11 Text Categorization
25 pages
01 What Is Text Classification 8-12
No ratings yet
01 What Is Text Classification 8-12
4 pages
The Crucible - Literature Kit Gr. 9-12
From Everand
The Crucible - Literature Kit Gr. 9-12
Chad Ibbotson
No ratings yet
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
8 pages
4.an Efficient
No ratings yet
4.an Efficient
10 pages
Ijcst V3i2p17
No ratings yet
Ijcst V3i2p17
5 pages
Kim 2016
No ratings yet
Kim 2016
5 pages
Job Opportunity Finding by Text Classification: Procedia Engineering
No ratings yet
Job Opportunity Finding by Text Classification: Procedia Engineering
5 pages
A Communication Perspective On Automatic
No ratings yet
A Communication Perspective On Automatic
15 pages
Theis finaldoc
No ratings yet
Theis finaldoc
86 pages
Genre Classification of Web Pages: - User Study and Feasibility Analysis
No ratings yet
Genre Classification of Web Pages: - User Study and Feasibility Analysis
15 pages
Computational Intelligence To Aid Text F
No ratings yet
Computational Intelligence To Aid Text F
14 pages
Classification of World Wide Web Documents
No ratings yet
Classification of World Wide Web Documents
15 pages
NLP Unit-3
No ratings yet
NLP Unit-3
17 pages
Supervised Learningclassification Part3
No ratings yet
Supervised Learningclassification Part3
42 pages
Survey On Text Classification
No ratings yet
Survey On Text Classification
7 pages
Intro Text Mining
No ratings yet
Intro Text Mining
83 pages
text classification research paper 2
No ratings yet
text classification research paper 2
7 pages
Revision: History, Theory, and Practice
From Everand
Revision: History, Theory, and Practice
CSPtrade2
No ratings yet
Textual Characteristics For Language Engineering: Mathias Bank, Robert Remus, Martin Schierle
No ratings yet
Textual Characteristics For Language Engineering: Mathias Bank, Robert Remus, Martin Schierle
5 pages
Document Classification Using Machine Learning
No ratings yet
Document Classification Using Machine Learning
56 pages
Machine Learning With Python - Unit-5
No ratings yet
Machine Learning With Python - Unit-5
26 pages
Tuck Everlasting - Literature Kit Gr. 5-6
From Everand
Tuck Everlasting - Literature Kit Gr. 5-6
Lisa Renaud
No ratings yet
04 Textcat
No ratings yet
04 Textcat
101 pages
AI Unit V
No ratings yet
AI Unit V
64 pages
Chapter Veera 6
No ratings yet
Chapter Veera 6
4 pages
Lecture 6 Text Classification
No ratings yet
Lecture 6 Text Classification
19 pages
Bibliomining For Library Decision-Making: Key Terms
No ratings yet
Bibliomining For Library Decision-Making: Key Terms
1 page
Databases and Ontologies
No ratings yet
Databases and Ontologies
1 page
Bibliomining For Library Decision-Making: Background
No ratings yet
Bibliomining For Library Decision-Making: Background
1 page
Bio in For Matics
No ratings yet
Bio in For Matics
1 page
Bioinformatics Programmers
No ratings yet
Bioinformatics Programmers
1 page
Machine Learning Tools: (Scherf Et. Al. 2005)
No ratings yet
Machine Learning Tools: (Scherf Et. Al. 2005)
1 page
A Bayesian Based Machine Learning Application To Task Analysis
No ratings yet
A Bayesian Based Machine Learning Application To Task Analysis
1 page
Provides More Accurate Recommendations
No ratings yet
Provides More Accurate Recommendations
1 page
Best Practices in Data Warehousing: Les Pang
No ratings yet
Best Practices in Data Warehousing: Les Pang
1 page
Discussed The Application
No ratings yet
Discussed The Application
1 page
American Standard Code For Informa
No ratings yet
American Standard Code For Informa
1 page
Historic Nature of Data
No ratings yet
Historic Nature of Data
1 page
Familiar With The Browser
No ratings yet
Familiar With The Browser
1 page
Categories of Customer Behavior
No ratings yet
Categories of Customer Behavior
1 page
The Framework For Behavioral Pattern-Based Clustering
No ratings yet
The Framework For Behavioral Pattern-Based Clustering
1 page
Bayesian Based Machine Learning
No ratings yet
Bayesian Based Machine Learning
1 page
Have Realized The Importance
No ratings yet
Have Realized The Importance
1 page
Business Areas Served
No ratings yet
Business Areas Served
1 page
Warehousing and Knowledge
No ratings yet
Warehousing and Knowledge
1 page
Modified For This Purpose
No ratings yet
Modified For This Purpose
1 page
Recorded Phone Conversations Between
No ratings yet
Recorded Phone Conversations Between
1 page
Key Terms: A Bayesian Based Machine Learning Application To Task Analysis
No ratings yet
Key Terms: A Bayesian Based Machine Learning Application To Task Analysis
1 page
Automatic Music Timbre Indexing
No ratings yet
Automatic Music Timbre Indexing
1 page
Task Analysis Compared
No ratings yet
Task Analysis Compared
1 page
Similarly Presented and Having
No ratings yet
Similarly Presented and Having
1 page
Their Semantic and Multidimen
No ratings yet
Their Semantic and Multidimen
1 page
What Are Musical Pitch
No ratings yet
What Are Musical Pitch
1 page
Automatic Musical Instrument
No ratings yet
Automatic Musical Instrument
1 page
Support Vector Machines
No ratings yet
Support Vector Machines
1 page
2018 Academic Calendar
No ratings yet
2018 Academic Calendar
1 page
PGP Brochure 2022.
No ratings yet
PGP Brochure 2022.
15 pages
Characteristics of Reflective Practitioners-Korthagen1995
No ratings yet
Characteristics of Reflective Practitioners-Korthagen1995
23 pages
Admit Card GD230284607
No ratings yet
Admit Card GD230284607
1 page
Second Language Acquisition
No ratings yet
Second Language Acquisition
5 pages
Gym xlsm-2
No ratings yet
Gym xlsm-2
2 pages
(Analecta Husserliana 52) Anna-Teresa Tymieniecka (Auth.), Anna-Teresa Tymieniecka (Eds.)-Phenomenology of Life and the Human Creative Condition_ Book I Laying Down the Cornerstones of the Field-Sprin
100% (1)
(Analecta Husserliana 52) Anna-Teresa Tymieniecka (Auth.), Anna-Teresa Tymieniecka (Eds.)-Phenomenology of Life and the Human Creative Condition_ Book I Laying Down the Cornerstones of the Field-Sprin
565 pages
Math+8+Lesson+Plan+ +052724
No ratings yet
Math+8+Lesson+Plan+ +052724
2 pages
(Name of School) : Build A Culture For Learning Based On Collaboration, Teamwork, and Shared Vision
100% (1)
(Name of School) : Build A Culture For Learning Based On Collaboration, Teamwork, and Shared Vision
3 pages
WS_G4_Mathematics_Q1_Wk3
No ratings yet
WS_G4_Mathematics_Q1_Wk3
23 pages
Foreword: ADDITIONAL COMBINED SCIENCE ................................................................................. 1
No ratings yet
Foreword: ADDITIONAL COMBINED SCIENCE ................................................................................. 1
2 pages
Sponsorship Annual Report 2023
No ratings yet
Sponsorship Annual Report 2023
4 pages
Merit List 2024
No ratings yet
Merit List 2024
8 pages
E Capacity
No ratings yet
E Capacity
13 pages
Tominamos Integrated School: Back To Basic Encampment and Investiture
No ratings yet
Tominamos Integrated School: Back To Basic Encampment and Investiture
3 pages
SANA-Bursary-Application-Form-2025 (1)
No ratings yet
SANA-Bursary-Application-Form-2025 (1)
5 pages
Anjali Singhal
No ratings yet
Anjali Singhal
2 pages
Homework 13.1
100% (1)
Homework 13.1
4 pages
DR Monzer Kahf
No ratings yet
DR Monzer Kahf
5 pages
Activity 1/the Worker
No ratings yet
Activity 1/the Worker
3 pages
G8 Caregiving WEEKLY HOME LEARNING PLAN WK 4-5
No ratings yet
G8 Caregiving WEEKLY HOME LEARNING PLAN WK 4-5
3 pages
Neet PG 2024
No ratings yet
Neet PG 2024
3 pages
A Problem With Problem Solving: Teaching Thinking Without Teaching Knowledge
No ratings yet
A Problem With Problem Solving: Teaching Thinking Without Teaching Knowledge
8 pages
Solo Taxonomy
100% (1)
Solo Taxonomy
39 pages
Content of English Typing PDF
No ratings yet
Content of English Typing PDF
13 pages
Curriculum OF International Relations BS (4-YEAR) : Higher Education Commission Islamabad
No ratings yet
Curriculum OF International Relations BS (4-YEAR) : Higher Education Commission Islamabad
70 pages
I. Objectives: LO4: Produce Paper Mache Products/Projects
No ratings yet
I. Objectives: LO4: Produce Paper Mache Products/Projects
4 pages
Secondaryhandbook15 16
No ratings yet
Secondaryhandbook15 16
49 pages
Standard 11 CC Workbook
0% (1)
Standard 11 CC Workbook
19 pages

A Small Set of Digital Library

Uploaded by

A Small Set of Digital Library

Uploaded by

Automatic Genre-Specific Text Classification

BACKGROUND et al., 2007]. After randomly examining the result set,

You might also like