0% found this document useful (0 votes)
7 views

Machine_Learning_approach_to_Document_Classificati

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Machine_Learning_approach_to_Document_Classificati

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/277907304

Machine Learning approach to Document Classification using Concept based


Features

Article in International Journal of Computer Applications · May 2015


DOI: 10.5120/20864-3578

CITATIONS READS

8 896

2 authors, including:

D. Thenmozhi
Sri Sivasubramaniya Nadar College of Engineering
56 PUBLICATIONS 436 CITATIONS

SEE PROFILE

All content following this page was uploaded by D. Thenmozhi on 16 April 2018.

The user has requested enhancement of the downloaded file.


International Journal of Computer Applications (0975 – 8887)
Volume 118 – No.20, May 2015

Machine Learning Approach to Document


Classification using Concept based Features

C.Saranya Jothi D.Thenmozhi


Department of Computer Science and Engineering Department of Computer Science and Engineering
SSN College of Engineering SSN College of Engineering
Chennai, India Chennai, India

ABSTRACT The rest of the paper is organized as follows. Section 2


Text mining refers to the process of deriving high-quality discusses related work that describes the various classification
information from text. Text processing involves in search and techniques. Section 3 deals with the system description of the
replace in electronic format of text. A number of approaches proposed system. Experimental results are discussed in
have been developed to represent and classify text documents. Section 4 and Section 5 concludes the paper.
Most of the approach tries to attain good classification
performance while taking a document only by words. We
propose a concept based methodology instead of terms. It
2. RELATED WORKS
The work accomplish by other analyzer that are related to
represents the meaning of text to reduce the features. Support
feature extraction and different classification techniques
Vector Machine (SVM) algorithm is applied for document
presented here. Each technique have been attempted to prove
classification. Then the performance measure is compared
the effectiveness of the approach.
with document classification using original features and
concept based features. This methodology enhances the
document classification accuracy. 2.1 Feature Extraction
Lin et. al introduced the SMTP similarity measure, to
Keywords measure the similarity between two document sets with
Text classification, Support Vector Machine, Feature respect to a feature. The effectiveness of the measure is
Selection. evaluated on several real-world data sets for text classification
and clustering problems [3]. For classify a classes they used
k-NN classifier with different measure. They proved that
1. INTRODUCTION similarity between two sets of documents is symmetric
Text mining also referred to as data mining. It contains the measure. The results have shown that the performance is
process of structuring the input as a text. It derives high- better than other measures. But the dimensional space is
quality information from text. High quality information refers sparse, so time taken for each and every process is huge.
to word integrity. Text mining tasks include text classification Peng et. al proposed an Chinese text processing based on
and clustering. concept similarity [4]. It is based on concept similarity
between words or sentences. They apply the algorithm for text
In document classification bag-of-words model is commonly classification using web news dataset. In this research the
used. Each document represented as a vector. The vector can result is better than k-NN method's based on vector space
accept only a numerical value. The value indicate a frequency model. Basu et. al proposed an supervised feature selection
of occurrence in the document, each word represent as a approach, it develops a similarity between a term and a class
feature. Generally, the dimensionality of a document space is [1] . In this they use score based on their similarity with all the
scarce, i.e., most of the values in the vector are zero. In this classes and then all the terms will be ranked accordingly. The
case, this high dimensionality of feature space is a major observation results are presented on several TREC and Reuter
problem in Text categorization (TC). data sets using k-NN classifier.

Text categorization is the task of assigning predefined In feature selection, similarity between a terms and class are
categories to a text documents. It can provide theoretical used, the dimensionality of the feature vector is very large.
views of document collections and has important applications Our work presented here is concept based feature extraction.
in the real world. Feature selection for TC helps in reducing It represents the meaning of the texts in a high dimensional
dimensionality of feature space and to improve the space to fewer dimensional spaces.
classification effectiveness. Feature selection is the process of
selecting a subset of relevant features in the document. In this 2.2 Classification
paper, we use concept based feature selection method to Gayathri K. and Marimuthu A. have used k-NN and SVM
reduce the high dimensional space to fewer spaces. It classifiers on a Reuters-21578 dataset, using document
represents the meaning of text in the document. For classification based on their contents [2]. They found that both
classification accuracy we use this method. The document is k-NN classifier and SVM classifier have been widely
classified using SVM classifier. Then the performance implemented in many real world applications. In this, they
measure is compared using original features and reduced prove that well-performing k-NN classifier may suffer from
features. less accurate than the SVM classifier. Wang et. al have
applied optimal SVM algorithm for text classification, to

33
International Journal of Computer Applications (0975 – 8887)
Volume 118 – No.20, May 2015

determine a given document belongs to which of the d1:


predefined categories [5]. SVM is a powerful supervised
learning model. A large number of techniques have been Health is good for children.
developed for text classification, including Naive Bayes,
Nearest Neighbor, neural networks, regression, rule induction, Youngster will play with kid
and Support Vector Machines. Among them SVM has been
recognized as one of the most effective text classification Kid is well.
methods.
d2:
Among all the classifiers, SVM is identified to be the best
classifier. The time taken to proceed for each process is less Apple is good for health
while compare to other classifier.
Wellness for good
3. PROPOSED FRAMEWORK We consider 9 features which are,
A concept based feature selection method for text
classification will be discussed here to improve the
Apple, children, good, health, kid, play, well, wellness,
performance of the classifier. The system architecture is
youngster.
described in the Figure 1. The proposed model is to acquire
highly consistent result by applying a classification algorithm.
Feature Vector:

Apple children good health kid play well wellness youngster

0 1 1 1 1 1 1 0 1

d1= (0,1,1,0,1,1,1,0,1)

d2= (1,0,2,1,0,0,0,1,0)

Therefore original feature vector is like this, some of the


features are semantically same in the text document.

3.2 Concept Based Feature Extraction


It represents the meaning of texts in a high-dimensional space
of concepts derived from Word Net [8]. By using the
concepts, the transformation of the data in the high-
dimensional space is reduced to fewer dimensions. The
Figure 1: System Architecture example for concept based feature extraction is shown in
Figure 2.
In this architecture, the input is given as a form of text which
is a collection of webpages. Here the text data is a structured
one and the web pages are collected by the world wide
knowledge base. In the input data, each and every line
indicates a document. From the documents, features where
extracted. Concept based methodology is used to reduce the
dimensionality of the feature space using these extracted
features. SVM algorithm is applied for document
classification. Finally compare the performance with original
features and concept based features for classification
accuracy.

3.1 Feature Extraction


Feature selection is usually employed for reducing the size of
the feature space. After removing the stop words, unique
features are extracted. The extracted features are selected for
categories a document. Then the document is represented as a
vector. Each attributes indicates the value of the
corresponding feature in the document. The feature value is a
term frequency. In-order to reduce the dimensionality of the
feature space concept based method is proposed. Figure 2: Concept Based Feature Extraction example

Example: Original Feature Vector:

After remove all the stop words unique features are extracted. Apple children good health kid play well wellness Youngster
Assume that we use word count as feature values.
0 1 1 1 1 1 1 0 1

34
International Journal of Computer Applications (0975 – 8887)
Volume 118 – No.20, May 2015

Reduced Feature Vector:

Apple children good health play wellness

0 3 2 0 1 0

d1=(0,3,2,0,1,0)

Here the dimensionality of the vector space is reduced from 9


columns to 6 columns by using the concepts.

4. IMPLEMENTATION Figure 3: Multi-class classification using SVM


The publicly available dataset of the WebKB dataset is
obtained from [6], used to evaluate our methodology of
document classification. The dataset description is given in
4.1.1 Data Pre-processing
Table 1. The documents in the WebKB data set are webpages The publicly available datasets of the WebKB datasets are
collected by the World Wide Knowledge Base. The downloaded and convert in to .ARFF (Attribute-Relation File
documents are manually classified into four classes. The Format) files [7]. The file is loaded into the WEKA data
number of features involved is 7786. mining tool after which a verification of data correctness is
done.
Table 1: Data set
4.1.2 Train and Test a Classifier with Cross
Class Training Testing Subtotal Validation
documents documents of Cross validation is commonly used method to tune a
documents classifier. It avoids overlapping test sets. Data is split into k
subsets of equal size. Sets number of folds for cross-
Project 336 168 504 validation. It takes 10% for testing data and the remainder for
training. Train the data using SVM algorithm. After finding
Course 620 310 930 the parameters, now apply the same classifier on the test set.
Faculty 750 374 1124
5. EXPERIMENTAL RESULTS
Student 1097 544 1641 The comparative performance measures in terms of
classification accuracy, of the Original data sets and reduced
Total 2803 1396 4199 data sets are tabulated in Table 2

Table 2: Result Table

Through the dataset, features will be extracted from each


Accuracy by Accuracy by
document. Each word in the document represents a feature.
Original data Reduced data
Feature is a unique word. Feature value indicates the word
using SVM using SVM
occurrence in the document. Each document represents as a
vector. Each word indicates the value of the corresponding
Training and 0.611 0.644
feature in the document. The feature vector will be formed as
Test set
a dimensionality of the feature space.
Cross- 0.681 0.767
4.1 Text classification with SVM Validation
Support Vector Machine is a powerful supervised learning
models that analyze data used for classification. It is the
maximum margin classifier. SVMs can only do binary
The classification accuracy is shown in True Positive (TP)
classification. It controls complexity and overfitting issued, so
rate (instances correctly classified as a given class) and False
it works well on a wide range of practical problems. It works
Positive (FP) rate (instances falsely classified as a given
very fast in training and testing. So that it become an
class). A confusion matrix contains information about actual
achievable option for large data set. Large margin gives better
and predicted classifications done by a classification system.
generalization ability and less over-fitting. It classify a correct
The accuracy is taken in to true positive value. The result
classes based on class labels. Multi class labels are shown in
table shows, that the accuracy of reduced data using SVM is
the figure 3
better than accuracy of original data.

35
International Journal of Computer Applications (0975 – 8887)
Volume 118 – No.20, May 2015

1 7. REFERENCES
[1] Basu T. and Murthy C. (2012). Effective text
0.767 classification by a supervised feature selection approach.
0.8 0.681
0.6110.644 In Data Mining Workshops (ICDMW), 2012 IEEE 12th
Original International Conference on, pages 918–925. IEEE.
0.6
Features
[2] Gayathri K. and Marimuthu A. (2013). Text document
0.4 Reduced pre-processing with the knn for classification using the
Features svm. In Intelligent Systems and Control (ISCO), 2013
0.2 7th International Conference on, pages 453–457. IEEE.
0 [3] Lin Y.S., Jiang J.Y., and Lee S.J. (2013). A similarity
Test Set
Cross Validation measure for text classification and clustering. IEEE
Transactions on Knowledge and Data Engineering, page
Figure 4: Analysis 1.
[4] Peng J., Yang D.q., Tang S.W., Gao J., Zhang P.y., and
In the analysis, the cross validation using original data and Fu Y. (2007). A concept similarity based text
reduced data is far better than test set is shown in the figure 4. classification algorithm. In Proceedings of the Fourth
The time taken for reduced data is less than original data. International Conference on Fuzzy Systems and
Knowledge Discovery-Volume 01, pages 535–539. IEEE
6. CONCLUSION Computer Society.
The proposed work for feature selection in which, concept
based features are extracted from original features by using [5] Wang Z.Q., Sun X., Zhang D.X., and Li X. (2006). An
Word Net. The high dimensional space is reduced to fewer optimal svm based text classification algorithm. In 2005
spaces by improving the classification accuracy. Support International Conference on Machine Learning and
Vector Machine (SVM) algorithm is applied for document Cybernetics, pages 1378–1381.
classification. Finally, compare the performance with original [6] Datasets for single-label text categorization:
features and reduced features. The results have shown that the http://web.ist.utl.pt/~acardoso/datasets/
performance obtained by the reduced data is better than that
original data. To improve the classification efficiency, use [7] WEKA, classpath: http://weka.wikispaces.com/classpath
SMTP similarity measure [3] and try different classification [8] WordNet 2.1. http://www.brothersoft.com/wordnet-
approach in further enhancement. 236667.html.

IJCATM : www.ijcaonline.org 36

View publication stats

You might also like