Machine_Learning_approach_to_Document_Classificati
Machine_Learning_approach_to_Document_Classificati
net/publication/277907304
CITATIONS READS
8 896
2 authors, including:
D. Thenmozhi
Sri Sivasubramaniya Nadar College of Engineering
56 PUBLICATIONS 436 CITATIONS
SEE PROFILE
All content following this page was uploaded by D. Thenmozhi on 16 April 2018.
Text categorization is the task of assigning predefined In feature selection, similarity between a terms and class are
categories to a text documents. It can provide theoretical used, the dimensionality of the feature vector is very large.
views of document collections and has important applications Our work presented here is concept based feature extraction.
in the real world. Feature selection for TC helps in reducing It represents the meaning of the texts in a high dimensional
dimensionality of feature space and to improve the space to fewer dimensional spaces.
classification effectiveness. Feature selection is the process of
selecting a subset of relevant features in the document. In this 2.2 Classification
paper, we use concept based feature selection method to Gayathri K. and Marimuthu A. have used k-NN and SVM
reduce the high dimensional space to fewer spaces. It classifiers on a Reuters-21578 dataset, using document
represents the meaning of text in the document. For classification based on their contents [2]. They found that both
classification accuracy we use this method. The document is k-NN classifier and SVM classifier have been widely
classified using SVM classifier. Then the performance implemented in many real world applications. In this, they
measure is compared using original features and reduced prove that well-performing k-NN classifier may suffer from
features. less accurate than the SVM classifier. Wang et. al have
applied optimal SVM algorithm for text classification, to
33
International Journal of Computer Applications (0975 – 8887)
Volume 118 – No.20, May 2015
0 1 1 1 1 1 1 0 1
d1= (0,1,1,0,1,1,1,0,1)
d2= (1,0,2,1,0,0,0,1,0)
After remove all the stop words unique features are extracted. Apple children good health kid play well wellness Youngster
Assume that we use word count as feature values.
0 1 1 1 1 1 1 0 1
34
International Journal of Computer Applications (0975 – 8887)
Volume 118 – No.20, May 2015
0 3 2 0 1 0
d1=(0,3,2,0,1,0)
35
International Journal of Computer Applications (0975 – 8887)
Volume 118 – No.20, May 2015
1 7. REFERENCES
[1] Basu T. and Murthy C. (2012). Effective text
0.767 classification by a supervised feature selection approach.
0.8 0.681
0.6110.644 In Data Mining Workshops (ICDMW), 2012 IEEE 12th
Original International Conference on, pages 918–925. IEEE.
0.6
Features
[2] Gayathri K. and Marimuthu A. (2013). Text document
0.4 Reduced pre-processing with the knn for classification using the
Features svm. In Intelligent Systems and Control (ISCO), 2013
0.2 7th International Conference on, pages 453–457. IEEE.
0 [3] Lin Y.S., Jiang J.Y., and Lee S.J. (2013). A similarity
Test Set
Cross Validation measure for text classification and clustering. IEEE
Transactions on Knowledge and Data Engineering, page
Figure 4: Analysis 1.
[4] Peng J., Yang D.q., Tang S.W., Gao J., Zhang P.y., and
In the analysis, the cross validation using original data and Fu Y. (2007). A concept similarity based text
reduced data is far better than test set is shown in the figure 4. classification algorithm. In Proceedings of the Fourth
The time taken for reduced data is less than original data. International Conference on Fuzzy Systems and
Knowledge Discovery-Volume 01, pages 535–539. IEEE
6. CONCLUSION Computer Society.
The proposed work for feature selection in which, concept
based features are extracted from original features by using [5] Wang Z.Q., Sun X., Zhang D.X., and Li X. (2006). An
Word Net. The high dimensional space is reduced to fewer optimal svm based text classification algorithm. In 2005
spaces by improving the classification accuracy. Support International Conference on Machine Learning and
Vector Machine (SVM) algorithm is applied for document Cybernetics, pages 1378–1381.
classification. Finally, compare the performance with original [6] Datasets for single-label text categorization:
features and reduced features. The results have shown that the http://web.ist.utl.pt/~acardoso/datasets/
performance obtained by the reduced data is better than that
original data. To improve the classification efficiency, use [7] WEKA, classpath: http://weka.wikispaces.com/classpath
SMTP similarity measure [3] and try different classification [8] WordNet 2.1. http://www.brothersoft.com/wordnet-
approach in further enhancement. 236667.html.
IJCATM : www.ijcaonline.org 36