Different Type of Feature Selection For Text Classification
Different Type of Feature Selection For Text Classification
2. Mutual Information (MI)
The mutual information of two random
variables is a quantity that measures the mutual
dependence of the two random variables. MI
measures how much information the
presence/absence of a term contributes to making the
correct classification decision.
{ } { } 1,0 1,0
( , )
( , ) ( , ) ln
( ) ( )
k
k
k
f ck
k f c
f c
vf vc
P F v Ck V
MI F C p F v ck v
P F V P Ck V
e e
= =
= = =
= =
3. Information Gain (IG)
Here both class membership and the
presence/absence of a particular term are seen as
random variables, and one computes how much
information about the class membership is gained by
knowing the presence/absence statistics (as is used in
decision tree induction.
It is defined by following expression
It is frequently used as a term goodness criterion in
machine learning. It measures the number of bits
required for category prediction by knowing the
presence or the absence of a term in the document.
4. X2 Statistic (CHI)
Feature Selection by Chi - square testing is
based on Pearsons X 2 (chi square) tests. The Chi
square test of independence helps to find out the
variables X and Y are related to or independent of
each other. In feature selection, the Chi - square test
measures the independence of a feature and a
category. The null-hypothesis here is that the feature
and category are completely independent. It is
defined by,
2
, ,
2
(( ) ( , , ))
( , )
,
k Ck Ck F Ck
F Ck
F C F F
k
F Ck
N N N N N
x F C
N N N
=
5. Ngl Coefficient
The NGL coefficient is a variant of the Chi
square metric. It was originally named a `correlation
coefficient', name it `NGL coefficient' after the last
names of the inventors Ng, Goh, and Low. The NGL
coefficient looks only for evidence of positive class
membership, while the chi square metric also selects
evidence of negative class membership.
( , , , , )
( , )
F
F Ck F Ck F Ck
k
F Ck Ck
N N N N
NGL F C
N N N N
=
6. Term Frequency Document
Frequency
The tfidf weight is a method based on the term
frequency combined with the document frequency
threshold, it is defined as,
1 2 1 3 2 3 ( ) ( ( )) TFDF F n n c n n n n = + +
7. GSS Coefficient
The GSS coefficient was originally presented in
[GSS00] as a `simplified chi square function'. We
International Journal of Computer Trends and Technology (IJCTT) volume 10 number 2 Apr 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page105
follow [Seb02] and name it GSS after the names on
the inventors Galavotti, Sebastiani, and Simi.
( ) , , , , , F Ck F Ck F F Ck N N = Ck
k
Gss F c N N
IV. TEXT CATEGORIZATION
With the rapid growth of online information,
there is a growing need for tools that help in finding,
filtering and managing the high dimensional data.
Automated text categorization is a supervised
learning task, defined as assigning category labels to
new documents based on likelihood suggested by a
training set of labeled documents. Real-world
applications of text categorization often require a
system to deal with tens of thousands of categories
defined over a large Taxonomy. Since building these
text classifiers by hand is time consuming and costly,
automated text categorization has gained importance
over the years.
1. KNearest Neighbor Classifier Algorithm
K-Nearest Neighbor is one of the most
popular algorithms for text categorization. K-nearest
neighbor algorithm (k-NN) is a method for
classifying objects based on closest training examples
in the space. The working of KNN can be detailed as
follows first the test document has to be classified the
KNN algorithm searches the nearest neighbors
among the training documents that are pre classified.
The ranks for the K nearest neighbors based on the
similarity scores are calculate using some similarity
measure such as Euclidean distance measure etc., The
distance between two neighbors using Euclidean
distance can be found using the given formula the
categories of the test document can be predicted
using the ranked scores. The classification for the
input pattern is the class with the highest confidence;
the performance of each learning model is tracked
using the validation technique called cross validation.
The cross validation technique is used to validate the
pre determined metric like performance and
accuracy.
While using kNN algorithm, after k nearest
neighbors is found, several strategies could be taken
to predict the category of a test document based on
them. A fixed k value is usually used for all classes in
these methods, regardless of their different
distributions. Equation (1) and (2) below are two of
the used strategies of this kind method.
where d
i
is a test document, x
i
is one of the neighbors
in the training set,y(x
j
,c
k
) indicates whether
x
j
belongs to class c
k
,sim(d
i
,x
j
) and is the similarity
function for d
i
and x
j
. Equation (1) means that the
predication will be the class that has the largest
number of members in the k nearest neighbors;
whereas equation (2) means the class with maximal
sum of similarity will be the winner.
2. Naive Bayesian Classifier Algorithm
The Naive Bayes classifiers are known as a
simple Bayesian classification algorithm. It has been
proven very effective for text categorization.
Regarding the text categorization problem, a
document d 2 D corresponds to a data instance,
where D denotes the training document set. The
document d can be represented as a bag of words.
Each word w 2 d comes from a set W of all feature
words. Each document d is associated with a class
label c 2 C, where C denotes the class label set. The
Naive Bayes classifiers estimate the conditional
probability P(c|d) which represents the probability
that a document d belongs to a class c. Using the
Bayes rule, we have
The key assumption of Naive Bayes classifiers is that
the words in the documents are conditionally
independent given the class value, so that
A popular way to estimate P(w|c) is through
Laplacian smoothing:
2
1
( , ) ( )
D
i
Dist X Y Xi Yi
=
=
International Journal of Computer Trends and Technology (IJCTT) volume 10 number 2 Apr 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page106
where n(w, c) is the number of the word positions
that are occupied by w in all training examples whose
class value is c. n(c) is the number of word positions
whose class value is c. Finally, |W| is the total
number of distinct words in the training set.
V. PERFORMANCE METRIC
The evaluation of a classifier is done using
the precision and recall measures .To derive a robust
measure of the effectiveness of the classifier It is able
to calculate the breakeven point, the 11-point
precision and average precision . to evaluate the
classification for a threshold ranging from 0 (recall =
1) up to a value where the precision value equals 1
and the recall value equals 0, incrementing the
threshold with a given threshold step size. The
breakeven point is the point where recall meets
precision and the eleven point precision is the
averaged value for the precision at the points where
recall equals the eleven values 0.0, 0.1, 0.2... 0.9, 1.0.
Average precision refines the eleven point
precision, as it approximates the areabelow the
precision/recall curve.
The 11-point average precision is another measure
for representing performance with a single value. For
every category the
i
CSV threshold is repeatedly
tuned such that allow the recall to take the values
. At every point the
precision is calculated and at the end the average over
these eleven values is returned [Sebastiani02]. The
retrieval system must support ranking policy.
VI. EXPERIMENTAL RESULTS
Data Set 1: Self Made
For the development used a small self-made corpus
that contains standard categories such as Science,
Business, Sports, Health, Education,
Travel, and Movies. It contains around 150
documents with the above mentioned categories.
Data Set 2: The Reuters 21578 corpus
The second corpus included for the development is
Reuters 21578 corpus. The corpus is freely available
on the internet (Lewis 1997). Uses an XML parser, it
was necessary to convert the 22 SGML documents to
XML, using the freely available tool SX (Clark
2001). After the conversion I deleted some single
characters which were rejected by the
validating XML parser as they had decimal
values below 30. This does not affect the
results since the characters would have been
considered as whitespaces anyway.
Table 1: Performances of two classification
algorithms
Da
ta
Set
Knn SVM
Mic
ro-
Avg.
Macro-Avg. Mic
ro-
Avg.
Macro-Avg.
Pre
=Re
c= F
1
Pr
e
R
ec
F
1
Pre
=Re
c= F
1
Pr
e
R
ec
F
1
Re
ute
rs
71.6
9
67
.1
0
66
.5
7
66
.0
2
72.3
5
70
.7
4
77
.2
4
77
.9
6
20
ne
ws
fro
up
70.1
2
67
.8
8
67
.5
6
66
.2
9
75.2
4
75
.1
7
78
.4
9
78
.1
1
Avg.
70.9
0
67
.4
9
66
.4
3
66
.1
5
73.7
9
72
.9
5
77
.8
6
78
.0
3
To evaluate the effectiveness of category
assignments by classifiers to documents, the standard
precision, recall, and F 1 measure are used here.
Precision is defined to be the ratio of correct
assignments by the system divided by the total
number of the systems assignments. Recall is the
ratio of correct assignments by the system divided by
the total number of correct assignments.
These scores can be computed for the binary
decisions on each individual category first and then
be averaged over categories. Or, they can be
computed globally over all the n*m binary decisions
where n is the number of total test documents, and m
is the number of categories in consideration. The
former way is called macro-averaging and the latter
International Journal of Computer Trends and Technology (IJCTT) volume 10 number 2 Apr 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page107
micro-averaging. It is understood that the micro-
averaged scores (recall, precision, and F 1) tend to be
dominated by the classifiers performance on
common categories, and that the macro-averaged
scores are more influenced by the performance on
rare categories.
VII. CONCLUSION
Analyzed the text classification using the
Naive Bayesian and K-Nearest Neighbor
classification. The methods are favorable in terms of
their effectiveness and efficiency when compared
with other classifier such as SVM. The advantage of
the proposed approach is classification algorithm
learns importance of attributes and utilizes them in
the similarity measure. In future the classification
model can be build that analyzes terms on the
sentence, document.
REFERENCES
[1] A Fuzzy Self-Constructing Feature Clustering Algorithm for
Text Classification Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue
Lee, Member, IEEE TRANS ON Knowledge and Data Eng.,Vol
23,No.3,March 2011
[2] J. Yan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang,
W. Xi,and Z. Chen, Effective and Efficient Dimensionality
Reduction for Large-Scale and Streaming Data Preprocessing,
IEEE Trans.Knowledge and Data Eng., vol. 18, no. 3, pp. 320-333,
Mar. 2006.
[3] H. Li, T. Jiang, and K. Zang, Efficient and Robust Feature
Extraction by Maximum Margin Criterion, T. Sebastian,
S.Lawrence, and S. Bernhard eds. Advances in Neural Information
Processing System, pp. 97-104, Springer, 2004.
[4] D.D. Lewis, Feature Selection and Feature Extraction for Text
Categorization, Proc. Workshop Speech and Natural
Language,pp. 212-217, 1992.
[5]http://kdd.ics.uci.edu/databases/
reuters21578/reuters21578.html. 2010.
[6] H. Kim, P. Howland, and H. Park, Dimension Reduction in
Text Classification with Support Vector Machines, J. Machine
Learning Research, vol. 6, pp. 37-53, 2005.
[7] F. Sebastiani, Machine Learning in Automated Text
Categorization, ACM Computing Surveys, vol. 34, no. 1, pp. 1-
47, 2002.
[8] H. Park, M. Jeon, and J. Rosen, Lower Dimensional
Representation of Text Data Based on Centroids and Least
Squares, BIT Numerical Math, vol. 43, pp. 427-448, 2003.