Is_Naive_Bayes_a_Good_Classifier_for_Document_Clas
Is_Naive_Bayes_a_Good_Classifier_for_Document_Clas
net/publication/266463703
Article in International Journal of Software Engineering and Its Applications · January 2011
CITATIONS READS
285 12,148
3 authors:
SEE PROFILE
All content following this page was uploaded by W.H. Ip on 21 July 2017.
Abstract
1. Introduction
With the explosive growth of the textual information from the electronic documents and
World Wide Web, proper classification of such enormous amount of information into our
needs is a critical step towards the business success. Recently, numerous research activities
have been conducted in the field of document classification, particularly applying in spam
filtering [1-3], emails categorization [4], website classification [5], formation of knowledge
repositories [6], and ontology mapping [7].
However, it is time-consuming and labor intensive for a human to read over and correctly
categorize an article manually. Attempts to address this challenge, automatic document
classification studies are gaining more interests in text mining research recently.
Consequently, an increasing number of approaches have been developed for accomplishing
such purpose, including k-nearest-neighbor (KNN) classification [8], Naïve Bayes
classification [9], support vector machines (SVM) [10], decision tree (DT) [11], neural
network (NN) [12], and maximum entropy [13].
Among these approaches, the Naïve Bayes text classifier has been widely used because of
its simplicity in both the training and classifying stage [10]. Although it is less accurate than
other discriminative methods (such as SVM), numerous researchers proved that it is effective
enough to classify the text in many domains [14]. Naïve Bayes models allow each attribute to
contribute towards the final decision equally and independently from other attributes, in
which it is more computational efficient when compared with other text classifiers. Thus, the
present study focuses on employing Naïve Bayes approach as the text classifier for document
classification and thus evaluates its classification performance against other classifiers.
The reminder of the paper is organized as follows: Section 2 depicts Naïve Bayes
classification algorithm and the document classification and evaluation methodology. Section
3 presents the data used for the evaluation, the setup of experiments (i.e. compare the Naïve
37
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011
Bayes results with SVM, NN, and DT), the evaluation method, and the results. Finally,
Section 4 draws the conclusion and summarizes the reflections on what has been learnt.
2. Research Methodology
In order to generate a document classifier model, Figure 1 depicts the methodology
beginning with data preprocessing to the model evaluation. Indeed, some data are useless (i.e.
do not affect the classification result even removing them, such as stop words) and some
carries similar meanings (i.e. the term „bank‟ and „banks‟), therefore a preprocessing phase
has been to conduct first. In this way, the dataset can be more precise.
After the data preprocessing phase, critical attributes have to be selected. In this study,
critical means the importance of such attribute towards the solution class. For example, the
term „bank‟ categorized in „business‟ class has the highest score in term of term frequency,
therefore it is analyzed that „bank‟ is one of the critical attributes to represent the documents
fell in the „business‟ class. Thus, less important features can be removed and so the
computational time can be improved significantly.
As for the classification phase, different classifiers (such as SVM, NN, and DT) are
employed to generate the model. However, this study only focused on using Naïve Bayes to
classify the documents. Given the probabilistic characteristic of Naïve Bayes, each training
document is vectorized by the trained Naïve Bayes classifier through the calculation of the
posterior probability value for each existing. Further explanation can be found in Section 2.3
Adoption of Document Classifier – Naïve Bayes.
Finally, the model is evaluated by a set of testing data. In order to test the classification
ability of the model, several evaluation measures (such as precision, recall, and F-measure)
are adopted. Furthermore, to interpret whether Naïve Bayes is best to use as the classifier, its
testing result will be compared with other classifiers‟ results as well.
It is common to find that several attributes are useless (such as the word „a‟, „the‟, etc.).
Thus, stopword removing algorithm has been applied. To initialize the algorithm, a set of
stopword (such as a, a's, able, about, above, according, accordingly, and across) has set by the
human beforehand and hence stored in a text file. Then, the model can simply match the
attributes with those preset stopword.
After the stopword algorithm, a missing data checking algorithm is adopted. This
algorithm is used to identify any missing data and hence interpret a value to it since data
mining cannot perform under missing data situation.
The third algorithm applied in the preprocessing phase is the stemming. Since some
words carry similar meanings but in different grammatically form (such as “bank” and
“banks”), therefore it is needed to combine them into one attribute. In this way, the
documents can show a better representation (with stronger correlations) of these terms and
even the dataset can be reduced for achieving faster processing time.
Feature selection is one of the most important preprocessing steps in data mining. It is an
effective dimensionality reduction technique to remove noise feature. In general, the basic
idea of feature selection algorithm to searches through all possible combinations of attributes
in the data to find which subset of features works best for prediction. Thus, the attribute
38
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011
vectors can be reduced in number by which the most meaningful ones are kept and the
irrelevant or redundant ones are removed and deleted.
In this study, all the documents in the training data are categorized into four
different categories in which the model can simply compute which terms are frequently
occurred in such category. In this way, some unless or irrelevant attributes can be
filtered out. As discussed by the study [5], Cfs Subset Evaluator is the best method to
get final feature set; whereas rank search or random search is suggested to have a good
feature set. Thus, in this study, Cfs Subset Evaluator and rank search are applied as the
39
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011
feature selection algorithm. For example, Table 1 summarizes the feature selected by
applying the Cfs Subset Evaluator and rank search (with Gain Ratio metric).
Table 1. Features Selected Using the Cfs Subset Evaluator and Rank Search
After preprocessing and feature selection phases, the numbers of attribute will be
significantly reduced and are more precise for the use in building the classification model. For
the classification phase, Naïve Bayes is used as the classifier because of its simplicity and
good performance in document and text classification, as reported and discussed by
Chakrabarti et al. [10].
Naïve Bayes classifier is the simplest instance of a probabilistic classifier. The output
Pr(C|d) of a probabilistic classifier is the probability that a document d belongs to a class C.
Each document contains terms which are given probabilities based on its number of
occurrence within that particular documents. With the supervised training, Naïve Bayes can
learn the pattern of examining a set of test documents that have been well-categorized and
hence comparing the contents in all categories by building a list of words as well as their
occurrence. Thus, such list of word occurrence can be used to classify the new documents to
their right categories, according to the highest posterior probability.
To test and evaluate the model, 70% of the dataset are used. Instances are extracted and
then served as a benchmarking dataset for machine learning problems. By comparing the
actual class of the instance with the predicted one (i.e. generated by the classification model),
system performance can be measures in term of recall, precision, and F-measure. These can
be mathematically defined as below.
(1)
(2)
(3)
40
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011
In order to further evaluate the performance of the proposed preprocessing stage, the
results of not preprocess and preprocess are compared. However, if the results are worse than
that when no preprocessing phase is conducted (i.e. the classification model is not good
enough), therefore adjusting and fine-tuning parameters are required (e.g. modifying the
technique used in feature selection) and hence re-build the model again. This step will stop
until a good classification result is obtained. Furthermore, Naïve Bayes classifier will be
tested with other classifier (such as DT, SVM, NN) to determine whether Naïve Bayes is the
best classifier among them.
3. Performance Evaluation
3.1. Data Description
The goal of this study is to classify the given specified experimental dataset into four
categories (i.e. business, politics, sport, travel) correctly. To start with, it is given 1000
documents for each category to serve as the dataset for generating the classification model.
To build and evaluate the classification model, the total 4000 documents will be split into two
datasets, namely training set and testing set, in which 30% of the documents will go to the
training set whereas the remaining 70% will go to the testing set.
In the representation of these documents, they have been vectorized into 1311 attributes
(in term of numerical values) and 1 solution attribute (in term of nominal values). No missing
data is among the attributes and all the numeric attributes are described in the term
frequency/inverse document frequency (TFIDF). An example of the data can be presented as
Figure 2 and Table 2 summarizes the description data in both training and testing set.
41
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011
The objective of this evaluation is twofold. First, it determines whether the preprocessing
phase is useful to deduce better classification accuracy and performance when compared to
the situation that has not been preprocessed the data. Second, it compares the classification
accuracy and performance when different classifiers are applied.
A dataset with 4000 documents classified in four different categories is used for
evaluation. The selected dataset contains four categories of document: business, politic, sports,
and travel. All the four categories are easily differentiated. 30% data (i.e. 1200 documents)
are extracted randomly to build the training dataset for the classifier. The other 2800
documents are used as the testing dataset to test the classifier.
The model is built based on the “Naïve Bayes” classifier developed in Weka [15]. Table 3
summarizes the result of using Naïve Bayes classifier to classify the documents. However, it
surprisingly finds that the results of preprocessed dataset (95.5%) are worse than those which
have not preprocessed (96.9%). Therefore, it is required to adjust the preprocessed model in
order to achieve a better result. Considering the preprocessing phase is common to adopt in
all case, therefore the adjustment is made in the feature selection phase.
As mentioned above, Cfs Subset Evaluator and rank search (with Gain ration metric) are
used for the feature selection. Therefore, another technique for rank search has been tried to
adopt. This time, Chi-square feature selection has been adopted and 89 attributes are selected
(Figure 3).
Rather than 75 attributes being selected previously, 89 attributes has been inputted this
time. The result has been improved after using Chi-square feature selection, as depicted in
Table 4 and Figure 4. The accuracy has been improved 0.1%. Although the improvement is
insignificant, it is proven that preprocessing and feature selection are useful in achieving
better classification result. It is believed that different searching technique can help to
accomplish different classification result under different situations, in which it needs to take
many trials and time to generalizing the best solution. However, due to the time constraints,
this study only draws the conclusion of using preprocessing and feature selection can achieve
better classification result. Furthermore, another critical point can be found is that the time
used to build the model is significantly improved after the number of features has been
greatly reduced from 9.66 seconds to around 0.19 seconds (Table 5).
After discussing the importance of preprocessing and feature selection, the following
experiment is to test whether Naïve Bayes is the best classifier among other classifiers. To
serve for this purpose, three different classifiers have been applied for testing. These
classifiers are: SVM (the “SMO” function in WEKA), NN (the lazy “IBk”), and DT (the tree
“J48”). In this experiment, the preprocessed dataset (with 90 attributes) are used for
42
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011
evaluation. Table 6 summarized all the accuracy results with the precision, recall, and F-
Measure. As shown in the table, the accuracy result of Naïve Bayes is the best among other
classifiers. Although SVM gets similar results as Naïve Bayes, the times taken to build the
model is dissatisfactory. Compared with the times used for building a Naïve Bayes classifier
(0.19 seconds), SVM requires 2.69 seconds, which is 14 times of Naïve Bayes classifier, as
depicted in Table 7. As a result, Naïve Bayes is reported to be the best text classifier.
Figure 3. Feature selection result using the Cfs Subset Evaluator and Rank
Search (with Chi-square feature selection)
4. Conclusions
In this study, Naïve Bayes classifier has been discussed as the best document classifier,
which satisfies the literature result. Through the implementation of different feature selection
and classifier available in WEKA, it is demonstrated preprocessing and feature selection are
two important steps to improve the mining quality. There are many words in the documents,
therefore when we captured the terms from these documents, thousand of terms are found.
However, there are some terms that are usefulness and uninteresting to the results, it is then
important to discover and interpret which features are useful and critical. Concerning
numerous searching and selection techniques are available; it is encouraged to apply all these
techniques and hence selects the best one for preprocess the data as well as to build the
model. Furthermore, the performance of mining result is directly affected by the quality of
data. So, preprocessing phase is important to make the data being more precise (so as to
achieve a better classification result) and even improve the time used to train and general the
model, as proven in the experiment section.
43
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011
Table 5. Times taken to build the Naïve Bayes classifier (by using the dataset
with preprocessing and without preprocessing)
44
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011
Acknowledgement
The authors would also like to express their sincere thanks to the Research Committee of
the Hong Kong Polytechnic University for providing the financial support for this research
work.
References
[1] S.J. Delany, P. Cunningham, and L. Coyle, “An assessment of case-based reasoning for spam filtering”,
Artificial Intelligence Review Journal, Vol. 24, No. 3-4, 2005, pp. 359-378.
[2] P. Cunningham, N. Nowlan, S.J. Delany, and M. Haahr, “A case-based approach in spam filtering that can
track concept drift”, In Proceedings: The ICCBR‟03 Workshop on Long-lived CBR Systems, Trondheim,
Norway, 2003
[3] K. Wei, A naïve Bayes spam filter, Faculty of Computer Science, University of Berkely, 2003.
[4] B. Kamens, Bayesian filtering: Beyond binary classification. Fog Creek Software, Inc., 2005.
[5] M.I. Devi, R. Rajaram, and K. Selvakuberan, “Generating best features for web page classification”,
Webology, Vol. 5, No. 1, 2008, Article 52.
[6] M. Hartley, D. Isa, V.P. Kallimani, and L.H. Lee, “A domain knowledge preserving in process engineering
using self-organizing concept”, In Proceedings: ICAIET 06. Sabah, Malaysia: Kota Kinabalu, 2006.
[7] X. Su, A text categorization perspective for ontology mapping, Norway: Department of Computer and
Information Science, Norweigian University of Science and Technology, 2002.
[8] E.H. Han, G. Karypis, and V. Kumar, Text categorization using weight adjusted k-nearest neighbour
classification, Department of Computer Science and Engineering, Army HPC Research Center, University of
Minnesota, 1999.
[9] A. McCallum, and K. Nigam, “A comparison of event models for naïve Bayes text classification”, Journal of
Machine Learning Research, Vol. 3, 2003, pp. 1265–1287.
[10] S. Chakrabarti, S. Roy, and M.V. Soundalgekar, “Fast and accurate text classification via multiple linear
discriminant projection”, The VLDB Journal The International Journal on Very Large Data Bases, 2003, pp.
170–185.
45
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011
[11] J.R. Quinlan, C4.5: programs for machine learning, Morgan Kaufmann Publishers Inc., San
Francisco, CA, 1993.
[12] S. Wermter, “Neural network agents for learning semantic text classification”, Information Retrieval, Vol. 3,
No. 2, 2004, pp. 87-103.
[13] K. Nigam, J. Lafferty, and A. McCallum, “Using maximum entropy for text classification”, In Proceedings:
IJCAI-99 Workshop on Machine Learning for Information Filtering, pp. 61–67, 1999.
[14] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features”, In
Proceedings: Machine Learning: ECML-98, 10th European Conference on Machine Learning, pp. 137–142,
1998.
[15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten, “The WEKA data mining
software: an update”, SIGKDD Explorations, Vol. 11, No. 1, 2009, pp. 10-18.
46