0% found this document useful (0 votes)

21 views11 pages

Is_Naive_Bayes_a_Good_Classifier_for_Document_Clas

The paper evaluates the effectiveness of the Naïve Bayes classifier for document classification, highlighting its simplicity and computational efficiency compared to other classifiers like decision trees and support vector machines. Results indicate that Naïve Bayes outperforms these classifiers in terms of accuracy and processing time, despite some preprocessing techniques yielding mixed results. The study emphasizes the importance of feature selection and preprocessing in enhancing classification performance.

Uploaded by

sarabenamar27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views11 pages

Is_Naive_Bayes_a_Good_Classifier_for_Document_Clas

Uploaded by

sarabenamar27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/266463703

Is Naïve Bayes a Good Classiﬁer for Document Classiﬁcation?

Article in International Journal of Software Engineering and Its Applications · January 2011

CITATIONS READS
285 12,148

3 authors:

S.L. Ting W.H. Ip

The Hong Kong Polytechnic University University of Saskatchewan
33 PUBLICATIONS 1,167 CITATIONS 431 PUBLICATIONS 8,190 CITATIONS

SEE PROFILE SEE PROFILE

Albert H.C. Tsang

The Hong Kong Polytechnic University
53 PUBLICATIONS 3,178 CITATIONS

SEE PROFILE

All content following this page was uploaded by W.H. Ip on 21 July 2017.

The user has requested enhancement of the downloaded file.

International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011

Is Naïve Bayes a Good Classifier for Document Classification?

S.L. Ting, W.H. Ip, Albert H.C. Tsang

Department of Industrial and Systems Engineering, The Hong Kong Polytechnic
University, Hung Hum, Kowloon, Hong Kong
[email protected]

Abstract

Document classification is a growing interest in the research of text mining. Correctly

identifying the documents into particular category is still presenting challenge because of
large and vast amount of features in the dataset. In regards to the existing classifying
approaches, Naïve Bayes is potentially good at serving as a document classification model
due to its simplicity. The aim of this paper is to highlight the performance of employing Naïve
Bayes in document classification. Results show that Naïve Bayes is the best classifiers against
several common classifiers (such as decision tree, neural network, and support vector
machines) in term of accuracy and computational efficiency.

Keywords: Document Classification, Naïve Bayes Classifier, Text Mining.

1. Introduction
With the explosive growth of the textual information from the electronic documents and
World Wide Web, proper classification of such enormous amount of information into our
needs is a critical step towards the business success. Recently, numerous research activities
have been conducted in the field of document classification, particularly applying in spam
filtering [1-3], emails categorization [4], website classification [5], formation of knowledge
repositories [6], and ontology mapping [7].
However, it is time-consuming and labor intensive for a human to read over and correctly
categorize an article manually. Attempts to address this challenge, automatic document
classification studies are gaining more interests in text mining research recently.
Consequently, an increasing number of approaches have been developed for accomplishing
such purpose, including k-nearest-neighbor (KNN) classification [8], Naïve Bayes
classification [9], support vector machines (SVM) [10], decision tree (DT) [11], neural
network (NN) [12], and maximum entropy [13].
Among these approaches, the Naïve Bayes text classifier has been widely used because of
its simplicity in both the training and classifying stage [10]. Although it is less accurate than
other discriminative methods (such as SVM), numerous researchers proved that it is effective
enough to classify the text in many domains [14]. Naïve Bayes models allow each attribute to
contribute towards the final decision equally and independently from other attributes, in
which it is more computational efficient when compared with other text classifiers. Thus, the
present study focuses on employing Naïve Bayes approach as the text classifier for document
classification and thus evaluates its classification performance against other classifiers.
The reminder of the paper is organized as follows: Section 2 depicts Naïve Bayes
classification algorithm and the document classification and evaluation methodology. Section
3 presents the data used for the evaluation, the setup of experiments (i.e. compare the Naïve

37
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011

Bayes results with SVM, NN, and DT), the evaluation method, and the results. Finally,
Section 4 draws the conclusion and summarizes the reflections on what has been learnt.

2. Research Methodology
In order to generate a document classifier model, Figure 1 depicts the methodology
beginning with data preprocessing to the model evaluation. Indeed, some data are useless (i.e.
do not affect the classification result even removing them, such as stop words) and some
carries similar meanings (i.e. the term „bank‟ and „banks‟), therefore a preprocessing phase
has been to conduct first. In this way, the dataset can be more precise.
After the data preprocessing phase, critical attributes have to be selected. In this study,
critical means the importance of such attribute towards the solution class. For example, the
term „bank‟ categorized in „business‟ class has the highest score in term of term frequency,
therefore it is analyzed that „bank‟ is one of the critical attributes to represent the documents
fell in the „business‟ class. Thus, less important features can be removed and so the
computational time can be improved significantly.
As for the classification phase, different classifiers (such as SVM, NN, and DT) are
employed to generate the model. However, this study only focused on using Naïve Bayes to
classify the documents. Given the probabilistic characteristic of Naïve Bayes, each training
document is vectorized by the trained Naïve Bayes classifier through the calculation of the
posterior probability value for each existing. Further explanation can be found in Section 2.3
Adoption of Document Classifier – Naïve Bayes.
Finally, the model is evaluated by a set of testing data. In order to test the classification
ability of the model, several evaluation measures (such as precision, recall, and F-measure)
are adopted. Furthermore, to interpret whether Naïve Bayes is best to use as the classifier, its
testing result will be compared with other classifiers‟ results as well.

2.1. Phase 1: Preprocessing

It is common to find that several attributes are useless (such as the word „a‟, „the‟, etc.).
Thus, stopword removing algorithm has been applied. To initialize the algorithm, a set of
stopword (such as a, a's, able, about, above, according, accordingly, and across) has set by the
human beforehand and hence stored in a text file. Then, the model can simply match the
attributes with those preset stopword.
After the stopword algorithm, a missing data checking algorithm is adopted. This
algorithm is used to identify any missing data and hence interpret a value to it since data
mining cannot perform under missing data situation.
The third algorithm applied in the preprocessing phase is the stemming. Since some
words carry similar meanings but in different grammatically form (such as “bank” and
“banks”), therefore it is needed to combine them into one attribute. In this way, the
documents can show a better representation (with stronger correlations) of these terms and
even the dataset can be reduced for achieving faster processing time.

2.2. Phase 2: Feature Selection

Feature selection is one of the most important preprocessing steps in data mining. It is an
effective dimensionality reduction technique to remove noise feature. In general, the basic
idea of feature selection algorithm to searches through all possible combinations of attributes
in the data to find which subset of features works best for prediction. Thus, the attribute

38
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011

vectors can be reduced in number by which the most meaningful ones are kept and the
irrelevant or redundant ones are removed and deleted.

Figure 1. Research Methodology

In this study, all the documents in the training data are categorized into four
different categories in which the model can simply compute which terms are frequently
occurred in such category. In this way, some unless or irrelevant attributes can be
filtered out. As discussed by the study [5], Cfs Subset Evaluator is the best method to
get final feature set; whereas rank search or random search is suggested to have a good
feature set. Thus, in this study, Cfs Subset Evaluator and rank search are applied as the

39
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011

feature selection algorithm. For example, Table 1 summarizes the feature selected by
applying the Cfs Subset Evaluator and rank search (with Gain Ratio metric).

Table 1. Features Selected Using the Cfs Subset Evaluator and Rank Search

2.3. Phase 3: Adoption of Document Classifier – Naïve Bayes

After preprocessing and feature selection phases, the numbers of attribute will be
significantly reduced and are more precise for the use in building the classification model. For
the classification phase, Naïve Bayes is used as the classifier because of its simplicity and
good performance in document and text classification, as reported and discussed by
Chakrabarti et al. [10].
Naïve Bayes classifier is the simplest instance of a probabilistic classifier. The output
Pr(C|d) of a probabilistic classifier is the probability that a document d belongs to a class C.
Each document contains terms which are given probabilities based on its number of
occurrence within that particular documents. With the supervised training, Naïve Bayes can
learn the pattern of examining a set of test documents that have been well-categorized and
hence comparing the contents in all categories by building a list of words as well as their
occurrence. Thus, such list of word occurrence can be used to classify the new documents to
their right categories, according to the highest posterior probability.

2.4. Phase 4: Model Evaluation

To test and evaluate the model, 70% of the dataset are used. Instances are extracted and
then served as a benchmarking dataset for machine learning problems. By comparing the
actual class of the instance with the predicted one (i.e. generated by the classification model),
system performance can be measures in term of recall, precision, and F-measure. These can
be mathematically defined as below.

(1)

(2)

(3)

40
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011

In order to further evaluate the performance of the proposed preprocessing stage, the
results of not preprocess and preprocess are compared. However, if the results are worse than
that when no preprocessing phase is conducted (i.e. the classification model is not good
enough), therefore adjusting and fine-tuning parameters are required (e.g. modifying the
technique used in feature selection) and hence re-build the model again. This step will stop
until a good classification result is obtained. Furthermore, Naïve Bayes classifier will be
tested with other classifier (such as DT, SVM, NN) to determine whether Naïve Bayes is the
best classifier among them.

3. Performance Evaluation
3.1. Data Description

The goal of this study is to classify the given specified experimental dataset into four
categories (i.e. business, politics, sport, travel) correctly. To start with, it is given 1000
documents for each category to serve as the dataset for generating the classification model.
To build and evaluate the classification model, the total 4000 documents will be split into two
datasets, namely training set and testing set, in which 30% of the documents will go to the
training set whereas the remaining 70% will go to the testing set.
In the representation of these documents, they have been vectorized into 1311 attributes
(in term of numerical values) and 1 solution attribute (in term of nominal values). No missing
data is among the attributes and all the numeric attributes are described in the term
frequency/inverse document frequency (TFIDF). An example of the data can be presented as
Figure 2 and Table 2 summarizes the description data in both training and testing set.

Figure 2. Example of Data View

Table 2. Data Description in this Study

41
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011

3.2. Experimental Results and Discussions

The objective of this evaluation is twofold. First, it determines whether the preprocessing
phase is useful to deduce better classification accuracy and performance when compared to
the situation that has not been preprocessed the data. Second, it compares the classification
accuracy and performance when different classifiers are applied.
A dataset with 4000 documents classified in four different categories is used for
evaluation. The selected dataset contains four categories of document: business, politic, sports,
and travel. All the four categories are easily differentiated. 30% data (i.e. 1200 documents)
are extracted randomly to build the training dataset for the classifier. The other 2800
documents are used as the testing dataset to test the classifier.
The model is built based on the “Naïve Bayes” classifier developed in Weka [15]. Table 3
summarizes the result of using Naïve Bayes classifier to classify the documents. However, it
surprisingly finds that the results of preprocessed dataset (95.5%) are worse than those which
have not preprocessed (96.9%). Therefore, it is required to adjust the preprocessed model in
order to achieve a better result. Considering the preprocessing phase is common to adopt in
all case, therefore the adjustment is made in the feature selection phase.

Table 3. Classification Accuracy of Naïve Bayes Classifier (by Using the

Dataset with Preprocessing and without Preprocessing)

As mentioned above, Cfs Subset Evaluator and rank search (with Gain ration metric) are
used for the feature selection. Therefore, another technique for rank search has been tried to
adopt. This time, Chi-square feature selection has been adopted and 89 attributes are selected
(Figure 3).
Rather than 75 attributes being selected previously, 89 attributes has been inputted this
time. The result has been improved after using Chi-square feature selection, as depicted in
Table 4 and Figure 4. The accuracy has been improved 0.1%. Although the improvement is
insignificant, it is proven that preprocessing and feature selection are useful in achieving
better classification result. It is believed that different searching technique can help to
accomplish different classification result under different situations, in which it needs to take
many trials and time to generalizing the best solution. However, due to the time constraints,
this study only draws the conclusion of using preprocessing and feature selection can achieve
better classification result. Furthermore, another critical point can be found is that the time
used to build the model is significantly improved after the number of features has been
greatly reduced from 9.66 seconds to around 0.19 seconds (Table 5).
After discussing the importance of preprocessing and feature selection, the following
experiment is to test whether Naïve Bayes is the best classifier among other classifiers. To
serve for this purpose, three different classifiers have been applied for testing. These
classifiers are: SVM (the “SMO” function in WEKA), NN (the lazy “IBk”), and DT (the tree
“J48”). In this experiment, the preprocessed dataset (with 90 attributes) are used for

42
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011

evaluation. Table 6 summarized all the accuracy results with the precision, recall, and F-
Measure. As shown in the table, the accuracy result of Naïve Bayes is the best among other
classifiers. Although SVM gets similar results as Naïve Bayes, the times taken to build the
model is dissatisfactory. Compared with the times used for building a Naïve Bayes classifier
(0.19 seconds), SVM requires 2.69 seconds, which is 14 times of Naïve Bayes classifier, as
depicted in Table 7. As a result, Naïve Bayes is reported to be the best text classifier.

Figure 3. Feature selection result using the Cfs Subset Evaluator and Rank
Search (with Chi-square feature selection)

4. Conclusions
In this study, Naïve Bayes classifier has been discussed as the best document classifier,
which satisfies the literature result. Through the implementation of different feature selection
and classifier available in WEKA, it is demonstrated preprocessing and feature selection are
two important steps to improve the mining quality. There are many words in the documents,
therefore when we captured the terms from these documents, thousand of terms are found.
However, there are some terms that are usefulness and uninteresting to the results, it is then
important to discover and interpret which features are useful and critical. Concerning
numerous searching and selection techniques are available; it is encouraged to apply all these
techniques and hence selects the best one for preprocess the data as well as to build the
model. Furthermore, the performance of mining result is directly affected by the quality of
data. So, preprocessing phase is important to make the data being more precise (so as to
achieve a better classification result) and even improve the time used to train and general the
model, as proven in the experiment section.

43
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011

Table 4. Classification Accuracy of Naïve Bayes Classifier (by using the

Dataset with Different Feature Selection Techniques)

Table 5. Times taken to build the Naïve Bayes classifier (by using the dataset
with preprocessing and without preprocessing)

Table 6. Classification Accuracy of Different Classifiers

Table 7. Times Taken to Build the Classifiers

44
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011

Figure 4. Classification Result Using Naïve Bayes Classifier with Preprocessed

Dataset

Acknowledgement
The authors would also like to express their sincere thanks to the Research Committee of
the Hong Kong Polytechnic University for providing the financial support for this research
work.

References
[1] S.J. Delany, P. Cunningham, and L. Coyle, “An assessment of case-based reasoning for spam filtering”,
Artificial Intelligence Review Journal, Vol. 24, No. 3-4, 2005, pp. 359-378.
[2] P. Cunningham, N. Nowlan, S.J. Delany, and M. Haahr, “A case-based approach in spam filtering that can
track concept drift”, In Proceedings: The ICCBR‟03 Workshop on Long-lived CBR Systems, Trondheim,
Norway, 2003
[3] K. Wei, A naïve Bayes spam filter, Faculty of Computer Science, University of Berkely, 2003.
[4] B. Kamens, Bayesian filtering: Beyond binary classification. Fog Creek Software, Inc., 2005.
[5] M.I. Devi, R. Rajaram, and K. Selvakuberan, “Generating best features for web page classification”,
Webology, Vol. 5, No. 1, 2008, Article 52.
[6] M. Hartley, D. Isa, V.P. Kallimani, and L.H. Lee, “A domain knowledge preserving in process engineering
using self-organizing concept”, In Proceedings: ICAIET 06. Sabah, Malaysia: Kota Kinabalu, 2006.
[7] X. Su, A text categorization perspective for ontology mapping, Norway: Department of Computer and
Information Science, Norweigian University of Science and Technology, 2002.
[8] E.H. Han, G. Karypis, and V. Kumar, Text categorization using weight adjusted k-nearest neighbour
classification, Department of Computer Science and Engineering, Army HPC Research Center, University of
Minnesota, 1999.
[9] A. McCallum, and K. Nigam, “A comparison of event models for naïve Bayes text classification”, Journal of
Machine Learning Research, Vol. 3, 2003, pp. 1265–1287.
[10] S. Chakrabarti, S. Roy, and M.V. Soundalgekar, “Fast and accurate text classification via multiple linear
discriminant projection”, The VLDB Journal The International Journal on Very Large Data Bases, 2003, pp.
170–185.

45
International Journal of Software Engineering and Its Applications
Vol. 5, No. 3, July, 2011

[11] J.R. Quinlan, C4.5: programs for machine learning, Morgan Kaufmann Publishers Inc., San
Francisco, CA, 1993.
[12] S. Wermter, “Neural network agents for learning semantic text classification”, Information Retrieval, Vol. 3,
No. 2, 2004, pp. 87-103.
[13] K. Nigam, J. Lafferty, and A. McCallum, “Using maximum entropy for text classification”, In Proceedings:
IJCAI-99 Workshop on Machine Learning for Information Filtering, pp. 61–67, 1999.
[14] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features”, In
Proceedings: Machine Learning: ECML-98, 10th European Conference on Machine Learning, pp. 137–142,
1998.
[15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten, “The WEKA data mining
software: an update”, SIGKDD Explorations, Vol. 11, No. 1, 2009, pp. 10-18.

View publication stats

Final Report
No ratings yet
Final Report
29 pages
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification
No ratings yet
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification
16 pages
Medicinal_Leaves_Classification_Using_Random_Forest_and_AdaBoost
No ratings yet
Medicinal_Leaves_Classification_Using_Random_Forest_and_AdaBoost
8 pages
Preboard-2 PAPERS 10 SET-2 - Answer Key
No ratings yet
Preboard-2 PAPERS 10 SET-2 - Answer Key
12 pages
Big Data Mining and Analytics Notes
No ratings yet
Big Data Mining and Analytics Notes
7 pages
Predictive Modeling Projectt
No ratings yet
Predictive Modeling Projectt
109 pages
Deep Learning
No ratings yet
Deep Learning
42 pages
Theis finaldoc
No ratings yet
Theis finaldoc
86 pages
Toxic Comments Classification
No ratings yet
Toxic Comments Classification
10 pages
Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE
No ratings yet
Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE
20 pages
Innovations in Stroke Identification A Machine Learning-Based Diagnostic Model Using Neuroimages
No ratings yet
Innovations in Stroke Identification A Machine Learning-Based Diagnostic Model Using Neuroimages
11 pages
0 Machine Learning Overview and Metrics LT
No ratings yet
0 Machine Learning Overview and Metrics LT
84 pages
Information Search
No ratings yet
Information Search
50 pages
11
No ratings yet
11
42 pages
Lecture 1
No ratings yet
Lecture 1
33 pages
Top Machine Learning Informations About Different Algorithms
No ratings yet
Top Machine Learning Informations About Different Algorithms
63 pages
FULLTEXT01
No ratings yet
FULLTEXT01
30 pages
Unit-3
No ratings yet
Unit-3
27 pages
Friedman1997 - Article - BayesianNetworkClassifiers Edited
No ratings yet
Friedman1997 - Article - BayesianNetworkClassifiers Edited
33 pages
Region_Growing_A_New_Approach
No ratings yet
Region_Growing_A_New_Approach
23 pages
The Feature Extraction For Classifying Words On Social Media With The Naïve Bayes Algorithm
No ratings yet
The Feature Extraction For Classifying Words On Social Media With The Naïve Bayes Algorithm
8 pages
Naive Bayes in Focus: A Thorough Examination of Its Algorithmic Foundations and Use Cases
No ratings yet
Naive Bayes in Focus: A Thorough Examination of Its Algorithmic Foundations and Use Cases
4 pages
Aad Project
No ratings yet
Aad Project
70 pages
Lect05
No ratings yet
Lect05
17 pages
Unit 4_Question Bank and answers
No ratings yet
Unit 4_Question Bank and answers
23 pages
Survey On Text Classification
No ratings yet
Survey On Text Classification
7 pages
Rishi S S(41111058) Final Report
No ratings yet
Rishi S S(41111058) Final Report
60 pages
111 1460444112 - 12-04-2016 PDF
No ratings yet
111 1460444112 - 12-04-2016 PDF
7 pages
Object Dectection in Low Light
No ratings yet
Object Dectection in Low Light
6 pages
Jimaging 09 00199 With Cover
No ratings yet
Jimaging 09 00199 With Cover
19 pages
1-Python-Intro-to-Linear-Regression-LT
No ratings yet
1-Python-Intro-to-Linear-Regression-LT
13 pages
4-Intro-to-K-Nearest-Neighbors
No ratings yet
4-Intro-to-K-Nearest-Neighbors
13 pages
math
No ratings yet
math
12 pages
IJCSI_PublishedPaper_Xhemali_et_al
No ratings yet
IJCSI_PublishedPaper_Xhemali_et_al
10 pages
Science Research Journal
No ratings yet
Science Research Journal
7 pages
Email Classification Using Machine Learning
No ratings yet
Email Classification Using Machine Learning
22 pages
Naive Bayes and Sentiment Classification
No ratings yet
Naive Bayes and Sentiment Classification
23 pages
Twitter Sentiment Analysis Using Different Algorithms
No ratings yet
Twitter Sentiment Analysis Using Different Algorithms
6 pages
Researchpaper (1)
No ratings yet
Researchpaper (1)
9 pages
Prabowo 2009
No ratings yet
Prabowo 2009
15 pages
19_ArticleClassificationusingNaturalLanguageProcessingandMachineLearning
No ratings yet
19_ArticleClassificationusingNaturalLanguageProcessingandMachineLearning
8 pages
[IEEE Semantic 2008 Pingpen Yuan] MSVM-KNN Multi-Class Text Classification
No ratings yet
[IEEE Semantic 2008 Pingpen Yuan] MSVM-KNN Multi-Class Text Classification
8 pages
2022-Prediction of Stuck Pipe Incidents Using Models Powered by Deep Learningand Machine Learning
No ratings yet
2022-Prediction of Stuck Pipe Incidents Using Models Powered by Deep Learningand Machine Learning
13 pages
A Survey On Machine Learning Techniques
No ratings yet
A Survey On Machine Learning Techniques
8 pages
Text Classification Using Machine Learning Techniq
No ratings yet
Text Classification Using Machine Learning Techniq
10 pages
Ijcst V3i2p17
No ratings yet
Ijcst V3i2p17
5 pages
Researchpaperclassification IEEEprocedding 1
No ratings yet
Researchpaperclassification IEEEprocedding 1
7 pages
Vision Transformer and Explainable Transfer Learning Models For Auto Detection of Kidney Cyst, Stone and Tumor From CT Radiography
No ratings yet
Vision Transformer and Explainable Transfer Learning Models For Auto Detection of Kidney Cyst, Stone and Tumor From CT Radiography
14 pages
Naive_Bayes_vs_decision_trees_in_intrusion_detecti
No ratings yet
Naive_Bayes_vs_decision_trees_in_intrusion_detecti
6 pages
Sciencedirect Sciencedirect
No ratings yet
Sciencedirect Sciencedirect
6 pages
Document
No ratings yet
Document
7 pages
Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis
No ratings yet
Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis
14 pages
AI MOCK TEST 2.docx
No ratings yet
AI MOCK TEST 2.docx
6 pages
ConSultantBERT Fine-Tuned Siamese Sentence-BERT For
No ratings yet
ConSultantBERT Fine-Tuned Siamese Sentence-BERT For
8 pages
Comparison of Text Classifiers On News Articles
No ratings yet
Comparison of Text Classifiers On News Articles
5 pages
Semi-Supervised Learning A Brief Review
No ratings yet
Semi-Supervised Learning A Brief Review
6 pages
Ijermt Jan2019
No ratings yet
Ijermt Jan2019
9 pages
ANewMethodofDataEncryptionbasedonOnetoOneFunctions
No ratings yet
ANewMethodofDataEncryptionbasedonOnetoOneFunctions
8 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Sentiment Analysis of Movie Reviews Using Hybrid Method of Naive Bayes and Genetic Algorithm
No ratings yet
Sentiment Analysis of Movie Reviews Using Hybrid Method of Naive Bayes and Genetic Algorithm
7 pages
Data Warehousing and Data Mining Lab Manual
0% (1)
Data Warehousing and Data Mining Lab Manual
30 pages
Research Methods in Machine Learning: A Content Analysis: Jackson Kamiri Geoffrey Mariga
No ratings yet
Research Methods in Machine Learning: A Content Analysis: Jackson Kamiri Geoffrey Mariga
14 pages
A Survey On Different Types of Approaches To Text Categorization
No ratings yet
A Survey On Different Types of Approaches To Text Categorization
3 pages
Text Classification
No ratings yet
Text Classification
10 pages
Document Classification Using Distributed Machine Learning
No ratings yet
Document Classification Using Distributed Machine Learning
4 pages
Weizhang 2013
No ratings yet
Weizhang 2013
4 pages
Sentiment Analysis On Twitter Data
No ratings yet
Sentiment Analysis On Twitter Data
7 pages
4.an Efficient
No ratings yet
4.an Efficient
10 pages
Kim 2016
No ratings yet
Kim 2016
5 pages
Lung Cancer Prediction System Using Logistic Regression Approach
No ratings yet
Lung Cancer Prediction System Using Logistic Regression Approach
5 pages
Improve Text Classification Accuracy Based On Classifier Fusion Methods
No ratings yet
Improve Text Classification Accuracy Based On Classifier Fusion Methods
6 pages
Text Classification Research With Attention-Based Recurrent Neural Networks
No ratings yet
Text Classification Research With Attention-Based Recurrent Neural Networks
12 pages
Fake News Detection
No ratings yet
Fake News Detection
11 pages
Machine Learning in Automated Text Categorization FABRIZIO SEBASTIANI Consiglio Nazionale Delle Ricerche
No ratings yet
Machine Learning in Automated Text Categorization FABRIZIO SEBASTIANI Consiglio Nazionale Delle Ricerche
3 pages
Job Opportunity Finding by Text Classification: Procedia Engineering
No ratings yet
Job Opportunity Finding by Text Classification: Procedia Engineering
5 pages
Naïve Bayes vs. Decision Trees vs. Neural Networks in The Classification of Training Web Pages
No ratings yet
Naïve Bayes vs. Decision Trees vs. Neural Networks in The Classification of Training Web Pages
8 pages
Naive Bayes Classifier From Wikipedia
No ratings yet
Naive Bayes Classifier From Wikipedia
13 pages
Modern Applications of Machine Learning: January 2006
No ratings yet
Modern Applications of Machine Learning: January 2006
11 pages
Review On Comparison Between Text Classification Algorithms
No ratings yet
Review On Comparison Between Text Classification Algorithms
4 pages
Cruz StrategiesSNNBPerformance
No ratings yet
Cruz StrategiesSNNBPerformance
10 pages
Comparison Between Multinomial and Bernoulli Naïve Bayes For Text Classification
No ratings yet
Comparison Between Multinomial and Bernoulli Naïve Bayes For Text Classification
4 pages
Large Scale Text Classification Using Map Reduce and Naive Bayes Algorithm For Domain Specified Ontology Building
No ratings yet
Large Scale Text Classification Using Map Reduce and Naive Bayes Algorithm For Domain Specified Ontology Building
5 pages
An Overview of Categorization Techniques: B. Mahalakshmi, Dr. K. Duraiswamy
No ratings yet
An Overview of Categorization Techniques: B. Mahalakshmi, Dr. K. Duraiswamy
7 pages
Social Media Data Driven Determination of Student Perceptions
No ratings yet
Social Media Data Driven Determination of Student Perceptions
14 pages
Ensemble Application of Convolutional and Recurrent Neural Networks For Multi-Label Text Categorization
No ratings yet
Ensemble Application of Convolutional and Recurrent Neural Networks For Multi-Label Text Categorization
7 pages
Techniques of Text Classification
No ratings yet
Techniques of Text Classification
28 pages
Nepali News Classification
No ratings yet
Nepali News Classification
5 pages
Naive Bayes and Sentiment
No ratings yet
Naive Bayes and Sentiment
19 pages
Proceedings of International Symposium
No ratings yet
Proceedings of International Symposium
1 page
Unit-4: Define The Domain For Clustering
No ratings yet
Unit-4: Define The Domain For Clustering
13 pages
An Investigation into the Use of a Neural Tree Classifier for Knowledge Discovery in OLAP Databases
From Everand
An Investigation into the Use of a Neural Tree Classifier for Knowledge Discovery in OLAP Databases
David R Swinburne
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet

Is_Naive_Bayes_a_Good_Classifier_for_Document_Clas

Uploaded by

Is_Naive_Bayes_a_Good_Classifier_for_Document_Clas

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Is Naïve Bayes a Good Classiﬁer for Document Classiﬁcation?

S.L. Ting W.H. Ip

SEE PROFILE SEE PROFILE

Albert H.C. Tsang

The user has requested enhancement of the downloaded file.

Is Naïve Bayes a Good Classifier for Document Classification?

S.L. Ting, W.H. Ip, Albert H.C. Tsang

Document classification is a growing interest in the research of text mining. Correctly

Keywords: Document Classification, Naïve Bayes Classifier, Text Mining.

2.1. Phase 1: Preprocessing

2.2. Phase 2: Feature Selection

Figure 1. Research Methodology

2.3. Phase 3: Adoption of Document Classifier – Naïve Bayes

2.4. Phase 4: Model Evaluation

Figure 2. Example of Data View

Table 2. Data Description in this Study

3.2. Experimental Results and Discussions

Table 3. Classification Accuracy of Naïve Bayes Classifier (by Using the

Table 4. Classification Accuracy of Naïve Bayes Classifier (by using the

Table 6. Classification Accuracy of Different Classifiers

Table 7. Times Taken to Build the Classifiers

Figure 4. Classification Result Using Naïve Bayes Classifier with Preprocessed

View publication stats

You might also like