0% found this document useful (0 votes)
12 views

text classification research paper 3

Uploaded by

Manish jaiswal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

text classification research paper 3

Uploaded by

Manish jaiswal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Hindawi

Wireless Communications and Mobile Computing


Volume 2023, Article ID 9796719, 1 page
https://doi.org/10.1155/2023/9796719

Retraction
Retracted: Text Classification Based on Machine Learning and
Natural Language Processing Algorithms

Wireless Communications and Mobile Computing


Received 17 October 2023; Accepted 17 October 2023; Published 18 October 2023

Copyright © 2023 Wireless Communications and Mobile Computing. This is an open access article distributed under the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.

This article has been retracted by Hindawi following an investi- References


gation undertaken by the publisher [1]. This investigation has
[1] H. Li and Z. Li, “Text Classification Based on Machine Learning
uncovered evidence of one or more of the following indicators of
and Natural Language Processing Algorithms,” Wireless Com-
systematic manipulation of the publication process: munications and Mobile Computing, vol. 2022, Article ID
3915491, 12 pages, 2022.
(1) Discrepancies in scope
(2) Discrepancies in the description of the research reported
(3) Discrepancies between the availability of data and the
research described
(4) Inappropriate citations
(5) Incoherent, meaningless and/or irrelevant content
included in the article
(6) Peer-review manipulation

The presence of these indicators undermines our confidence


in the integrity of the article’s content and we cannot, therefore,
vouch for its reliability. Please note that this notice is intended
solely to alert readers that the content of this article is unreliable.
We have not investigated whether authors were aware of or
involved in the systematic manipulation of the publication
process.
Wiley and Hindawi regrets that the usual quality checks did
not identify these issues before publication and have since put
additional measures in place to safeguard research integrity.
We wish to credit our own Research Integrity and Research
Publishing teams and anonymous and named external
researchers and research integrity experts for contributing to
this investigation.
The corresponding author, as the representative of all
authors, has been given the opportunity to register their agree-
ment or disagreement to this retraction. We have kept a record of
any response received.
6302, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/3915491, Wiley Online Library on [23/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Hindawi
Wireless Communications and Mobile Computing
Volume 2022, Article ID 3915491, 12 pages
https://doi.org/10.1155/2022/3915491

Research Article
Text Classification Based on Machine Learning and Natural
Language Processing Algorithms

E D
Hui Li1,2 and Zeming Li

T
2

1
School of Computer and Information Engineering, Harbin University of Commerce, Harbin, 150028 Heilongjiang, China

C
2
Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin, 150028 Heilongjiang, China

Correspondence should be addressed to Zeming Li; [email protected]

Received 27 April 2022; Revised 30 May 2022; Accepted 24 June 2022; Published 19 July 2022

A
Academic Editor: Chia-Huei Wu

Copyright © 2022 Hui Li and Zeming Li. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.

R
Nowadays, with the development of media technology, people receive more and more information, but the current classification
methods have the disadvantages of low classification efficiency and inability to identify multiple languages. In view of this, this
paper is aimed at improving the text classification method by using machine learning and natural language processing
technology. For text classification technology, this paper combines the technical requirements and application scenarios of text
classification with ML to optimize the classification. For the application of natural language processing (NLP) technology in

T
text classification, this paper puts forward the Trusted Platform Module (TPM) text classification algorithm. In the experiment
of distinguishing spam from legitimate mail by text recognition, all performance indexes of the TPM algorithm are superior to
other algorithms, and the accuracy of the TPM algorithm on different datasets is above 95%.

1. Introduction

R E
Although the representation of information is getting richer
and richer, so far, the main representation of information is
still text. On the one hand, because text is the most natural
form of information representation, it is easily accepted by
people. On the other hand, due to the low cost of text repre-
sentation, driven by the advocacy of paperless office, a large
number of electronic publications, digital libraries, e-com-
merce, etc. have appeared in the form of text. In addition,
with the rapid development of the global Internet in recent
how to dig out important information from massive infor-
mation has very high research value and practical signifi-
cance. Due to the different needs of users, how to excavate
the characteristics of different users and find exclusive infor-
mation for them has become the main problem that should
be solved in current information processing. The text classi-
fication technology using artificial intelligence algorithms
can automatically and efficiently perform classification tasks,
greatly reducing cost consumption. It plays an important
role in many fields such as sentiment analysis, public opin-
ion analysis, domain recognition, and intent recognition.
years, a large number of social networking sites, mobile In this paper, the first chapter briefly describes the cur-
Internet, and other industries have emerged. rent situation of natural language processing and machine
From a global perspective, the number of websites will learning. The second chapter is the research of related
continue to grow, which will inevitably generate an even work, summarizing the advantages and disadvantages of
greater amount of information. Because the amount of text other scholars’ natural language processing algorithms.
data is so large, while providing people with more usable The third chapter describes the text classification algorithm
information, it also makes it more difficult for people to find in detail, paving the way for the subsequent algorithm. In
the information that interests them most. That is to say, Chapter 4, aiming at the adaptive algorithm of deep learn-
information explosion leads to information trek. Therefore, ing and intelligent learning technology, the existing natural
6302, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/3915491, Wiley Online Library on [23/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 Wireless Communications and Mobile Computing

(x1, y1), ..., (xN , yN)

Learning system
Model

E D
xN+1

Taxonomy system

C
Figure 1: Description of the classification problem.
T yN+1

A
language algorithm is improved, and TPM algorithm is for semisupervised text classification. The proposed method
proposed and introduced. The fifth chapter analyzes the includes a self-training and model-based semisupervised text
performance of the proposed algorithm. Finally, summarize classification algorithm that determines parameter settings

R
the full text. for any new collection of documents [4]. Kobayashi et al.
are very interested in data mining technology; they believe
2. Related Work that text classification technology can help in data mining
technology [5]. He believes that today’s athletes cannot
avoid injuries during training, so he hopes to model the

T
So far, text classification technology has been widely used in
information filtering, mail classification, search engines, sports training of athletes and reduce sports injuries of ath-
query intent prediction, topic tracking, text corpus construc- letes [6]. Shah et al. considered an improved version of a
tion, and other fields. It can help users to accurately classify semisupervised learning algorithm for graph-structured data
messy data to obtain classified text information and solve the to address the problem of scaling deep learning methods to

E
problem of rapid positioning of information required by represent graph data [7]. Anoual and Zeroual’s research on
users. A large number of researchers in both academia and Arabic is very in-depth. They believe that Arabic is more
industry have begun to pay attention to this direction, which complex than other languages on the Internet, and it is not
not only promotes academic development but also promotes so easy to accurately display and translate them on the Inter-
net, so they study Arabic text classification technology and

R
the R&D and promotion of corresponding products.
Mohamed et al. proposed a novel active learning method conduct research on Arabic word combinations [8]. How-
for text classification in order to solve the problem of man- ever, through related research, it can also be found that
ually labeling data samples during the training phase. The although text is widely used by technology, it lacks real opti-
experimental results show that the proposed active learning mization, and in the era of big data, it is less combined with
method significantly reduces the labeling workload while ML.
improving the classification accuracy [1]. Mironczuk and
Protasiewicz designed a semisupervised learning Universum 3. Text Classification Technology
algorithm based on boosting technology, mainly for the case
of only a small number of labeled samples. Their experi- The classification problem includes two processes: learning
ments used four datasets and several combinations. Experi- and classification. The goal of the learning process is to build
mental results show that the algorithm can benefit from a classification model based on the known training data to
Universum samples and outperform several other methods, obtain a classifier. The task of the classification process is
especially when the labeled samples are insufficient [2]. to use the learned classifier to predict the class label of a
The purpose of Liu et al. was to extract state-of-the-art fea- new data instance. Figure 1 is a descriptive diagram of the
tures for text classification. They believed that this study will classification problem.
help readers obtain the necessary information about these In the figure, ðx1 , y1 Þ, ⋯, ðxN , yN Þ represents the training
elements and their associated technologies [3]. Pavlinek dataset that has been labeled with classes, xi represents the
and Podgorelec proposed a topic model-based approach data instance, and yi represents the class label corresponding
6302, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/3915491, Wiley Online Library on [23/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Wireless Communications and Mobile Computing 3

Training text

D
Pretreatment Text Rating of merit
representation
Training
process

Classification
process

Training text
Feature dimension reduction

T
Build a classifier

E
C
Text to be Classification
classified Pretreatment Text representation result

Figure 2: Text classification framework.

A
to xi . The learning system is based on the training data, from
which it learns a classifier PðY ∣ XÞ or Y = f ðXÞ. The classifi-
cation system classifies a new input instance xN+1 with the

R
already obtained classifier to predict the class label yN+1 of
its output [9, 10].
A text classification problem is a guided learning process
where the object is text and the task is to automatically clas-
subject of the document. Building a classifier: how to design
a text classifier is the main research content of text classifica-
tion methods. First, the text that can represent each category
in the classification system is selected as the training set, the
classifier is learned from the training set, and the classifica-
tion of new objects is realized. Performance evaluation: the
purpose of this step is to evaluate the pros and cons of the

T
sify new input text into one or more predefined categories; classification method and system performance. Different
each text object may belong to one or more categories. evaluation parameters can be used for different classification
There is an unknown mapping function Φ : D × C ⟶ problems; for example, single-label classification and multi-
fT, Fg between the text set and the category set, where D label classification problems will use different parameters.
= fd1 , d 2 ,⋯,d jDj g represents the document set to be classi- Text classifier performance evaluation methods include

E
fied, and C = fc1 , c2 ,⋯,cjCj g represents the predefined cate- recall rate, accuracy rate, F-value, microaverage, and macro-
average, so as to improve the performance of the classifica-
gory set. For each given data pair <d j , ci > , there are two
tion system.
values, a value of T indicates that document d j belongs to
category ci , and a value of F indicates that d j does not belong 4. Improved Text Classification Algorithm

R
to ci . That is to say, through the learning process, obtaining Based on ML and NLP Algorithms
the optimal estimation of the target mapping function is
what should be considered in the text classification task, 4.1. TMP Text Classification Algorithm. LSTM and CNN
which is also called the classifier. models are more commonly used neural network models.
The text classification framework is shown in Figure 2, Their combination can create many possibilities [11, 12].
which includes the basic problems that need to be solved. Based on this, this paper proposes a text classification algo-
As shown in Figure 2, the main functional modules of rithm model as shown in Figure 3.
the text classification system are briefly described as follows: As shown in Figure 3, the vector x can be obtained at the
Preprocessing: in order to improve the quality of text repre- embedding layer:
sentation and facilitate subsequent processing, preprocessing
operations such as formatting are required for the original X = fx0 ,⋯,xi ,⋯,xn−1 g = Embeddingða0 ,⋯,ai+1 ,⋯,an Þ: ð1Þ
text corpus. Text representation: the problems that need to
be solved in text representation include the following: First,
what language elements should be selected as text features, The second step is to transmit the data of the input layer
most of which are words or phrases. The second is to choose down, such as the following formula:
what model to quantify text objects. Feature dimensionality
reduction: in order to achieve text classification, it is neces- h ƒƒƒ! ƒƒƒ i
H = LSTMðX Þ, LSTMðX Þ : ð2Þ
sary to select features from the text that can best reflect the
6302, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/3915491, Wiley Online Library on [23/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 Wireless Communications and Mobile Computing

LSTM

D
LSTM LSTM

Convolution Pooling Output

E
Embedding
layer layer layer

LSTM LSTM

CRF
Figure 3: MS-KNN model classification algorithm.

CRF CRF

C T
Private
layer

R A
Bi‑GRU Bi‑GRU …… Bi‑GRU Bi‑GRU Shared layer

T
BERT

……

E
Embedding

R
……

Figure 4: TPM Chinese word segmentation model.

The third step is to extract the corresponding features as probability that the text belongs to a certain class. The spe-
in the following equations: cific model structure is shown in Figure 4.
As can be seen from Figure 4, the input of the TPM Chi-
ci = ConvðW i , H Þ, ð3Þ nese word segmentation model is still a piece of prepro-
cessed Chinese text [13, 14]. First, through the embedding
layer of the model, the natural language is converted into a
pi = poolingðC i Þ: ð4Þ text vector that can be recognized by the computer. Then,
the powerful semantic feature extraction ability of the BERT
The final output data is shown in the following equation: model is used to extract semantic features, which is equiva-
lent to reencoding the text according to the context seman-
yk = soft max ðW k P + bk Þ: ð5Þ tics. Then, according to the original dataset where the
input data is located, the semantic feature vector is inputted
W k represents the linear parameters of the fully con- into the corresponding Bi-GRU model of the private layer.
nected layer, b represents the bias, and yk represents the It is used to extract the unique features of the dataset
6302, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/3915491, Wiley Online Library on [23/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Wireless Communications and Mobile Computing 5

Web‑scale database

E D
Million‑scale data
Subsampling

T
Ancher graph
construction

C
A
points

TR Seed labels

Figure 5: Schematic diagram of the anchor map-based label propagation method.

E
compared to other datasets. At the same time, the semantic with support vector machine (SVM), among which k-near-
feature vector is inputted into the Bi-GRU model of the est neighbor (KNN) [105] is simple and intuitive, without
shared layer, which is used to extract common features of explicit learning process and offline training of classification
multiple datasets. Finally, the private features of the data models. Its basic idea is as follows: given the training set, in

R
are combined with the public features and put into the corre- which the data category has been determined, when a new
sponding CRF model of the inference layer to obtain the label sample to be classified is inputted, the similarity measure-
of each character in the text. Finally, according to the label of ment method is used to determine the similarity between
each character, the input text is divided into a sequence of the new sample and the training data, and then, the nearest
words and output, and the word segmentation operation of K samples are found from the training set, and the predic-
the data is completed by the model. tion is made by majority voting. Support vector machine
The text classification algorithm includes active learning (SVM) is a widely used text classification method. It is a
stage and mainstream active learning methods. Among the machine learning method based on statistical learning the-
pool-based active learning methods, uncertainty sampling ory. It was first proposed for binary classification problems.
is one of the simplest and most commonly used query For multiclassification problems, it is necessary to build
frameworks. Typical uncertain sampling methods include multiple classifiers. When constructing a binary SVM classi-
least confident (LC), margin sampling (MS), entropy sam- fier, its core task is to find an optimal hyperplane from
pling (ES), and centroid sampling (CS). In this paper, Edge countless classification interfaces, which is also called the
MS is chosen as the active learning algorithm because of its decision plane. It can best distinguish the samples in two cat-
excellent performance in mail classification. egories, and the distance between different categories and
The second stage is the text classification stage. Tradi- this plane is the largest. From the geometric point of view,
tional text classification algorithms include naive Bayes, k this hyperplane divides the input space into positive and
-nearest neighbor, and support vector machine. In this negative spaces, which is a line in two-dimensional space
paper, k-nearest neighbor (KNN) is selected to compare and a surface in three-dimensional space.
6302, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/3915491, Wiley Online Library on [23/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 Wireless Communications and Mobile Computing

2.5 1.2

1
2

D
0.8
1.5
0.6
1

E
0.4

0.5 0.2

0 0

T
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

(a) (b)
1.6 1.2
1.4

C
1
1.2
0.8
1
0.6
0.8

A
0.6 0.4

0.4 0.2
0.2
0
0 −3 −2 −1 0 1 2 3

R
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −0.2
(c) (d)

Figure 6: Schematic diagram of loss function.

T
The TPM algorithm in this paper is applied to the transmission needs to be established; the size is ðL + UÞ ×
research of Chinese word segmentation in multitask learn- ðL + UÞ, as shown in the following formula:
ing. In order to speed up the training and further enhance

E
the extraction of text semantic information, based on the wij
original multistandard word segmentation model, a new T ij + Pði ⟶ jÞ = L+U
: ð6Þ
∑k=1 wik
multistandard word segmentation model TPM based on
POS_LSTM (Particle Swarm Optimization, Long-term and
The time complexity of the algorithm has reached a
Short-term Network Memory Model) is improved. In the
high level. The label propagation algorithm is a method of

R
active learning stage and text classification stage, the TMP
transduction learning. Every time the test set is changed,
algorithm and boundary sampling method proposed in this
the algorithm must be run again. Therefore, for large-
paper are compared with the MS_KNN and MS_SVM algo-
scale tasks, time overhead is a key factor like application
rithms combined with k-nearest neighbor and support vec-
promotion. Figure 5 is a schematic diagram of the anchor
tor machine.
map-based label propagation method.
The calculation steps of the algorithm are as follows:
4.2. Application of NLP in Text Classification. Graph-based
semisupervised learning algorithms build a graph of all data (1) Use the K-means algorithm to select m anchor
samples (labeled and unlabeled) based on their similarity, points
and each point on the graph represents a data sample. The (2) Use the formula to calculate the mapping relation-
edge between two nodes is generally defined by a certain ship between sample points and anchor points
similarity measure, reflecting the connection between sam- (data2anchor) matrix ZðxÞ:
ples. There are usually two ways to define similarity: K
-adjacent and Gaussian kernel. Â À Á À ÁÃ T
When building a graph, the similarity between two verti- δ1 exp −D2 ðx, u1 Þ/t ,⋯,δm exp −D2 ðx, um Þ/t
ces can be defined by itself. It can be assumed that the Z ðx Þ À À Á Á :
∑mj=1 δ j exp −D2 x, u j /t
Gaussian kernel in formula (6) is used to define it. In the
process of label transmission, a probability matrix of label ð7Þ
6302, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/3915491, Wiley Online Library on [23/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Wireless Communications and Mobile Computing 7

H1 (t) = max (0,1–t)

E D
C T
R A i=1 i=L+1

Figure 7: Schematic diagram of email recognition process.

T
Table 1: TR07 and ES datasets.
(3) Using the AGR algorithm, the soft label matrix A∗ of
the anchor point is solved by the following formula: Dataset TR07 ES
 À Á−1  Spam quantity 50199 17171

E
A∗ = Z T1 Z 1 + γZ T Z − γZ T ZΛ−1 Z T Z Z T1 Y 1 : Legal mail quantity 25220 16545
Amount to 75419 33716
ð8Þ

For labeled data, according to the traditional support

R
(4) According to the decision function, the formula cal- vector machine (SVM) theory, the loss function is the hinge
culates the label of the unlabeled sample: loss, formula (10), as shown in Figure 6(a).

Zia j
̂y1 = arg max , i = 1, ⋯, n, ð9Þ H 1 ðt Þ = max ð0, 1 − t Þ: ð10Þ
j∈ð1,⋯,C Þ λj

Part of the improvement proposes a smoother version of


where δ is the indicated value, δ ∈ ð0, 1Þ. D is the dis- the loss function, such as formula (11), as shown in
tance function, which can you define by yourself. Figure 6(c):
Although the algorithms reduce the time complexity of
graph-based algorithms to linear, the problem of data À Á
Sðt Þ = exp −3t 2 : ð11Þ
sparseness has not been properly solved. Therefore, the algo-
rithm has only achieved application progress in the field of
image classification. We believe that if the sparsity of the task In the subsequent experiments, we use HingeLoss, which
is solved, the anchor graph-based label propagation algo- has the best overall effect of efficiency and accuracy, as the
rithm can be extended to the field of natural language pro- loss function for labeled data, and RampLoss as the loss
cessing. We take the part-of-speech tagging task as an function for unlabeled data, as shown in Figure 6(b). Then,
example and try to generalize the algorithm to NLP [15, 16]. in the following chapters [17, 18], our optimization objective
6302, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/3915491, Wiley Online Library on [23/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
8 Wireless Communications and Mobile Computing

TR07
60 0.99

D
0.98
50

0.97
40

E
Tca (ms%)

0.96

Fa
30
0.95

20

T
0.94

10
0.93

C
0 0.92
0 10 20 30 40 50 60 70 80 90 100
Delt

A
ES
70 0.99

R
60 0.98

50 0.97

T
Tca (ms%)

40 0.96

30 0.95 Fa

E
20 0.94

10 0.93

R
0 0.92
0 10 20 30 40 50 60 70 80 90 100
Delt
Fa
Tca

Figure 8: Fa and Tca values for TR07 and ES datasets when the value range is ½0, 100Š.

is formula (12), as shown in Figure 6(d): To verify the effectiveness of our method, we use and con-
duct experiments on two benchmark datasets: TREC2007
L L+U (TR07) and Enron-spam (ES) [19, 20]. It contains spam
1 2
min w + e 〠 H 1 ðyi f θ ðxi ÞÞ + e 〠 RS ð j f θ ðxi ÞjÞ: ð12Þ and legitimate mail as shown in Table 1.
w ,b 2 i=1 i=L+1
5.1.1. Selection of Threshold ∇. The time overhead required
5. Text Classification Performance Test for classification is actually related to the value of the param-
eter ∇. Wanting to obtain the optimized parameter ∇, when
5.1. Experimental Results and Analysis. Figure 7 is a sche- the value of ∇ varies between 0 and 100, we conduct the cor-
matic flow chart of the method for testing the text recogni- responding statistical experiments [21, 22]. Statistical exper-
tion and classification of emails. iments were performed on the TR07 and ES datasets, and
6302, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/3915491, Wiley Online Library on [23/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Wireless Communications and Mobile Computing 9

TR07
1.6

1.4

D
1.2

E
Tca (ms)

0.8

0.6

T
0.4

0.2

C
0
0 100 200 300 400 500 600
n

A
ES
1.6

1.4

R
1.2

1
Tca (ms)

T
0.8

0.6

E
0.4

0.2

R
0 100 200 300 400 500 600
n
MS+KNN
MS+SVM
TPM

Figure 9: Tca values for TR07 and ES datasets when using different methods.

the corresponding calculated values of Fa and Tca are shown sification process as much as possible, when using the TR07
in Figure 7. ∇ is denoted as delt in Figure 8. dataset, set ∇ = 30, and when using the ES dataset, set ∇ = 40.
From Figure 8, we can see that on dataset TR07, when
the value of parameter ∇ varies between the interval ½0, 30Š, 5.1.2. Comparative Analysis of Time Cost. Suppose that ∣A0
the value of Fa grows rapidly. As the value of ∇ increases fur- ∣ = 200; the value of ns is set to 100, 200, 300, 400, and
ther, the value of Fa tends to stabilize. In the dataset ES, 500, and the value of ∣Si ∣ is set to 60 and 120 [23, 24].
when the value of parameter ∇ varies between the interval Figure 9 shows the value of the time cost Tca obtained when
½0, 40Š, the value of Fa increases rapidly, and as the value of experiments are performed on the TR07 and ES datasets.
∇ increases further, the value of Fa tends to be stable. There- The upper part of Figure 9 corresponds to the dataset
fore, in order to reduce the time overhead in the sample clas- TR07, while the lower part of Figure 9 corresponds to the
6302, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/3915491, Wiley Online Library on [23/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 Wireless Communications and Mobile Computing

TR07
0.98

0.97

D
0.96

0.95

E
F value

0.94

0.93

T
0.92

0.91

C
0.9
100 200 300 400 500
n

0.98

0.97

0.96

R A ES

T
0.95
F value

0.94

E
0.93

0.92

R
0.91
100 200 300 400 500
n
MS+KNN
MS+SVM
TPM

Figure 10: Fa values for TR07 and ES datasets when using different methods.

dataset ES. So when the value of ns varies between the inter-


Table 2: Corresponding FM and n values when using different
val ½100, 500Š, the value of Tca produced by the MS+KNN
methods.
and ES+KNN methods combined with the KNN classifica-
Dataset ES TR07 tion algorithm increases significantly. At the same time, we
Algorithm FM n FM n also noticed that the MS+SVM and ES+SVM methods com-
MS+KNN 0.965 1825 0.967 1682 bined with the SVM classifier have better performance in
terms of computational complexity than those combined
MS+SVM 0.976 1156 0.981 1215
with the KNN classification algorithm [25, 26]. Likewise, in
TPM 0.979 563 0.983 697
Figure 9, we can also observe that the MS+NB and ES+NB
methods combined with the NB classifier have smaller Tca
6302, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/3915491, Wiley Online Library on [23/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Wireless Communications and Mobile Computing 11

values relative to the method combined with the SVM clas- Conflicts of Interest
sifier. This is because the computational complexity of the
NB classifier is only related to the vector dimension of the The authors declare that the research was conducted in the
absence of any commercial or financial relationships that

D
feature space. Compared with the MS+NB and ES+NB
methods combined with the NB classifier, when the value could be construed as a potential conflict of interest.
of ns is greater than 300, the method in this paper obviously
has the best performance. This is mainly because, in the Acknowledgments
word frequency-based user interest set method proposed in

E
this chapter, the direct use of the SVM classifier is avoided. This research is supported by the Natural Science Foundation
Therefore, it effectively reduces the average time overhead of Heilongjiang Province of China (No. YQ2020G002),
of the sample classification generated in the classification University Nursing Program for Young Scholar with Crea-
process. tive Talents in Heilongjiang Province (No. UNPYSCT-

T
2020212), and Science Foundation of Harbin Commerce
5.1.3. Accuracy Comparison between Different Methods. University (No. 2019CX22 and No. 18XN064).
Figure 10 shows the average values of Fa obtained by the
method in this paper when experiments are performed on References
the TR07 dataset and the ES dataset.

C
It can be seen from Figure 10 that compared with other [1] M. Goudjil, M. Koudil, M. Bedda, and N. Ghoggali, “A novel
active learning method using SVM for text classification,”
methods, the method combined with the KNN classifier per-
International Journal of Automation and Computing, vol. 15,
forms the worst. This is due to the fact that during active no. 3, pp. 290–298, 2018.
learning and classification, as the value ofnincreases, the Fa [2] M. M. Mironczuk and J. Protasiewicz, “A recent overview of
value of TPM is getting closer and closer to those of the the state-of-the-art elements of text classification,” Expert Sys-

A
MS+SVM and ES+SVM methods combined with the SVM tems with Applications, vol. 106, pp. 36–54, 2018.
classifier, and it can be seen that the value is significantly [3] C. L. Liu, W. H. Hsaio, C. H. Lee, T. H. Chang, and T. H. Kuo,
higher than in the other methods. “Semi-supervised text classification with Universum learning,”
IEEE Transactions on Cybernetics, vol. 46, no. 2, pp. 462–473,
5.1.4. Comparative Analysis of Sample Labeling Burden. In 2016.

R
order to facilitate the calculation, the initialization parame- [4] M. Pavlinek and V. Podgorelec, “Text classification method
ters for sample labeling are given, ∣A0 ∣ is set to 300, and ∣ based on self-training and LDA topic models,” Expert Systems
Si ∣ is set to 300. For dataset TR07 and dataset ES, the max- with Applications, vol. 80, pp. 83–93, 2017.
imum value achieved by F1 in the experiment is defined as [5] V. B. Kobayashi, S. T. Mol, H. A. Berkers, G. Kismihók, and

T
FM [27, 28], as shown in Table 2. D. N. den Hartog, “Text classification for organizational
From the experimental results in Table 2, it can be seen researchers: a tutorial,” Organizational Research Methods,
vol. 21, no. 3, pp. 766–799, 2018.
that when using dataset TR07 and dataset ES, the values of
the minimum FM produced by all methods on these two [6] K. He, “Prediction model of juvenile football players' sports
injury based on text classification technology of ML,” Mobile

E
datasets are 0.961 and 0.964, respectively. Corresponding
Information Systems, vol. 2021, Article ID 2955215, 10 pages,
to different FM values, the calculated total number of sam- 2021.
ples recommended to users for labeling is defined as n in
[7] S. M. Shah, H. Ge, S. A. Haider et al., “A quantum spatial graph
Table 2. From Table 2, we can also find that when the value convolutional network for text classification,” Computer Sys-
of FM is not less than 0.96, compared with other methods, tems Science and Engineering, vol. 36, no. 2, pp. 369–382, 2021.

R
the n value of our method is relatively low [29]. [8] E. K. Anoual and I. Zeroual, “The effects of pre-processing
techniques on Arabic text classification,” International Journal
6. Conclusion of Advanced Trends in Computer Science and Engineering,
vol. 10, no. 1, pp. 41–48, 2021.
This paper is an optimization and improvement study of the [9] J. Atwan, M. Wedyan, Q. Bsoul, A. Hamadeen, R. Alturki, and
text classification algorithm. The datasets used in the exper- M. Ikram, “The effect of using light stemming for Arabic text
iment are the TREC2007 and Enron-spam datasets, and the classification,” International Journal of Advanced Computer
classification process adopts support vector machine, naive Science and Applications, vol. 12, no. 5, pp. 768–773, 2021.
Bayes classifier, and k-nearest neighbor classifier. The exper- [10] H. Amazal and M. Kissi, “A new big data feature selection
approach for text classification,” Scientific Programming,
imental results show that, for the TREC2007 and Enron-
vol. 2021, no. 2, 10 pages, 2021.
spam datasets, under the premise of less burden of sample
[11] Q. Wang, W. Li, and Z. Jin, “Review of text classification in
annotation, when F1 value is used for evaluation, the pro-
deep learning,” Open Access Library Journal, vol. 8, no. 3,
posed method also shows relatively better performance than pp. 1–8, 2021.
other methods.
[12] X. Luo, “Efficient English text classification using selected
machine learning techniques,” AEJ-Alexandria Engineering
Data Availability Journal, vol. 60, no. 3, pp. 3401–3409, 2021.
[13] T. Salles, M. Goncalves, V. Rodrigues, and L. Rocha, “Improv-
No data were used to support this study. ing random forests by neighborhood projection for effective
6302, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/3915491, Wiley Online Library on [23/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 Wireless Communications and Mobile Computing

text classification,” Information Systems, vol. 77, pp. 1–21, [28] R. W. Chang, L. Y. Tucker, K. A. Rothenberg et al., “Establish-
2018. ing a carotid artery stenosis disease cohort for comparative
[14] S. F. Yin, H. Zheng, S. H. Xu, H. Rong, and N. Zhang, “A text effectiveness research using natural language processing,”
Journal of Vascular Surgery, vol. 68, no. 3, pp. e32–e33, 2018.

D
classification algorithm based on feature library projection,”
Journal of Central South University, vol. 48, no. 7, pp. 1782– [29] N. Afzal, V. P. Mallipeddi, S. Sohn et al., “Natural language
1789, 2017. processing of clinical notes for identification of critical limb
[15] F. A. Wenando, T. B. Adji, and I. Ardiyanto, “Text classifica- ischemia,” International Journal of Medical Informatics,
tion to detect student level of understanding in prior knowl- vol. 111, pp. 83–89, 2018.

E
edge activation process,” Advanced Science Letters, vol. 23,
no. 3, pp. 2285–2287, 2017.
[16] W. Cao, A. Song, and J. Hu, “Stacked residual recurrent neural
network with word weight for text classification,” IAENG

T
International Journal of Computer Science, vol. 44, no. 3,
pp. 277–284, 2017.
[17] S. Bahassine, A. Madani, and M. Kissi, “Arabic text classifica-
tion using new stemmer for feature selection and decision
trees,” Journal of Engineering Science and Technology, vol. 12,

C
no. 126, pp. 1475–1487, 2017.
[18] S. Yu, D. Liu, W. Zhu, Y. Zhang, and S. Zhao, “Attention-
based LSTM, GRU and CNN for short text classification,”
Journal of Intelligent and Fuzzy Systems, vol. 39, no. 1,
pp. 333–340, 2020.
[19] T. Hernandez-Boussard, P. Kourdis, R. Dulal et al., “A natural

A
language processing algorithm to measure quality prostate
cancer care,” Journal of Clinical Oncology, vol. 35, Supple-
ment_8, pp. 232–232, 2017.
[20] Z. Kong, C. Yue, Y. Shi, J. Yu, C. Xie, and L. Xie, “Entity extrac-
tion of electrical equipment malfunction text by a hybrid nat-

R
ural language processing algorithm,” IEEE Access, vol. 9,
no. 99, pp. 40216–40226, 2021.
[21] Y. Gong, N. Lu, and J. Zhang, “Application of deep learning
fusion algorithm in natural language processing in emotional

T
semantic analysis,” Concurrency & Computation Practice &
Experience, vol. 31, no. 10, pp. e4779.1–e4779.9, 2019.
[22] W. H. Weng, K. B. Wagholikar, A. T. McCray, P. Szolovits,
and H. C. Chueh, “Medical subdomain classification of clinical
notes using a machine learning-based natural language pro-

E
cessing approach,” Bmc Medical Informatics & Decision Mak-
ing, vol. 17, no. 1, pp. 155–167, 2017.
[23] D. G. Morgan, K. Chorneyko, D. Swain, B. Bowes, V. Lee, and
J. Tinmouth, “279 - validation of a natural language processing
algorithm to identify colonic adenomas across a health sys-

R
tem,” Gastroenterology, vol. 156, no. 6, p. S-56, 2019.
[24] J. M. Ehrenfeld, K. G. Gottlieb, L. B. Beach, S. E. Monahan, and
D. Fabbri, “Development of a natural language processing
algorithm to identify and evaluate transgender patients in elec-
tronic health record system,” Ethnicity & Disease, vol. 29, Sup-
plement 2, pp. 441–450, 2019.
[25] C. L. Wi, S. Sohn, M. C. Rolfes et al., “Application of a nat-
ural language processing algorithm to asthma ascertainment:
an automated chart review,” American Journal of Respiratory
and Critical Care Medicine, vol. 196, no. 4, pp. 430–437,
2017.
[26] S. Triputra and F. Atqiya, “Implementation of natural lan-
guage processing in seller-bot for SMEs,” Journal of Physics
Conference Series, vol. 1764, no. 1, pp. 012069–012075,
2021.
[27] J. S. Kim, V. Arvind, J. T. Schwartz et al., “P72. Natural lan-
guage processing of operative note dictations to automatically
generate CPT codes for billing,” The Spine Journal, vol. 20,
no. 9, pp. S181–S182, 2020.

You might also like