0% found this document useful (0 votes)
80 views

Research of Integrated Algorithm: Establishment of A Spam Detection System

fdfds

Uploaded by

war machine
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

Research of Integrated Algorithm: Establishment of A Spam Detection System

fdfds

Uploaded by

war machine
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2015 4th International Conference on Computer Science and Network Technology (ICCSNT 2015)

Research of Integrated Algorithm


Establishment of a Spam Detection System

Ruxi Yin Hanshi Wang, Lizhen Liu


Information and Engineering College Information and Engineering College
Capital Normal University Capital Normal University
Beijing, China Beijing, China

Abstract-- Nowadays, more and more people are getting their practical use. The process of opinion mining could be on
engaged in the construction of the Internet, consciously or not, by the level of the texts, and the sentences as well. Opinion mining
posting their individual comments on it. In today’s big data era, and sentiment analysis involve opinion integration algorithm,
opinion mining on customer’s opinions has become one of the conflicting opinion analyzing algorithm, etc.
most effective ways to roundly use the great amount of
information. Opinion mining, a brand new section of This paper focuses on opinion integration algorithm, and
unstructured information mining, is mainly related to emotional actualizes a comment spam detection system, based on
analysis, features digging and subjective comments recognition evidence classifier. As it is said before, the development of the
and so on. It is also an important part of knowledge discovery, Internet promotes the development of economy and technology;
often used to extract hidden information from unstructured or online shopping is getting more and more popular. While
semi-structured data. In the field of key algorithm for opinion making their final decisions, users tend to rely on the online
mining and integrating, opinion integration algorithm means a comments. However, some information, which is posted on
calculating method, which ignores the non-significant internal purpose, not according to the fact, is useless for users. So these
parts of the comments. That is, skipping the minor issues from comments should be regarded as spams. If they are not detected
the users’ comments, and focusing on the section of useful and deleted on time, they may waste the users’ precious time of
information, then summing up with some valuable conclusions making their decisions. A nice way to solve this problem is to
for practical application. The research of opinion integration establish an opinion spam detection system.
algorithm consists of four parts, namely, opinion spam detection
opinion summarization, opinion visualization and opinion The rest of the article is organized as follows. In Section 2,
assessment. This paper focuses on opinion spam detection we briefly review the related work. Then in Section 3, the
methods. Spam refers to fake user reviews, which means well- process of establishing the spam detection system will be
designed fake comments targeted at enhancing or damaging a introduced. Our experimental results and analysis are in
specific product by an individual or an organization. Therefore, Section 4. The final part, Section 5, concludes this whole paper
identifying spam comments becomes an important task for
improving the authenticity and accuracy of opinion mining. We
regard this task a classification problem. With the use of wed II. RELATED WORK
crawlers, segmentation system and artificial labeling methods, we In this section, we briefly review three kinds of methods
acquired a big amount of online comments. By training these that are related to our framework.
data and selecting the relevant features, we finally build a
classifier. The results from this experiment show that the Nowadays, some websites, which realize opinion
methods provided herein can achieve the purpose of preliminary integration, also known as visualized evaluation system, have
comment spam detection. already been established. For example, the Google search
engine and the Bing net search engine. They are both settled
Keywords—Sentiment Analysis; Opinion Mining; Opinion with pictures, brief description and testing projects. In a word,
Spam Detection they represent the same kind of project details. But, Google
search engine chooses the conclusive report of the
I. INTRODUCTION representative products, while the Bing net research engine
integrates the individual estimation of the products. That is to
With the huge development of Web 2.0 time, people say, the differences between the users’ demands cause the tiny
became the former and disseminator of information. Nowadays, differences of algorithm design. The developers analyze users’
the Internet contains a vast amount of text messages, and these comments, in order to make an accurate result. And, as for the
messages need to be deeply analyzed and well estimated. use of spam detection system, there are also a great amount of
Opinion mining now newly became one of the most heated examples for us to use for reference.
areas in computer science. At the same time, electronic
commerce, also known as e-commerce, is shooting up, which WordPress is a Blog that is established by using PHP
leads to a fast growth in the amount of users’ comments. And, language. Users can build their own websites on servers that
the users’ comments do influence other buyers’ final choice. support PHP and MySQL. We could also use WordPress as a
Therefore, making a good use of the comments will actualize Content Management System. WordPress and spam comments

978-1-4673-8173-4/15/$31.00 ©2015 IEEE 584 Harbin, China


are inseparable. Domestic and overseas blogs all have a great function on the basis of the data, and then constructing a
amount of spam comments. We all know that opinion spam classification model that can map the data in the database into a
causes bad effects, so that preventions are needed for keeping specific category. And it can be applied in data prediction. [2]
an efficient website. WordPress is settled with commenting The main classifier algorithms are as follows:
filters, which asks users to log in before posting their
comments. In this way, comment spam would be blocked. A. Decision Tree Classifier
Users must pass a review before making their comment. If they This method requires a set of properties, and sets this
are unfortunately included in the blacklist, then the system will property to make a series of decisions based on these decisions
automatically record their comment as spam comments. This to classify data. The whole progress equals to determine one
method targets at logged users. The process is cumbersome, thing according to its characteristics. This kind of classifier can
and may lower the users’ enthusiasm. But at the same time, it be used to determine the creditworthiness of a person, for
will significantly increase the amount of registered users. example, a decision tree may conclude that "a man with a
WordPress has officially recommended anti-spam software, family, a car that values between $ 15,000 to $ 23,000, and two
called A.kis.met. Users’ comments are submitted to the server kids" with good credit. Decision tree classifier generates
via this software. The server uses a specific algorithm to judge decision trees from a "training set".
whether the comment is a spam according the user’s comment
history.
B. Selection Tree Classifier
This method can effectively prevent those comments posted Selection tree classifier uses the similar technology of
by marked users, even when they are using another blog decision tree classifier for classifying data. But the difference is
address. More than 90 percent of spam comments are generated that the selection tree contains special nodes, and the nodes
by robots [1], so that preventing spamming robot is an effective have more than one branch. For example, if we use a tree to
way. My Qapatcha, Fancy Captcha and SI CAPATCHA are distinguish the origins of the cars, we could select a node that
plugins for verifying spam comments made by robots. Users contains the horsepower, the number of cylinders, and some
are asked to do some manual control, in order to prove that other information such as vehicle weight. In decision trees, a
their comments are non-labor input. The process provides a node can take at most one property as consideration. While in
nice users’ experience. It only requires users to move the selection trees, we can consider more possibilities. Selection
mouse to complete the verification. But the plug-in process trees are usually more accurate than decision trees, but they are
may cause incompatibility issues, which will add extra trouble. much larger. The two kinds of classifiers use the same
At the same time, WordPress offers two kinds of coding algorithm to generate decision trees from the training sets.
method, one is directly stopping the spams, and the other is to
build a queue to store the suspected comment. Generally, the C. Evidence Classifier
default choice is the former one. The coding method can solve
the compatibility problem, but it may cause some false The evidence classifier classifies data by checking the
positives. possibility of a specific result that is on the basis of a particular
attribute. For example, it may make a judgment that a person
Overall, WordPress, settled with anti-spam comments who owns a fancy car is likely to have good credit. The
features and an additional pure coding method, is very effective classification is based on a simple probabilistic model, using
and has a high user experience. In addition, there is a model- the maximum probability value to predict the data. Similar with
based LDA Blog spam discovery algorithm. This algorithm not the decision tree classifier, the classifier is generated from the
only targets at the user, but also at the comment itself, by training set.
extracting the implicit features of spam comment and then
using SVM method to classify the comment. This method is
III. THE SPAM DETECTION SYSTEM
divided into two categories: explicit and implicit spam
comment. The first part uses rule-based methods for its
A. Introduction of the System
identification, and the second part uses LDA model to extract
the implicit theme. Then use topic-based feature selection and Currently, the study for spam detection system has not been
retrieval model-based methods to analyze and find spams. perfected, but because several classic algorithms have already
characterized the machine learning process, we focus on the
Based on the descriptions above, the goal of our project is estimation of the existing algorithms, and conclude an
to build a spam detection system, in order to let users analyze innovated solution, which can be used for generating a spam
the online comments, and measure the credibility of the detection system. The related work is as follows:
comments. Therefore, the system should have two
characteristics: first, a strong distinguishing feature, so that it  Acquisition for comments: A total of 800 comments are
could be distinguished among the existing systems; then, an mainly gathered from Jingdong.com and Amazon.com
accurate estimation for specific comments. through a web crawler. And then, we mark the
comments with tags. For spam comments, we mark as 0,
Currently, the most significant part of constructing a spam and for no-spam comments, we mark as 1. Meaningless
detection system is to choose a classification method. And, the or useless comments, advertisements and extreme
methods include decision tree algorithm, logistic regression comments that distort the fact should be marked with 0
algorithm, neural network algorithm, and Naïve Bayes according to our measure. At the same time, we filter
algorithm. The concept of classification means concluding a the unrelated words by using the method of probability

585
and statistics. This method has a high accuracy and matching knowledge, extracting features, and finally gives a
feasibility. Finally, we mark the nouns, adjectives, verbs comprehensive evaluation assessment and reasonable score.
and strings.
Structural characteristics of the text are used to identify
 Construction of the classifier: The classification process whether there is too many irrelevant words or advertising
mainly contains the following four steps: a) selecting information in the comments, and the emotional characteristics
samples, dividing them into positive and negative are used to identify whether the comments contain too much
specimens, and generating a testing group; b) extracting praise or criticism.
characteristic words from the samples, and building a
classification model; c) generating results from the IV. THE ESTABLISHMENT
testing group according to the classification model; d)
calculating the evaluation measures, and assessing the A. Access for Online Comments
performance of the classification model. Among all, the
Web crawl software: We use Jsoup-1.7.2 Development Kit
most important step is feature extraction and
to crawl down information from Web page contents, and
classification, which will be elaborated in the following
analyze the HTML source code, preserve the comments. Jsoup
parts.
is a Java HTML parser, which can directly resolve a URL
 System test and evaluation: Use a web crawler to grab address and HTML text content. It provides a laborsaving set
product comments about mobile phones from of API, available through DOM, CSS and jQuery-like method
Jingdong.com and Amazon.com. Mark the spam and of operation to take out and manipulate data. [5].
no-spam comments. Test and train the data and generate
a classifier. Finally, give out the results. The system B. Access for Online Comments
evaluation standards are: the corpus size, the speed of  Word segmentation and tagging: Chinese is not like
importing data and the accuracy of spam detection. English. It is separated with a space between each word,
first word in order to further processing. Segmentation
B. The User Interface Module algorithms can be divided into three types: (1) Rules-
User interface module is an interface for the interaction of based: the correct rules should be based on the normal
user and system. It displayed many aspects of contents. Users segmentation methods. We should translate these
can enter their own information that needs assessment. methods into mechanical languages, and then put the
Generally, a major interface includes two parts: feature algorithm into codes. In this way, computers can find
extraction and comment calculation. Users can open the out how to make segmentations by using correct rules.
manually tagged data from the disk files or directly use a linked (2) Statistics-based: statistics equals to large data. We
database as training samples. Finally, users may enter their can obtain some laws and analysis from the useful data
comments, and get a generated score. (3) Rules-statistics-based: major segmentation methods
include: simple mode-matching method, maximum-
C. System Development Environment matching method, max-reverse-matching method, and
bilateral-matching method.
This system uses the Java language, and works through
MyEclipse8.5 environment. MyEclipse8.5 is a powerful  Statistical methods: such as statistical-model
enterprise-class integrated development environment based on segmentation, automatic segmentation and none-
the Eclipse plus developed a plug-in, mainly used in Java, Java dictionary segmentation. We use ICTCLAS
EE, and mobile application development. MyEclipse8.5 is very segmentation tools, which were developed by Chinese
powerful, and is also very extensive, especially supports Academy of Sciences calculations in this article.
various open source products. [4] ICTCLAS uses hidden Markov Model (Hierarchical
At the same time, it is associated with MySQL database Hidden Markov Model). It integrates all aspects of
management system, making it more convenient for users to Chinese lexical analysis, puts these parts into a
add, delete, modify and query the corpora. This feature brings a comprehensive theoretical framework, and finally gets
more excellent packaging for the system; Users can view the the best overall results. [6]
contents in a corpus on their own, analyze and achieve a high-  Manual annotation of spam comments: we use a manual
efficient usage of the data. annotation method to mark 800 comments with
polarities. Spam comments (reverse comments) are
D. Innovation System Outlines marked with 0, non-spam comments (positive
This article starts from the view of composition words of comments) are marked as 1. Reasons for manual
the comments, and brings up a concept of quantitative annotation: we could get an accurate calculation for the
evaluation. The system recognizes the evaluative words from comments according to the manual tags. Reasons for
uses’ comments, and uses the marked tags and the words’ co- tagging Arabic numerals: they are easy to be
occurrence times to measure the comments quality. It extracts distinguished from characters, and are easy to be
features that are associated with the comment and emotional implemented during the following processes. And,
tendencies according to the normal comment syntax of speech Arabic numerals are easy to be compared, so that it
would be easier for calculating and generating the final
result.

586
 Filter irrelevant words: After the word segmentation defined as: if (x, y) ~p (x, y), then the mutual information
and tagging, some meaningless function words need to between X and Y is: I (X; Y) =H(X)-H (X|Y). Mutual
be filtered, in order to make a higher efficiency. In this information is a balanced and nonnegative measure. I (X; Y)
article, we use the probability and statistics method to reflects the X’s reduction of uncertainty after knowing Y.
filter the unrelated words. In the field of probability and
statistics, this result in individual experiments shows
uncertainty, while in a mass amount of experiments, it
shows regularity. We call it a random phenomenon.
Probability and statistics theory research and reveal a
random statistical regularity. In this article, we define
the polar of the words and the probability of a word’s
appearance in positive and negative comments, as some
random events. Let A and B be two events in test E, if
P(A)>0, you can define P(B|A). In General, A’s
appearance affects B’s, which means that P(B|A)≠P(B).
But, only if it is not in this situation, we have
P(B|A)=P(B), and P(AB)=P(B|A)*P(A)=P(A)P(B). If
two events, A and B, satisfy the equation, then we say
that A and B are independent with each other.
Fig. 1. The relationship of entropy and mutual information

TABLE I. THE MEANINGS OF THE SIGNS


D. Tests for the Classifier
Signs Meanings Practical meanings From the equation, we know that:
The probability that event The appearance probability
P(A)
A occurs. of word A. 
The probability that event The appearance probability I  X ; Y   H  X   H  X | Y   H  X   H Y   H  X , Y   
P(B)
B occurs. of no-spam comments.
The probability that event The appearance probability 1 1
P(C)
B occurs. of spam comments. =  x
p( x) log 2
p ( x)
  y p( y ) log 2
p( y)
  x , y p ( x, y ) log 2 p( x, y )
The probability that event The probability that word
P(AB) AB occur at the same A appears in no-spam
time. comments.
= p( x, y)
The probability that event The probability that word  x, y
p( x, y)log2
p( x) p( y)
P(BC) BC occur at the same A appears in spam
time. comments.

In the process of judging the probability of the Because: H X | X  0


independence, there may be some errors. So, in the coding
implementation, we verified the error as 0.01 through So: H X   H X  H X | X   I X; X  (4)
numerous experiments.
This explains why entropy is also known as information.
C. Generation of the Classifier
On the other hand, it illustrates the interdependence of the two
If x is a discrete random variable ( x  R) , then the mutual information between variables is not a constant, but
probability distribution is p  x   P  X  x  x  R . So, entropy h depends on their entropy. In fact, mutual information reflects
the dependency between the two variables. If I(X;Y)>>0, this
(x) is defined as:
means that X and Y are highly relevant; if I(X;Y)=0, it
indicates that X and Y are independent with each other; if
  I(X;Y)<<0, it shows that the distribution of X and Y are
H  X    p  x  log 2 p  x   uncorrelated. [7]
xR

In this article, mutual information is applied in computing


And we define 0log0=0. Entropy is also known as self- the co-occurrence probabilities of testing words and featured
information. It can be considered as describing a number, words. In the former works, we have already gathered a group
which describes the uncertainty of a variable. Greater entropy of polar words, and for this step, we score the sentences
brings greater uncertainty. according to the polar words and define the tendency of the
According to the chain rule: sentence.

H  X , Y   H  X   H Y | X   H Y   H  X | Y   

So: H(X)-H (X|Y) =H(Y)-H (Y|X). This difference is called


the mutual information of x and y, denoted by I(X; Y). Or is

587
V. THE EXPERIMENT AND ANALYSIS with no evaluation on specific attributes but have
extreme reviews in other areas. Meanwhile, excessive
A. The Experimental Environment praise is designed to discredit the product being
evaluated, misleading the consumer audio, so it should
be classified as spam.

Fig. 2. The application's main page

Fig. 4. Calculation interface

 Calculate the tendency of a sentence: This part


calculates the polar of the testing sentence. After the
user has entered a comment, click on the "calculate
sentence sentiment" button. The system will
automatically output the score and the final conclusion.

B. Analysis for experimental results


 When the input statement is "The mobile is good and
genuine", the positive tendency is bigger than the
reverse tendency, and the sentence is eventually
classified as a positive one, which means it is not a
spam comment. When you enter a test statement as
"The seller is a liar, credibility his credibility is bullshit",
Fig. 3. Feature Word extraction interface the positive tendency is smaller than the reverse
tendency, so the sentence is eventually classified as a
 Feature extraction: This part implements the training of spam comment. According to common sense, the
manual annotation statement segmentation and feature classifications for the two sets of comments are accurate.
extraction. Users can click on the "import files" button
to import the comment, the time cost of this process  When viewing the internal database with SQLyog, we
depends on the size of the file. Once imported, words in could find out that the amount of reverse words is much
the file will be automatically saved in the local database, smaller than that of the positive words. This is because
and can be called at any time. At this point, if the user that most of the comments crawling from the websites
closes the file by accident, the whole process can be run are positive, that is, non-spam comments. The lack of
again later, and extract the feature words directly. These reverse words may influence the accuracy of the final
are the artificial methods to determine whether the decision. The extracted characteristics of the systems
comments are spams: a. Repeat comments: this kind of mentioned in this article may not be perfect, but is
comment spam seems normal, but different users may statistically valid. Individual examples are not
post completely or largely similar comments, known as significant. There is no existing method to build a
duplicate comments. Such comments are not for a perfect distinguishing feature between the two kinds of
specific product. b. Unrelated comments: comments that comments.
have nothing to do with the product is useless. Most of
them are produced by machine. This kind of comment
spam often has nothing to do with the product. It also
contains a large number of hyperlinks. c. The comments

588
VI. CONCLUSIONS Development Planned under Grants No.KM201410028017;
Our topic is "Research of Integrated Algorithm--- Academic Degree Graduate Courses group projects and the
Establishment of a Spam Detection System". In the experiment, Beijing Key Disciplines of Computer Application Technology.
we reviewed the integration algorithms and methods of design
summarizes. Because comment spam detection is the first part REFERENCES
of the integrated algorithm, we finished the generation of a [1] Jindal N, Liu B, "Opinion Spam and Analysis," Proceeding of
spam detection system. The program design, construction International Conference on Web Search and Web Data Mining. NY
method and experimental procedure are elaborated. We add USA: ACM, 2008:210-229
some innovative modifications to enhance the efficiency and [2] J. Ge, Y. Qiu, C. Wu, and G. Pu, "Summary of genetic algorithms
accuracy off the classifier. Our experiments proved that the research," Application Research of Computers, vol. 25, pp. 2911-2916,
2008.
comment spam detection system provides an accurate
[3] Zhisong Pan, Bin Chen, "The Research on One-Class classifier,"
credibility on the analysis of user’s product reviews. Electronic, vol. 3, p. 87, 2009
[4] Wenjing Zhao, "Reasearch on Product Description Words," Journal of
ACKNOWLEDGMENT University of Posts and Telecommunications , vol 3, p.67, 2010
[5] Shichuan Li, " Solution for Chinese Disorderly Codes " Network
This work was supported in part by National Science Administrator World, vol. 3, p. 2012
Foundation of China under Grants No. 61303105 and
[6] Xiao Li, Shengchun Ding, " Identification of waste product review
61402304; the Humanity & Social Science General Project of information [j]" Library and information technology,2013:63-68
Ministry of Education under Grants No.14YJAZH046; the [7] Zhenqing Tian, Yue Zhou, " Basic properties of entropy [j] " Journal of
Beijing Educational Committee Science and Technology Inner Mongolia Normal University, vol. 4, p.56, 2012

589

You might also like