Research of Integrated Algorithm: Establishment of A Spam Detection System
Research of Integrated Algorithm: Establishment of A Spam Detection System
Abstract-- Nowadays, more and more people are getting their practical use. The process of opinion mining could be on
engaged in the construction of the Internet, consciously or not, by the level of the texts, and the sentences as well. Opinion mining
posting their individual comments on it. In today’s big data era, and sentiment analysis involve opinion integration algorithm,
opinion mining on customer’s opinions has become one of the conflicting opinion analyzing algorithm, etc.
most effective ways to roundly use the great amount of
information. Opinion mining, a brand new section of This paper focuses on opinion integration algorithm, and
unstructured information mining, is mainly related to emotional actualizes a comment spam detection system, based on
analysis, features digging and subjective comments recognition evidence classifier. As it is said before, the development of the
and so on. It is also an important part of knowledge discovery, Internet promotes the development of economy and technology;
often used to extract hidden information from unstructured or online shopping is getting more and more popular. While
semi-structured data. In the field of key algorithm for opinion making their final decisions, users tend to rely on the online
mining and integrating, opinion integration algorithm means a comments. However, some information, which is posted on
calculating method, which ignores the non-significant internal purpose, not according to the fact, is useless for users. So these
parts of the comments. That is, skipping the minor issues from comments should be regarded as spams. If they are not detected
the users’ comments, and focusing on the section of useful and deleted on time, they may waste the users’ precious time of
information, then summing up with some valuable conclusions making their decisions. A nice way to solve this problem is to
for practical application. The research of opinion integration establish an opinion spam detection system.
algorithm consists of four parts, namely, opinion spam detection
opinion summarization, opinion visualization and opinion The rest of the article is organized as follows. In Section 2,
assessment. This paper focuses on opinion spam detection we briefly review the related work. Then in Section 3, the
methods. Spam refers to fake user reviews, which means well- process of establishing the spam detection system will be
designed fake comments targeted at enhancing or damaging a introduced. Our experimental results and analysis are in
specific product by an individual or an organization. Therefore, Section 4. The final part, Section 5, concludes this whole paper
identifying spam comments becomes an important task for
improving the authenticity and accuracy of opinion mining. We
regard this task a classification problem. With the use of wed II. RELATED WORK
crawlers, segmentation system and artificial labeling methods, we In this section, we briefly review three kinds of methods
acquired a big amount of online comments. By training these that are related to our framework.
data and selecting the relevant features, we finally build a
classifier. The results from this experiment show that the Nowadays, some websites, which realize opinion
methods provided herein can achieve the purpose of preliminary integration, also known as visualized evaluation system, have
comment spam detection. already been established. For example, the Google search
engine and the Bing net search engine. They are both settled
Keywords—Sentiment Analysis; Opinion Mining; Opinion with pictures, brief description and testing projects. In a word,
Spam Detection they represent the same kind of project details. But, Google
search engine chooses the conclusive report of the
I. INTRODUCTION representative products, while the Bing net research engine
integrates the individual estimation of the products. That is to
With the huge development of Web 2.0 time, people say, the differences between the users’ demands cause the tiny
became the former and disseminator of information. Nowadays, differences of algorithm design. The developers analyze users’
the Internet contains a vast amount of text messages, and these comments, in order to make an accurate result. And, as for the
messages need to be deeply analyzed and well estimated. use of spam detection system, there are also a great amount of
Opinion mining now newly became one of the most heated examples for us to use for reference.
areas in computer science. At the same time, electronic
commerce, also known as e-commerce, is shooting up, which WordPress is a Blog that is established by using PHP
leads to a fast growth in the amount of users’ comments. And, language. Users can build their own websites on servers that
the users’ comments do influence other buyers’ final choice. support PHP and MySQL. We could also use WordPress as a
Therefore, making a good use of the comments will actualize Content Management System. WordPress and spam comments
585
and statistics. This method has a high accuracy and matching knowledge, extracting features, and finally gives a
feasibility. Finally, we mark the nouns, adjectives, verbs comprehensive evaluation assessment and reasonable score.
and strings.
Structural characteristics of the text are used to identify
Construction of the classifier: The classification process whether there is too many irrelevant words or advertising
mainly contains the following four steps: a) selecting information in the comments, and the emotional characteristics
samples, dividing them into positive and negative are used to identify whether the comments contain too much
specimens, and generating a testing group; b) extracting praise or criticism.
characteristic words from the samples, and building a
classification model; c) generating results from the IV. THE ESTABLISHMENT
testing group according to the classification model; d)
calculating the evaluation measures, and assessing the A. Access for Online Comments
performance of the classification model. Among all, the
Web crawl software: We use Jsoup-1.7.2 Development Kit
most important step is feature extraction and
to crawl down information from Web page contents, and
classification, which will be elaborated in the following
analyze the HTML source code, preserve the comments. Jsoup
parts.
is a Java HTML parser, which can directly resolve a URL
System test and evaluation: Use a web crawler to grab address and HTML text content. It provides a laborsaving set
product comments about mobile phones from of API, available through DOM, CSS and jQuery-like method
Jingdong.com and Amazon.com. Mark the spam and of operation to take out and manipulate data. [5].
no-spam comments. Test and train the data and generate
a classifier. Finally, give out the results. The system B. Access for Online Comments
evaluation standards are: the corpus size, the speed of Word segmentation and tagging: Chinese is not like
importing data and the accuracy of spam detection. English. It is separated with a space between each word,
first word in order to further processing. Segmentation
B. The User Interface Module algorithms can be divided into three types: (1) Rules-
User interface module is an interface for the interaction of based: the correct rules should be based on the normal
user and system. It displayed many aspects of contents. Users segmentation methods. We should translate these
can enter their own information that needs assessment. methods into mechanical languages, and then put the
Generally, a major interface includes two parts: feature algorithm into codes. In this way, computers can find
extraction and comment calculation. Users can open the out how to make segmentations by using correct rules.
manually tagged data from the disk files or directly use a linked (2) Statistics-based: statistics equals to large data. We
database as training samples. Finally, users may enter their can obtain some laws and analysis from the useful data
comments, and get a generated score. (3) Rules-statistics-based: major segmentation methods
include: simple mode-matching method, maximum-
C. System Development Environment matching method, max-reverse-matching method, and
bilateral-matching method.
This system uses the Java language, and works through
MyEclipse8.5 environment. MyEclipse8.5 is a powerful Statistical methods: such as statistical-model
enterprise-class integrated development environment based on segmentation, automatic segmentation and none-
the Eclipse plus developed a plug-in, mainly used in Java, Java dictionary segmentation. We use ICTCLAS
EE, and mobile application development. MyEclipse8.5 is very segmentation tools, which were developed by Chinese
powerful, and is also very extensive, especially supports Academy of Sciences calculations in this article.
various open source products. [4] ICTCLAS uses hidden Markov Model (Hierarchical
At the same time, it is associated with MySQL database Hidden Markov Model). It integrates all aspects of
management system, making it more convenient for users to Chinese lexical analysis, puts these parts into a
add, delete, modify and query the corpora. This feature brings a comprehensive theoretical framework, and finally gets
more excellent packaging for the system; Users can view the the best overall results. [6]
contents in a corpus on their own, analyze and achieve a high- Manual annotation of spam comments: we use a manual
efficient usage of the data. annotation method to mark 800 comments with
polarities. Spam comments (reverse comments) are
D. Innovation System Outlines marked with 0, non-spam comments (positive
This article starts from the view of composition words of comments) are marked as 1. Reasons for manual
the comments, and brings up a concept of quantitative annotation: we could get an accurate calculation for the
evaluation. The system recognizes the evaluative words from comments according to the manual tags. Reasons for
uses’ comments, and uses the marked tags and the words’ co- tagging Arabic numerals: they are easy to be
occurrence times to measure the comments quality. It extracts distinguished from characters, and are easy to be
features that are associated with the comment and emotional implemented during the following processes. And,
tendencies according to the normal comment syntax of speech Arabic numerals are easy to be compared, so that it
would be easier for calculating and generating the final
result.
586
Filter irrelevant words: After the word segmentation defined as: if (x, y) ~p (x, y), then the mutual information
and tagging, some meaningless function words need to between X and Y is: I (X; Y) =H(X)-H (X|Y). Mutual
be filtered, in order to make a higher efficiency. In this information is a balanced and nonnegative measure. I (X; Y)
article, we use the probability and statistics method to reflects the X’s reduction of uncertainty after knowing Y.
filter the unrelated words. In the field of probability and
statistics, this result in individual experiments shows
uncertainty, while in a mass amount of experiments, it
shows regularity. We call it a random phenomenon.
Probability and statistics theory research and reveal a
random statistical regularity. In this article, we define
the polar of the words and the probability of a word’s
appearance in positive and negative comments, as some
random events. Let A and B be two events in test E, if
P(A)>0, you can define P(B|A). In General, A’s
appearance affects B’s, which means that P(B|A)≠P(B).
But, only if it is not in this situation, we have
P(B|A)=P(B), and P(AB)=P(B|A)*P(A)=P(A)P(B). If
two events, A and B, satisfy the equation, then we say
that A and B are independent with each other.
Fig. 1. The relationship of entropy and mutual information
H X , Y H X H Y | X H Y H X | Y
587
V. THE EXPERIMENT AND ANALYSIS with no evaluation on specific attributes but have
extreme reviews in other areas. Meanwhile, excessive
A. The Experimental Environment praise is designed to discredit the product being
evaluated, misleading the consumer audio, so it should
be classified as spam.
588
VI. CONCLUSIONS Development Planned under Grants No.KM201410028017;
Our topic is "Research of Integrated Algorithm--- Academic Degree Graduate Courses group projects and the
Establishment of a Spam Detection System". In the experiment, Beijing Key Disciplines of Computer Application Technology.
we reviewed the integration algorithms and methods of design
summarizes. Because comment spam detection is the first part REFERENCES
of the integrated algorithm, we finished the generation of a [1] Jindal N, Liu B, "Opinion Spam and Analysis," Proceeding of
spam detection system. The program design, construction International Conference on Web Search and Web Data Mining. NY
method and experimental procedure are elaborated. We add USA: ACM, 2008:210-229
some innovative modifications to enhance the efficiency and [2] J. Ge, Y. Qiu, C. Wu, and G. Pu, "Summary of genetic algorithms
accuracy off the classifier. Our experiments proved that the research," Application Research of Computers, vol. 25, pp. 2911-2916,
2008.
comment spam detection system provides an accurate
[3] Zhisong Pan, Bin Chen, "The Research on One-Class classifier,"
credibility on the analysis of user’s product reviews. Electronic, vol. 3, p. 87, 2009
[4] Wenjing Zhao, "Reasearch on Product Description Words," Journal of
ACKNOWLEDGMENT University of Posts and Telecommunications , vol 3, p.67, 2010
[5] Shichuan Li, " Solution for Chinese Disorderly Codes " Network
This work was supported in part by National Science Administrator World, vol. 3, p. 2012
Foundation of China under Grants No. 61303105 and
[6] Xiao Li, Shengchun Ding, " Identification of waste product review
61402304; the Humanity & Social Science General Project of information [j]" Library and information technology,2013:63-68
Ministry of Education under Grants No.14YJAZH046; the [7] Zhenqing Tian, Yue Zhou, " Basic properties of entropy [j] " Journal of
Beijing Educational Committee Science and Technology Inner Mongolia Normal University, vol. 4, p.56, 2012
589