0% found this document useful (0 votes)
2 views

Article28

The document presents an improved parallel PageRank algorithm designed for spam filtering, which addresses the limitations of the traditional PageRank by implementing a non-uniform distribution of PageRank values among outgoing links. This new method prioritizes important pages while reducing the PageRank of spam pages, thereby enhancing the relevance of search results. The algorithm has been implemented on NVIDIA GPU architecture using CUDA, demonstrating improved computational efficiency in filtering spam compared to existing methods.

Uploaded by

aniketjaiswal950
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Article28

The document presents an improved parallel PageRank algorithm designed for spam filtering, which addresses the limitations of the traditional PageRank by implementing a non-uniform distribution of PageRank values among outgoing links. This new method prioritizes important pages while reducing the PageRank of spam pages, thereby enhancing the relevance of search results. The algorithm has been implemented on NVIDIA GPU architecture using CUDA, demonstrating improved computational efficiency in filtering spam compared to existing methods.

Uploaded by

aniketjaiswal950
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

ISSN (Print) : 0974-6846

Indian Journal of Science and Technology, Vol 9(38), DOI: 10.17485/ijst/2016/v9i38/90410, October 2016 ISSN (Online) : 0974-5645

Improved Parallel PageRank Algorithm for


Spam Filtering
Hema Dubey1*, Nilay Khare1, K. K. Appu Kuttan1 and Shreyas Bhatia2
Department of Computer Science and Engineering, Maulana Azad National Institute of Technology,
1

Bhopal - 462003, Madhya Pradesh, India; [email protected], [email protected],


[email protected],
2
Adobe Systems, Noida – 201304, India; [email protected]

Abstract
Background/Objectives: PageRanking algorithm is a well known link based technique given by Google for indexing of
its web pages. This algorithm works on the linking structure of web pages id est inbound and outbound links of pages.
The existing Page Rank algorithm follows equal distribution law that is; it distributes the Page Rank of a web page evenly
among all the outgoing links. The problem with the uniform distribution of Page Rank is that sometimes uninteresting
pages got high Page Rank values. Methods/Statistical Analysis: This paper proposed an improved parallel Page Rank
algorithm that un-uniformly distributes the Page Rank values among all the outgoing links. The proposed work has been
­implemented on NVIDIA Quadro 2000 GPU architecture using CUDA programming language. Findings: The proposed
­algorithm mitigates spam and provides better results in terms of computational time as compared to Parallel Page Rank,
because it assigns higher priority to important pages and less priority to less important web pages. By assigning values in
such a fashion important pages show an increase in the Page Rank value and unrelated pages that is spam pages show a
decrease in Page Rank value. Application: The proposed work performs spam filtering by classifying important as well as
irrelevant web pages.

Keywords: CUDA, GPU, Non-Uniform Distribution, Parallel Page Rank, Spam Pages

1. Introduction Figure 1 depicts that Document Processor performs


pre-processing, recognize probable indexable items in
With the rapidly growing www, search engines play an a document, removing stop words, stemming words,
imperative role in the extraction of relevant information extracting index entries, assigning weights to terms and
from the internet. The search engine is an important part then finally creating the index. The Query Processor car-
of any Information Retrieval (IR) system. An index is cre- ries out several steps. Some of them are tokenizing the
ated by a search engine on the basis of which it matches query expressions into comprehensible segments, pars-
the query. The index includes the words present in each ing, obliterate stop words, stemming words, creating
of the documents, plus the pointers to the locations of query representation, expanding query expression and
words present within the documents, this is known as weighting query expression1.
Inverted File. There are four essential modules present in When a user fires a query in a search engine, for
the search engine1: instance, Google, Yahoo, Baidu, Bing, the search engine
returns a million of web pages. Some pages are relevant,
• A Document Processor. important and useful for the user, but some are of less
• A Query Processor. important. So it becomes foremost necessary to rank web
• A Searching and Matching Function. pages in accordance with their relevance and pertinence so
• A Ranker. as to display significant web pages on the top of the search

*Author for correspondence


Improved Parallel Page Rank Algorithm for Spam Filtering

spam and the third category of web spam is cloaking:


Link spam creates links between web pages to increase
the ranking of spammer’s websites. Content spam alters
the contents of web pages by stuffing extra keywords asso-
ciated with famous query terms. Cloaking indicates the
way of presenting altogether different content to the user’s
browser as compared to the content presented to search
engine spider8,9.
The organization of this paper is specified as f­ollows:
Section 2 introduces related work done in the field
of PageRank computation; Section 3 gives descrip-
tion related to GPU and CUDA; Section 4 discusses
Figure 1. Mechanism of search engine. Traditional Page Rank Algorithm; Section 5 discusses the
methodology used to parallelize Page Rank Algorithm
results list. The relevance and popularity of web pages are using Non-Uniform PageRank Values; Section 6 describes
generally computed by examining the hyperlink structure experimental work and results in detail; Section 7 epito-
of web graph. The co-founders of Google i.e. Sergey Brin mize the proposed research work in the form of conclusion
and Larry Page in 1998, developed a PageRank algorithm and future ­prospect. At last, references are presented.
based on link analysis to calculate the prominence scores
of web pages2, which is thereafter used by Google as a part
of its popular search engine.
2. Related Work
The PageRank scores of web pages are computed from Many types of research have been done on ranking web
it’s in links. PageRank algorithm is based on the principle pages. In10 put forward a method to calculate review
that if a web page x is pointed by many other important relevance by considering not only the similarity and cor-
web pages, then web page x is also important. As the relation but also the votes for each review and proved
Google becomes successful, the superiority of PageRank that their method works well in terms of effectiveness
has been proved over other link analysis algorithms3–5. and accuracy. In11 proposed Link Click-Concept based
PageRank calculation is a challenging task. The first Ranking Algorithm based on user profile construction.
challenge is the extremely large volume of World Wide In12 proposed a method to monitor the Simple Mail
Web. Estimated size of World Wide Web is about tril- Transfer Protocol (SMTP) sessions and email addresses
lions6 of web pages. Therefore, a lot of computing efforts for detecting spamming messages. There has been
are required. And the second challenge results from the an earnest effort to reduce the computational time of
dynamic nature of World Wide Web. Every day new pages PageRank algorithm. In13 presented Parallel Relaxed and
are added or removed from the WWW, thus pages and Extrapolated algorithms based on the Power method for
their content are continually changing in WWW. This accelerating the PageRank computation. In14 proposed
results in a change in web structure4. Despite these chal- PageRank computation in a distributed environment
lenges, the PageRank values must be state of art all the using iterative aggregation and disaggregation tech-
times and hence computation of PageRank scores must be nique to effectively speed up the PageRank computation.
performed as fast as possible. Looking at these challenges, In15 presented a parallel computation of PageRank on a
it becomes vital to run the PageRank algorithm on High- cluster of Gigabit PC and showed major improvement.
Performance Machines. The search also returns a lot of In16 applied Gauß-Seidel iterative method in a paral-
spam pages to which the original PageRank is blinded7. lel computing environment for improving convergence.
The proposed algorithm assigns a non-uniform distribu- In17 uses partition and repartition site based methods to
tion of PageRank values to Web Pages, that is giving more balance load and minimize the communication over-
importance to more important web pages and less impor- head for parallel computations of PageRank values. In18
tance to less important web pages. perform experiments on a cluster of 70 nodes and evalu-
At the present time, mainly three categories of web ated that the PageRank computation on linear systems is
spam exist: The first one is link spam, second is content agile as compared to power method. In19 concentrated on

2 Vol 9 (38) | October 2016 | www.indjst.org Indian Journal of Science and Technology
Hema Dubey, Nilay Khare, K. K. Appu Kuttan and Shreyas Bhatia

c­ omparing the running time of PageRank algorithm on ­ tilized for General Purpose Computing. Unlike the CPU,
u
both GPU and CPU clusters. Other research work in this the GPU has a parallel throughput architecture which
regard is by 20 who implemented power method for paral- helps in executing a plethora of threads concurrently21,22.
lel computation of PageRank algorithm on AMD GPUs The grid is assigned to a GPU for execution, where each
by using OpenCL programming language. Our research grid is further divided into blocks. A block is assigned to a
work is an attempt to reduce the PageRank values of spam multiprocessor (SM) and all the threads present within a
pages so that important web pages are shown on the top block will make use of the processing elements of only its
of google search engine. We have implemented this work assigned multiprocessor. If we consider the case, where the
on NVIDIA Quadro 2000 GPU architecture using CUDA number of SMs is less as compared to a number of blocks,
programming language. then multiple blocks could be assigned to a single SM at a
time, depending on the availability of memory in the SM.
Threads present within all the blocks which are assigned
3. GPU and CUDA to same SM will execute simultaneously. In an SM, threads
GPUs are processors conventionally designed for graphic present within a block are split in a group of 32 threads;
processing in computer games. It is a multicore proces- such set of threads is called a warp. All threads of a warp
sor possessing extremely parallel processing ability with execute in parallel and execute the same instruction at a
high-speed memories. GP-GPU exemplifies general time. An SM can execute 768 threads simultaneously, that
purpose computing on graphics processing units to use is, 24 SIMD warps of 32 threads. Maximum parallelism
GPU for computational jobs that are typically handled can be achieved if all warps have 32 threads21,22.
by CPU3. The architecture of GPU is shown in Figure 2.
The GPU evolved due to the voracious market requisite
4. PageRank Algorithm
for high quality and real time graphics. GPU, today is a
processor with a plenitude of cores, which support high The famous search engine Google uses PageRank
parallelism and multithreading with high memory band- ­algorithm. This algorithm is based on assigning each
width. The GPU runs on the SIMD model which is Single web page a numeric weight, so as to measure the relative
Instruction Multiple Data. Each core within a multipro- importance of each page within a hyperlinked structure
cessor executes the identical instruction, but on dissimilar of web pages. PageRank is based on the random surfer
data. GPU contains one or more scalable multithreaded model23 where a random surfer (a user) switches to a
Streaming Multiprocessors (SMs). The SMs are designed random page after clicking on several links. The value of
to execute plenitude of threads concurrently. PageRank determines the chance that the random surfer
CUDA (Compute Unified Device Architecture) allows will arrive on that page by clicking on a link. This can be
programmers to define kernels that will execute on the assumed analogous to a Markov Chain in which the pages
GPU. A programmer is unaware of the hardware of the are states and the equally probable transitions are links
GPU instead they see the number of threads which are between the pages.
organized into blocks. The GPU’s using CUDA can be The Google’s Algorithm as described in2 is given as
follows:

Where PR (A) is the PageRank of page A, PR (Bi) is the


PageRank of pages Bi which links to page A,
Bi is the number of outbound links on page Bi and
d is a damping factor which can be set between 0 and 1.
The algorithm basically uses a uniform distribution of
Page Ranks among the pages. The page which is pointed
by several other pages becomes important. It does not
Figure 2. Architecture of GPU. consider the fact that some links are more important than

Vol 9 (38) | October 2016 | www.indjst.org Indian Journal of Science and Technology 3
Improved Parallel Page Rank Algorithm for Spam Filtering

others and hence they should have more PageRank value 12 for each page r in q.outgoingNeighbors do
as compared to other links. Furthermore, there is also
a probability of raising spam web pages by the business 13 sum += r.prev_PageRank)
firms, thereby enhancing hyperlinks to its home web page
14 p.PageRank +=
for the promotional and advertisement activities.

5. Improved Parallel PageRank 15 p.PageRank := (1-d) + p.PageRank * d // here d is


using Non-Uniform Distribution damping factor
16 mx := 0 // to find maximum difference
The proposed PageRank algorithm introduces a function
f(x) in the calculation of PageRank scores of web pages. 17 for each page p in G do //find the maximum
The function f(x) can be a polynomial, logarithmic or ­difference
exponential function. The purpose of f(x) is to provide a 18 mx := max of ( p.PageRank - p.prev_PageRank, mx)
non-uniform distribution among all the given PageRanks. 19 if mx is less than threshold do // Convergence
The idea behind a non-uniform distribution comes from reached
the fact that more important pages will have higher 20 stop // stop the program
PageRank, they become more important as compared 21 for each page p in G do in parallel
to less important pages and so for any single page hav-
22 p.prev_PageRank : = p.PageRank
ing a link pointing to an important page should provide
more amount of PageRank as compared to less important
pages. The original PageRank2,3 uses a uniform PageRank 6. Experimental Results
distribution for calculation of PageRank.
The formula used in our implementation is as The experiment aims to confirm that using a non-uniform
­follows: distribution of PageRanks improves the PageRank of
relevant pages with a reduction in the PageRank of irrel-
evant pages which are usually spam links. This method
eliminates spam by giving more PageRank values to more
important pages, whereas the spam pages which have a
relatively low PageRank values are given an even lesser
value than other pages. Hence the value of the spam pages
5.1 Proposed Parallel PageRank Pseudo
will be suppressed. The proposed work uses different
Code functions for calculation of PageRank values for non-uni-
1 G: = set of pages. form distribution, the functions used are: -1,
2 Threshold: = convergence factor. Logarithmic Function: and
Square Function: .
3 function (p) // Non-uniform distribution func-
We have implemented both the existing PageRank
tion.
Algorithm and Proposed PageRank Algorithm on
4 for each page p in G do in parallel. NVIDIA Quadro 2000 GPU architecture using CUDA
5 p.prev_pageRank: = 1.0 // p.PageRank is the programming language. The PageRank values obtained
PageRank score of the page p. from both the existing and proposed PageRank Algorithm
6 function PageRank (G). are then compared shown in the Tables 2, 3 and 4.
7 for step from 1 to k do // run the algorithm for k We have used CUDA on 64 bit Windows platform
steps. for the parallel implementation of PageRank. The CPU
8 for each page p in G do in parallel. has an Intel Xeon processor clocked at 2.30 GHz with 4
GB of RAM. The GPU used is NVIDIA Tesla C2075 with
9 p.pageRank: = 0;
­compute capability 2.0.
10 for each page q in p.incomingNeighbors do
The experiment is performed on three datasets taken
11 sum: = 0.0 from Stanford Large Network Dataset Collection24. These

4 Vol 9 (38) | October 2016 | www.indjst.org Indian Journal of Science and Technology
Hema Dubey, Nilay Khare, K. K. Appu Kuttan and Shreyas Bhatia

are the Amazon’s Product Co-Purchasing Network, mitigation in PageRank values, these nodes are
Internet Peer to Peer Network Web graphs. The details ­actually considered as spam or irrelevant pages.
of datasets are shown in Table 1. The input data were The proposed algorithm is implemented on dataset
sorted according to the in-degree of the graph for better “p2p-Gnutella31” which consists of 62,586 nodes and
­performance. 147,892 edges. We have shown only a few nodes in table 3
Dataset Amazon’s Product Co-Purchasing Network for better visualization. A couple of local maxima can be
-“Amazon302” consists of 26211 nodes and 1,234,877 seen in the Table 3 which correspond to relevant pages.
edges. Only a few nodes are shown in Table 2 to visual- The Pagerank values for relevant pages are enhanced by
ize the results properly. It can be seen from the Table 2 using Exponential, Log and Square functions, which can
how the values for each node changes according to the be seen from the Table 3. Nodes whose PageRank values
function. The node number 10 shows an increment in are suppressed are considered to be spam pages.
PageRank value in this case. Only the relevant or impor- Table 4 shows the implementation of existing proposed
tant pages will show a high-valued local maximum in algorithm on dataset “web-NotreDame” which consists of
Table 2. Other nodes show 325,729 nodes and 1,497,134 edges. We have shown only

Table 1. Properties of the data sets20 used in the Table 3. PageRank values of some nodes of dataset 2:
experiments Gnutella Peer to Peer Network – “p2p-Gnutella31”
Dataset Type Nodes Edges Description Node Parallel Proposed Parallel Pagerank Algorithm
Name No. PageRank Exponential Logartithmic Square
Amazon Function Function Function
product co- 1 0.541431794 1.06606 1.052 1.05132
Amazon0302 Directed 262,111 1,234,877 purchasing
2 0.742006559 2.61749 2.08039 7.98779
network from
March 2 2003 3 0.349563626 0.267324 0.378557 0.163827
Gnutella peer 4 0.963083037 4.33025 2.60139 10.2946
p2p- to peer network 5 0.252753979 0.175646 0.198987 0.150852
Directed 62,586 147,892
Gnutella31 from August 31
6 0.37433706 0.210706 0.292558 0.150535
2002
Web- Web graph of 7 0.654885792 1.93907 1.75213 3.09926
Directed 325,729 1,497,134
NotreDame Notre Dame 8 0.380693168 0.214274 0.290963 0.151892
9 0.41766649 0.299011 0.424158 0.166012
Table 2. PageRank values of some nodes of dataset
1: Amazon’s Product Co-Purchasing Network Table 4. PageRank values of some nodes of dataset 3:
-“Amazon302” Web graph – “web-NotreDame”
Proposed Parallel Pagerank Algorithm Proposed Parallel Pagerank Algorithm
Node Parallel Node Parallel
Exponential Logartithmic Square Exponential Logartithmic Square
No. PageRank No. PageRank
Function Function Function Function Function Function
4 0.187242542 0.150228 0.152202 0.150006 51 2.75361792 0.398098 6.36853 0.398025
5 2.653685497 3.09225 2.56103 3.27858
52 3.94637185 0.205065 23.901 0.156454
6 32.39687142 22.0483 51.2055 21.9181
53 2.13233258 0.166166 6.52146 0.15188
7 45.6258104 28.3825 87.9957 28.4655
54 5.26112282 1.17416 10.3334 1.3005
8 49.54418857 0.533287 89.6108 0.5795
55 6.0156624 0.319871 10.1728 0.357065
9 223.3873659 1884.57 617.616 2199.66
10 41.28655134 0.309471 71.9561 0.195143 56 42.2198826 6.81718 70.2482 12.1533
11 13.17856673 6.20527 14.9291 2.23216 57 1.99740612 0.15 1.19805 0.15
12 6.628396633 6.97554 7.36729 7.47488 58 2.22523246 0.15 1.33911 0.15
13 6.768893738 27.6338 10.9088 29.7742 59 33.1424986 15.1747 64.4203 23.1498
14 103.6555688 0.82321 294.436 2.72674 60 12.2011788 0.213027 28.28 0.150216

Vol 9 (38) | October 2016 | www.indjst.org Indian Journal of Science and Technology 5
Improved Parallel Page Rank Algorithm for Spam Filtering

some of the nodes in Table 4, this table clearly depicts how 8. Hua J, Huaxiang Z. Analysis on the content features and
the PageRank values are enhanced for the relevant pages. their correlation of web pages for spam detection. IEEE
The important pages still remained important but with Transactions on Communications. China. 2015 Mar;
greater Pagerank values. As seen from the table, proposed 12(3):84–94.
algorithm exemplifies higher peak values for the relevant 9. Almomani A, Obeidat A, Alsaedi K, Obaida MAH,
Al-Betar M. Spam E-mail filtering using ECOS algorithms.
web pages.
Indian Journal of Science and Technology. 2015 May; 8(S9).
DOI:10.17485/ijst/2015/v8iS9/55320.
7. Conclusion 10. Shri JMR, Subramaniyaswamy V. An effective approach
to rank reviews based on relevance by weighting method.
This paper proposed an improved PageRank algorithm Indian Journal of Science and Technology. 2015 Jun; 8(11).
based on the non-uniform distribution of PageRank scores DOI: 10.17485/ijst/2015/v8i11/61768\.
to calculate the PageRank of web pages. The proposed 11. Geetha Rani IS, Sorana Mageswari M. A link-click-con-
work renders spam filtering by classifying important as cept based ranking algorithm for ranking search results.
well as irrelevant web pages. This filtering is based on Indian Journal of Science and Technology. 2014 Jan; 7(10).
the hyperlink structure of the web that is related to out- DOI:10.17485/ijst/2014/v7i10/50682.
bound and inbound links to a webpage. The apparent 12. Anbazhagu UV, Praveen JS, Soundarapandian R,
Manoharan N. Efficacious spam filtering and detection in
importance of a web page is determined by its PageRank
social networks. Indian Journal of Science and Technology.
values. Important pages show higher values whereas the
2014 Nov; 7(S7). DOI: 10.17485/ijst/2014/v7iS7/61956.
spam pages show lesser values. Further improvements to 13. Arnal J, Migallon H, Migallon V, Palomino JA, Penades J.
spam filtering can be done using the web content mining. Parallel relaxed and extrapolated algorithms for comput-
We performed the experiments on CPU and GPU using ing PageRank. The Journal of Supercomputing. 2014 Nov;
CUDA programming language with different standard 70(2):637–48.
datasets. 14. Zhu Y, Ye S, Li X. Distributed PageRank computation
based on iterative aggregation-disaggregation methods.
8. References Proceedings of the 14th ACM International Conference
on Information and Knowledge Management; New York,
1. Article title. 2015. Available from: http://www.infotoday. USA. 2005 Oct. p. 578–85.
com/searcher/may01/liddy.htm 15. Manaskasemsak B, Rungsawang A. Parallel PageRank
2. Brin S, Page L. The anatomy of a large-scale hypertextual computation on gigabit PC Cluster. Proceedings of 18th
web search engine. Computer Network and ISDN Systems. IEEE International Conference on Advanced Information
1998; 30(1-7):107–17. Networking and Applications AINA. 2004 Mar; 1:273–7.
3. Duong NT, Nguyen QAP, Nguyen AT, Nguyen HD. Parallel 16. Kohlschiiutter C, Chirita PA, Nejdl W. Efficient paral-
PageRank computation using GPUs. Proceedings of the lel computation of PageRank. Advances in Information
Third Symposium on Information and Communication Retrieval. 28th European Conference on IR Research, ECIR
Technology ACM; New York, USA. 2012 Aug. p. 223–30. 2006, Springer Berlin Heidelberg,; London, UK. 2006 Apr;
4. Dubey H, Roy BN. An improved PageRank algorithm 3936:241–52. April 2006. pp. 241–252.
based on optimized normalization technique. IJCSIT. 2011; 17. Cevahir A, Aykanat C, Turk A, Barla Cambazoglu B. Site-
2(5):2183–8. based partitioning and repartitioning techniques for parallel
5. Tarun K, Parikshit S, Ankush M. Parallelization of PageRank computation. IEEE Transactions on Parallel and
PageRank on multicore processors. Distributed Computing Distributed Systems. 2011 May; 22(5):786–802.
and Internet Technology. Proceedings of 8th International 18. Gleich D, Zhukov L, Berkhin P. Fast parallel PageRank: A
Conference, ICDCIT 2012; Springer Berlin Heidelberg: linear system approach. Technical Report; 2004.
Bhubaneswar, India. 2012 Feb; 7154:129–40. 19. Cevahir A, Aykanat C, Turk A, Barla Cambazoglu B,
6. Article title. 2015. Available from: https://www.google.co.in/ Nukada A, Matsuoka S. Efficient PageRank on GPU clus-
insidesearch/howsearchworks/crawling-indexing.html ters. IPSJ SIG Technical Report. 2010; 2010(21):HPC-128.
7. Pu BY, Huang TZ, Wen C. An improved PageRank 20. Wu T, Wang B, Shan Y, Yan F, Wang Y, Xu N. Efficient
­algorithm: Immune to spam. IEEE Fourth International PageRank and SpMV computation on AMD GPUs. IEEE
Conference on Network and System Security; Melbourne. International Conference on Parallel Processing; San Diego.
2010 Sep. p. 425–9. 2010 Sep. p. 81–9.

6 Vol 9 (38) | October 2016 | www.indjst.org Indian Journal of Science and Technology
Hema Dubey, Nilay Khare, K. K. Appu Kuttan and Shreyas Bhatia

21. Article title. 2015. Available from: http://docs.nvidia.com/ 23. Page L, Brin S, Motwani R, Winograd T. The PageRank
cuda/cuda-c-programming-guide citation ranking: Bringing Order to the Web. Stanford
22. Bhatia S, Tolpadi M, Rasool A. Importance of GPGPUs University: Technical report; 1999.
in efficiency improvement of real world applications. 24. Article title. 2015. Available from: https://snap.stanford.
IEEE Students’ Conference on Electrical, Electronics and edu/data
Computer Science (SCEECS); Bhopal. 2014 Mar. p. 1–6.

Vol 9 (38) | October 2016 | www.indjst.org Indian Journal of Science and Technology 7

You might also like