Article28
Article28
Indian Journal of Science and Technology, Vol 9(38), DOI: 10.17485/ijst/2016/v9i38/90410, October 2016 ISSN (Online) : 0974-5645
Abstract
Background/Objectives: PageRanking algorithm is a well known link based technique given by Google for indexing of
its web pages. This algorithm works on the linking structure of web pages id est inbound and outbound links of pages.
The existing Page Rank algorithm follows equal distribution law that is; it distributes the Page Rank of a web page evenly
among all the outgoing links. The problem with the uniform distribution of Page Rank is that sometimes uninteresting
pages got high Page Rank values. Methods/Statistical Analysis: This paper proposed an improved parallel Page Rank
algorithm that un-uniformly distributes the Page Rank values among all the outgoing links. The proposed work has been
implemented on NVIDIA Quadro 2000 GPU architecture using CUDA programming language. Findings: The proposed
algorithm mitigates spam and provides better results in terms of computational time as compared to Parallel Page Rank,
because it assigns higher priority to important pages and less priority to less important web pages. By assigning values in
such a fashion important pages show an increase in the Page Rank value and unrelated pages that is spam pages show a
decrease in Page Rank value. Application: The proposed work performs spam filtering by classifying important as well as
irrelevant web pages.
Keywords: CUDA, GPU, Non-Uniform Distribution, Parallel Page Rank, Spam Pages
2 Vol 9 (38) | October 2016 | www.indjst.org Indian Journal of Science and Technology
Hema Dubey, Nilay Khare, K. K. Appu Kuttan and Shreyas Bhatia
c omparing the running time of PageRank algorithm on tilized for General Purpose Computing. Unlike the CPU,
u
both GPU and CPU clusters. Other research work in this the GPU has a parallel throughput architecture which
regard is by 20 who implemented power method for paral- helps in executing a plethora of threads concurrently21,22.
lel computation of PageRank algorithm on AMD GPUs The grid is assigned to a GPU for execution, where each
by using OpenCL programming language. Our research grid is further divided into blocks. A block is assigned to a
work is an attempt to reduce the PageRank values of spam multiprocessor (SM) and all the threads present within a
pages so that important web pages are shown on the top block will make use of the processing elements of only its
of google search engine. We have implemented this work assigned multiprocessor. If we consider the case, where the
on NVIDIA Quadro 2000 GPU architecture using CUDA number of SMs is less as compared to a number of blocks,
programming language. then multiple blocks could be assigned to a single SM at a
time, depending on the availability of memory in the SM.
Threads present within all the blocks which are assigned
3. GPU and CUDA to same SM will execute simultaneously. In an SM, threads
GPUs are processors conventionally designed for graphic present within a block are split in a group of 32 threads;
processing in computer games. It is a multicore proces- such set of threads is called a warp. All threads of a warp
sor possessing extremely parallel processing ability with execute in parallel and execute the same instruction at a
high-speed memories. GP-GPU exemplifies general time. An SM can execute 768 threads simultaneously, that
purpose computing on graphics processing units to use is, 24 SIMD warps of 32 threads. Maximum parallelism
GPU for computational jobs that are typically handled can be achieved if all warps have 32 threads21,22.
by CPU3. The architecture of GPU is shown in Figure 2.
The GPU evolved due to the voracious market requisite
4. PageRank Algorithm
for high quality and real time graphics. GPU, today is a
processor with a plenitude of cores, which support high The famous search engine Google uses PageRank
parallelism and multithreading with high memory band- algorithm. This algorithm is based on assigning each
width. The GPU runs on the SIMD model which is Single web page a numeric weight, so as to measure the relative
Instruction Multiple Data. Each core within a multipro- importance of each page within a hyperlinked structure
cessor executes the identical instruction, but on dissimilar of web pages. PageRank is based on the random surfer
data. GPU contains one or more scalable multithreaded model23 where a random surfer (a user) switches to a
Streaming Multiprocessors (SMs). The SMs are designed random page after clicking on several links. The value of
to execute plenitude of threads concurrently. PageRank determines the chance that the random surfer
CUDA (Compute Unified Device Architecture) allows will arrive on that page by clicking on a link. This can be
programmers to define kernels that will execute on the assumed analogous to a Markov Chain in which the pages
GPU. A programmer is unaware of the hardware of the are states and the equally probable transitions are links
GPU instead they see the number of threads which are between the pages.
organized into blocks. The GPU’s using CUDA can be The Google’s Algorithm as described in2 is given as
follows:
Vol 9 (38) | October 2016 | www.indjst.org Indian Journal of Science and Technology 3
Improved Parallel Page Rank Algorithm for Spam Filtering
others and hence they should have more PageRank value 12 for each page r in q.outgoingNeighbors do
as compared to other links. Furthermore, there is also
a probability of raising spam web pages by the business 13 sum += r.prev_PageRank)
firms, thereby enhancing hyperlinks to its home web page
14 p.PageRank +=
for the promotional and advertisement activities.
4 Vol 9 (38) | October 2016 | www.indjst.org Indian Journal of Science and Technology
Hema Dubey, Nilay Khare, K. K. Appu Kuttan and Shreyas Bhatia
are the Amazon’s Product Co-Purchasing Network, mitigation in PageRank values, these nodes are
Internet Peer to Peer Network Web graphs. The details actually considered as spam or irrelevant pages.
of datasets are shown in Table 1. The input data were The proposed algorithm is implemented on dataset
sorted according to the in-degree of the graph for better “p2p-Gnutella31” which consists of 62,586 nodes and
performance. 147,892 edges. We have shown only a few nodes in table 3
Dataset Amazon’s Product Co-Purchasing Network for better visualization. A couple of local maxima can be
-“Amazon302” consists of 26211 nodes and 1,234,877 seen in the Table 3 which correspond to relevant pages.
edges. Only a few nodes are shown in Table 2 to visual- The Pagerank values for relevant pages are enhanced by
ize the results properly. It can be seen from the Table 2 using Exponential, Log and Square functions, which can
how the values for each node changes according to the be seen from the Table 3. Nodes whose PageRank values
function. The node number 10 shows an increment in are suppressed are considered to be spam pages.
PageRank value in this case. Only the relevant or impor- Table 4 shows the implementation of existing proposed
tant pages will show a high-valued local maximum in algorithm on dataset “web-NotreDame” which consists of
Table 2. Other nodes show 325,729 nodes and 1,497,134 edges. We have shown only
Table 1. Properties of the data sets20 used in the Table 3. PageRank values of some nodes of dataset 2:
experiments Gnutella Peer to Peer Network – “p2p-Gnutella31”
Dataset Type Nodes Edges Description Node Parallel Proposed Parallel Pagerank Algorithm
Name No. PageRank Exponential Logartithmic Square
Amazon Function Function Function
product co- 1 0.541431794 1.06606 1.052 1.05132
Amazon0302 Directed 262,111 1,234,877 purchasing
2 0.742006559 2.61749 2.08039 7.98779
network from
March 2 2003 3 0.349563626 0.267324 0.378557 0.163827
Gnutella peer 4 0.963083037 4.33025 2.60139 10.2946
p2p- to peer network 5 0.252753979 0.175646 0.198987 0.150852
Directed 62,586 147,892
Gnutella31 from August 31
6 0.37433706 0.210706 0.292558 0.150535
2002
Web- Web graph of 7 0.654885792 1.93907 1.75213 3.09926
Directed 325,729 1,497,134
NotreDame Notre Dame 8 0.380693168 0.214274 0.290963 0.151892
9 0.41766649 0.299011 0.424158 0.166012
Table 2. PageRank values of some nodes of dataset
1: Amazon’s Product Co-Purchasing Network Table 4. PageRank values of some nodes of dataset 3:
-“Amazon302” Web graph – “web-NotreDame”
Proposed Parallel Pagerank Algorithm Proposed Parallel Pagerank Algorithm
Node Parallel Node Parallel
Exponential Logartithmic Square Exponential Logartithmic Square
No. PageRank No. PageRank
Function Function Function Function Function Function
4 0.187242542 0.150228 0.152202 0.150006 51 2.75361792 0.398098 6.36853 0.398025
5 2.653685497 3.09225 2.56103 3.27858
52 3.94637185 0.205065 23.901 0.156454
6 32.39687142 22.0483 51.2055 21.9181
53 2.13233258 0.166166 6.52146 0.15188
7 45.6258104 28.3825 87.9957 28.4655
54 5.26112282 1.17416 10.3334 1.3005
8 49.54418857 0.533287 89.6108 0.5795
55 6.0156624 0.319871 10.1728 0.357065
9 223.3873659 1884.57 617.616 2199.66
10 41.28655134 0.309471 71.9561 0.195143 56 42.2198826 6.81718 70.2482 12.1533
11 13.17856673 6.20527 14.9291 2.23216 57 1.99740612 0.15 1.19805 0.15
12 6.628396633 6.97554 7.36729 7.47488 58 2.22523246 0.15 1.33911 0.15
13 6.768893738 27.6338 10.9088 29.7742 59 33.1424986 15.1747 64.4203 23.1498
14 103.6555688 0.82321 294.436 2.72674 60 12.2011788 0.213027 28.28 0.150216
Vol 9 (38) | October 2016 | www.indjst.org Indian Journal of Science and Technology 5
Improved Parallel Page Rank Algorithm for Spam Filtering
some of the nodes in Table 4, this table clearly depicts how 8. Hua J, Huaxiang Z. Analysis on the content features and
the PageRank values are enhanced for the relevant pages. their correlation of web pages for spam detection. IEEE
The important pages still remained important but with Transactions on Communications. China. 2015 Mar;
greater Pagerank values. As seen from the table, proposed 12(3):84–94.
algorithm exemplifies higher peak values for the relevant 9. Almomani A, Obeidat A, Alsaedi K, Obaida MAH,
Al-Betar M. Spam E-mail filtering using ECOS algorithms.
web pages.
Indian Journal of Science and Technology. 2015 May; 8(S9).
DOI:10.17485/ijst/2015/v8iS9/55320.
7. Conclusion 10. Shri JMR, Subramaniyaswamy V. An effective approach
to rank reviews based on relevance by weighting method.
This paper proposed an improved PageRank algorithm Indian Journal of Science and Technology. 2015 Jun; 8(11).
based on the non-uniform distribution of PageRank scores DOI: 10.17485/ijst/2015/v8i11/61768\.
to calculate the PageRank of web pages. The proposed 11. Geetha Rani IS, Sorana Mageswari M. A link-click-con-
work renders spam filtering by classifying important as cept based ranking algorithm for ranking search results.
well as irrelevant web pages. This filtering is based on Indian Journal of Science and Technology. 2014 Jan; 7(10).
the hyperlink structure of the web that is related to out- DOI:10.17485/ijst/2014/v7i10/50682.
bound and inbound links to a webpage. The apparent 12. Anbazhagu UV, Praveen JS, Soundarapandian R,
Manoharan N. Efficacious spam filtering and detection in
importance of a web page is determined by its PageRank
social networks. Indian Journal of Science and Technology.
values. Important pages show higher values whereas the
2014 Nov; 7(S7). DOI: 10.17485/ijst/2014/v7iS7/61956.
spam pages show lesser values. Further improvements to 13. Arnal J, Migallon H, Migallon V, Palomino JA, Penades J.
spam filtering can be done using the web content mining. Parallel relaxed and extrapolated algorithms for comput-
We performed the experiments on CPU and GPU using ing PageRank. The Journal of Supercomputing. 2014 Nov;
CUDA programming language with different standard 70(2):637–48.
datasets. 14. Zhu Y, Ye S, Li X. Distributed PageRank computation
based on iterative aggregation-disaggregation methods.
8. References Proceedings of the 14th ACM International Conference
on Information and Knowledge Management; New York,
1. Article title. 2015. Available from: http://www.infotoday. USA. 2005 Oct. p. 578–85.
com/searcher/may01/liddy.htm 15. Manaskasemsak B, Rungsawang A. Parallel PageRank
2. Brin S, Page L. The anatomy of a large-scale hypertextual computation on gigabit PC Cluster. Proceedings of 18th
web search engine. Computer Network and ISDN Systems. IEEE International Conference on Advanced Information
1998; 30(1-7):107–17. Networking and Applications AINA. 2004 Mar; 1:273–7.
3. Duong NT, Nguyen QAP, Nguyen AT, Nguyen HD. Parallel 16. Kohlschiiutter C, Chirita PA, Nejdl W. Efficient paral-
PageRank computation using GPUs. Proceedings of the lel computation of PageRank. Advances in Information
Third Symposium on Information and Communication Retrieval. 28th European Conference on IR Research, ECIR
Technology ACM; New York, USA. 2012 Aug. p. 223–30. 2006, Springer Berlin Heidelberg,; London, UK. 2006 Apr;
4. Dubey H, Roy BN. An improved PageRank algorithm 3936:241–52. April 2006. pp. 241–252.
based on optimized normalization technique. IJCSIT. 2011; 17. Cevahir A, Aykanat C, Turk A, Barla Cambazoglu B. Site-
2(5):2183–8. based partitioning and repartitioning techniques for parallel
5. Tarun K, Parikshit S, Ankush M. Parallelization of PageRank computation. IEEE Transactions on Parallel and
PageRank on multicore processors. Distributed Computing Distributed Systems. 2011 May; 22(5):786–802.
and Internet Technology. Proceedings of 8th International 18. Gleich D, Zhukov L, Berkhin P. Fast parallel PageRank: A
Conference, ICDCIT 2012; Springer Berlin Heidelberg: linear system approach. Technical Report; 2004.
Bhubaneswar, India. 2012 Feb; 7154:129–40. 19. Cevahir A, Aykanat C, Turk A, Barla Cambazoglu B,
6. Article title. 2015. Available from: https://www.google.co.in/ Nukada A, Matsuoka S. Efficient PageRank on GPU clus-
insidesearch/howsearchworks/crawling-indexing.html ters. IPSJ SIG Technical Report. 2010; 2010(21):HPC-128.
7. Pu BY, Huang TZ, Wen C. An improved PageRank 20. Wu T, Wang B, Shan Y, Yan F, Wang Y, Xu N. Efficient
algorithm: Immune to spam. IEEE Fourth International PageRank and SpMV computation on AMD GPUs. IEEE
Conference on Network and System Security; Melbourne. International Conference on Parallel Processing; San Diego.
2010 Sep. p. 425–9. 2010 Sep. p. 81–9.
6 Vol 9 (38) | October 2016 | www.indjst.org Indian Journal of Science and Technology
Hema Dubey, Nilay Khare, K. K. Appu Kuttan and Shreyas Bhatia
21. Article title. 2015. Available from: http://docs.nvidia.com/ 23. Page L, Brin S, Motwani R, Winograd T. The PageRank
cuda/cuda-c-programming-guide citation ranking: Bringing Order to the Web. Stanford
22. Bhatia S, Tolpadi M, Rasool A. Importance of GPGPUs University: Technical report; 1999.
in efficiency improvement of real world applications. 24. Article title. 2015. Available from: https://snap.stanford.
IEEE Students’ Conference on Electrical, Electronics and edu/data
Computer Science (SCEECS); Bhopal. 2014 Mar. p. 1–6.
Vol 9 (38) | October 2016 | www.indjst.org Indian Journal of Science and Technology 7