0% found this document useful (0 votes)
36 views

Parallel Page Rank Algorithms: A Survey

Paper Title Parallel Page Rank Algorithms: A Survey Authors Atul kumar Srivastava, Mitali Srivastava, Rakhi Garg, P. K . Mishra Abstract The PageRank method is an important and basic component in effective web search to compute the rank score of each page. The exponential growth of the Internet makes a crucial challenges for search engines to provide up-to-date and relevant user’s query search results within time period. The PageRank method computed on huge number of web pages and this is computation intensive task. In this paper, we provide the basic concept of PageRank method and discuss some Parallel PageRank methods. We also compare some Parallel algorithmic concepts like load balance, distributed vs. shared memory and data layout on these algorithms. Keywords PageRank; Power Method; Numerical Method; Parallel PageRank; MPI. Citation/Export MLA Atul kumar Srivastava, Mitali Srivastava, Rakhi Garg, P. K . Mishra, “Parallel Page Rank Algorithms: A Survey”, May 17 Volume 5 Issue 5 , International Journal on Recent and Innovation Trends in Computing and Communication (IJRITCC), ISSN: 2321-8169, PP: 470 – 473 APA Atul kumar Srivastava, Mitali Srivastava, Rakhi Garg, P. K . Mishra, May 17 Volume 5 Issue 5, “Parallel Page Rank Algorithms: A Survey”, International Journal on Recent and Innovation Trends in Computing and Communication (IJRITCC), ISSN: 2321-8169, PP: 470 – 473

Uploaded by

Editor IJRITCC
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Parallel Page Rank Algorithms: A Survey

Paper Title Parallel Page Rank Algorithms: A Survey Authors Atul kumar Srivastava, Mitali Srivastava, Rakhi Garg, P. K . Mishra Abstract The PageRank method is an important and basic component in effective web search to compute the rank score of each page. The exponential growth of the Internet makes a crucial challenges for search engines to provide up-to-date and relevant user’s query search results within time period. The PageRank method computed on huge number of web pages and this is computation intensive task. In this paper, we provide the basic concept of PageRank method and discuss some Parallel PageRank methods. We also compare some Parallel algorithmic concepts like load balance, distributed vs. shared memory and data layout on these algorithms. Keywords PageRank; Power Method; Numerical Method; Parallel PageRank; MPI. Citation/Export MLA Atul kumar Srivastava, Mitali Srivastava, Rakhi Garg, P. K . Mishra, “Parallel Page Rank Algorithms: A Survey”, May 17 Volume 5 Issue 5 , International Journal on Recent and Innovation Trends in Computing and Communication (IJRITCC), ISSN: 2321-8169, PP: 470 – 473 APA Atul kumar Srivastava, Mitali Srivastava, Rakhi Garg, P. K . Mishra, May 17 Volume 5 Issue 5, “Parallel Page Rank Algorithms: A Survey”, International Journal on Recent and Innovation Trends in Computing and Communication (IJRITCC), ISSN: 2321-8169, PP: 470 – 473

Uploaded by

Editor IJRITCC
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169

Volume: 5 Issue: 5 470 473


_______________________________________________________________________________________________
Parallel PageRank Algorithms: A Survey
Atul kumar Srivastava
Department of Computer Science, Institute of Science, Banaras Hindu University, Varanasi, India

Mitali Srivastava
Department of Computer Science, Institute of Science, Banaras Hindu University, Varanasi, India

Rakhi Garg
Computer Science Section, MMV, Banaras Hindu University, Varanasi, India

P. K . Mishra
Department of Computer Science, Institute of Science, Banaras Hindu University, Varanasi, India

Corresponding author: [email protected]

AbstractThe PageRank method is an important and basic component in effective web search to compute the rank score of each page. The
exponential growth of the Internet makes a crucial challenges for search engines to provide up-to-date and relevant users query search results
within time period. The PageRank method computed on huge number of web pages and this is computation intensive task. In this paper, we
provide the basic concept of PageRank method and discuss some Parallel PageRank methods. We also compare some Parallel algorithmic
concepts like load balance, distributed vs. shared memory and data layout on these algorithms.

Keywords-PageRank; Power Method; Numerical Method; Parallel PageRank; MPI.


__________________________________________________*****_________________________________________________

I. INTRODUCTION et al. [14] have used PC cluster to compute the PageRank


method.
The information and knowledge resources available on the In this paper we review Parallel PageRank algorithm based
internet continue to increase at exponential rate that required an on architecture setup and platform, and data distribution. The
efficient web search technology for users. For a given user configuration of this paper is as follows: Section 2 describe the
query keyword, web search engine provides millions of results basic concept of PageRank method. Section 3, elaborate
web pages as an answer [1]. Efficient and effective ranking various Parallel PageRank algorithm and how it is
methodology is required to determine the important and implemented. In Section 4, comparative study of various
relevant web search result. PageRank has been known as most Parallel PageRank algorithm based on different attribute is
effective and basic method to rank the web pages based on given. Finally section 5 concludes the paper.
relevancy score [2]. It uses hyperlink structure of web graph to
calculate the relevancy score of web page. Since the hyperlink II. PAGERANK METHOD
structure data is usually huge, so PageRank computation is still The basic definition of PageRank is that a page is important
a challenging task for todays web search engines. PageRank of if it is pointed by other important pages. Theoretically,
any web page depends upon the PageRank of pages that PageRank score of a page is the sum of all pages score that
pointing to this page. A page got higher rank if it is pointed by pointing to it. Let a transition probability matrix M of size n is
many high rank web pages [3, 4]. represented as follows [1, 5]:
With the tremendous increase of web pages today, 1
computation of PageRank takes a lot of time sometimes it takes = , (1)
several day and consume many computing resources. An 0 ,
efficient and faster PageRank computation method is required
because regularly updated web pages directly put impact on In equation (1), denote the number of out-degree of
frequent re-computation of PageRank method to update the page . If any row in matrix contains only zero entries i.e.
rank value of web pages [5, 6]. out-degree of any page is zero then it is called dangling page.
Many recent researchers have focuses on some extensive To deal with the dangling page change the transition
studies to accelerate the PageRank computation. Many probability matrix into new matrix . Let is the uniform
accelerating techniques have been proposed in terms of teleportation row vector and is the number of pages in web
numerical methods, I/O efficient techniques, and in terms of graph then is defined as:
architectural ways [8]. Haveliwal [9] and Chen et al. [8] have
used efficient memory computation and disk-based = + . (2)
computation. Arasu et al. [10] accelerates the convergence rate
of PageRank computation by using Gauss-seidel method and In equation (2), is the dimensional column vector that
structural properties of web graphs. Kamvar et al. [11, 12] denote dangling pages defined as:
speed up the PageRank computation by periodically off-
estimate the non-principle eigenvector from the current 1 , = 0
= (3)
iteration. Boldi et al. [13] computed the PageRank vector in 0 ,
main memory by compressing the web graph data. Rungswang
470
IJRITCC | May 2017, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 5 Issue: 5 470 473
_______________________________________________________________________________________________
1 Ethernet network. To speed up the PageRank computation
= (4)
1 partition the binary link structure file is partitioned into
parts such that0 , 1 , 2 , . . .. , 1 . The proposed PageRank
In equation (2), the second term in right side . denote algorithm is given below:
that if the user visit to any dangling page, then they may
randomly jump to any other page with probability 1 [2, 3]. Algorithm 1: Parallel PageRank Algorithm [14]
The uniqueness of PageRank vector is guaranteed when matrix 1. Initialize each processor with MPI process id
further transformed into on which Power method is 0 , 1 , 2 , .
applied by following equation:
2. = + 1, and = ( + 1)

= + (1 ). (5) 3. = 1 , , = 1/ , = 0.85 // assign
rank of every web page to 1/n
In equation (5), is the column identity vector. The 4.
modified transition matrix is used to compute PageRank 5. Repeat till end of file
vector. Let be the n-dimensional column PageRank vector 6. = . _ // to obtain index
at iteration , then Power method is used to compute PageRank 7. _ , =
of web graph by following equation:
(1 ) , .
+ . _
+1
= . (6)
8. end
Equation (6) iterates until it converges to a given tolerance 9. end
value. Due to huge size of web graph Power method takes
several days to compute PageRank score, so there is need of Algorithm (1) iterates until convergence is reached and all
some Parallel platform like MPI, MapReduce or Spark that processors will be synchronized their blocks of PageRank value
compute PageRank vector efficiently and effectively. In further in 2 log 2 communication steps [14]. This experimental result
section we discuss some Parallel PageRank algorithm that show that speed up curve get close to the ideal speed up for the
efficiently compute on those platform and performs better large data size.
result than basic PageRank algorithm.
III. PARALLEL PAGERANK ALGORITHMS B. Parallel Adaptive PageRank algorithm
Efficient computation of PageRank is required due to huge Armon et al. [15] proposed a parallel PageRank
size of web graph. Recent researchers discuss some algorithmic computation on PC Cluster. Kamvar et al. [16] found that
and architectural approaches to speed up the computation of PageRank vector for about 2/3 of pages it converges very early
PageRank. Here we review some of those Parallel PageRank than for other pages. So based on this concept they proposed
algorithm and focuses on their pros and cons, platforms, adaptive PageRank method that avoid the overhead of re-
software etc... To compute PageRank vector on Parallel computation of already converged web pages and compute
platform first we need to store hyperlink matrix into binary link converge web page by using following equation:
structure file that is partitioned into chunks. Here is the
number of systems in a cluster setup [14, 15]. Let there are T ,
= (1 ) (7)
records in binary structure file and file is described as follows: + ,

dest src_id out_deg in_deg


They proposed Parallel PageRank algorithm and combined
_id
it with adaptive PageRank computation. The parallel adaptive
1 1029 1 1
PageRank algorithm is given below:
2 101 23 5 2
3 276 34 111 3 3 Algorithm 2: Parallel Adaptive PageRank algorithms
4 21 476 234 165 4 4 [15]
5 234 1 1 1. = 1/ // initialize rank of all pages
6 563 876 7 2 2. initialize ( , ) // initialize convergent rank of
file B file O file I pages and
3. while end of non-convergent file
Figure 1: Show the structure of file for Parallel PageRank computation 4. while Bi is not end of file
5. if . _ // check whether
In Figure 1, file B have two attribute i.e. dest_id and src_id page is in set of converged rank
that contains incoming and outgoing pages id e.g. page 1 is 6. . _ = [. _]()
pointed by page 1029 and page 2 is pointed by page 101 and 7. else
page 23. File O, and file I contains number of outgoing link and 8. for every j=1,2, in_deg
[. _ ]
incoming link of page of file B. 9. Score+=
[. _ ]
(1 )
10. . = . +
A. Parallel PageRank algorithm on PC Cluster
11. }
Rungsawan et al. [14] have proposed Parallel PageRank 12. }
based on a cluster of processor that is connected via fast
471
IJRITCC | May 2017, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 5 Issue: 5 470 473
_______________________________________________________________________________________________
From experiment analysis they have showed that adapting E. Parallel PageRank of CUDA
PageRank converges faster by elimination of redundant Tarun et al. [19] presented parallel PageRank algorithm on
computation needed for already converged pages, and this Cell-BE Processor and CUDA showed that their algorithm runs
algorithm is also scalable for large datasets. almost 3 times faster than Xeon dual core 3.0 GHz. To get
C. Parallel PageRank computation by using Gauss-Seidel better performance they have used SPE (Synergistic Processing
method Element) for all calculation. They had design data structure to
store binary link structure file as follows:
Kohlschtter et al. [17] accelerate the PageRank
computation by applying Gauss-Seidel method rather than
Node Array
Power method in Parallel PageRank. They present a method
[n1|in_deg1|s1|s2|s3|..|n2|in_deg|s1|s2|s3|.] that
which combines the Gauss-Seidel and Jacobi method that
contains first node followed by its in-degree, followed
improved the convergence of PageRank algorithm. They had
by source nodes, then second node followed by its in-
use combined method and simplified by using following
degree and so on.
equation and implement on parallel platform (8):
Degree Array
( ) [n1|in_deg1|d1|d2|d3|..|n2|in_deg|s1|s2|s3|.] that
= 1 + ( ,) | | (8) contains first node followed its in-degree, followed by

destination nodes.
In equation (8), = 1, () where m denotes Two array V1 and V2 that keep rank of nodes at
iteration number, (i, j) webpages, and 1, is an ( + 1) iteration.
intermediate PageRank vector that combines the previous and
current iteration vector. Based on given data structure they presented following
Through experiment they prove that their approach and algorithm [4]:
original method both produce the same rank score, and
combined method is more scalable than other parallel Algorithm 4: Parallel PageRank on CUDA [19]
PageRank methods due to parallelization of Gauss-Seidel 1. Initialize PageRank to every page = 1
method [20]. 2. for every iteration
3. Degree Array and V1 are equally divided among
D. PPR MT algorithm number of SPE, and rank is calculated
Bundit et al. [18] proposed an efficient Parallel PageRank on PageRank array.
method based on Multi-threaded model of MPI. They have 4. After calculation V2 is updated with V1.
executed Parallel adaptive PageRank algorithm on two SMP
cluster system. In this algorithm MPI is used as a
communication model between processes on different systems, IV. COMPARISON OF PARALLEL PAGERANK
and inter-thread method is used for communication between ALGORITHM
threads. They have partitioned the binary link structure file to
distribute it equally among cluster nodes. The file B and I are In this section, we categorizes different parallel PageRank
partitioned and distributed among the cluster nodes, while the algorithm developed so far on the basis of load balancing
file O is not partitioned. They have proposed following PPR- technique used, type of parallel and data layout used. The
MT algorithms [18]: details are given in Table 1:

Algorithm 3: PPR-MT Algorithm [18]


V. CONCLUSION
1. = 1/ // assign rank score to every web
page From above review, it is clear that there are lots of Parallel
2. Create threads algorithms to compute PageRank vector have been proposed so
3. while till converge far. As we can observe that algorithm 1, algorithm 2, and
4. for every page assign to thread algorithm 5 have not used any load balancing technique, they
5. score=0 have distributed data horizontally to every node, and algorithm
6. if . _ converges then 3 and 4 have distributed data to every node based on size.
7. . _ = [. _] Earlier MPI was used as popular platform for parallel
8. else computation of PageRank method that required too much
[. _ ]
9. compute all score+= attention for users to perform every task like load balancing,
[. _ ]
(1 ) and data layout. To achieve all aspect i.e. load balancing, type
10. . = . + of parallelism and data layout efficiently is a complex task in

11. endfor MPI. So today there are other platform available like Spark that
12. endwhile have greater advantage over MPI. There is a big scope of
PageRank computation on other platform like Spark to achieve
On execution of algorithm (3), it was observed that the better efficiency and scale-up.
speed up of given algorithm is about 23 times faster on four
machine than to basic PageRank Power method on single
machine.

472
IJRITCC | May 2017, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 5 Issue: 5 470 473
_______________________________________________________________________________________________
Table 1: Comparison of Various Parallel PageRank algorithm based different parameter

Load Memory Used Type of Parallelism Data Layout


Balancing
Distributed Shared Data Task Horizontal Vertical
Parallel PR on No Yes No Yes No Yes No
PC Cluster [14]
Parallel No Yes No Yes No Yes No
Adaptive PR [15]
Parallel PR Yes Yes No Yes No Yes No
using GSeidel
[17]
PPR MT Yes Yes Yes Yes No Yes No
algorithm [18]
Parallel PR on No Yes No Yes Yes Yes No
CUDA [19] (Thread
level)

international conference on World Wide Web (pp. 261-270).


REFERENCES ACM.
[1] S. Brin, L. Page (1998), The Anatomy of a Large-scale [13] Boldi, Paolo, and Sebastiano Vigna. "The webgraph
Hyper textual Web Search Engine Proceedings of the framework I: compression techniques." Proceedings of the
Seventh International World Wide Web Conference, 13th international conference on World Wide Web. ACM,
Page(s):107-117. 2004.
[2] Langville, A. N., and Meyer, C. D. (2004) Deeper inside [14] Rungsawang, A., & Manaskasemsak, B. (2003, September).
pagerank Internet Mathematics, Vol.1 No.3, pp. 335-380. PageRank computation using PC cluster. In European Parallel
[3] Pavel Berkhin (2005), A survey on PageRank computing, Virtual Machine/Message Passing Interface Users Group
Internet Mathematics 2, Vol.1, Page(s):73120. Meeting (pp. 152-159). Springer Berlin Heidelberg.
[4] P Desikan, J Srivastava, Vipin Kumar, and Pang-Ning [15] Rungsawang, A., & Manaskasemsak, B. (2006, February).
Tan(2002), Hyperlink Analysis: Techniques and Parallel adaptive technique for computing PageRank. In
Applications Page(s):1-42. Parallel, Distributed, and Network-Based Processing, 2006.
[5] Pretto, L.: A theoretical analysis of googles PageRank. In: PDP 2006. 14th Euromicro International Conference on (pp.
Laender,A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, 6-pp). IEEE.
vol. 2476, pp. 131144. Springer, Heidelberg (2002). [16] Kamvar, S., Haveliwala, T., & Golub, G. (2004). Adaptive
[6] Langville, A.N., Meyer, C.D.: Googles PageRank and methods for the computation of PageRank. Linear Algebra
Beyond: The Science of Search Engine Rankings. Princeton and its Applications, 386, 51-65.
University Press, Princeton (2006). [17] Kohlschtter, C., Chirita, P. A., & Nejdl, W. (2006, April).
[7] Langville, A. N., & Meyer, C. D. (2004, May). Updating Efficient parallel computation of pagerank. In European
pagerank with iterative aggregation. In Proceedings of the Conference on Information Retrieval (pp. 241-252). Springer
13th international World Wide Web conference on Alternate Berlin Heidelberg.
track papers & posters (pp. 392-393). ACM. [18] Manaskasemsak, Bundit, Putchong Uthayopas, and Arnon
[8] Chen, Y. Y., Gan, Q., & Suel, T. (2002, November). I/O- Rungsawang. "A mixed MPI-thread approach for parallel
efficient techniques for computing PageRank. In Proceedings page ranking computation." On the Move to Meaningful
of the eleventh international conference on Information and Internet Systems 2006: CoopIS, DOA, GADA, and ODBASE
knowledge management (pp. 549-557). ACM. (2006): 1223-1233.
[9] Haveliwala, T. (1999). Efficient computation of PageRank. [19] Kumar, T., Sondhi, P., & Mittal, A. (2012, February).
Stanford. Parallelization of PageRank on multicore processors. In
International Conference on Distributed Computing and
[10] Arasu, Arvind, et al. "PageRank computation and the structure
Internet Technology (pp. 129-140). Springer Berlin
of the web: Experiments and algorithms." Proceedings of the
Heidelberg.
Eleventh International World Wide Web Conference, Poster
Track. 2002.
[11] Kamvar, Sepandar, et al. "Exploiting the block structure of the
web for computing pagerank." Stanford University Technical
Report (2003).
[12] Kamvar, S. D., Haveliwala, T. H., Manning, C. D., & Golub,
G. H. (2003, May). Extrapolation methods for accelerating
PageRank computations. In Proceedings of the 12th
473
IJRITCC | May 2017, Available @ http://www.ijritcc.org
_______________________________________________________________________________________

You might also like