0% found this document useful (0 votes)

242 views

Parallel BFS On Graphs Using GPGPU

Graph algorithms are a fundamental paradigm of computer Science and is relevant to many domains and application areas. Large graphs involving millions of vertices are common in scientific and engineering applications. Reasonable time bound implementations of graph algorithms using high-end computing resources have been reported but are accessible to only a few. Graphics Processing Units (GPUs) are fast emerging as inexpensive parallel processors due to their high computation power and low price. The GeForce line of Nvidia GPUs provides the CUDA programming model that treats the GPU as a SIMD processor array. I’ve presented a fundamental graph algorithm - the breadth first search, using this programming model on large graphs. The results on a graph of 500, 000 vertices would suggest that the NVIDIA GPUs can be used as a reasonable co-processor to accelerate parts of an application.

Uploaded by

Soumasish Goswami

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

242 views

Parallel BFS On Graphs Using GPGPU

Uploaded by

Soumasish Goswami

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Page 1! of !

CSE 603 Project Report

Parallel BFS on Graphs using GPGPU
Name: Soumasish Goswami
Dec 18, 2015
Abstract
Graph algorithms are a fundamental paradigm of computer Science and is relevant to
many domains and application areas. Large graphs involving millions of vertices are
common in scientific and engineering applications. Reasonable time bound
implementations of graph algorithms using high-end computing resources have been
reported but are accessible to only a few. Graphics Processing Units (GPUs) are fast
emerging as inexpensive parallel processors due to their high computation power and
low price. The GeForce line of Nvidia GPUs provides the CUDA programming model
that treats the GPU as a SIMD processor array. Ive presented a fundamental graph
algorithm - the breadth first search, using this programming model on large graphs. The
results on a graph of 500, 000 vertices would suggest that the NVIDIA GPUs can be
used as a reasonable co-processor to accelerate parts of an application.

Soumasish Goswami

Parallel BFS on CUDA 7.5

Page !2 of !10

Introduction
Graph representations are common in many problems of scientific and engineering
applications. Some of these problems map to very large graphs, often involving millions
of vertices, viz, problems likeVLSI chip layout, phylogeny reconstruction, data mining,
and network analysis can require graphs with millions of vertices. Running serial
implementations of Breadth-First-Search traversal algorithms on such large datasets
can be a time-consuming task. This report summarizes observations on parallelizing this
task using CUDA. Owing to the limitations of hardware and environment as obtained in
Centre for Computational Research (CCR), the project has been tested in its entirety on
a personal computer(with above average specs). However the observations can be easily
generalized to a larger data set. The actual tests on the serial algorithm run significantly
longer than the timings plotted in the report, primarily because the sub routine of
traversing a list of edges and adding it to the graph is painfully slow. The relevant tables
in the observations section only summarizes the time taken to run the actual BreadthFirst-Search(hereby referred to as BFS) algorithm.
GPGPU
General purpose programming on Graphics processing Units (GPGPU) tries to solve a
problem by posing it as a graphics rendering problem, restricting the range of solutions
that can be ported to the GPU. A GPGPU solution is designed to follow the general flow
of the graphics pipeline (consisting of vertex, geometry and pixel processors), with each
iteration of the solution being one rendering pass. Since the GPU memory layout is also
optimized for graphics rendering the GPGPU solutions as an optimal data structure may
not be available. Creating efficient data structures using the GPU memory model is a
challenge in itself . Memory size on GPU is another restricting factor. A single data
structure on the GPU cannot be larger than the maximum texture size supported by it.
The CUDA Programming Model
For the programmer the CUDA model is a collection of threads running in parallel. A
warp is a collection of threads that can run simultaneously on a multiprocessor. The
warp size is fixed for a specific GPU. The programmer decides the number of threads to
be executed. If the number of threads is more than the warp size, they are time-shared
internally on the multiprocessor. A collection of threads (called a block) runs on a
multiprocessor at a given time. Multiple blocks can be assigned to a single
multiprocessor and their execution is time-shared. A single execution on a device
generates a number of blocks. A collection of all blocks in a single execution is called a
grid. All threads of all blocks executing on a single multiprocessor divide its resources
equally amongst themselves. Each thread and block is given a unique ID that can be
accessed within the thread during its execution. Each thread executes a single
instruction set called the kernel.
The kernel is the core code to be executed on each thread. Using the thread and block

Soumasish Goswami

Parallel BFS on CUDA 7.5

Page !3 of !10

IDs each thread can perform the kernel task on different set of data. Since the device
memory is available to all the threads, it can access any memory location. The CUDA
programming interface provides a Parallel Random Access Machine (PRAM)
architecture, if one uses the device memory alone. However, the multiprocessors follow
a SIMD model, the performance improves with the use of shared memory which can be
accessed faster than the device memory. The hardware architecture allows multiple
instruction sets to be executed on different multiprocessors. The current CUDA
programming model, however, cannot assign different kernels to different
multiprocessors.
With CUDA, the GPU ca be viewed as a massive parallel SIMD processor, limited only
by the amount of memory available on the graphics hardware. The GT 750M graphics
card has 2048 MB memory. Large graphs can reside in this memory, given a suitable
representation.
Device Specs
GeForce GT 750M
CUDA Driver Version/ Runtime Version

7.5/7.5

CUDA Capability Major/Minor version

3.0

Total amount of global memory

2048 MBytes

2 Multiprocessors, 192 CUDA cores/MP

384 CUDA Cores

GPU Max Clock rate

926 Mhz

Memory Clock Rate

2508 Mhz

Memory Bus Width

128- bit

L2 Cache Size

262144 bytes

Maximum Texture Dimension Size(x,y,z)

1D=(65536), 2D = (65536, 65536), 3D= (4096,

4096, 4096)

Maximum layered 1D Texture size

1D=(16384), 2048 layers

Maximum layered 2D Texture size

2D= (16384, 16384), 2048 layers

Total amount of constant memory

65536 bytes

Warp Size

Maximum number of threads per multiprocessor

2048
1024

Maximum number of threads per block

Maximum dimension of a thread block(x, y, z)

(1024, 1024, 64)

Maximum dimension of a grid size(x, y, z)

(2147483647, 65535, 65535)

Soumasish Goswami

Parallel BFS on CUDA 7.5

Page 4
! of !10

Thus the device limitations being set is that a total of 786,432 threads could be run at
the same time owing to the device capability of 384 cores of 2048 threads each.
Data Set
The data set taken in as an argument by both the implementations is a vector/array of
edge lists. To this extent this is a rough parallel to he Erdos Renyi model of graph
generation. A brief reference can be found here:
The random seed used to generate the graph in this case is, we use a C++11 seed
generator:
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(0, (NUM_VERTICES -1));
Key assumptions
For the sake of simplicity we assume each vertex to be an integer. An edge is thus a pair
of integers representing from and to. An edge list is an array of such pairs. All edges are
bi-directional and have equal weights.
Core Serial Algorithm
The serial implementation follows a simple C++ STL Queue/Vector based graph. The
Graph class maintains a vector of vectors for each vertex listing the connected edges for
the respective vertex. The BFS algorithm uses a queue to traverse across each frontier /
level and touches all connected components from a start node.
Parallelization of BFS
Since BFS lends itself well to parallelization there are two common yet distinct strategies
that are followed in the parallel execution of BFS:
1.
2.

The level-synchronous algorithm.

The fixed-point algorithm

The level synchronous algorithm uses the following approach; it manages three sets of
nodes - the visited set V , the current-level set C, and the next-level set N. Iteratively, the
algorithm visits (in parallel) all the nodes in set C and transfers them to set V (in
parallel). C is then populated with the nodes from set N, and N is cleared for the new
iteration. This iterative process continues until naturally there is no node in the next
level. The level synchronous algorithm effectively visits in parallel all nodes in each BFS
level, with the parallel execution synchronizing at the end of each level iteration.

Soumasish Goswami

Parallel BFS on CUDA 7.5

Page 5
! of 10
!

The fixed-point algorithm,on the other hand, continuously updates the BFS level of
every node, based on BFS levels of all neighboring nodes until no more updates are
made. This method is sub-optimal at times because of the lack of communication
between neighboring nodes in parallel environments. For the purpose of this project Ive
based my implementation on this approach.
Observations & Speed Up
Starting from the base data set of 1024 vertices and an equal number on nodes the data
set has been scaled up progressively in multiples of 2. The details of the scale up in the
vertex to edge ratio and the time taken to run BFS on them are enumerated in the tables
below.
Base testing data set -1024 vertices
Vertices

Edges

Serial Runtime

Parallel
Runtime

Speed Up

1024

0.000012

0.001332

-11000%

1024

2048

0.000011

0.000944

-8481.8%

1024

4096

0.000554

0.000877

-58.3%

1024

8192

0.001467

0.001084

26.1%

1024

16,384

0.005263

0.001106

99.9%

1024

32,768

0.018046

0.001338

92.5%

1024

65,536

0.064879

0.001344

97.9%

Serial

Parallel

70000

52500

35000

17500

0
1X

Soumasish Goswami

16x

32x

64x

Parallel BFS on CUDA 7.5

Page 6
! of !10

Data on a more sparse graph

Vertices

Edges

Serial Runtime

Parallel
Runtime

Speed Up

65536

32,768

0.000034

0.000974

-2764.7%

65536

65,536

0.000031

0.003547

-11341.9%

Serial

Parallel

4000

3000

2000

1000

0
1x

Soumasish Goswami

1/2 x

Parallel BFS on CUDA 7.5

Page 7
! of !10

Scale up - 8X
Vertices

Edges

Serial Runtime

Parallel
Runtime

Speed Up

8192

0.000026

0.000235

-803.8%

8192

16384

0.001461

0.001979

-35.4%

8192

32762

0.004600

0.006388

-38.8%

8192

65536

0.013723

0.003123

77.2%

8192

131072

0.042567

0.002640

93.7%

8192

262096

0.155348

0.002776

98.2%

8192

524288

0.567887

0.001600

99.7%

Serial

Parallel

600000

450000

300000

150000

0
1x

Soumasish Goswami

16x

32x

64x

Parallel BFS on CUDA 7.5

Page 8
! of 10
!

Scale up - 16X
Vertices

Edges

Serial Runtime

Parallel
Runtime

Speed Up

65536

0.000181

0.014693

-8017.6%

65536

131072

0.015386

0.014237

7.4%

65536

262096

0.051042

0.016766

67.1%

65536

524288

0.134016

0.023420

82.5%

Serial

Parallel

140000

105000

70000

35000

0
1x

Soumasish Goswami

Parallel BFS on CUDA 7.5

Page 9
! of !10

Analysis
The following trends emanate from the plotting of the two algorithms run on the same
data set.
1. The serial algorithm tends to work better on a extremely sparse graphs with a 1:1
vertex to edge ratio.
2. The parallel algorithm works better on moderately sparse graphs and shows a
significant runtime improvement as the vertex to edge ratio goes up.
3. As the graph gets denser below 1x to 1/2x or 1/4x, the parallel algorithm continues to
perform poorer and poorer, primarily owing to the multiple serial calls being made
to launch the kernel. This is the serial bottleneck of this algorithm.
4. As the data set grows there seems to be a gradual fall in the performance difference
of the two algorithms on graphs with a 1:1/1:2 vertex edge ratio. Thus from 1024 to
65536 the speed up values are -8481% and 7%(1:2 vertex to edge). It can be safely
assumed that as the data set grows the relative runtime of the serial algorithm on the
a sparse graph will continue to decrease. Thus the parallel algorithm will
proportionately keep performing better over larger and larger data sets.
5. The catch up is also quicker as the data set grows. For the base data set of 1024
nodes the edge density had to increase to 8x over vertex density for a yield of 27%.
However in case of a graph with 65536 vertices the edge increase of only 4x yielded
67% speed up. Clearly the more the data the better the speedup - in both relative and
absolute terms.
6. Aside from the timings noted here each algorithm takes a non-trivial amount of time
doing initializations. The serial algorithm has to read an array of edge list and
compose the graph before it can start BFS. That in effect doubles the execution time
of the program. In case of the parallel algorithm the frontier array has to be
initialized to -1 for all levels except the start vertex. For a large data set this might
have to be done in batches which could add non-trivial overhead on the execution of
the program.
7. There was some difference observed in running the same data on CUDA multiple
times. Though not significant for a data set less than 500, 000 nodes this is
something which needs to be observed on a bigger data set of billion nodes. In all
likelihood the difference in runtime is attributable to other processes using the GPU
on the machine.
8. One key observation while running the tests was that the level of connectivity of a
node played a key role in determining the runtime. For the same data set, different
start points had significantly different running times based on its level of
connectivity. Thus it was of paramount importance to use the same start vertex for
both the algorithms.
Limitations of the Algorithm
Its not a surprise that this algorithm works well with a multi-connected graph where
one vertex is connected to many edges but limited to a certain number of levels. This is
because for every level theres a fresh kernel call that is made and the kernel though
Soumasish Goswami

Parallel BFS on CUDA 7.5

Page !10 of !10

launched with threads equal to the number of vertices only writes on those indices if the
array which are connected at that level. This is wasteful in nature and continuous
copying of data from host to device is a bottleneck in the algorithm. Also in case of a
graph which is linearly connected to all vertices, a linked list, the serial and the parallel
execution time would be the same.
However the beauty of the algorithm lies in the fact that it easily lends itself to complete
parallelization. On the current data set the algorithm performs as desirably as one could
expect.
Thus the algorithm is ideally recommended for a moderately sparse graphs. Also the
algorithm scales proportionately with data.
Recommendations
The algorithms speedup begins to improve as the number of edges goes up relative to
the number of vertices, this is a minor concern because most large-scale real-world
networks are sparse where the number of edges is much smaller than the maximum
number of possible edges. In that sense this implementation will find more usage in
streaming data where the graph is more connected.
Code Repositories
Data Set Generation: https://github.com/soumasish/GenerateDataSet
Serial BFS: https://github.com/soumasish/SerialBFS
Parallel BFS: https://github.com/soumasish/ParallelBreadthFirstSearch
Conclusion
The size of the device memory limits the size of the graphs handled on a single GPU. The
CUDA programming model provides an interface to use multiple GPUs in parallel using
multi-GPU bridges. Up to 2 synchronized GPUs can be combined using the SLI
interface. NVIDIA QuadroPlex is a CUDA enabled graphics solution with two Quadro
5600 cards each. Two such systems can be supported by a single CPU to give even better
performance than the GT 750M . NVIDIA has announced its Tesla S870 range of GPUs,
with up to four cores and 6GB system memory capacity, targeted at high performance
computing. Further research is required on partitioning the problem and streaming the
data from the CPU to GPU to handle even larger problems. External memory
approaches can also be adapted to the GPUs for this purpose.
References
1.
2.
3.
4.
5.

http://impact.crhc.illinois.edu/shared/papers/effective2010.pdf
https://en.wikipedia.org/wiki/Erd%C5%91s%E2%80%93R%C3%A9nyi_model
https://docs.nvidia.com/cuda/cuda-c-programming-guide/
https://www.sci.utah.edu/publications/Fu2014a/UUSCI-2014-002.pdf
https://research.nvidia.com/publication/scalable-gpu-graph-traversal

Soumasish Goswami

Parallel BFS on CUDA 7.5

Activated Upa Usb 2014 Upa Usb V1.3 Upa Eprom Programmer 1.3
100% (1)
Activated Upa Usb 2014 Upa Usb V1.3 Upa Eprom Programmer 1.3
4 pages
Digital Literacy 1 Notes
100% (3)
Digital Literacy 1 Notes
65 pages
Forensic Ws11 12 Exercise1
0% (7)
Forensic Ws11 12 Exercise1
2 pages
Pawan 09 Graph Algorithms
No ratings yet
Pawan 09 Graph Algorithms
26 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
GPU Quicksort
No ratings yet
GPU Quicksort
22 pages
CUDA Cuts Fast Graph Cuts On The GPU
No ratings yet
CUDA Cuts Fast Graph Cuts On The GPU
8 pages
Efficient Acceleration of Asymmetric Cryptography On Graphics Hardware
No ratings yet
Efficient Acceleration of Asymmetric Cryptography On Graphics Hardware
17 pages
Wavelet Tree
No ratings yet
Wavelet Tree
29 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
Exploiting The Power of Gpus For Asymmetric Cryptography: Abstract. Modern Graphics Processing Units (Gpu) Have Reached A
No ratings yet
Exploiting The Power of Gpus For Asymmetric Cryptography: Abstract. Modern Graphics Processing Units (Gpu) Have Reached A
21 pages
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
No ratings yet
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
10 pages
A LBM Solver 3D Fluid Simulation On GPU
No ratings yet
A LBM Solver 3D Fluid Simulation On GPU
9 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
6 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
1. Introduction — CUDA C Programming Guide
No ratings yet
1. Introduction — CUDA C Programming Guide
573 pages
UNIT V Scalable Multi-GPU Programming (T2 Chapter 6) - P P With CUDA
No ratings yet
UNIT V Scalable Multi-GPU Programming (T2 Chapter 6) - P P With CUDA
43 pages
Parallelization of BFS Graph Algorithm Using CUDA
No ratings yet
Parallelization of BFS Graph Algorithm Using CUDA
6 pages
Side-Channel Power Analysis of A GPU AES Implementation: Chao Luo, Yunsi Fei, Pei Luo, Saoni Mukherjee, David Kaeli
No ratings yet
Side-Channel Power Analysis of A GPU AES Implementation: Chao Luo, Yunsi Fei, Pei Luo, Saoni Mukherjee, David Kaeli
8 pages
Ieee 05486259
No ratings yet
Ieee 05486259
6 pages
Image Parallel Processing Based On GPU PDF
No ratings yet
Image Parallel Processing Based On GPU PDF
4 pages
Cuda Chapter
No ratings yet
Cuda Chapter
18 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
cuuda nvidai guide_Part1
No ratings yet
cuuda nvidai guide_Part1
15 pages
Case Study On: Nitte Meenakshi Institute of Technology
No ratings yet
Case Study On: Nitte Meenakshi Institute of Technology
8 pages
GPU Cluster
No ratings yet
GPU Cluster
12 pages
Modern GPU Architecture
No ratings yet
Modern GPU Architecture
93 pages
(Videogame) Rendering 102
No ratings yet
(Videogame) Rendering 102
32 pages
Improving The Performance of A Ray Tracing Algorithm Using A GPU
No ratings yet
Improving The Performance of A Ray Tracing Algorithm Using A GPU
10 pages
Christen 07
No ratings yet
Christen 07
8 pages
Unit V Part B and C - 240514 - 220831
No ratings yet
Unit V Part B and C - 240514 - 220831
17 pages
gpus
No ratings yet
gpus
32 pages
3-1
No ratings yet
3-1
35 pages
Efficient Implementation of Sorting On Multi-Core SIMD CPU Architecture
No ratings yet
Efficient Implementation of Sorting On Multi-Core SIMD CPU Architecture
12 pages
4 - Key Concepts
No ratings yet
4 - Key Concepts
2 pages
Icmc2011 Supernova
No ratings yet
Icmc2011 Supernova
6 pages
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
No ratings yet
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
10 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
A Performance Study of Applying CUDA-Enabled GPU in Polar Hough Transform For Lines
No ratings yet
A Performance Study of Applying CUDA-Enabled GPU in Polar Hough Transform For Lines
4 pages
Ijaret: International Journal of Advanced Research in Engineering and Technology (Ijaret)
No ratings yet
Ijaret: International Journal of Advanced Research in Engineering and Technology (Ijaret)
9 pages
Scan Primitives
No ratings yet
Scan Primitives
11 pages
Parallel Video Processing Performance Evaluation On The Ibm Cell Broadband Engine Processor
No ratings yet
Parallel Video Processing Performance Evaluation On The Ibm Cell Broadband Engine Processor
13 pages
Embedded Systems Unit Vi
No ratings yet
Embedded Systems Unit Vi
18 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
Is There A Real Difference Between DSPs and GPUs
No ratings yet
Is There A Real Difference Between DSPs and GPUs
18 pages
Nvidia Tesla
No ratings yet
Nvidia Tesla
20 pages
3.3.6 Parallel Processing
No ratings yet
3.3.6 Parallel Processing
6 pages
Comparison of Processing Performance and Architectural Efficiency Metrics For Fpgas and Gpus in 3D Ultrasound Computer Tomography
No ratings yet
Comparison of Processing Performance and Architectural Efficiency Metrics For Fpgas and Gpus in 3D Ultrasound Computer Tomography
7 pages
SIMD Architecture
100% (1)
SIMD Architecture
16 pages
GPU Khoruzhenko
No ratings yet
GPU Khoruzhenko
5 pages
An FPGA Based Generic Framework For High Speed Sum of Absolute Difference Implementation
No ratings yet
An FPGA Based Generic Framework For High Speed Sum of Absolute Difference Implementation
24 pages
Design and Implementation of A Parallel Priority Queue On Many-Core Architectures
No ratings yet
Design and Implementation of A Parallel Priority Queue On Many-Core Architectures
10 pages
Research Article: Hybrid MPI and CUDA Parallelization For CFD Applications On Multi-GPU HPC Clusters
No ratings yet
Research Article: Hybrid MPI and CUDA Parallelization For CFD Applications On Multi-GPU HPC Clusters
15 pages
Fast Sort On CPUs, GPUs and Intel MIC Architectures - Technical Report - Intel Labs (Intel-Labs-Radix-Sort-Mic-Report)
No ratings yet
Fast Sort On CPUs, GPUs and Intel MIC Architectures - Technical Report - Intel Labs (Intel-Labs-Radix-Sort-Mic-Report)
11 pages
A Novel VLSI Architecture MST Core Supported Video Codec Using CSDA
No ratings yet
A Novel VLSI Architecture MST Core Supported Video Codec Using CSDA
7 pages
Parallel Data Mining Techniques On Graph
No ratings yet
Parallel Data Mining Techniques On Graph
26 pages
High Performance Pattern Recognition On GPU
No ratings yet
High Performance Pattern Recognition On GPU
6 pages
COE4590_15_GPU1
No ratings yet
COE4590_15_GPU1
14 pages
Application-Development_2008_Reconfigurable-Computing
No ratings yet
Application-Development_2008_Reconfigurable-Computing
4 pages
Modelado de GPUs e implementación de características dentro de Accel-sim. TFG Juan José Castillo Otón
No ratings yet
Modelado de GPUs e implementación de características dentro de Accel-sim. TFG Juan José Castillo Otón
56 pages
Neural Network Implementation Using CUDA and OpenMP
No ratings yet
Neural Network Implementation Using CUDA and OpenMP
7 pages
Digital Raster Graphic: Unveiling the Power of Digital Raster Graphics in Computer Vision
From Everand
Digital Raster Graphic: Unveiling the Power of Digital Raster Graphics in Computer Vision
Fouad Sabry
No ratings yet
Inventory Count Sheet 2015
No ratings yet
Inventory Count Sheet 2015
12 pages
OS Pretest 01 + 02
No ratings yet
OS Pretest 01 + 02
37 pages
Inspiron 15 - Inspiron-15-3558-Laptop - Service-Manual - En-Us
No ratings yet
Inspiron 15 - Inspiron-15-3558-Laptop - Service-Manual - En-Us
78 pages
RSMI Blank Form
No ratings yet
RSMI Blank Form
6 pages
Poriyaan: Unit Ii Embedded Systems
No ratings yet
Poriyaan: Unit Ii Embedded Systems
88 pages
Keypad Scanning
No ratings yet
Keypad Scanning
11 pages
BCS 011 English 2023 24
No ratings yet
BCS 011 English 2023 24
8 pages
Asynchronous Counters: Asynchronous 4-Bit UP Counter
No ratings yet
Asynchronous Counters: Asynchronous 4-Bit UP Counter
13 pages
Types of Printers
No ratings yet
Types of Printers
9 pages
Group Members:: 1.natnael Wondimu 2.rediet Abayneh 3.tamrat Hordofa
No ratings yet
Group Members:: 1.natnael Wondimu 2.rediet Abayneh 3.tamrat Hordofa
45 pages
MP Exp9 - Largestarray Lab Manual
No ratings yet
MP Exp9 - Largestarray Lab Manual
3 pages
Quanta BU5
No ratings yet
Quanta BU5
35 pages
Memory Management
No ratings yet
Memory Management
48 pages
G1.5-HEX Caterpillar Hydraulic Excavator Simulator Information
No ratings yet
G1.5-HEX Caterpillar Hydraulic Excavator Simulator Information
10 pages
stm32 Selection Chart
No ratings yet
stm32 Selection Chart
8 pages
HP Compaq Notebook and Desktop Price List
No ratings yet
HP Compaq Notebook and Desktop Price List
3 pages
Dxdiag
No ratings yet
Dxdiag
49 pages
IT Chapter 4
No ratings yet
IT Chapter 4
4 pages
Memory Management 2010
No ratings yet
Memory Management 2010
103 pages
Lab 2
No ratings yet
Lab 2
11 pages
Computer Studies Grade 8 Test 1 From Natwange Secondary School.
50% (2)
Computer Studies Grade 8 Test 1 From Natwange Secondary School.
6 pages
8085 Pin Diagram
No ratings yet
8085 Pin Diagram
48 pages
pdp11 40
100% (1)
pdp11 40
212 pages
TASKING TriCore Tools Linker Tips - Tricks - WEB
No ratings yet
TASKING TriCore Tools Linker Tips - Tricks - WEB
12 pages
Infografia Línea Del Tiempo Historia Timeline Doodle Multicolor
No ratings yet
Infografia Línea Del Tiempo Historia Timeline Doodle Multicolor
1 page
Thuyet Trinh CEA201
No ratings yet
Thuyet Trinh CEA201
5 pages
Poweredge Fx2: With Fc640 and Fd332
No ratings yet
Poweredge Fx2: With Fc640 and Fd332
10 pages

Parallel BFS On Graphs Using GPGPU

Uploaded by

Parallel BFS On Graphs Using GPGPU

Uploaded by

Page 1! of !

CSE 603 Project Report

Parallel BFS on CUDA 7.5

Parallel BFS on CUDA 7.5

CUDA Capability Major/Minor version

Total amount of global memory

2 Multiprocessors, 192 CUDA cores/MP

384 CUDA Cores

GPU Max Clock rate

Memory Clock Rate

Memory Bus Width

Maximum Texture Dimension Size(x,y,z)

1D=(65536), 2D = (65536, 65536), 3D= (4096,

Maximum layered 1D Texture size

1D=(16384), 2048 layers

Maximum layered 2D Texture size

2D= (16384, 16384), 2048 layers

Total amount of constant memory

Maximum number of threads per multiprocessor

Maximum number of threads per block

(1024, 1024, 64)

Maximum dimension of a grid size(x, y, z)

(2147483647, 65535, 65535)

Parallel BFS on CUDA 7.5

The level-synchronous algorithm.

Parallel BFS on CUDA 7.5

Parallel BFS on CUDA 7.5

Data on a more sparse graph

Parallel BFS on CUDA 7.5

Parallel BFS on CUDA 7.5

Parallel BFS on CUDA 7.5

Parallel BFS on CUDA 7.5

Page !10 of !10

Parallel BFS on CUDA 7.5

You might also like