0% found this document useful (0 votes)
65 views

From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo

The document discusses graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA) parallel programming model. It describes how GPUs provide much higher computational power than CPUs. The CUDA programming model uses kernels that can execute many lightweight threads in parallel on the GPU. Threads are organized into blocks and grids. The document outlines the GPU memory hierarchy and execution model, noting that only threads within the same block can share data and synchronize.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo

The document discusses graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA) parallel programming model. It describes how GPUs provide much higher computational power than CPUs. The CUDA programming model uses kernels that can execute many lightweight threads in parallel on the GPU. Threads are organized into blocks and grids. The document outlines the GPU memory hierarchy and execution model, noting that only threads within the same block can share data and synchronize.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

From CPU to GPU with CUDA C language

Michele Tuttafesta Dottorato di ricerca in Fisica 25

Ciclo

Universit degli Studi di Bari


Settembre 2011

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Overview

1 2 3 4 5 6 7

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and code Positive results and outlooks Bibliographical research References

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA


In the last few years, the rapid development of graphics processing units (GPUs) makes them more powerful in performance and more programmable in functionality.

By comparing the computational power of GPUs and CPUs, GPUs exceed CPUs by orders of magnitude.

The theoretical peak performance of the current consumer graphics card NVIDIA GeForce GTX 295 (with two GPUs) is 1788.48G oating-point operations per second (FLOPS) per GPU in single precision while a CPU (Core 2 Quad Q9650  3.0 GHz) gives a peak performance of around 96GFLOPS in single precision.
The release of the Compute Unied Device Architecture (CUDA) hardware and software architecture is the culmination of such development.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA


In the last few years, the rapid development of graphics processing units (GPUs) makes them more powerful in performance and more programmable in functionality.

By comparing the computational power of GPUs and CPUs, GPUs exceed CPUs by orders of magnitude.

The theoretical peak performance of the current consumer graphics card NVIDIA GeForce GTX 295 (with two GPUs) is 1788.48G oating-point operations per second (FLOPS) per GPU in single precision while a CPU (Core 2 Quad Q9650  3.0 GHz) gives a peak performance of around 96GFLOPS in single precision.
The release of the Compute Unied Device Architecture (CUDA) hardware and software architecture is the culmination of such development.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA


In the last few years, the rapid development of graphics processing units (GPUs) makes them more powerful in performance and more programmable in functionality.

By comparing the computational power of GPUs and CPUs, GPUs exceed CPUs by orders of magnitude.

The theoretical peak performance of the current consumer graphics card NVIDIA GeForce GTX 295 (with two GPUs) is 1788.48G oating-point operations per second (FLOPS) per GPU in single precision while a CPU (Core 2 Quad Q9650  3.0 GHz) gives a peak performance of around 96GFLOPS in single precision.
The release of the Compute Unied Device Architecture (CUDA) hardware and software architecture is the culmination of such development.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA


In the last few years, the rapid development of graphics processing units (GPUs) makes them more powerful in performance and more programmable in functionality.

By comparing the computational power of GPUs and CPUs, GPUs exceed CPUs by orders of magnitude.

The theoretical peak performance of the current consumer graphics card NVIDIA GeForce GTX 295 (with two GPUs) is 1788.48G oating-point operations per second (FLOPS) per GPU in single precision while a CPU (Core 2 Quad Q9650  3.0 GHz) gives a peak performance of around 96GFLOPS in single precision.
The release of the Compute Unied Device Architecture (CUDA) hardware and software architecture is the culmination of such development.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

Illustrated History of Parallel Computing

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

The Compute Unied Device Architecture (CUDA) was introduced by NVIDIA as a general purpose parallel computing architecture, which includes GPU hardware architecture as well as software components (CUDA compiler and the system drivers and libraries).

The CUDA programming model consists of functions, called kernels , which can be executed simultaneously by a large number of lightweight threads on the GPU.

These threads are grouped into one-, two-, or three-dimensional thread blocks , which are further organized into one- or two-dimensional grids .

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

The Compute Unied Device Architecture (CUDA) was introduced by NVIDIA as a general purpose parallel computing architecture, which includes GPU hardware architecture as well as software components (CUDA compiler and the system drivers and libraries).

The CUDA programming model consists of functions, called kernels , which can be executed simultaneously by a large number of lightweight threads on the GPU.

These threads are grouped into one-, two-, or three-dimensional thread blocks , which are further organized into one- or two-dimensional grids .

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

The Compute Unied Device Architecture (CUDA) was introduced by NVIDIA as a general purpose parallel computing architecture, which includes GPU hardware architecture as well as software components (CUDA compiler and the system drivers and libraries).

The CUDA programming model consists of functions, called kernels , which can be executed simultaneously by a large number of lightweight threads on the GPU.

These threads are grouped into one-, two-, or three-dimensional thread blocks , which are further organized into one- or two-dimensional grids .

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The execution model

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA


Only threads in the same block can share data and synchronize with each other during execution.

Thread blocks are independent of each other and can be executed in any other.

A graphics card that supports CUDA, for example, the GT200, consisting of 30 streaming multiprocessors (SMs). Each multiprocessor consists of 8 streaming processors (SPs), providing a total of 240 SPs.

Threads are grouped into batches of 32 called warps which are executed in single instruction multiple data (SIMD) fashion independently. Threads within a warp execute a common instruction at a time.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA


Only threads in the same block can share data and synchronize with each other during execution.

Thread blocks are independent of each other and can be executed in any other.

A graphics card that supports CUDA, for example, the GT200, consisting of 30 streaming multiprocessors (SMs). Each multiprocessor consists of 8 streaming processors (SPs), providing a total of 240 SPs.

Threads are grouped into batches of 32 called warps which are executed in single instruction multiple data (SIMD) fashion independently. Threads within a warp execute a common instruction at a time.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA


Only threads in the same block can share data and synchronize with each other during execution.

Thread blocks are independent of each other and can be executed in any other.

A graphics card that supports CUDA, for example, the GT200, consisting of 30 streaming multiprocessors (SMs). Each multiprocessor consists of 8 streaming processors (SPs), providing a total of 240 SPs.

Threads are grouped into batches of 32 called warps which are executed in single instruction multiple data (SIMD) fashion independently. Threads within a warp execute a common instruction at a time.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA


Only threads in the same block can share data and synchronize with each other during execution.

Thread blocks are independent of each other and can be executed in any other.

A graphics card that supports CUDA, for example, the GT200, consisting of 30 streaming multiprocessors (SMs). Each multiprocessor consists of 8 streaming processors (SPs), providing a total of 240 SPs.

Threads are grouped into batches of 32 called warps which are executed in single instruction multiple data (SIMD) fashion independently. Threads within a warp execute a common instruction at a time.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

Four types of memory:

global , constant , texture , shared .

Global memory has a separate address space for obtaining data from the host CPU's main memory through the PCIE bus, which is about 8 GB/sec in the GT200 GPU. Any valued stored in global memory can be accessed by all SMs via load and store instructions. Constant memory and texture memory are cached, read-only and shared between SPs. Constants that are kept unchanged during kernel execution may be stored in constant memory. Built-in linear interpolation is available in texture memory. Shared memory is limited (16 KB for GT200 GPU) and shared between all SPs in a MP.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

Four types of memory:

global , constant , texture , shared .

Global memory has a separate address space for obtaining data from the host CPU's main memory through the PCIE bus, which is about 8 GB/sec in the GT200 GPU. Any valued stored in global memory can be accessed by all SMs via load and store instructions. Constant memory and texture memory are cached, read-only and shared between SPs. Constants that are kept unchanged during kernel execution may be stored in constant memory. Built-in linear interpolation is available in texture memory. Shared memory is limited (16 KB for GT200 GPU) and shared between all SPs in a MP.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

Four types of memory:

global , constant , texture , shared .

Global memory has a separate address space for obtaining data from the host CPU's main memory through the PCIE bus, which is about 8 GB/sec in the GT200 GPU. Any valued stored in global memory can be accessed by all SMs via load and store instructions. Constant memory and texture memory are cached, read-only and shared between SPs. Constants that are kept unchanged during kernel execution may be stored in constant memory. Built-in linear interpolation is available in texture memory. Shared memory is limited (16 KB for GT200 GPU) and shared between all SPs in a MP.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

Four types of memory:

global , constant , texture , shared .

Global memory has a separate address space for obtaining data from the host CPU's main memory through the PCIE bus, which is about 8 GB/sec in the GT200 GPU. Any valued stored in global memory can be accessed by all SMs via load and store instructions. Constant memory and texture memory are cached, read-only and shared between SPs. Constants that are kept unchanged during kernel execution may be stored in constant memory. Built-in linear interpolation is available in texture memory. Shared memory is limited (16 KB for GT200 GPU) and shared between all SPs in a MP.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA: Shared memory

The CUDA C compiler treats variables in shared memory dierently than typical variables. It creates a copy of the variable for each block that you launch on the GPU.

Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks. This provides an excellent means by which threads within a block can communicate and collaborate on computations. Furthermore, shared memory buers reside physically on the GPU as opposed to residing in o-chip DRAM. Because of this, the latency to access shared memory tends to be far lower than typical buers, making shared memory eective as a per-block, softwaremanaged cache or scratchpad.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA: Shared memory

The CUDA C compiler treats variables in shared memory dierently than typical variables. It creates a copy of the variable for each block that you launch on the GPU.

Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks. This provides an excellent means by which threads within a block can communicate and collaborate on computations. Furthermore, shared memory buers reside physically on the GPU as opposed to residing in o-chip DRAM. Because of this, the latency to access shared memory tends to be far lower than typical buers, making shared memory eective as a per-block, softwaremanaged cache or scratchpad.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA: Shared memory

The CUDA C compiler treats variables in shared memory dierently than typical variables. It creates a copy of the variable for each block that you launch on the GPU.

Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks. This provides an excellent means by which threads within a block can communicate and collaborate on computations. Furthermore, shared memory buers reside physically on the GPU as opposed to residing in o-chip DRAM. Because of this, the latency to access shared memory tends to be far lower than typical buers, making shared memory eective as a per-block, softwaremanaged cache or scratchpad.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Summing vectors: CPU vector sums

void

} { }

add( i n t a , i n t b , i n t c) i n t tid = 0; while ( tid < N) { c [ tid ] = a [ tid ] + b[ tid ] ; tid += 1; } ... add(a ,b , c ); ... main( void )

//======================

int

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Summing vectors: CPU vector sums

void

} {

add( i n t a , i n t b , i n t c) i n t tid = 0; while ( tid < N) { c [ tid ] = a [ tid ] + b[ tid ] ; tid += 1; } ... add(a ,b , c ); ... main( void )

//======================

int

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Summing vectors: GPU vector sums

__global__ void add( i n t a , i n t b , i n t c) { i n t tid = blockIdx.x ; // h a n d l e t h e d a t a a t t h i s i n d e x i f ( tid < N) c [ tid ] = a [ tid ] + b[ tid ] ; } //====================== i n t main( void ) { ... cudaMemcpy(dev_a , a ,N s i z e o f ( i n t ) , cudaMemcpyHostToDevice ); cudaMemcpy(dev_b ,b ,N s i z e o f ( i n t ) , cudaMemcpyHostToDevice ); add <<<N,1>>> ( dev_a , dev_b , dev_c ); cudaMemcpy(c , dev_c ,N s i z e o f ( i n t ) , cudaMemcpyDeviceToHost ); ... }

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Imagine four blocks, all running through the same copy of the device code but having dierent values for the variable blockIdx.x.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Summing vectors: GPU vector sums

__global__ void add( i n t a , i n t b , i n t c) { i n t tid = threadIdx.x ; // h a n d l e t h e d a t a a t t h i s i n d e x i f ( tid < N) c [ tid ] = a [ tid ] + b[ tid ] ; } //====================== i n t main( void ) { ... cudaMemcpy(dev_a , a ,N s i z e o f ( i n t ) , cudaMemcpyHostToDevice ); cudaMemcpy(dev_b ,b ,N s i z e o f ( i n t ) , cudaMemcpyHostToDevice ); add <<<1,N>>> ( dev_a , dev_b , dev_c ); cudaMemcpy(c , dev_c ,N s i z e o f ( i n t ) , cudaMemcpyDeviceToHost ); ... }

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Summing vectors: GPU vector sums

__global__ void add( i n t a , i n t b , i n t c) { i n t tid = threadIdx.x + blockIdx.x * blockDim.x ; while (tid < N) { c [ tid ] = a [ tid ] + b[ tid ] ; tid += blockDim.x * gridDim.x ; } } //====================== i n t main( void ) { ... add <<<BlocksPerGrid,ThreadsPerBlock>>> ( dev_a , dev_b , dev_c ); ... }

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Using whole device

__global__ void kernel ( . . . ) { i n t tx = threadIdx . x ; i n t ty = threadIdx . y ; i n t tz = threadIdx . z ; i n t bx = blockIdx . x ; i n t by = blockIdx . y ; ... } //====================== i n t main( void ){ dim3 grids(32,18); dim3 threads(16,128,64); kernel<<<grids , threads >>>(...); ... }

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Program eulero2d: rectangle test

Constant inlet Initial conditions:

Free stream outlet Initial conditions:

Pin = 1 bar uin = 0 m/s Tin = 8000 K


Model C

Pout = 1050 bar uout = 0 m/s Tout = 300 K


P T

|u|

rot (u) >

10

3 1

M (gray)

Mach reection

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Program eulero2d: rectangle test

Machine Time (MT) = 20212 s = 5.6 h

We may be in serious troubles adding chemical kinetics for instance...

WE NEED SPEED-UP STRATEGIES !

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Program eulero2d: rectangle test

Machine Time (MT) = 20212 s = 5.6 h

We may be in serious troubles adding chemical kinetics for instance...

WE NEED SPEED-UP STRATEGIES !

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Program eulero2d: rectangle test

Machine Time (MT) = 20212 s = 5.6 h

We may be in serious troubles adding chemical kinetics for instance...

WE NEED SPEED-UP STRATEGIES !

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Program eulero2d: rectangle test

Machine Time (MT) = 20212 s = 5.6 h

We may be in serious troubles adding chemical kinetics for instance...

WE NEED SPEED-UP STRATEGIES !

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Recalling eulero2d computational model


Rectangular domain - Cartesian reference frame

u v t + x + y E u E Energy: t + x + u t + Momentum: v t +
Mass:

=0
v E y u u x v v y

+P

v + y u + P = 0 x u + x v + P = 0 y

u x

+ v y

=0

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Recalling eulero2d computational model


Operator Splitting

M. Capitelli A.R.Casavola, G.Colonna.

Kinetic model of titanium laser induced plasma expansion in

nitrogen environment.

Plasma Sources Science and Technology, 18, 2009.

The Euler equations are solved for a given time interval

...and after along the y axis for all the x grid points, neglecting the x derivatives ...rstly along the x direction for any point of the y grid, neglecting the derivatives with respect to y

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Recalling eulero2d computational model


Operator Splitting

M. Capitelli A.R.Casavola, G.Colonna.

Kinetic model of titanium laser induced plasma expansion in

nitrogen environment.

Plasma Sources Science and Technology, 18, 2009.

The Euler equations are solved for a given time interval

...and after along the y axis for all the x grid points, neglecting the x derivatives ...rstly along the x direction for any point of the y grid, neglecting the derivatives with respect to y

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Recalling eulero2d computational model


Operator Splitting

M. Capitelli A.R.Casavola, G.Colonna.

Kinetic model of titanium laser induced plasma expansion in

nitrogen environment.

Plasma Sources Science and Technology, 18, 2009.

The Euler equations are solved for a given time interval

...rstly along the x direction for any point of the y grid, neglecting the derivatives with respect to y

...and after along the y axis for all the x grid points, neglecting the x derivatives

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Recalling eulero2d computational model

Basically in a  t we divide the 2D ow: rst in mplt 1D ow along the x-direction and after (splitting) in nplt 1D ow along the y-direction.

Every 1D ow implies the resolution of 4 tridiagonal liner systems, one for each of the conserved quantity

, u , v , E

Therefore in every  t we have to solve 4mplt+4nplt tridiagonal systems.

In our computations mplt or nplt ranges from 20 up to 200.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst idea


Since our 4mplt+4nplt tridiagonal systems can be solved independently, thus our rst idea was to parallelize their calculation. So we buy this Nvidia CUDA graphic card ( 100 euro):

General Information Name: GeForce GTS 450 Compute capability: 2.1 Clock rate: 1566000 Device copy overlap: Enabled Kernel execution timeout : Enabled Memory Information Total global mem: 1072889856 Total constant Mem: 65536 Max mem pitch: 2147483647 Texture Alignment: 512
While our CPU is:

MP Information Multiprocessor count: 4 Shared mem per mp: 49152 Registers per mp: 32768 Threads in warp: 32 Max threads per block: 1024 Max thread dimensions: (1024,1024,64) Max grid dimensions: (65535,65535,1)

Pentium(R) Dual-Core CPU; E5500 @ 2.80GHz 2.80GHz; RAM= 4.00 GB


...and spend little time to convert our code in CUDA C.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst idea


Since our 4mplt+4nplt tridiagonal systems can be solved independently, thus our rst idea was to parallelize their calculation. So we buy this Nvidia CUDA graphic card ( 100 euro):

General Information Name: GeForce GTS 450 Compute capability: 2.1 Clock rate: 1566000 Device copy overlap: Enabled Kernel execution timeout : Enabled Memory Information Total global mem: 1072889856 Total constant Mem: 65536 Max mem pitch: 2147483647 Texture Alignment: 512
While our CPU is:

MP Information Multiprocessor count: 4 Shared mem per mp: 49152 Registers per mp: 32768 Threads in warp: 32 Max threads per block: 1024 Max thread dimensions: (1024,1024,64) Max grid dimensions: (65535,65535,1)

Pentium(R) Dual-Core CPU; E5500 @ 2.80GHz 2.80GHz; RAM= 4.00 GB


...and spend little time to convert our code in CUDA C.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst idea


Since our 4mplt+4nplt tridiagonal systems can be solved independently, thus our rst idea was to parallelize their calculation. So we buy this Nvidia CUDA graphic card ( 100 euro):

General Information Name: GeForce GTS 450 Compute capability: 2.1 Clock rate: 1566000 Device copy overlap: Enabled Kernel execution timeout : Enabled Memory Information Total global mem: 1072889856 Total constant Mem: 65536 Max mem pitch: 2147483647 Texture Alignment: 512
While our CPU is:

MP Information Multiprocessor count: 4 Shared mem per mp: 49152 Registers per mp: 32768 Threads in warp: 32 Max threads per block: 1024 Max thread dimensions: (1024,1024,64) Max grid dimensions: (65535,65535,1)

Pentium(R) Dual-Core CPU; E5500 @ 2.80GHz 2.80GHz; RAM= 4.00 GB


...and spend little time to convert our code in CUDA C.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst idea


Since our 4mplt+4nplt tridiagonal systems can be solved independently, thus our rst idea was to parallelize their calculation. So we buy this Nvidia CUDA graphic card ( 100 euro):

General Information Name: GeForce GTS 450 Compute capability: 2.1 Clock rate: 1566000 Device copy overlap: Enabled Kernel execution timeout : Enabled Memory Information Total global mem: 1072889856 Total constant Mem: 65536 Max mem pitch: 2147483647 Texture Alignment: 512
While our CPU is:

MP Information Multiprocessor count: 4 Shared mem per mp: 49152 Registers per mp: 32768 Threads in warp: 32 Max threads per block: 1024 Max thread dimensions: (1024,1024,64) Max grid dimensions: (65535,65535,1)

Pentium(R) Dual-Core CPU; E5500 @ 2.80GHz 2.80GHz; RAM= 4.00 GB


...and spend little time to convert our code in CUDA C.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst failure

Our rst attempt was almost frustrating. The converted code runs some kernel with max 4200= 800 threads. In such a case the ratio of the program execution Machine Time CPU/GPU is about 0.6 (decreasing if threads

< 800)

Through further investigations we realize that the goal CPU/GPU> 1 requires kernels running more than thousands of threads.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst failure


Our rst attempt was almost frustrating. The converted code runs some kernel with max 4200= 800 threads. In such a case the ratio of the program execution Machine Time CPU/GPU is about 0.6 (decreasing if threads

< 800)

Through further investigations we realize that the goal CPU/GPU> 1 requires kernels running more than thousands of threads.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst failure


Our rst attempt was almost frustrating. The converted code runs some kernel with max 4200= 800 threads. In such a case the ratio of the program execution Machine Time CPU/GPU is about 0.6 (decreasing if threads

< 800)

Through further investigations we realize that the goal CPU/GPU> 1 requires kernels running more than thousands of threads.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst failure


Our rst attempt was almost frustrating. The converted code runs some kernel with max 4200= 800 threads. In such a case the ratio of the program execution Machine Time CPU/GPU is about 0.6 (decreasing if threads

< 800)

Through further investigations we realize that the goal CPU/GPU> 1 requires kernels running more than thousands of threads.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

The code provided to us by prof. Giuseppe Pascazio of Politecnico di Bari solves a 2D inviscid ow using a exible structured grid.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

The code provided to us by prof. Giuseppe Pascazio of Politecnico di Bari solves a 2D inviscid ow using a exible structured grid.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:

rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX

Time discretization: second/fourth order Runge-Kutta. Explicit scheme summarizing in

independently executable little routines ( threads )

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:

rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX

Time discretization: second/fourth order Runge-Kutta. Explicit scheme summarizing in

independently executable little routines ( threads )

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:

rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX

Time discretization: second/fourth order Runge-Kutta. Explicit scheme summarizing in

independently executable little routines ( threads )

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:

rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX

Time discretization: second/fourth order Runge-Kutta. Explicit scheme summarizing in

independently executable little routines ( threads )

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:

rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX

Time discretization: second/fourth order Runge-Kutta. Explicit scheme summarizing in

independently executable little routines ( threads )

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:

rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX

Time discretization: second/fourth order Runge-Kutta. Explicit scheme summarizing in

independently executable little routines ( threads )

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:

rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX

Time discretization: second/fourth order Runge-Kutta. Explicit scheme summarizing in

independently executable little routines ( threads )

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

eulero-fds (CPU) vs eulero2d

Pascazio's code was written in FORTRAN. So we rst made a translation-modication in C and named the resulting program eulero-fds (Flux Dierence Splitting).

1D Riemann problem validation test

L = 0.042 kg/m3 PL = 1 bar uL = 0 m/s (TL = 8000 K)

R = 1 L 8 1 PR = 10 PL uR = 0 m/s (TR = 6400 K) E

Density; Pressure; Velocity;

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

eulero-fds (CPU) vs eulero2d: rectangle test

Constant inlet Initial conditions:

Free stream outlet Initial conditions:

Pin = 1 bar uin = 0 m/s Tin = 8000 K

Pout = 1050 bar uout = 0 m/s Tout = 300 K


Density; Mach number

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

eulero-fds (CPU) vs eulero2d: rectangle test

Simply adjusting the CFL parameter in the new model we get our rst gain in Machine Time.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

eulero-fds CPU vs GPU: rectangle test

Summarizing Machine Times and ratios R eulero2d 5.6 h


R =37.8

eulero-fds(CPU) 8.9 min


R =4.45

eulero-fds(GPU) 2 min

Total speed-up ratio= 168

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

eulero-fds CPU vs GPU: nozzle test

Subsonic inlet Initial conditions:

Subsonic outlet Initial conditions:

Pin = 0.77 Pa Min = 0.2 m/s

in =0.8297

kg/m

Pout = Pin Mout = Min


out = in

MUSCL II order Density (200x200 grid)

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

eulero-fds CPU vs GPU: nozzle test

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

eulero-fds CPU vs GPU: nozzle test

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Computers & Fluids


October 2010 F.Kuo, R.Smith et al. : GPU acceleration for general conservation equations and its application to several engineering problems Physical system Several benchmark gas and shallow water ow engineering problems. Numerical approach Harten, Lax and Van Leer method applied to Euler Equations and Shallow Water Equations. GPU/CPU Nvidia C1060 / Intel Xeon 3.0 GHz, 32 MB cache Speed-up CPU/GPU Over 67 (in 2D simulations with multi-million cell numbers)

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Computer Physics Communications


November 2010 (preprint) H.Wong et al. : Ecient magnetohydrodynamic

simulations on graphics processing units with CUDA


Physical system One-dimensional problems: Brio-Wu shock tube, MHD shock tube. Two-dimensional problems: Orszag-Tang problem, Two-dimensional blast wave problem, MHD rotor problem. Three-dimensional blast wave problem. Numerical approach Ideal MHD equations with magnetic permeability = 1 represented as hyperbolic system of conservation laws. The magnetic eld is held xed rst and then the uid variables are updated. A reverse procedure is then performed to complete a one time step. Three dimensional problem is split into one-dimensional subproblems by using a Strang-type directional splitting. GPU/CPU GTX 295(480) 1.75G (1.5G) video memory / Intel Core i7 965 3.20 GHz, 6G main memory Speed-up CPU/GPU Up to 80 in 1D; up to 600 in 2D; up to 250 in 3D.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Journal of Computational Physics


August 2008 E.Elsen et al. : Large calculation of the ow over a hypersonic

vehicle using a GPU


Physical system Hypersonic vehicle conguration with detailed geometry and accurate boundary conditions. Numerical approach Compressible Euler equations. NavierStokes Stanford University Solver (NSSUS) that is a multi-block structured code using a vertex-based nite-dierence method. GPU/CPU 8800GTX / Intel Core 2 Duo Speed-up CPU/GPU Over 40x for simple test geometries and 20x for complex ones.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Journal of Computational Physics

August 2011 (preprint) W.Ran,W.Cheng et al. : GPU accelerated CESE method

for 1D shock tube problems


Physical system Condensation problem in 1D shock tube. Numerical approach Compressible Euler equations with a source term related to condensation. GPU/CPU GeForce 9800GT, Tesla C2050 / Intel CPU E7300 Speed-up CPU/GPU Up to 40x.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Comput. Methods Appl. Mech. Engrg.

January 2011 M.Papadrakakis et al. : A new era in scientic computing: Domain

decomposition methods in hybrid CPUGPU architectures


Physical system 3D linear elasticity problems with a cubic geometry. Numerical approach A domain decomposition method (DDM) called FETI (Finite Element Tearing and Interconnecting), described by C. Farhat and F. Roux in 1991. GPU/CPU GTX285 1GB GDDR3 memory / Intel Core 2 Quad Q6600 2.4 GHz Speed-up CPU/GPU Up to 170x in Hybrid congurazions.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

CUDA GPUs features and specications


The Compute Capability describes the features supported by a CUDA hardware.
http://developer.download.nvidia.com/ compute/cuda/3_2_prod/toolkit/docs/ CUDA_C_Programming_Guide.pdf

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs comparison (costs in euro)

http://developer.nvidia.com/cuda-gpus http://www.trovaprezzi.it/prezzo_schede-grache_nvidia_tesla.aspx http://en.wikipedia.org/wiki/CUDA Name GeForce GTX 285 GeForce GTX 295 GeForce GTX 480 GeForce 8800 GTX GeForce 9800 GT Tesla C1060 Tesla C2050 1 Compute Capability CC1 Mem. Clock rate MPxCores Cost

GeForce GTS 450

2.1
1.3 1.3 2.0 1.0 1.0 1.3 2.0

1 GB

1 GB 1792 MB 1536 MB 768 MB 512 MB 4 GB 3 GB

1.57 GHz

1476 MHz 1242 MHz 1401 MHz 575 MHz 1500 MHz 1.3 GHz 1.15 MHz

4x48=192

30x8=240 2x240=480 15x32=480 16x8=128 14x8=112 240 448

100

300430 490 280 420 150 920 21002800

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

References

J.Sanders, E.Kandrot  2010.

CUDA by example

Addison-Wesley,

http://www.nvidia.com/ object/computational_uid_dynamics.html

You might also like