0% found this document useful (0 votes)

65 views

From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo

The document discusses graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA) parallel programming model. It describes how GPUs provide much higher computational power than CPUs. The CUDA programming model uses kernels that can execute many lightweight threads in parallel on the GPU. Threads are organized into blocks and grids. The document outlines the GPU memory hierarchy and execution model, noting that only threads within the same block can share data and synchronize.

Uploaded by

Andreina Dell'olio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views

From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo

Uploaded by

Andreina Dell'olio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

From CPU to GPU with CUDA C language

Michele Tuttafesta Dottorato di ricerca in Fisica 25

Ciclo

Universit degli Studi di Bari

Settembre 2011

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Overview

1 2 3 4 5 6 7

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and code Positive results and outlooks Bibliographical research References

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

In the last few years, the rapid development of graphics processing units (GPUs) makes them more powerful in performance and more programmable in functionality.

By comparing the computational power of GPUs and CPUs, GPUs exceed CPUs by orders of magnitude.

The theoretical peak performance of the current consumer graphics card NVIDIA GeForce GTX 295 (with two GPUs) is 1788.48G oating-point operations per second (FLOPS) per GPU in single precision while a CPU (Core 2 Quad Q9650 3.0 GHz) gives a peak performance of around 96GFLOPS in single precision.
The release of the Compute Unied Device Architecture (CUDA) hardware and software architecture is the culmination of such development.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

In the last few years, the rapid development of graphics processing units (GPUs) makes them more powerful in performance and more programmable in functionality.

By comparing the computational power of GPUs and CPUs, GPUs exceed CPUs by orders of magnitude.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

In the last few years, the rapid development of graphics processing units (GPUs) makes them more powerful in performance and more programmable in functionality.

By comparing the computational power of GPUs and CPUs, GPUs exceed CPUs by orders of magnitude.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

In the last few years, the rapid development of graphics processing units (GPUs) makes them more powerful in performance and more programmable in functionality.

By comparing the computational power of GPUs and CPUs, GPUs exceed CPUs by orders of magnitude.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

Illustrated History of Parallel Computing

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

The Compute Unied Device Architecture (CUDA) was introduced by NVIDIA as a general purpose parallel computing architecture, which includes GPU hardware architecture as well as software components (CUDA compiler and the system drivers and libraries).

The CUDA programming model consists of functions, called kernels , which can be executed simultaneously by a large number of lightweight threads on the GPU.

These threads are grouped into one-, two-, or three-dimensional thread blocks , which are further organized into one- or two-dimensional grids .

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

The CUDA programming model consists of functions, called kernels , which can be executed simultaneously by a large number of lightweight threads on the GPU.

These threads are grouped into one-, two-, or three-dimensional thread blocks , which are further organized into one- or two-dimensional grids .

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

The CUDA programming model consists of functions, called kernels , which can be executed simultaneously by a large number of lightweight threads on the GPU.

These threads are grouped into one-, two-, or three-dimensional thread blocks , which are further organized into one- or two-dimensional grids .

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The execution model

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

Only threads in the same block can share data and synchronize with each other during execution.

Thread blocks are independent of each other and can be executed in any other.

A graphics card that supports CUDA, for example, the GT200, consisting of 30 streaming multiprocessors (SMs). Each multiprocessor consists of 8 streaming processors (SPs), providing a total of 240 SPs.

Threads are grouped into batches of 32 called warps which are executed in single instruction multiple data (SIMD) fashion independently. Threads within a warp execute a common instruction at a time.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

Only threads in the same block can share data and synchronize with each other during execution.

Thread blocks are independent of each other and can be executed in any other.

Threads are grouped into batches of 32 called warps which are executed in single instruction multiple data (SIMD) fashion independently. Threads within a warp execute a common instruction at a time.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

Only threads in the same block can share data and synchronize with each other during execution.

Thread blocks are independent of each other and can be executed in any other.

Threads are grouped into batches of 32 called warps which are executed in single instruction multiple data (SIMD) fashion independently. Threads within a warp execute a common instruction at a time.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

Only threads in the same block can share data and synchronize with each other during execution.

Thread blocks are independent of each other and can be executed in any other.

Threads are grouped into batches of 32 called warps which are executed in single instruction multiple data (SIMD) fashion independently. Threads within a warp execute a common instruction at a time.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

Four types of memory:

global , constant , texture , shared .

Global memory has a separate address space for obtaining data from the host CPU's main memory through the PCIE bus, which is about 8 GB/sec in the GT200 GPU. Any valued stored in global memory can be accessed by all SMs via load and store instructions. Constant memory and texture memory are cached, read-only and shared between SPs. Constants that are kept unchanged during kernel execution may be stored in constant memory. Built-in linear interpolation is available in texture memory. Shared memory is limited (16 KB for GT200 GPU) and shared between all SPs in a MP.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

Four types of memory:

global , constant , texture , shared .

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

Four types of memory:

global , constant , texture , shared .

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA

Four types of memory:

global , constant , texture , shared .

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA: Shared memory

The CUDA C compiler treats variables in shared memory dierently than typical variables. It creates a copy of the variable for each block that you launch on the GPU.

Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks. This provides an excellent means by which threads within a block can communicate and collaborate on computations. Furthermore, shared memory buers reside physically on the GPU as opposed to residing in o-chip DRAM. Because of this, the latency to access shared memory tends to be far lower than typical buers, making shared memory eective as a per-block, softwaremanaged cache or scratchpad.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA: Shared memory

The CUDA C compiler treats variables in shared memory dierently than typical variables. It creates a copy of the variable for each block that you launch on the GPU.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs and CUDA: Shared memory

The CUDA C compiler treats variables in shared memory dierently than typical variables. It creates a copy of the variable for each block that you launch on the GPU.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Summing vectors: CPU vector sums

void

} { }

add( i n t a , i n t b , i n t c) i n t tid = 0; while ( tid < N) { c [ tid ] = a [ tid ] + b[ tid ] ; tid += 1; } ... add(a ,b , c ); ... main( void )

//======================

int

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Summing vectors: CPU vector sums

void

} {

add( i n t a , i n t b , i n t c) i n t tid = 0; while ( tid < N) { c [ tid ] = a [ tid ] + b[ tid ] ; tid += 1; } ... add(a ,b , c ); ... main( void )

//======================

int

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Summing vectors: GPU vector sums

__global__ void add( i n t a , i n t b , i n t c) { i n t tid = blockIdx.x ; // h a n d l e t h e d a t a a t t h i s i n d e x i f ( tid < N) c [ tid ] = a [ tid ] + b[ tid ] ; } //====================== i n t main( void ) { ... cudaMemcpy(dev_a , a ,N s i z e o f ( i n t ) , cudaMemcpyHostToDevice ); cudaMemcpy(dev_b ,b ,N s i z e o f ( i n t ) , cudaMemcpyHostToDevice ); add <<<N,1>>> ( dev_a , dev_b , dev_c ); cudaMemcpy(c , dev_c ,N s i z e o f ( i n t ) , cudaMemcpyDeviceToHost ); ... }

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Imagine four blocks, all running through the same copy of the device code but having dierent values for the variable blockIdx.x.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Summing vectors: GPU vector sums

__global__ void add( i n t a , i n t b , i n t c) { i n t tid = threadIdx.x ; // h a n d l e t h e d a t a a t t h i s i n d e x i f ( tid < N) c [ tid ] = a [ tid ] + b[ tid ] ; } //====================== i n t main( void ) { ... cudaMemcpy(dev_a , a ,N s i z e o f ( i n t ) , cudaMemcpyHostToDevice ); cudaMemcpy(dev_b ,b ,N s i z e o f ( i n t ) , cudaMemcpyHostToDevice ); add <<<1,N>>> ( dev_a , dev_b , dev_c ); cudaMemcpy(c , dev_c ,N s i z e o f ( i n t ) , cudaMemcpyDeviceToHost ); ... }

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Summing vectors: GPU vector sums

__global__ void add( i n t a , i n t b , i n t c) { i n t tid = threadIdx.x + blockIdx.x * blockDim.x ; while (tid < N) { c [ tid ] = a [ tid ] + b[ tid ] ; tid += blockDim.x * gridDim.x ; } } //====================== i n t main( void ) { ... add <<<BlocksPerGrid,ThreadsPerBlock>>> ( dev_a , dev_b , dev_c ); ... }

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Using whole device

__global__ void kernel ( . . . ) { i n t tx = threadIdx . x ; i n t ty = threadIdx . y ; i n t tz = threadIdx . z ; i n t bx = blockIdx . x ; i n t by = blockIdx . y ; ... } //====================== i n t main( void ){ dim3 grids(32,18); dim3 threads(16,128,64); kernel<<<grids , threads >>>(...); ... }

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Program eulero2d: rectangle test

Constant inlet Initial conditions:

Free stream outlet Initial conditions:

Pin = 1 bar uin = 0 m/s Tin = 8000 K

Model C

Pout = 1050 bar uout = 0 m/s Tout = 300 K

P T

|u|

rot (u) >

3 1

M (gray)

Mach reection

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Program eulero2d: rectangle test

Machine Time (MT) = 20212 s = 5.6 h

We may be in serious troubles adding chemical kinetics for instance...

WE NEED SPEED-UP STRATEGIES !

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Program eulero2d: rectangle test

Machine Time (MT) = 20212 s = 5.6 h

We may be in serious troubles adding chemical kinetics for instance...

WE NEED SPEED-UP STRATEGIES !

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Program eulero2d: rectangle test

Machine Time (MT) = 20212 s = 5.6 h

We may be in serious troubles adding chemical kinetics for instance...

WE NEED SPEED-UP STRATEGIES !

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Program eulero2d: rectangle test

Machine Time (MT) = 20212 s = 5.6 h

We may be in serious troubles adding chemical kinetics for instance...

WE NEED SPEED-UP STRATEGIES !

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Recalling eulero2d computational model

Rectangular domain - Cartesian reference frame

u v t + x + y E u E Energy: t + x + u t + Momentum: v t +
Mass:

=0
v E y u u x v v y

v + y u + P = 0 x u + x v + P = 0 y

u x

+ v y

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Recalling eulero2d computational model

Operator Splitting

M. Capitelli A.R.Casavola, G.Colonna.

Kinetic model of titanium laser induced plasma expansion in

nitrogen environment.

Plasma Sources Science and Technology, 18, 2009.

The Euler equations are solved for a given time interval

...and after along the y axis for all the x grid points, neglecting the x derivatives ...rstly along the x direction for any point of the y grid, neglecting the derivatives with respect to y

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Recalling eulero2d computational model

Operator Splitting

M. Capitelli A.R.Casavola, G.Colonna.

Kinetic model of titanium laser induced plasma expansion in

nitrogen environment.

Plasma Sources Science and Technology, 18, 2009.

The Euler equations are solved for a given time interval

...and after along the y axis for all the x grid points, neglecting the x derivatives ...rstly along the x direction for any point of the y grid, neglecting the derivatives with respect to y

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Recalling eulero2d computational model

Operator Splitting

M. Capitelli A.R.Casavola, G.Colonna.

Kinetic model of titanium laser induced plasma expansion in

nitrogen environment.

Plasma Sources Science and Technology, 18, 2009.

The Euler equations are solved for a given time interval

...rstly along the x direction for any point of the y grid, neglecting the derivatives with respect to y

...and after along the y axis for all the x grid points, neglecting the x derivatives

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Recalling eulero2d computational model

Basically in a t we divide the 2D ow: rst in mplt 1D ow along the x-direction and after (splitting) in nplt 1D ow along the y-direction.

Every 1D ow implies the resolution of 4 tridiagonal liner systems, one for each of the conserved quantity

, u , v , E

Therefore in every t we have to solve 4mplt+4nplt tridiagonal systems.

In our computations mplt or nplt ranges from 20 up to 200.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst idea

Since our 4mplt+4nplt tridiagonal systems can be solved independently, thus our rst idea was to parallelize their calculation. So we buy this Nvidia CUDA graphic card ( 100 euro):

General Information Name: GeForce GTS 450 Compute capability: 2.1 Clock rate: 1566000 Device copy overlap: Enabled Kernel execution timeout : Enabled Memory Information Total global mem: 1072889856 Total constant Mem: 65536 Max mem pitch: 2147483647 Texture Alignment: 512
While our CPU is:

MP Information Multiprocessor count: 4 Shared mem per mp: 49152 Registers per mp: 32768 Threads in warp: 32 Max threads per block: 1024 Max thread dimensions: (1024,1024,64) Max grid dimensions: (65535,65535,1)

Pentium(R) Dual-Core CPU; E5500 @ 2.80GHz 2.80GHz; RAM= 4.00 GB

...and spend little time to convert our code in CUDA C.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst idea

Since our 4mplt+4nplt tridiagonal systems can be solved independently, thus our rst idea was to parallelize their calculation. So we buy this Nvidia CUDA graphic card ( 100 euro):

Pentium(R) Dual-Core CPU; E5500 @ 2.80GHz 2.80GHz; RAM= 4.00 GB

...and spend little time to convert our code in CUDA C.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst idea

Since our 4mplt+4nplt tridiagonal systems can be solved independently, thus our rst idea was to parallelize their calculation. So we buy this Nvidia CUDA graphic card ( 100 euro):

Pentium(R) Dual-Core CPU; E5500 @ 2.80GHz 2.80GHz; RAM= 4.00 GB

...and spend little time to convert our code in CUDA C.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst idea

Since our 4mplt+4nplt tridiagonal systems can be solved independently, thus our rst idea was to parallelize their calculation. So we buy this Nvidia CUDA graphic card ( 100 euro):

Pentium(R) Dual-Core CPU; E5500 @ 2.80GHz 2.80GHz; RAM= 4.00 GB

...and spend little time to convert our code in CUDA C.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst failure

Our rst attempt was almost frustrating. The converted code runs some kernel with max 4200= 800 threads. In such a case the ratio of the program execution Machine Time CPU/GPU is about 0.6 (decreasing if threads

< 800)

Through further investigations we realize that the goal CPU/GPU> 1 requires kernels running more than thousands of threads.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst failure

< 800)

Through further investigations we realize that the goal CPU/GPU> 1 requires kernels running more than thousands of threads.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst failure

< 800)

Through further investigations we realize that the goal CPU/GPU> 1 requires kernels running more than thousands of threads.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

The rst failure

< 800)

Through further investigations we realize that the goal CPU/GPU> 1 requires kernels running more than thousands of threads.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

The code provided to us by prof. Giuseppe Pascazio of Politecnico di Bari solves a 2D inviscid ow using a exible structured grid.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

The code provided to us by prof. Giuseppe Pascazio of Politecnico di Bari solves a 2D inviscid ow using a exible structured grid.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:

rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX

Time discretization: second/fourth order Runge-Kutta. Explicit scheme summarizing in

independently executable little routines ( threads )

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX

Time discretization: second/fourth order Runge-Kutta. Explicit scheme summarizing in

independently executable little routines ( threads )

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX

Time discretization: second/fourth order Runge-Kutta. Explicit scheme summarizing in

independently executable little routines ( threads )

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX

Time discretization: second/fourth order Runge-Kutta. Explicit scheme summarizing in

independently executable little routines ( threads )

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX

Time discretization: second/fourth order Runge-Kutta. Explicit scheme summarizing in

independently executable little routines ( threads )

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX

Time discretization: second/fourth order Runge-Kutta. Explicit scheme summarizing in

independently executable little routines ( threads )

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Changing computational model

rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX

Time discretization: second/fourth order Runge-Kutta. Explicit scheme summarizing in

independently executable little routines ( threads )

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

eulero-fds (CPU) vs eulero2d

Pascazio's code was written in FORTRAN. So we rst made a translation-modication in C and named the resulting program eulero-fds (Flux Dierence Splitting).

1D Riemann problem validation test

L = 0.042 kg/m3 PL = 1 bar uL = 0 m/s (TL = 8000 K)

R = 1 L 8 1 PR = 10 PL uR = 0 m/s (TR = 6400 K) E

Density; Pressure; Velocity;

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

eulero-fds (CPU) vs eulero2d: rectangle test

Constant inlet Initial conditions:

Free stream outlet Initial conditions:

Pin = 1 bar uin = 0 m/s Tin = 8000 K

Pout = 1050 bar uout = 0 m/s Tout = 300 K

Density; Mach number

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

eulero-fds (CPU) vs eulero2d: rectangle test

Simply adjusting the CFL parameter in the new model we get our rst gain in Machine Time.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

eulero-fds CPU vs GPU: rectangle test

Summarizing Machine Times and ratios R eulero2d 5.6 h

R =37.8

eulero-fds(CPU) 8.9 min

R =4.45

eulero-fds(GPU) 2 min

Total speed-up ratio= 168

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

eulero-fds CPU vs GPU: nozzle test

Subsonic inlet Initial conditions:

Subsonic outlet Initial conditions:

Pin = 0.77 Pa Min = 0.2 m/s

in =0.8297

kg/m

Pout = Pin Mout = Min

out = in

MUSCL II order Density (200x200 grid)

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

eulero-fds CPU vs GPU: nozzle test

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

eulero-fds CPU vs GPU: nozzle test

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Computers & Fluids

October 2010 F.Kuo, R.Smith et al. : GPU acceleration for general conservation equations and its application to several engineering problems Physical system Several benchmark gas and shallow water ow engineering problems. Numerical approach Harten, Lax and Van Leer method applied to Euler Equations and Shallow Water Equations. GPU/CPU Nvidia C1060 / Intel Xeon 3.0 GHz, 32 MB cache Speed-up CPU/GPU Over 67 (in 2D simulations with multi-million cell numbers)

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Computer Physics Communications

November 2010 (preprint) H.Wong et al. : Ecient magnetohydrodynamic

simulations on graphics processing units with CUDA

Physical system One-dimensional problems: Brio-Wu shock tube, MHD shock tube. Two-dimensional problems: Orszag-Tang problem, Two-dimensional blast wave problem, MHD rotor problem. Three-dimensional blast wave problem. Numerical approach Ideal MHD equations with magnetic permeability = 1 represented as hyperbolic system of conservation laws. The magnetic eld is held xed rst and then the uid variables are updated. A reverse procedure is then performed to complete a one time step. Three dimensional problem is split into one-dimensional subproblems by using a Strang-type directional splitting. GPU/CPU GTX 295(480) 1.75G (1.5G) video memory / Intel Core i7 965 3.20 GHz, 6G main memory Speed-up CPU/GPU Up to 80 in 1D; up to 600 in 2D; up to 250 in 3D.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Journal of Computational Physics

August 2008 E.Elsen et al. : Large calculation of the ow over a hypersonic

vehicle using a GPU

Physical system Hypersonic vehicle conguration with detailed geometry and accurate boundary conditions. Numerical approach Compressible Euler equations. NavierStokes Stanford University Solver (NSSUS) that is a multi-block structured code using a vertex-based nite-dierence method. GPU/CPU 8800GTX / Intel Core 2 Duo Speed-up CPU/GPU Over 40x for simple test geometries and 20x for complex ones.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Journal of Computational Physics

August 2011 (preprint) W.Ran,W.Cheng et al. : GPU accelerated CESE method

for 1D shock tube problems

Physical system Condensation problem in 1D shock tube. Numerical approach Compressible Euler equations with a source term related to condensation. GPU/CPU GeForce 9800GT, Tesla C2050 / Intel CPU E7300 Speed-up CPU/GPU Up to 40x.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

Comput. Methods Appl. Mech. Engrg.

January 2011 M.Papadrakakis et al. : A new era in scientic computing: Domain

decomposition methods in hybrid CPUGPU architectures

Physical system 3D linear elasticity problems with a cubic geometry. Numerical approach A domain decomposition method (DDM) called FETI (Finite Element Tearing and Interconnecting), described by C. Farhat and F. Roux in 1991. GPU/CPU GTX285 1GB GDDR3 memory / Intel Core 2 Quad Q6600 2.4 GHz Speed-up CPU/GPU Up to 170x in Hybrid congurazions.

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

CUDA GPUs features and specications

The Compute Capability describes the features supported by a CUDA hardware.
http://developer.download.nvidia.com/ compute/cuda/3_2_prod/toolkit/docs/ CUDA_C_Programming_Guide.pdf

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

GPUs comparison (costs in euro)

http://developer.nvidia.com/cuda-gpus http://www.trovaprezzi.it/prezzo_schede-grache_nvidia_tesla.aspx http://en.wikipedia.org/wiki/CUDA Name GeForce GTX 285 GeForce GTX 295 GeForce GTX 480 GeForce 8800 GTX GeForce 9800 GT Tesla C1060 Tesla C2050 1 Compute Capability CC1 Mem. Clock rate MPxCores Cost

GeForce GTS 450

2.1
1.3 1.3 2.0 1.0 1.0 1.3 2.0

1 GB

1 GB 1792 MB 1536 MB 768 MB 512 MB 4 GB 3 GB

1.57 GHz

1476 MHz 1242 MHz 1401 MHz 575 MHz 1500 MHz 1.3 GHz 1.15 MHz

4x48=192

30x8=240 2x240=480 15x32=480 16x8=128 14x8=112 240 448

100

300430 490 280 420 150 920 21002800

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

References

J.Sanders, E.Kandrot 2010.

CUDA by example

Addison-Wesley,

http://www.nvidia.com/ object/computational_uid_dynamics.html

Repair Manual Genesis 260 350 GSM-GB
100% (1)
Repair Manual Genesis 260 350 GSM-GB
76 pages
Impact On Customers - : Komatsu Komtrax Case
No ratings yet
Impact On Customers - : Komatsu Komtrax Case
1 page
1. Introduction — CUDA C Programming Guide
No ratings yet
1. Introduction — CUDA C Programming Guide
573 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
cuuda nvidai guide_Part1
No ratings yet
cuuda nvidai guide_Part1
15 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
17 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
course-7
No ratings yet
course-7
21 pages
04 IntroductionGPUsCUDA
No ratings yet
04 IntroductionGPUsCUDA
25 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
p10-cuda
No ratings yet
p10-cuda
28 pages
Gpu Cuda Part2
No ratings yet
Gpu Cuda Part2
15 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
CUDA
No ratings yet
CUDA
33 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
CUDA
No ratings yet
CUDA
46 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Developing Library of Internet Protocol Suite On CUDA Platform
No ratings yet
Developing Library of Internet Protocol Suite On CUDA Platform
4 pages
CUDA 1_Introduction to GPU, CUDA (1)
No ratings yet
CUDA 1_Introduction to GPU, CUDA (1)
21 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Christian Eh An Sen 2
No ratings yet
Christian Eh An Sen 2
18 pages
4 - Key Concepts
No ratings yet
4 - Key Concepts
2 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
PART19
No ratings yet
PART19
20 pages
Lec 1
No ratings yet
Lec 1
27 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Cuda
No ratings yet
Cuda
69 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Analyzing CUDA Workloads Using A Detailed GPU Simulator
No ratings yet
Analyzing CUDA Workloads Using A Detailed GPU Simulator
12 pages
cuda
No ratings yet
cuda
25 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
chapter-8
No ratings yet
chapter-8
58 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
w13s1_MultiprocessingGPU
No ratings yet
w13s1_MultiprocessingGPU
21 pages
CSE_lec4_cuda
No ratings yet
CSE_lec4_cuda
91 pages
Cuda C
No ratings yet
Cuda C
70 pages
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
GPU Overclocking Guide
From Everand
GPU Overclocking Guide
Alisa Turing
No ratings yet
Applicationnews E9 Cardus
No ratings yet
Applicationnews E9 Cardus
3 pages
Multisens Picture Error in Rlink 2: Ddt4All
No ratings yet
Multisens Picture Error in Rlink 2: Ddt4All
16 pages
CATIA User Companion PDF
No ratings yet
CATIA User Companion PDF
7 pages
ETAP 12.5 Install Guide Release PDF
No ratings yet
ETAP 12.5 Install Guide Release PDF
4 pages
Ict2622 Object-Oriented Analysis Summary Notes Phase CH 5 - Satzinger Jackson Burd 6th Ed
100% (1)
Ict2622 Object-Oriented Analysis Summary Notes Phase CH 5 - Satzinger Jackson Burd 6th Ed
8 pages
Assignment # 2 Introduction To Computer Bscs-1A: Name: Roll No.: Instructor: Q.1)
No ratings yet
Assignment # 2 Introduction To Computer Bscs-1A: Name: Roll No.: Instructor: Q.1)
3 pages
UPDATED Internship Report - Mini Project Format
No ratings yet
UPDATED Internship Report - Mini Project Format
9 pages
Referent Axis 1
No ratings yet
Referent Axis 1
12 pages
Make Your Home A Living Wonder: With HBL Installment Plan at 0% Mark-Up
No ratings yet
Make Your Home A Living Wonder: With HBL Installment Plan at 0% Mark-Up
4 pages
G9 Types and Componentsof A Computer System
No ratings yet
G9 Types and Componentsof A Computer System
27 pages
Interview Questions For Fresher Level
No ratings yet
Interview Questions For Fresher Level
2 pages
Eflexis LIS SET UP
No ratings yet
Eflexis LIS SET UP
42 pages
COURSE OUTLINE_DESIGN AND FABRICATION OF VEHICLE
No ratings yet
COURSE OUTLINE_DESIGN AND FABRICATION OF VEHICLE
5 pages
SAP How-To-Guide_ How to Enable Data Quality Management for Custom Attributes in Master Data Governance
No ratings yet
SAP How-To-Guide_ How to Enable Data Quality Management for Custom Attributes in Master Data Governance
10 pages
T2900 Manual Install Rev 1
No ratings yet
T2900 Manual Install Rev 1
24 pages
Carprog Cr16 Airbag Manual
No ratings yet
Carprog Cr16 Airbag Manual
9 pages
Maintenance Resources Optimization Applied To A Manufacturing System
No ratings yet
Maintenance Resources Optimization Applied To A Manufacturing System
8 pages
Driver Fatique Detection System
No ratings yet
Driver Fatique Detection System
10 pages
Zvenigorod Toy Piano Basic Manual
No ratings yet
Zvenigorod Toy Piano Basic Manual
7 pages
Siwes Report by Adeyemi Sheriffdeen Alabi
No ratings yet
Siwes Report by Adeyemi Sheriffdeen Alabi
26 pages
1284-INTENTO-DESIGN-_-Fonds-de-commerce
No ratings yet
1284-INTENTO-DESIGN-_-Fonds-de-commerce
34 pages
Diesel Generator (7.5 KVA) - Manual
No ratings yet
Diesel Generator (7.5 KVA) - Manual
3 pages
GFW0001 - Group Project Assignment - Sem - I - 2023 - 2024
No ratings yet
GFW0001 - Group Project Assignment - Sem - I - 2023 - 2024
10 pages
Ad TR: Saraswati Lab Manual Physics-XI
No ratings yet
Ad TR: Saraswati Lab Manual Physics-XI
11 pages
ISMS Implementation Tracker
No ratings yet
ISMS Implementation Tracker
5 pages
Assignment No 3
No ratings yet
Assignment No 3
3 pages
Jamtaba - Manual v.1.0.4
No ratings yet
Jamtaba - Manual v.1.0.4
6 pages
Establishing Requirements
No ratings yet
Establishing Requirements
38 pages

From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo

Uploaded by

From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo

Uploaded by

GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co

From CPU to GPU with CUDA C language

Michele Tuttafesta Dottorato di ricerca in Fisica 25

Universit degli Studi di Bari

GPUs and CUDA

GPUs and CUDA

GPUs and CUDA

GPUs and CUDA

GPUs and CUDA

Illustrated History of Parallel Computing

GPUs and CUDA

GPUs and CUDA

GPUs and CUDA

The execution model

GPUs and CUDA

GPUs and CUDA

GPUs and CUDA

GPUs and CUDA

GPUs and CUDA

Four types of memory:

global , constant , texture , shared .

GPUs and CUDA

Four types of memory:

global , constant , texture , shared .

GPUs and CUDA

Four types of memory:

global , constant , texture , shared .

GPUs and CUDA

Four types of memory:

global , constant , texture , shared .

GPUs and CUDA: Shared memory

GPUs and CUDA: Shared memory

GPUs and CUDA: Shared memory

Summing vectors: CPU vector sums

Summing vectors: CPU vector sums

Summing vectors: GPU vector sums

Summing vectors: GPU vector sums

Summing vectors: GPU vector sums

Using whole device

Program eulero2d: rectangle test

Constant inlet Initial conditions:

Free stream outlet Initial conditions:

Pin = 1 bar uin = 0 m/s Tin = 8000 K

Pout = 1050 bar uout = 0 m/s Tout = 300 K

rot (u) >

Program eulero2d: rectangle test

Machine Time (MT) = 20212 s = 5.6 h

We may be in serious troubles adding chemical kinetics for instance...

WE NEED SPEED-UP STRATEGIES !

Program eulero2d: rectangle test

Machine Time (MT) = 20212 s = 5.6 h

We may be in serious troubles adding chemical kinetics for instance...

WE NEED SPEED-UP STRATEGIES !

Program eulero2d: rectangle test

Machine Time (MT) = 20212 s = 5.6 h

We may be in serious troubles adding chemical kinetics for instance...

WE NEED SPEED-UP STRATEGIES !

Program eulero2d: rectangle test

Machine Time (MT) = 20212 s = 5.6 h

We may be in serious troubles adding chemical kinetics for instance...

WE NEED SPEED-UP STRATEGIES !

Recalling eulero2d computational model

Recalling eulero2d computational model

M. Capitelli A.R.Casavola, G.Colonna.

Kinetic model of titanium laser induced plasma expansion in

Plasma Sources Science and Technology, 18, 2009.

The Euler equations are solved for a given time interval

Recalling eulero2d computational model

M. Capitelli A.R.Casavola, G.Colonna.

Kinetic model of titanium laser induced plasma expansion in

Plasma Sources Science and Technology, 18, 2009.

The Euler equations are solved for a given time interval

Recalling eulero2d computational model

M. Capitelli A.R.Casavola, G.Colonna.

Kinetic model of titanium laser induced plasma expansion in

Plasma Sources Science and Technology, 18, 2009.

Program eulero2d: rectangle test

Constant inlet Initial conditions:

Free stream outlet Initial conditions:

Program eulero2d: rectangle test

Program eulero2d: rectangle test

Program eulero2d: rectangle test

Program eulero2d: rectangle test

Recalling eulero2d computational model

Recalling eulero2d computational model

Recalling eulero2d computational model

Recalling eulero2d computational model

Recalling eulero2d computational model

Therefore in every t we have to solve 4mplt+4nplt tridiagonal systems.

The rst idea

The rst idea

The rst idea

The rst idea

The rst failure

The rst failure

The rst failure

The rst failure

eulero-fds (CPU) vs eulero2d

eulero-fds (CPU) vs eulero2d: rectangle test

Constant inlet Initial conditions:

Free stream outlet Initial conditions:

eulero-fds (CPU) vs eulero2d: rectangle test

eulero-fds CPU vs GPU: rectangle test

eulero-fds CPU vs GPU: nozzle test

Subsonic inlet Initial conditions:

Subsonic outlet Initial conditions:

eulero-fds CPU vs GPU: nozzle test

eulero-fds CPU vs GPU: nozzle test