From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo
From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo
Ciclo
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Overview
1 2 3 4 5 6 7
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and code Positive results and outlooks Bibliographical research References
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
By comparing the computational power of GPUs and CPUs, GPUs exceed CPUs by orders of magnitude.
The theoretical peak performance of the current consumer graphics card NVIDIA GeForce GTX 295 (with two GPUs) is 1788.48G oating-point operations per second (FLOPS) per GPU in single precision while a CPU (Core 2 Quad Q9650 3.0 GHz) gives a peak performance of around 96GFLOPS in single precision.
The release of the Compute Unied Device Architecture (CUDA) hardware and software architecture is the culmination of such development.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
By comparing the computational power of GPUs and CPUs, GPUs exceed CPUs by orders of magnitude.
The theoretical peak performance of the current consumer graphics card NVIDIA GeForce GTX 295 (with two GPUs) is 1788.48G oating-point operations per second (FLOPS) per GPU in single precision while a CPU (Core 2 Quad Q9650 3.0 GHz) gives a peak performance of around 96GFLOPS in single precision.
The release of the Compute Unied Device Architecture (CUDA) hardware and software architecture is the culmination of such development.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
By comparing the computational power of GPUs and CPUs, GPUs exceed CPUs by orders of magnitude.
The theoretical peak performance of the current consumer graphics card NVIDIA GeForce GTX 295 (with two GPUs) is 1788.48G oating-point operations per second (FLOPS) per GPU in single precision while a CPU (Core 2 Quad Q9650 3.0 GHz) gives a peak performance of around 96GFLOPS in single precision.
The release of the Compute Unied Device Architecture (CUDA) hardware and software architecture is the culmination of such development.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
By comparing the computational power of GPUs and CPUs, GPUs exceed CPUs by orders of magnitude.
The theoretical peak performance of the current consumer graphics card NVIDIA GeForce GTX 295 (with two GPUs) is 1788.48G oating-point operations per second (FLOPS) per GPU in single precision while a CPU (Core 2 Quad Q9650 3.0 GHz) gives a peak performance of around 96GFLOPS in single precision.
The release of the Compute Unied Device Architecture (CUDA) hardware and software architecture is the culmination of such development.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
The Compute Unied Device Architecture (CUDA) was introduced by NVIDIA as a general purpose parallel computing architecture, which includes GPU hardware architecture as well as software components (CUDA compiler and the system drivers and libraries).
The CUDA programming model consists of functions, called kernels , which can be executed simultaneously by a large number of lightweight threads on the GPU.
These threads are grouped into one-, two-, or three-dimensional thread blocks , which are further organized into one- or two-dimensional grids .
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
The Compute Unied Device Architecture (CUDA) was introduced by NVIDIA as a general purpose parallel computing architecture, which includes GPU hardware architecture as well as software components (CUDA compiler and the system drivers and libraries).
The CUDA programming model consists of functions, called kernels , which can be executed simultaneously by a large number of lightweight threads on the GPU.
These threads are grouped into one-, two-, or three-dimensional thread blocks , which are further organized into one- or two-dimensional grids .
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
The Compute Unied Device Architecture (CUDA) was introduced by NVIDIA as a general purpose parallel computing architecture, which includes GPU hardware architecture as well as software components (CUDA compiler and the system drivers and libraries).
The CUDA programming model consists of functions, called kernels , which can be executed simultaneously by a large number of lightweight threads on the GPU.
These threads are grouped into one-, two-, or three-dimensional thread blocks , which are further organized into one- or two-dimensional grids .
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Thread blocks are independent of each other and can be executed in any other.
A graphics card that supports CUDA, for example, the GT200, consisting of 30 streaming multiprocessors (SMs). Each multiprocessor consists of 8 streaming processors (SPs), providing a total of 240 SPs.
Threads are grouped into batches of 32 called warps which are executed in single instruction multiple data (SIMD) fashion independently. Threads within a warp execute a common instruction at a time.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Thread blocks are independent of each other and can be executed in any other.
A graphics card that supports CUDA, for example, the GT200, consisting of 30 streaming multiprocessors (SMs). Each multiprocessor consists of 8 streaming processors (SPs), providing a total of 240 SPs.
Threads are grouped into batches of 32 called warps which are executed in single instruction multiple data (SIMD) fashion independently. Threads within a warp execute a common instruction at a time.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Thread blocks are independent of each other and can be executed in any other.
A graphics card that supports CUDA, for example, the GT200, consisting of 30 streaming multiprocessors (SMs). Each multiprocessor consists of 8 streaming processors (SPs), providing a total of 240 SPs.
Threads are grouped into batches of 32 called warps which are executed in single instruction multiple data (SIMD) fashion independently. Threads within a warp execute a common instruction at a time.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Thread blocks are independent of each other and can be executed in any other.
A graphics card that supports CUDA, for example, the GT200, consisting of 30 streaming multiprocessors (SMs). Each multiprocessor consists of 8 streaming processors (SPs), providing a total of 240 SPs.
Threads are grouped into batches of 32 called warps which are executed in single instruction multiple data (SIMD) fashion independently. Threads within a warp execute a common instruction at a time.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Global memory has a separate address space for obtaining data from the host CPU's main memory through the PCIE bus, which is about 8 GB/sec in the GT200 GPU. Any valued stored in global memory can be accessed by all SMs via load and store instructions. Constant memory and texture memory are cached, read-only and shared between SPs. Constants that are kept unchanged during kernel execution may be stored in constant memory. Built-in linear interpolation is available in texture memory. Shared memory is limited (16 KB for GT200 GPU) and shared between all SPs in a MP.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Global memory has a separate address space for obtaining data from the host CPU's main memory through the PCIE bus, which is about 8 GB/sec in the GT200 GPU. Any valued stored in global memory can be accessed by all SMs via load and store instructions. Constant memory and texture memory are cached, read-only and shared between SPs. Constants that are kept unchanged during kernel execution may be stored in constant memory. Built-in linear interpolation is available in texture memory. Shared memory is limited (16 KB for GT200 GPU) and shared between all SPs in a MP.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Global memory has a separate address space for obtaining data from the host CPU's main memory through the PCIE bus, which is about 8 GB/sec in the GT200 GPU. Any valued stored in global memory can be accessed by all SMs via load and store instructions. Constant memory and texture memory are cached, read-only and shared between SPs. Constants that are kept unchanged during kernel execution may be stored in constant memory. Built-in linear interpolation is available in texture memory. Shared memory is limited (16 KB for GT200 GPU) and shared between all SPs in a MP.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Global memory has a separate address space for obtaining data from the host CPU's main memory through the PCIE bus, which is about 8 GB/sec in the GT200 GPU. Any valued stored in global memory can be accessed by all SMs via load and store instructions. Constant memory and texture memory are cached, read-only and shared between SPs. Constants that are kept unchanged during kernel execution may be stored in constant memory. Built-in linear interpolation is available in texture memory. Shared memory is limited (16 KB for GT200 GPU) and shared between all SPs in a MP.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
The CUDA C compiler treats variables in shared memory dierently than typical variables. It creates a copy of the variable for each block that you launch on the GPU.
Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks. This provides an excellent means by which threads within a block can communicate and collaborate on computations. Furthermore, shared memory buers reside physically on the GPU as opposed to residing in o-chip DRAM. Because of this, the latency to access shared memory tends to be far lower than typical buers, making shared memory eective as a per-block, softwaremanaged cache or scratchpad.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
The CUDA C compiler treats variables in shared memory dierently than typical variables. It creates a copy of the variable for each block that you launch on the GPU.
Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks. This provides an excellent means by which threads within a block can communicate and collaborate on computations. Furthermore, shared memory buers reside physically on the GPU as opposed to residing in o-chip DRAM. Because of this, the latency to access shared memory tends to be far lower than typical buers, making shared memory eective as a per-block, softwaremanaged cache or scratchpad.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
The CUDA C compiler treats variables in shared memory dierently than typical variables. It creates a copy of the variable for each block that you launch on the GPU.
Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks. This provides an excellent means by which threads within a block can communicate and collaborate on computations. Furthermore, shared memory buers reside physically on the GPU as opposed to residing in o-chip DRAM. Because of this, the latency to access shared memory tends to be far lower than typical buers, making shared memory eective as a per-block, softwaremanaged cache or scratchpad.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
void
} { }
add( i n t a , i n t b , i n t c) i n t tid = 0; while ( tid < N) { c [ tid ] = a [ tid ] + b[ tid ] ; tid += 1; } ... add(a ,b , c ); ... main( void )
//======================
int
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
void
} {
add( i n t a , i n t b , i n t c) i n t tid = 0; while ( tid < N) { c [ tid ] = a [ tid ] + b[ tid ] ; tid += 1; } ... add(a ,b , c ); ... main( void )
//======================
int
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
__global__ void add( i n t a , i n t b , i n t c) { i n t tid = blockIdx.x ; // h a n d l e t h e d a t a a t t h i s i n d e x i f ( tid < N) c [ tid ] = a [ tid ] + b[ tid ] ; } //====================== i n t main( void ) { ... cudaMemcpy(dev_a , a ,N s i z e o f ( i n t ) , cudaMemcpyHostToDevice ); cudaMemcpy(dev_b ,b ,N s i z e o f ( i n t ) , cudaMemcpyHostToDevice ); add <<<N,1>>> ( dev_a , dev_b , dev_c ); cudaMemcpy(c , dev_c ,N s i z e o f ( i n t ) , cudaMemcpyDeviceToHost ); ... }
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Imagine four blocks, all running through the same copy of the device code but having dierent values for the variable blockIdx.x.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
__global__ void add( i n t a , i n t b , i n t c) { i n t tid = threadIdx.x ; // h a n d l e t h e d a t a a t t h i s i n d e x i f ( tid < N) c [ tid ] = a [ tid ] + b[ tid ] ; } //====================== i n t main( void ) { ... cudaMemcpy(dev_a , a ,N s i z e o f ( i n t ) , cudaMemcpyHostToDevice ); cudaMemcpy(dev_b ,b ,N s i z e o f ( i n t ) , cudaMemcpyHostToDevice ); add <<<1,N>>> ( dev_a , dev_b , dev_c ); cudaMemcpy(c , dev_c ,N s i z e o f ( i n t ) , cudaMemcpyDeviceToHost ); ... }
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
__global__ void add( i n t a , i n t b , i n t c) { i n t tid = threadIdx.x + blockIdx.x * blockDim.x ; while (tid < N) { c [ tid ] = a [ tid ] + b[ tid ] ; tid += blockDim.x * gridDim.x ; } } //====================== i n t main( void ) { ... add <<<BlocksPerGrid,ThreadsPerBlock>>> ( dev_a , dev_b , dev_c ); ... }
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
__global__ void kernel ( . . . ) { i n t tx = threadIdx . x ; i n t ty = threadIdx . y ; i n t tz = threadIdx . z ; i n t bx = blockIdx . x ; i n t by = blockIdx . y ; ... } //====================== i n t main( void ){ dim3 grids(32,18); dim3 threads(16,128,64); kernel<<<grids , threads >>>(...); ... }
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
|u|
10
3 1
M (gray)
Mach reection
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
u v t + x + y E u E Energy: t + x + u t + Momentum: v t +
Mass:
=0
v E y u u x v v y
+P
v + y u + P = 0 x u + x v + P = 0 y
u x
+ v y
=0
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
nitrogen environment.
...and after along the y axis for all the x grid points, neglecting the x derivatives ...rstly along the x direction for any point of the y grid, neglecting the derivatives with respect to y
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
nitrogen environment.
...and after along the y axis for all the x grid points, neglecting the x derivatives ...rstly along the x direction for any point of the y grid, neglecting the derivatives with respect to y
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
nitrogen environment.
...rstly along the x direction for any point of the y grid, neglecting the derivatives with respect to y
...and after along the y axis for all the x grid points, neglecting the x derivatives
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Basically in a t we divide the 2D ow: rst in mplt 1D ow along the x-direction and after (splitting) in nplt 1D ow along the y-direction.
Every 1D ow implies the resolution of 4 tridiagonal liner systems, one for each of the conserved quantity
, u , v , E
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
General Information Name: GeForce GTS 450 Compute capability: 2.1 Clock rate: 1566000 Device copy overlap: Enabled Kernel execution timeout : Enabled Memory Information Total global mem: 1072889856 Total constant Mem: 65536 Max mem pitch: 2147483647 Texture Alignment: 512
While our CPU is:
MP Information Multiprocessor count: 4 Shared mem per mp: 49152 Registers per mp: 32768 Threads in warp: 32 Max threads per block: 1024 Max thread dimensions: (1024,1024,64) Max grid dimensions: (65535,65535,1)
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
General Information Name: GeForce GTS 450 Compute capability: 2.1 Clock rate: 1566000 Device copy overlap: Enabled Kernel execution timeout : Enabled Memory Information Total global mem: 1072889856 Total constant Mem: 65536 Max mem pitch: 2147483647 Texture Alignment: 512
While our CPU is:
MP Information Multiprocessor count: 4 Shared mem per mp: 49152 Registers per mp: 32768 Threads in warp: 32 Max threads per block: 1024 Max thread dimensions: (1024,1024,64) Max grid dimensions: (65535,65535,1)
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
General Information Name: GeForce GTS 450 Compute capability: 2.1 Clock rate: 1566000 Device copy overlap: Enabled Kernel execution timeout : Enabled Memory Information Total global mem: 1072889856 Total constant Mem: 65536 Max mem pitch: 2147483647 Texture Alignment: 512
While our CPU is:
MP Information Multiprocessor count: 4 Shared mem per mp: 49152 Registers per mp: 32768 Threads in warp: 32 Max threads per block: 1024 Max thread dimensions: (1024,1024,64) Max grid dimensions: (65535,65535,1)
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
General Information Name: GeForce GTS 450 Compute capability: 2.1 Clock rate: 1566000 Device copy overlap: Enabled Kernel execution timeout : Enabled Memory Information Total global mem: 1072889856 Total constant Mem: 65536 Max mem pitch: 2147483647 Texture Alignment: 512
While our CPU is:
MP Information Multiprocessor count: 4 Shared mem per mp: 49152 Registers per mp: 32768 Threads in warp: 32 Max threads per block: 1024 Max thread dimensions: (1024,1024,64) Max grid dimensions: (65535,65535,1)
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Our rst attempt was almost frustrating. The converted code runs some kernel with max 4200= 800 threads. In such a case the ratio of the program execution Machine Time CPU/GPU is about 0.6 (decreasing if threads
< 800)
Through further investigations we realize that the goal CPU/GPU> 1 requires kernels running more than thousands of threads.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
< 800)
Through further investigations we realize that the goal CPU/GPU> 1 requires kernels running more than thousands of threads.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
< 800)
Through further investigations we realize that the goal CPU/GPU> 1 requires kernels running more than thousands of threads.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
< 800)
Through further investigations we realize that the goal CPU/GPU> 1 requires kernels running more than thousands of threads.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
The code provided to us by prof. Giuseppe Pascazio of Politecnico di Bari solves a 2D inviscid ow using a exible structured grid.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
The code provided to us by prof. Giuseppe Pascazio of Politecnico di Bari solves a 2D inviscid ow using a exible structured grid.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:
rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:
rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:
rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:
rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:
rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:
rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Prof. G.Pascazio code technical specications Finite volume approach. Flux calculation at cell interfaces by the Roe ux-dierencing. MUSCL extrapolation of physical variables at cell interfaces. Available limiters: minmod, van Albada. Spatial discretization:
rst order fully upwind second order fully upwind second order upwind biased third order upwind biased
4IMAXJMAX
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Pascazio's code was written in FORTRAN. So we rst made a translation-modication in C and named the resulting program eulero-fds (Flux Dierence Splitting).
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
Simply adjusting the CFL parameter in the new model we get our rst gain in Machine Time.
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
R =37.8
R =4.45
eulero-fds(GPU) 2 min
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
in =0.8297
kg/m
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
http://developer.nvidia.com/cuda-gpus http://www.trovaprezzi.it/prezzo_schede-grache_nvidia_tesla.aspx http://en.wikipedia.org/wiki/CUDA Name GeForce GTX 285 GeForce GTX 295 GeForce GTX 480 GeForce 8800 GTX GeForce 9800 GT Tesla C1060 Tesla C2050 1 Compute Capability CC1 Mem. Clock rate MPxCores Cost
2.1
1.3 1.3 2.0 1.0 1.0 1.3 2.0
1 GB
1.57 GHz
1476 MHz 1242 MHz 1401 MHz 575 MHz 1500 MHz 1.3 GHz 1.15 MHz
4x48=192
100
GPUs and CUDA Parallel Programming in CUDA C Our needs, attempts and failures Replacing original model and co
References
CUDA by example
Addison-Wesley,
http://www.nvidia.com/ object/computational_uid_dynamics.html