Parallel Distributed Computing
Parallel Distributed Computing
Board Details
SLI Connector Single slot cooling
sVideo
TV Out
DVI x 2
256MB/256-bit DDR3
600 MHz
16x PCI-Express 8 pieces of 8Mx32
GeForce 7800 GTX
GPU Details
302 million transistors
430 MHz core clock
256-bit memory interface
Notable Functionality
• Non-power-of-two textures with mipmaps
• Floating-point (fp16) blending and filtering
• sRGB color space texture filtering and
frame buffer blending
• Vertex textures
• 16x anisotropic texture filtering
• Dynamic vertex and fragment branching
• Double-rate depth/stencil-only rendering
• Early depth/stencil culling
• Transparency antialiasing
GeForce 7800 GTX
Parallelism
8 Vertex Engines
Z Cull Texture
GeForce Graphics Pipeline
Vertex Engine
Vertex pulling
Vector floating-point instructions
Dynamic branching
Vertex texture
Vertex stream frequency
Z Cull Texture
GeForce Graphics Pipeline
Setup
Prepare triangle for
rasterization
215M triangles/sec setup
Z Cull Texture
GeForce Graphics Pipeline
Raster
Compute coverage
Points, lines, and triangles
Rotated grid multisampling
Z Cull Texture
GeForce Graphics Pipeline
Z Cull
Z Cull Texture
GeForce Graphics Pipeline
Fragment Shader
User-programmed fragment coloring
Dynamic branching
Long shaders
Multiple render targets
fp16 and fp32 vectors
Z Cull Texture
GeForce Graphics Pipeline
Texture
fp16 and sRGB filtering
16x anisotropic filtering
Non-power-of-two mipmapping
Shadow maps, cube maps, and 3D
Floating-point textures
Z Cull Texture
GeForce Graphics Pipeline
Texture
2x and 4x multisampling
fp16 and sRGB blending
Multiple render targets
Color and depth compression
Double-speed depth/stencil only
Z Cull Texture
Single GeForce 7800
Vertex Unit
Primitive Assembly + Vertex Processing Engine
Attribute Processing • MIMD Architecture
• Dual Issue
• Low-penalty branching
• Shader Model 3.0
• 32 vector registers
Vertex FP32 FP32
• 512 static instructions per
Texture Scalar Vector
Fetch Unit Unit
program
• Indexed input and output
registers
Texture Branch
Vertex Texture Fetch
Cache Unit
• Non-stalling
• Up to 4 texture units
Viewport Processing • Unlimited fetches
• Mipmapping, no filtering
To Setup
Vertex Texturing Example
Vertex
Program
Images used with permission from Pacific Fighters. © 2004 Developed by 1C:Maddox Games.
All rights reserved. © 2004 Ubi Soft Entertainment.
Vertex Textures to Drive
Particle Systems
◼ Render-to-texture
Simulation runs
in floating-point
frame buffer, also
usable as texture
◼ Vertex textures
Determines particle
location with
vertex texture
fetch
Single GeForce 7800
Fragment Shader Pipeline
Texture Input Fragment Texture Processor
Data Data
16 texture units
1 texture fetch at full speed
Bilinear or tri-linear filtering
FP32 16x anisotropic filtering
Texture
Shader
Processor Floating-point (fp16) texture filtering
Unit 1
Shader Unit 1
FP32 4 MULs + RCP
Texture Dual Issue
Shader
Cache
Unit 2 Texture address calculation
Fast fp16 normalize
Branch Free: negate, abs, condition codes
Processor
Shader Unit 2
Output 4 MADs or DP4
Fixed-function
Shaded Dual Issue
Fog Unit
Fragments
Free: negate, abs, condition codes
Operations Per Fragment
Shader Pass
Shader 4 Components
1 Op / component
Unit 2 4 Ops / fragment
per pass
Operation 3 Operation 4
Single GeForce 7800
Raster Operations Pipeline
Input
Shaded Pixel Crossbar
Fragment Interconnect Functionality
Data
• OpenEXR
Multisample Antialiasing floating-point
blending
• sRGB
Depth Color blending
Compression Compression • 4x rotated grid
multisampling
Depth Color • Lossless color
Raster Raster and depth
Operations Operations compression
• Multiple
render targets
Memory Frame Buffer Partition
GeForce 7800
Transparency Antialiasing
SLI
Connector
,
GPU Vertex Primitive Clipping, Setup,
Raster
Front End Assembly Assembly and Rasterization Operations
hardware!
Attribute Fetch Parameter Buffer Read Texture Fetch Framebuffer Access
Memory Interface
Dual Warp Scheduling
32 threads launch!
Shader or CUDA Core,
Same Unit but Two Personalities
◼ Execution unit
Scalar floating-point
Scalar integer
Levels of Caching in Fermi GPU
◼ 12 KB L1 Texture cache
Per texture unit
◼ SM 64 K cache
Split into dedicated 16K or 48K
Load/Store cache
Shared memory 48K or 16K
◼ Programming APIs
CUDA, OpenCL, DirectCompute
◼ APIs + language = parallel processing system
OpenGL or Direct3D through shaders
◼ Cg, HLSL, GLSL