Lec04 gpu architecture

Instructor NotesWe describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional programming coursesContrast conventional multicore CPU architecture with high level view of AMD and Nvidia GPU ArchitectureThis lecture starts with a high level architectural view of all GPUs, discusses each vendor’s architecture and then converges back to the OpenCL specStress on the difference between the AMD VLIW architecture and Nvidia scalar architectureAlso discuss the different memory architectureBrief discussion of ICD and compilation flow of OpenCL provides a lead to Lecture 5 where the first complete OpenCL program is written2Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

TopicsMapping the OpenCL spec to many-core hardware AMD GPU ArchitectureNvidia GPU ArchitectureCell Broadband EngineOpenCL Specific TopicsOpenCL Compilation SystemInstallable Client Driver (ICD)3Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

MotivationWhy are we discussing vendor specific hardware if OpenCL is platform independent ?Gain intuition of how a program’s loops and data need to map to OpenCL kernels in order to obtain performanceObserve similarities and differences between Nvidia and AMD hardwareUnderstanding hardware will allow for platform specific tuning of code in later lectures4Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Conventional CPU ArchitectureSpace devoted to control logic instead of ALUCPUs are optimized to minimize the latency of a single threadCan efficiently handle control flow intensive workloadsMulti level caches used to hide latencyLimited number of registers due to smaller number of active threadsControl logic to reorder execution, provide ILP and minimize pipeline stallsConventional CPU Block DiagramControl LogicL2 CacheL3 CacheALUL1 Cache ~ 25GBPSSystem MemoryA present day multicore CPU could have more than one ALU ( typically < 32) and some of the cache hierarchy is usually shared across cores5Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Modern GPGPU ArchitectureGeneric many core GPULess space devoted to control logic and cachesLarge register files to support multiple thread contextsLow latency hardware managed thread switchingLarge number of ALU per “core” with small user managed cache per core Memory bus optimized for bandwidth ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneouslyHigh Bandwidth bus to ALUsOn Board System MemorySimple ALUsCache6Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

AMD GPU Hardware ArchitectureAMD 5870 – Cypress20 SIMD engines16 SIMD units per core5 multiply-adds per functional unit (VLIW processing)2.72 Teraflops Single Precision544 Gigaflops Double PrecisionSource: Introductory OpenCL SAAHPC2010, Benedict R. Gaster7Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

SIMD EngineOne SIMD EngineA SIMD engine consists of a set of “Stream Cores”Stream cores arranged as a five way Very Long Instruction Word (VLIW) processor Up to five scalar operations can be issued in a VLIW instructionScalar operations executed on each processing elementStream cores within compute unit execute same VLIW instructionThe block of work-items that are executed together is called a wavefront.64 work items for 5870One Stream CoreInstruction and Control FlowT-Processing ElementBranchExecution UnitProcessingElementsGeneral Purpose RegistersSource: AMD Accelerated Parallel Processing OpenCL Programming Guide8Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

AMD Platform as seen in OpenCLIndividual work-items execute on a single processing elementProcessing element refers to a single VLIW coreMultiple work-groups execute on a compute unitA compute unit refers to a SIMD Engine9Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

AMD GPU Memory ArchitectureMemory per compute unitLocal data store (on-chip)Registers L1 cache (8KB for 5870) per compute unitL2 Cache shared between compute units (512KB for 5870)Fast path for only 32 bit operationsComplete path for atomics and < 32bit operationsSIMD Engine LDS, RegistersL1 CacheCompute Unit to Memory X-barL2 CacheWrite CacheAtomic PathLDS10Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

AMD Memory Model in OpenCLSubset of hardware memory exposed in OpenCLLocal Data Share (LDS) exposed as local memoryShare data between items of a work group designed to increase performanceHigh Bandwidth access per SIMD EnginePrivate memory utilizes registers per work itemConstant Memory__constant tags utilize L1 cache.Private MemoryPrivate MemoryPrivate MemoryPrivate MemoryWorkitem 1Workitem 1Workitem 1Workitem 1Compute Unit 1Compute Unit NLocal MemoryLocal MemoryGlobal / Constant Memory Data CacheCompute DeviceGlobal MemoryCompute Device Memory11Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

AMD Constant Memory UsageConstant Memory declarations for AMD GPUs only beneficial for following access patternsDirect-Addressing Patterns: For non array constant values where the address is known initiallySame Index Patterns: When all work-items reference the same constant addressGlobally scoped constant arrays: Arrays that are initialized, globally scoped can use the cache if less than 16KBCases where each work item accesses different indices, are not cached and deliver the same performance as a global memory readSource: AMD Accelerated Parallel Processing OpenCL Programming Guide12Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Nvidia GPUs - Fermi Architecture Instruction CacheCoreCoreCoreCoreGTX 480 - Compute 2.0 capability15 cores or Streaming Multiprocessors (SMs)Each SM features 32 CUDA processors480 CUDA processorsGlobal memory with ECCWarp Scheduler Warp Scheduler Dispatch UnitDispatch UnitCoreCoreCoreCoreRegister File 32768 x 32bitLDSTLDSTCoreCoreCoreCoreSFULDSTLDSTCoreCoreCoreCoreLDSTLDSTSFUCoreCoreCoreCoreLDSTLDSTSFULDSTLDSTCoreCoreCoreCoreCUDA CoreDispatch PortLDSTLDSTOperand CollectorCoreCoreCoreCoreSFULDSTLDSTSource: NVIDIA’s Next Generation CUDA Architecture WhitepaperInterconnect MemoryFP UnitInt UnitCoreCoreCoreCoreLDSTLDSTL1 Cache / 64kB Shared MemoryL2 CacheResult Queue13Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Nvidia GPUs – Fermi ArchitectureSM executes threads in groups of 32 called warps.Two warp issue units per SMConcurrent kernel executionExecute multiple kernels simultaneously to improve efficiencyCUDA core consists of a single ALU and floating point unit FPUInstruction CacheCoreCoreCoreCoreWarp Scheduler Warp Scheduler Dispatch UnitDispatch UnitCoreCoreCoreCoreRegister File 32768 x 32bitLDSTLDSTCoreCoreCoreCoreSFULDSTLDSTCoreCoreCoreCoreLDSTLDSTSFUCoreCoreCoreCoreLDSTLDSTSFULDSTLDSTCoreCoreCoreCoreCUDA CoreDispatch PortLDSTLDSTOperand CollectorCoreCoreCoreCoreSFUSource: NVIDIA’s Next Generation CUDA Compute Architecture WhitepaperLDSTLDSTInterconnect MemoryFP UnitInt UnitCoreCoreCoreCoreLDSTLDSTL1 Cache / 64kB Shared MemoryL2 CacheResult Queue14Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

SIMT and SIMDSIMT denotes scalar instructions and multiple threads sharing an instruction streamHW determines instruction stream sharing across ALUsE.g. NVIDIA GeForce (“SIMT” warps), AMD Radeon architectures (“wavefronts”) where all the threads in a warp /wavefront proceed in lockstepDivergence between threads handled using predicationSIMT instructions specify the execution and branching behavior of a single threadSIMD instructions exposes vector width, E.g. of SIMD: explicit vector instructions like x86 SSE15Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

SIMT Execution ModelSIMD execution can be combined with pipelining

ALUs all execute the same instruction

Pipelining is used to break instruction into phases

When first instruction completes (4 cycles here), the next instruction is ready to executeSIMD WidthAddMulAddMulAddMulAddMulAddMulAddMul…AddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulWavefrontAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMul…Cycle16Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Nvidia Memory HierarchyL1 cache per SM configurable to support shared memory and caching of global memory48 KB Shared / 16 KB of L1 cache16 KB Shared / 48 KB of L1 cacheData shared between work items of a group using shared memoryEach SM has a 32K register bank L2 cache (768KB) that services all operations (load, store and texture)Unified path to global for loads and storesRegistersThread BlockL1 CacheShared MemoryL2 CacheGlobal Memory17Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Nvidia Memory Model in OpenCLLike AMD, a subset of hardware memory exposed in OpenCLConfigurable shared memory is usable as local memory Local memory used to share data between items of a work group at lower latency than global memory Private memory utilizes registers per work itemPrivate MemoryPrivate MemoryPrivate MemoryPrivate MemoryWorkitem 1Workitem 1Workitem 1Workitem 1Compute Unit 1Compute Unit NLocal MemoryLocal MemoryGlobal / Constant Memory Data CacheCompute DeviceGlobal MemoryCompute Device Memory18Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Cell Broadband EngineSPE 2SPE 0SPE 1SPE 3Developed by Sony, Toshiba, IBMTransitioned from embedded platforms into HPC via the Playstation 3OpenCL drivers available for Cell Bladecenter serversConsists of a Power Processing Element (PPE) and multiple Synergistic Processing Elements (SPE)Uses the IBM XL C for OpenCL compilerSPUSPUSPUSPULSLSLSLS25 GBPS25 GBPS25 GBPSElement Interconnect ~ 200GBPSLS = Local store per SPE of 256KBMemory & Interrupt ControllerL1 and L2 CachePOWER PCPPESource: http://www.alphaworks.ibm.com/tech/opencl19Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Cell BE and OpenCLCell Power/VMX CPU used as a CL_DEVICE_TYPE_CPUCell SPU (CL_DEVICE_TYPE_ACCELERATOR) No. of compute units on a SPU accelerator device is <=16Local memory size <= 256KB256K of local storage divided among OpenCL kernel, 8KB global data cache, local, constant and private variablesOpenCL accelerator devices, and OpenCL CPU device share a common memory busProvides extensions like “Device Fission” and “Migrate Objects” to specify where an object resides (discussed in Lecture 10)No support for OpenCL images, sampler objects, atomics and byte addressable memorySource: http://www.alphaworks.ibm.com/tech/opencl20Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

An Optimal GPGPU KernelFrom the discussion on hardware we see that an ideal kernel for a GPU:Has thousands of independent pieces of workUses all available compute unitsAllows interleaving for latency hidingIs amenable to instruction stream sharingMaps to SIMD execution by preventing divergence between work itemsHas high arithmetic intensityRatio of math operations to memory access is highNot limited by memory bandwidthNote that these caveats apply to all GPUs21Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Lec04 gpu architecture

Recommended

More Related Content

What's hot (20)

Similar to Lec04 gpu architecture (20)

More from Taras Zakharchenko (7)

Recently uploaded (20)

Lec04 gpu architecture

Editor's Notes