SlideShare a Scribd company logo
GPU ArchitecturePerhaad Mistry & Dana Schaa,Northeastern University Computer ArchitectureResearch Lab, with Benedict R. Gaster, AMD© 2011
Instructor NotesWe describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional programming coursesContrast conventional multicore CPU architecture with high level view of AMD and Nvidia GPU ArchitectureThis lecture starts with a high level architectural view of all GPUs, discusses each vendor’s architecture and then converges back to the OpenCL specStress on the difference between the AMD VLIW architecture and Nvidia scalar architectureAlso discuss the different memory architectureBrief discussion of ICD and compilation flow of OpenCL provides a lead to Lecture 5 where the first complete OpenCL program is written2Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
TopicsMapping the OpenCL spec to many-core hardware AMD GPU ArchitectureNvidia GPU ArchitectureCell Broadband EngineOpenCL Specific TopicsOpenCL Compilation SystemInstallable Client Driver (ICD)3Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
MotivationWhy are we discussing vendor specific hardware if OpenCL is platform independent ?Gain intuition of how a program’s loops and data need to map to OpenCL kernels in order to obtain performanceObserve similarities and differences between Nvidia and AMD hardwareUnderstanding hardware will allow for platform specific tuning of code in later lectures4Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Conventional CPU ArchitectureSpace devoted  to control logic instead of  ALUCPUs are optimized to minimize the latency of a single threadCan efficiently handle control flow intensive workloadsMulti level caches used to hide latencyLimited number of registers due to smaller number of active threadsControl logic to reorder execution, provide ILP and minimize pipeline stallsConventional CPU Block DiagramControl LogicL2 CacheL3 CacheALUL1 Cache ~ 25GBPSSystem MemoryA present day multicore CPU could have more than one ALU ( typically < 32) and some of the cache hierarchy is usually shared across cores5Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Modern GPGPU ArchitectureGeneric many core GPULess space devoted to control logic and cachesLarge register files to support multiple thread contextsLow latency hardware managed thread switchingLarge number of ALU per “core” with small user managed cache per core Memory bus optimized for  bandwidth ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneouslyHigh Bandwidth bus to ALUsOn Board System MemorySimple ALUsCache6Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD GPU Hardware ArchitectureAMD 5870 – Cypress20  SIMD engines16 SIMD units per core5 multiply-adds per functional unit (VLIW processing)2.72 Teraflops Single Precision544 Gigaflops Double PrecisionSource:  Introductory OpenCL SAAHPC2010, Benedict R. Gaster7Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
SIMD EngineOne SIMD EngineA SIMD engine consists of a set of “Stream Cores”Stream cores arranged as a five way Very Long Instruction Word (VLIW) processor Up to five scalar operations can be issued in a VLIW instructionScalar operations executed on each processing elementStream cores within compute unit execute same VLIW instructionThe block of work-items that are executed together is called a wavefront.64 work items for 5870One Stream CoreInstruction and Control FlowT-Processing ElementBranchExecution UnitProcessingElementsGeneral Purpose RegistersSource:  AMD Accelerated Parallel Processing OpenCL Programming Guide8Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD Platform as seen in OpenCLIndividual work-items execute on a single processing elementProcessing element refers to a single VLIW coreMultiple work-groups execute on a compute unitA compute unit refers to a SIMD Engine9Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD GPU Memory ArchitectureMemory per compute unitLocal data store (on-chip)Registers L1 cache (8KB for 5870) per compute unitL2 Cache shared between compute units (512KB for 5870)Fast path for only 32 bit operationsComplete path for atomics and < 32bit operationsSIMD Engine LDS, RegistersL1 CacheCompute Unit to Memory X-barL2 CacheWrite CacheAtomic PathLDS10Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD Memory Model in OpenCLSubset of hardware memory exposed in OpenCLLocal Data Share (LDS) exposed as local memoryShare data between items of a work group designed to increase performanceHigh Bandwidth access per SIMD EnginePrivate memory utilizes registers per work itemConstant Memory__constant tags utilize L1 cache.Private MemoryPrivate MemoryPrivate MemoryPrivate MemoryWorkitem 1Workitem 1Workitem 1Workitem 1Compute Unit 1Compute Unit  NLocal MemoryLocal MemoryGlobal / Constant Memory Data CacheCompute DeviceGlobal MemoryCompute Device Memory11Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD Constant Memory UsageConstant Memory declarations for AMD GPUs only beneficial for following access patternsDirect-Addressing Patterns: For non array constant values where the address is known initiallySame Index Patterns: When all work-items reference the same constant addressGlobally scoped constant arrays: Arrays that are initialized, globally scoped can use the cache if  less than 16KBCases where each work item accesses different indices, are not cached and deliver the same performance as a global memory readSource:  AMD Accelerated Parallel Processing OpenCL Programming Guide12Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Nvidia GPUs - Fermi Architecture Instruction CacheCoreCoreCoreCoreGTX 480 - Compute 2.0 capability15 cores or Streaming Multiprocessors (SMs)Each SM features 32 CUDA processors480  CUDA processorsGlobal memory  with ECCWarp Scheduler Warp Scheduler Dispatch UnitDispatch UnitCoreCoreCoreCoreRegister File 32768 x 32bitLDSTLDSTCoreCoreCoreCoreSFULDSTLDSTCoreCoreCoreCoreLDSTLDSTSFUCoreCoreCoreCoreLDSTLDSTSFULDSTLDSTCoreCoreCoreCoreCUDA CoreDispatch PortLDSTLDSTOperand CollectorCoreCoreCoreCoreSFULDSTLDSTSource: NVIDIA’s Next Generation CUDA Architecture WhitepaperInterconnect MemoryFP UnitInt UnitCoreCoreCoreCoreLDSTLDSTL1 Cache / 64kB Shared MemoryL2 CacheResult Queue13Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Nvidia GPUs – Fermi ArchitectureSM  executes threads in groups of 32 called warps.Two warp issue units per SMConcurrent kernel executionExecute multiple  kernels simultaneously to improve efficiencyCUDA core consists of a single ALU and floating point unit FPUInstruction CacheCoreCoreCoreCoreWarp Scheduler Warp Scheduler Dispatch UnitDispatch UnitCoreCoreCoreCoreRegister File 32768 x 32bitLDSTLDSTCoreCoreCoreCoreSFULDSTLDSTCoreCoreCoreCoreLDSTLDSTSFUCoreCoreCoreCoreLDSTLDSTSFULDSTLDSTCoreCoreCoreCoreCUDA CoreDispatch PortLDSTLDSTOperand CollectorCoreCoreCoreCoreSFUSource: NVIDIA’s Next Generation CUDA Compute Architecture WhitepaperLDSTLDSTInterconnect MemoryFP UnitInt UnitCoreCoreCoreCoreLDSTLDSTL1 Cache / 64kB Shared MemoryL2 CacheResult Queue14Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
SIMT and SIMDSIMT denotes scalar instructions and multiple threads sharing an instruction streamHW determines instruction stream sharing across ALUsE.g. NVIDIA GeForce (“SIMT” warps), AMD Radeon architectures (“wavefronts”) where all the threads in a warp /wavefront proceed in lockstepDivergence between threads handled using predicationSIMT instructions specify the execution and branching behavior of a single threadSIMD instructions exposes vector width, E.g. of SIMD: explicit vector instructions like x86 SSE15Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
SIMT Execution ModelSIMD execution can be combined with pipelining
ALUs all execute the same instruction
Pipelining is used to break instruction into phases
When first instruction completes (4 cycles here), the next instruction is ready to executeSIMD WidthAddMulAddMulAddMulAddMulAddMulAddMul…AddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulWavefrontAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMul…Cycle16Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Nvidia Memory HierarchyL1 cache per SM configurable to support shared memory and caching of  global memory48 KB Shared / 16 KB of L1 cache16 KB Shared / 48 KB of L1 cacheData shared between work items of a group  using shared memoryEach SM has a 32K register bank L2 cache (768KB) that services all operations (load, store and texture)Unified path to global for loads and storesRegistersThread BlockL1 CacheShared MemoryL2 CacheGlobal Memory17Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Nvidia Memory Model in OpenCLLike AMD, a subset of hardware memory exposed in OpenCLConfigurable shared memory is usable as local memory Local memory used to share data between items of a work group at lower latency than global memory Private memory utilizes registers per work itemPrivate MemoryPrivate MemoryPrivate MemoryPrivate MemoryWorkitem 1Workitem 1Workitem 1Workitem 1Compute Unit 1Compute Unit  NLocal MemoryLocal MemoryGlobal / Constant Memory Data CacheCompute DeviceGlobal MemoryCompute Device Memory18Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Cell Broadband EngineSPE 2SPE 0SPE 1SPE 3Developed by Sony, Toshiba, IBMTransitioned from embedded platforms into HPC via the Playstation 3OpenCL drivers available for Cell Bladecenter serversConsists of a Power Processing Element (PPE) and multiple Synergistic Processing Elements (SPE)Uses the IBM XL C for OpenCL compilerSPUSPUSPUSPULSLSLSLS25 GBPS25 GBPS25 GBPSElement Interconnect ~ 200GBPSLS = Local store per SPE of 256KBMemory & Interrupt ControllerL1 and L2 CachePOWER PCPPESource: http://www.alphaworks.ibm.com/tech/opencl19Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Cell BE and OpenCLCell Power/VMX CPU used as a CL_DEVICE_TYPE_CPUCell SPU (CL_DEVICE_TYPE_ACCELERATOR) No. of compute units on a SPU accelerator device is <=16Local memory size <= 256KB256K of local storage divided among OpenCL kernel, 8KB global data cache, local, constant and private variablesOpenCL accelerator devices, and OpenCL CPU device share a common memory busProvides extensions like “Device Fission” and “Migrate Objects” to specify where an object resides (discussed in Lecture 10)No support for OpenCL images, sampler objects, atomics and  byte addressable memorySource: http://www.alphaworks.ibm.com/tech/opencl20Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
An Optimal GPGPU KernelFrom the discussion on hardware we see that an ideal kernel for a GPU:Has thousands of independent pieces of workUses all available compute unitsAllows interleaving for latency hidingIs amenable to instruction stream sharingMaps to SIMD execution by preventing divergence between work itemsHas high arithmetic intensityRatio of math operations to memory access is highNot limited by memory bandwidthNote that these caveats apply to all GPUs21Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Ad

More Related Content

What's hot (20)

Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
Rohit Khatana
 
CUDA Architecture
CUDA ArchitectureCUDA Architecture
CUDA Architecture
Dr Shashikant Athawale
 
Graphics processing unit
Graphics processing unitGraphics processing unit
Graphics processing unit
Shashwat Shriparv
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
Piyush Mittal
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
Raymond Tay
 
Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)
MuntasirMuhit
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit ppt
Nitesh Dubey
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Martin Peniak
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
Saksham Tanwar
 
Soc architecture and design
Soc architecture and designSoc architecture and design
Soc architecture and design
Satya Harish
 
CPU vs GPU Comparison
CPU  vs GPU ComparisonCPU  vs GPU Comparison
CPU vs GPU Comparison
jeetendra mandal
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
Grigory Sapunov
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
Vishal Singh
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
Sandeep Singh
 
I2c drivers
I2c driversI2c drivers
I2c drivers
Pradeep Tewani
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
CastLabKAIST
 
Graphic Processing Unit
Graphic Processing UnitGraphic Processing Unit
Graphic Processing Unit
Kamran Ashraf
 
ARM Architecture in Details
ARM Architecture in Details ARM Architecture in Details
ARM Architecture in Details
GlobalLogic Ukraine
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
Gpu Systems
Gpu SystemsGpu Systems
Gpu Systems
jpaugh
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
Rohit Khatana
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
Piyush Mittal
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
Raymond Tay
 
Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)
MuntasirMuhit
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit ppt
Nitesh Dubey
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Martin Peniak
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
Saksham Tanwar
 
Soc architecture and design
Soc architecture and designSoc architecture and design
Soc architecture and design
Satya Harish
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
Grigory Sapunov
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
Vishal Singh
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
Sandeep Singh
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
CastLabKAIST
 
Graphic Processing Unit
Graphic Processing UnitGraphic Processing Unit
Graphic Processing Unit
Kamran Ashraf
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
Gpu Systems
Gpu SystemsGpu Systems
Gpu Systems
jpaugh
 

Similar to Lec04 gpu architecture (20)

gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
Corei7
Corei7Corei7
Corei7
sumit jadhav
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
Heiko Joerg Schick
 
Nehalem
NehalemNehalem
Nehalem
Ajmal Ak
 
corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptx
Pranita602627
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
UniFabric
 
Intel Core i7 Processors
Intel Core i7 ProcessorsIntel Core i7 Processors
Intel Core i7 Processors
Anagh Vijayvargia
 
Multi-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IKMulti-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IK
Ilgın Kavaklıoğulları
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
zaid_b
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
AbdullahMunir32
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
In-Memory Computing Summit
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
Wilhelm van Belkum
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
nextlib
 
Lec07 threading hw
Lec07 threading hwLec07 threading hw
Lec07 threading hw
Taras Zakharchenko
 
Lec06 memory
Lec06 memoryLec06 memory
Lec06 memory
Taras Zakharchenko
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
arnamoy10
 
Modern processor art
Modern processor artModern processor art
Modern processor art
waqasjadoon11
 
Danish presentation
Danish presentationDanish presentation
Danish presentation
waqasjadoon11
 
Modern processor art
Modern processor artModern processor art
Modern processor art
waqasjadoon11
 
processor struct
processor structprocessor struct
processor struct
waqasjadoon11
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptx
Pranita602627
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
UniFabric
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
zaid_b
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
AbdullahMunir32
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
In-Memory Computing Summit
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
nextlib
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
arnamoy10
 
Modern processor art
Modern processor artModern processor art
Modern processor art
waqasjadoon11
 
Modern processor art
Modern processor artModern processor art
Modern processor art
waqasjadoon11
 
Ad

More from Taras Zakharchenko (7)

Lec13 multidevice
Lec13 multideviceLec13 multidevice
Lec13 multidevice
Taras Zakharchenko
 
Lec11 timing
Lec11 timingLec11 timing
Lec11 timing
Taras Zakharchenko
 
Lec09 nbody-optimization
Lec09 nbody-optimizationLec09 nbody-optimization
Lec09 nbody-optimization
Taras Zakharchenko
 
Lec08 optimizations
Lec08 optimizationsLec08 optimizations
Lec08 optimizations
Taras Zakharchenko
 
Lec05 buffers basic_examples
Lec05 buffers basic_examplesLec05 buffers basic_examples
Lec05 buffers basic_examples
Taras Zakharchenko
 
Lec02 03 opencl_intro
Lec02 03 opencl_introLec02 03 opencl_intro
Lec02 03 opencl_intro
Taras Zakharchenko
 
Lec12 debugging
Lec12 debuggingLec12 debugging
Lec12 debugging
Taras Zakharchenko
 
Ad

Recently uploaded (20)

ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 

Lec04 gpu architecture

  • 1. GPU ArchitecturePerhaad Mistry & Dana Schaa,Northeastern University Computer ArchitectureResearch Lab, with Benedict R. Gaster, AMD© 2011
  • 2. Instructor NotesWe describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional programming coursesContrast conventional multicore CPU architecture with high level view of AMD and Nvidia GPU ArchitectureThis lecture starts with a high level architectural view of all GPUs, discusses each vendor’s architecture and then converges back to the OpenCL specStress on the difference between the AMD VLIW architecture and Nvidia scalar architectureAlso discuss the different memory architectureBrief discussion of ICD and compilation flow of OpenCL provides a lead to Lecture 5 where the first complete OpenCL program is written2Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 3. TopicsMapping the OpenCL spec to many-core hardware AMD GPU ArchitectureNvidia GPU ArchitectureCell Broadband EngineOpenCL Specific TopicsOpenCL Compilation SystemInstallable Client Driver (ICD)3Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 4. MotivationWhy are we discussing vendor specific hardware if OpenCL is platform independent ?Gain intuition of how a program’s loops and data need to map to OpenCL kernels in order to obtain performanceObserve similarities and differences between Nvidia and AMD hardwareUnderstanding hardware will allow for platform specific tuning of code in later lectures4Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 5. Conventional CPU ArchitectureSpace devoted to control logic instead of ALUCPUs are optimized to minimize the latency of a single threadCan efficiently handle control flow intensive workloadsMulti level caches used to hide latencyLimited number of registers due to smaller number of active threadsControl logic to reorder execution, provide ILP and minimize pipeline stallsConventional CPU Block DiagramControl LogicL2 CacheL3 CacheALUL1 Cache ~ 25GBPSSystem MemoryA present day multicore CPU could have more than one ALU ( typically < 32) and some of the cache hierarchy is usually shared across cores5Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 6. Modern GPGPU ArchitectureGeneric many core GPULess space devoted to control logic and cachesLarge register files to support multiple thread contextsLow latency hardware managed thread switchingLarge number of ALU per “core” with small user managed cache per core Memory bus optimized for bandwidth ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneouslyHigh Bandwidth bus to ALUsOn Board System MemorySimple ALUsCache6Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 7. AMD GPU Hardware ArchitectureAMD 5870 – Cypress20 SIMD engines16 SIMD units per core5 multiply-adds per functional unit (VLIW processing)2.72 Teraflops Single Precision544 Gigaflops Double PrecisionSource: Introductory OpenCL SAAHPC2010, Benedict R. Gaster7Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 8. SIMD EngineOne SIMD EngineA SIMD engine consists of a set of “Stream Cores”Stream cores arranged as a five way Very Long Instruction Word (VLIW) processor Up to five scalar operations can be issued in a VLIW instructionScalar operations executed on each processing elementStream cores within compute unit execute same VLIW instructionThe block of work-items that are executed together is called a wavefront.64 work items for 5870One Stream CoreInstruction and Control FlowT-Processing ElementBranchExecution UnitProcessingElementsGeneral Purpose RegistersSource: AMD Accelerated Parallel Processing OpenCL Programming Guide8Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 9. AMD Platform as seen in OpenCLIndividual work-items execute on a single processing elementProcessing element refers to a single VLIW coreMultiple work-groups execute on a compute unitA compute unit refers to a SIMD Engine9Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 10. AMD GPU Memory ArchitectureMemory per compute unitLocal data store (on-chip)Registers L1 cache (8KB for 5870) per compute unitL2 Cache shared between compute units (512KB for 5870)Fast path for only 32 bit operationsComplete path for atomics and < 32bit operationsSIMD Engine LDS, RegistersL1 CacheCompute Unit to Memory X-barL2 CacheWrite CacheAtomic PathLDS10Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 11. AMD Memory Model in OpenCLSubset of hardware memory exposed in OpenCLLocal Data Share (LDS) exposed as local memoryShare data between items of a work group designed to increase performanceHigh Bandwidth access per SIMD EnginePrivate memory utilizes registers per work itemConstant Memory__constant tags utilize L1 cache.Private MemoryPrivate MemoryPrivate MemoryPrivate MemoryWorkitem 1Workitem 1Workitem 1Workitem 1Compute Unit 1Compute Unit NLocal MemoryLocal MemoryGlobal / Constant Memory Data CacheCompute DeviceGlobal MemoryCompute Device Memory11Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 12. AMD Constant Memory UsageConstant Memory declarations for AMD GPUs only beneficial for following access patternsDirect-Addressing Patterns: For non array constant values where the address is known initiallySame Index Patterns: When all work-items reference the same constant addressGlobally scoped constant arrays: Arrays that are initialized, globally scoped can use the cache if less than 16KBCases where each work item accesses different indices, are not cached and deliver the same performance as a global memory readSource: AMD Accelerated Parallel Processing OpenCL Programming Guide12Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 13. Nvidia GPUs - Fermi Architecture Instruction CacheCoreCoreCoreCoreGTX 480 - Compute 2.0 capability15 cores or Streaming Multiprocessors (SMs)Each SM features 32 CUDA processors480 CUDA processorsGlobal memory with ECCWarp Scheduler Warp Scheduler Dispatch UnitDispatch UnitCoreCoreCoreCoreRegister File 32768 x 32bitLDSTLDSTCoreCoreCoreCoreSFULDSTLDSTCoreCoreCoreCoreLDSTLDSTSFUCoreCoreCoreCoreLDSTLDSTSFULDSTLDSTCoreCoreCoreCoreCUDA CoreDispatch PortLDSTLDSTOperand CollectorCoreCoreCoreCoreSFULDSTLDSTSource: NVIDIA’s Next Generation CUDA Architecture WhitepaperInterconnect MemoryFP UnitInt UnitCoreCoreCoreCoreLDSTLDSTL1 Cache / 64kB Shared MemoryL2 CacheResult Queue13Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 14. Nvidia GPUs – Fermi ArchitectureSM executes threads in groups of 32 called warps.Two warp issue units per SMConcurrent kernel executionExecute multiple kernels simultaneously to improve efficiencyCUDA core consists of a single ALU and floating point unit FPUInstruction CacheCoreCoreCoreCoreWarp Scheduler Warp Scheduler Dispatch UnitDispatch UnitCoreCoreCoreCoreRegister File 32768 x 32bitLDSTLDSTCoreCoreCoreCoreSFULDSTLDSTCoreCoreCoreCoreLDSTLDSTSFUCoreCoreCoreCoreLDSTLDSTSFULDSTLDSTCoreCoreCoreCoreCUDA CoreDispatch PortLDSTLDSTOperand CollectorCoreCoreCoreCoreSFUSource: NVIDIA’s Next Generation CUDA Compute Architecture WhitepaperLDSTLDSTInterconnect MemoryFP UnitInt UnitCoreCoreCoreCoreLDSTLDSTL1 Cache / 64kB Shared MemoryL2 CacheResult Queue14Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 15. SIMT and SIMDSIMT denotes scalar instructions and multiple threads sharing an instruction streamHW determines instruction stream sharing across ALUsE.g. NVIDIA GeForce (“SIMT” warps), AMD Radeon architectures (“wavefronts”) where all the threads in a warp /wavefront proceed in lockstepDivergence between threads handled using predicationSIMT instructions specify the execution and branching behavior of a single threadSIMD instructions exposes vector width, E.g. of SIMD: explicit vector instructions like x86 SSE15Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 16. SIMT Execution ModelSIMD execution can be combined with pipelining
  • 17. ALUs all execute the same instruction
  • 18. Pipelining is used to break instruction into phases
  • 19. When first instruction completes (4 cycles here), the next instruction is ready to executeSIMD WidthAddMulAddMulAddMulAddMulAddMulAddMul…AddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulWavefrontAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMul…Cycle16Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 20. Nvidia Memory HierarchyL1 cache per SM configurable to support shared memory and caching of global memory48 KB Shared / 16 KB of L1 cache16 KB Shared / 48 KB of L1 cacheData shared between work items of a group using shared memoryEach SM has a 32K register bank L2 cache (768KB) that services all operations (load, store and texture)Unified path to global for loads and storesRegistersThread BlockL1 CacheShared MemoryL2 CacheGlobal Memory17Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 21. Nvidia Memory Model in OpenCLLike AMD, a subset of hardware memory exposed in OpenCLConfigurable shared memory is usable as local memory Local memory used to share data between items of a work group at lower latency than global memory Private memory utilizes registers per work itemPrivate MemoryPrivate MemoryPrivate MemoryPrivate MemoryWorkitem 1Workitem 1Workitem 1Workitem 1Compute Unit 1Compute Unit NLocal MemoryLocal MemoryGlobal / Constant Memory Data CacheCompute DeviceGlobal MemoryCompute Device Memory18Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 22. Cell Broadband EngineSPE 2SPE 0SPE 1SPE 3Developed by Sony, Toshiba, IBMTransitioned from embedded platforms into HPC via the Playstation 3OpenCL drivers available for Cell Bladecenter serversConsists of a Power Processing Element (PPE) and multiple Synergistic Processing Elements (SPE)Uses the IBM XL C for OpenCL compilerSPUSPUSPUSPULSLSLSLS25 GBPS25 GBPS25 GBPSElement Interconnect ~ 200GBPSLS = Local store per SPE of 256KBMemory & Interrupt ControllerL1 and L2 CachePOWER PCPPESource: http://www.alphaworks.ibm.com/tech/opencl19Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 23. Cell BE and OpenCLCell Power/VMX CPU used as a CL_DEVICE_TYPE_CPUCell SPU (CL_DEVICE_TYPE_ACCELERATOR) No. of compute units on a SPU accelerator device is <=16Local memory size <= 256KB256K of local storage divided among OpenCL kernel, 8KB global data cache, local, constant and private variablesOpenCL accelerator devices, and OpenCL CPU device share a common memory busProvides extensions like “Device Fission” and “Migrate Objects” to specify where an object resides (discussed in Lecture 10)No support for OpenCL images, sampler objects, atomics and byte addressable memorySource: http://www.alphaworks.ibm.com/tech/opencl20Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 24. An Optimal GPGPU KernelFrom the discussion on hardware we see that an ideal kernel for a GPU:Has thousands of independent pieces of workUses all available compute unitsAllows interleaving for latency hidingIs amenable to instruction stream sharingMaps to SIMD execution by preventing divergence between work itemsHas high arithmetic intensityRatio of math operations to memory access is highNot limited by memory bandwidthNote that these caveats apply to all GPUs21Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 25. OpenCL Compilation SystemLLVM - Low Level Virtual Machine Kernels compiled to LLVM IROpen Source Compiler Platform, OS independentMultiple back endshttp://llvm.orgOpenCL Compute ProgramLLVM Front-endLLVM IRAMD CAL ILx86Nvidia PTX22Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 26. Installable Client DriverICD allows multiple implementations to co-existCode only links to libOpenCL.soApplication selects implementation at runtimeCurrent GPU driver model does not easily allow multiple devices across manufacturersclGetPlatformIDs() and clGetPlatformInfo() examine the list of available implementations and select a suitable oneApplicationlibOpenCL.soatiocl.soNvidia-opencl23Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 27. SummaryWe have examined different many-core platforms and how they map onto the OpenCL specAn important take-away is that even though vendors have implemented the spec differently the underlying ideas for obtaining performance by a programmer remain consistentWe have looked at the runtime compilation model for OpenCL to understand how programs and kernels for compute devices are created at runtimeWe have looked at the ICD to understand how an OpenCL application can choose an implementation at runtimeNext LectureCover moving of data to a compute device and some simple but complete OpenCL examples24Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Editor's Notes

  • #5: This point is important because this would be a common question while teaching an open platform agnostic programming
  • #6: Basic computer architecture points for multicore CPUs
  • #7: High level 10,000 feet view of what a GPU looks like, irrespective of whether its AMD’s or Nvidia’s
  • #8: Very AMD specific discussion of low-level GPU architecture
  • #9: Very AMD specific discussion of low-level GPU architectureDiscusses a single SIMD engine and the stream cores
  • #10: We converge back to the OpenCL terminology to understand how the AMD GPU maps onto the OpenCL processing elements
  • #11: A brief overview of the AMD GPU memory architecture (As per Evergreen Series)
  • #12: The mapping of the AMD GPU memory components to the OpenCL terminology.Architecturally this is similar for AMD and Nvidia except that each ones have their own vendor specific namesSimilar types of memory are mapped to local memory for both AMD and Nvidia
  • #13: Summary of the usage of constant memory.Important because there are a restricted set of cases where there is a hardware provided performance boost while using constant memoryThis will have greater context with a complete example which would have to be later.This slide is added in case some one is reading this while optimizing some application and needs device specific details
  • #14: Nvidia “Fermi” Architecture, High level overview.
  • #15: Architectural highlights of a SM in a Fermi GPUMention scalar nature of a CUDA core unlike AMD’s VLIW architecture
  • #16: The SIMT execution model of GPU threads. SIMD specifies vector width as in SSE. However the SIMT execution model doesn’t necessarily need to know the number of threads in a warp for a OpenCL program.The concept of a warp / wavefront is not within OpenCL.
  • #17: The SIMT Execution mode which shows how different threads execute the same instruction
  • #18: Nvidia specific GPU memory architecture. Main highlight is the configurable L1 : Shared size ratioL2 is not exposed in the OpenCL specification
  • #19: Similar to AMD in the sense that low latency memory which is the shared memory becomes OpenCL local memory
  • #20: Brief introductionon the Cell
  • #21: Brief overview of how the Cell’s memory architecture maps to OpenCLFor usage of the Cell in specific applications a high level view is given and Lec 10 discusses its special extensionsOptimizations in Lec 6-8 do not apply to the Cell because of its very different architecture
  • #22: Discusses an optimal kernel to show how irrespective of the different underlying architecture, an optimum program for both AMD and Nvidia would have similar characteristics
  • #23: Explains how platform agnostic OpenCL code is mapped to a device specific Instruction Set Architecture.
  • #24: The ICD is added in order to explain how we can interface different OpenCL implementations with a similar compilation tool-chain