SlideShare a Scribd company logo
GPUs vs CPUs For
Parallel Processing
Mohammed Billoo
MAB Labs, LLC
Outline
● Overview
● Processor Trends
● CPU Design
● GPU Design
● Comparison
● Summary/Follow-Up
2
Overview
● GPUs blow CPUs out of the water when it comes to raw processing
horsepower of a specific problem set
○ “Specific”: Where computation can be whittled down to a singular algorithm that
can be applied across a wide dataset
● Why?
○ The original purpose of CPUs (and their resulting design) has led to this limitation
○ The original purpose of GPUs (and their resulting design) has made them ideal
for use in this particular application
3
Processor Trends
● Previously, for about 20 years, the driving factor in processor design
has been performance
○ Processor design had been targeted to provide more features and functionality to
users
○ Processor design had been driven by increased clock rate
■ “CPU arms race”
○ Fundamental CPU architecture has been developed to minimize responsiveness
of a single application run by a single user
● Since 2003, reduction in size of computational devices has shifted
focus from raw processing to energy consumption and heat dissipation
○ Battery life!
○ Resulted in vendors shifting focus from pure clock rate to the number of “cores” in
a processor
■ Core = processing element
4
CPU Design
● Traditionally, most CPU software was developed to behave in a
sequential manner
○ Before the advent of multiple cores that can operate in true parallel fashion,
either:
■ SW had to play tricks to make it seem that multiple applications were being
executed in parallel (relying on increasing CPU clock rates)
■ HW enhancements to make sequential processing “look” parallel (i.e.
pipelining)
● With increasing number of cores that can truly run in parallel on a
single CPU silicon die, SW developers have had to rethink app
development
○ Emphasis has been placed on parallel programs
○ But parallel development is not new!
○ Programs that truly run in parallel have been developed for decades
■ High performance computing applications
■ Run on expensive, dedicated HW
5
CPU Design
● The fundamental architecture of CPUs has limited the number of cores
that can exist on a single silicon die
○ Premise of CPU architecture was to (originally) optimize responsiveness of a
single application executed by a single user
○ HW design to support true parallel behavior has had to be “shoehorned”, limiting
the number of cores that is attainable
■ Maximum number of CPU cores ⇒ ~10ish
● Nature of original CPU architectures has required additions to support
efficient floating point operations
○ Again, because there was no original need to perform floating operations
efficiently
○ Required additions to the Instruction Set Architecture (ISA), and in turn
modifications to the underlying HW
○ Another alternative was to add a dedicated controller in the processor for floating
point operations (i.e. FPU)
6
CPU Design
● Why can’t more cores be easily included in CPU designs?
○ Over the past ~17 years (since 2003), number of HW cores has increased from 1
→ 10ish, 20-ish in CPU designs
○ Limited by the original CPU architecture, since more silicon “real estate” was
devoted to:
■ The control logic to transfer instructions and data to the core
■ The processor cache to avoid having to fetch instructions that are frequently
used
■ Goal was/has been to keep instruction and data access latencies to a
minimum
○ Unfortunately, there is less real estate available for the actual processing cores
● Transfer of data has been another issue
○ Again, because the original problem that CPUs were meant to solve didn’t involve
a significant amount of data
○ Data transfer speeds is another issue/bottleneck for faster parallel processing
7
(Simplified) CPU Design
* Fewer resources devoted to “actual” processing (i.e. core)
Contr
ol
Core Core
Core Core
Cach
e
8
GPU Design
● GPUs were originally (and still are) designed for graphics intensive
applications
● Graphics applications are inherently parallel in nature
○ Each pixel is (usually) independent of another pixel
○ The same operations are (usually) performed on each pixel
○ Each frame usually consists of 100k, 1M pixels
● Because of the nature of the problem that GPUs were originally meant
to solve, they have become ideal candidates for highly parallel,
non-graphics applications
○ Machine Learning
○ Artificial Intelligence
○ Data Science
9
GPU Design
● The fundamental problem that GPUs were meant to solve has allowed
for many more cores to be easily added over the years
○ Don’t really care about responsiveness of a single application but rather the
overall execution throughput
■ Gamer doesn’t care about how long it takes for a particular pixel to be
rendered, but rather an entire frame
■ A video editor doesn’t care about how long it takes for a particular pixel (or
even frame) to be processed, but rather an entire video
○ “Manycore” computing device vs CPU-based “multi-core” computing device
■ Manycore: 10k, 100k, 1M cores
■ Multi-core: Single, double-digit cores
● Nature of graphics applications resulted in native support for fast
floating point operations in GPUs
○ Ray-tracing, 2D, 3D graphics inherently must be done using floating point
numbers
○ HW was designed to support optimal floating point operations
10
GPU Design
● Due to original problem that GPUs were meant to solve, adding more
cores is much easier than on a CPU
○ Increase in number of cores in a GPU is by orders of magnitude year-over-year
(e.g. 10x)
○ GPU architecture allows for fast execution of instructions on a large dataset in
parallel
○ More silicon “real estate” devoted to the processing cores themselves vs control
logic to transfer instructions and data
● GPU Architecture was developed to allow for transfer of large datasets
○ Graphics processing involves transferring a ton of data at once (e.g. individual
frame of pixels)
○ Memory was optimized to NOT be a bottleneck
11
(Simplified) GPU Design
Control
Core CoreCore
Cache
. . . . . . . . . .
.
Control
Core CoreCore . . . . . . . . . .
.
Control
Core CoreCore . . . . . . . . . .
.
Control
Core CoreCore . . . . . . . . . .
.
* More processor resources devoted to cores
12
Cache
Cache
Cache
Comparison
Category CPU GPU
Number of cores Few (10s, 100s (maybe?)) Many (10k, 100k, 1M)
Capability of each core Can perform more complex
operations
Can perform simpler
operations
Floating Point Support Added later (either via
modifications to the ISA or
with a dedicated FPU)
Native support in
computation core
Memory Transfer Slower and much more
frequent (can use cache to
alleviate this)
Faster and much less
frequent (usually transfer
large dataset between
system memory and GPU
memory “at once”)
SW Development Effort Simpler Complex (requires dataset
to be structured a certain
way and have to write SW a
particular way to leverage
HW)
13
Summary/Follow-Up
● CPUs are the optimal choice for one set of problems and GPUs are
the optimal choice for another set of problems
● Can’t use a single processor type
● Need to use both in a complete system
○ Even in a GPU-based system, need file-transfer, network operations, etc.. which
are ideally suited for a CPU
● Follow-Up
○ How to implement a simple algorithm on an Nvidia GPU using CUDA C
■ Discuss the challenges that are usually associated with such a task
● Data structure
● Core interactions
● Data transfer from system memory to GPU memory
■ CUDA C ⇒ Extension of the C language to support optimal operations on
an Nvidia GPU
14
Ad

More Related Content

What's hot (19)

Danish presentation
Danish presentationDanish presentation
Danish presentation
waqasjadoon11
 
GPU Computing: A brief overview
GPU Computing: A brief overviewGPU Computing: A brief overview
GPU Computing: A brief overview
Rajiv Kumar
 
Warehouse scale computer
Warehouse scale computerWarehouse scale computer
Warehouse scale computer
Hassan A-j
 
An introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale ComputersAn introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale Computers
Alessio Villardita
 
VM-aware Adaptive Storage Cache Prefetching
VM-aware Adaptive Storage Cache PrefetchingVM-aware Adaptive Storage Cache Prefetching
VM-aware Adaptive Storage Cache Prefetching
Shinagawa Laboratory, The University of Tokyo
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
Chakkrit (Kla) Tantithamthavorn
 
Hardware-aware thread scheduling: the case of asymmetric multicore processors
Hardware-aware thread scheduling: the case of asymmetric multicore processorsHardware-aware thread scheduling: the case of asymmetric multicore processors
Hardware-aware thread scheduling: the case of asymmetric multicore processors
Achille Peternier
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
Dhan V Sagar
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
arnamoy10
 
Graphics processing unit
Graphics processing unitGraphics processing unit
Graphics processing unit
Shashwat Shriparv
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
Saksham Tanwar
 
Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1
Gao Boyang
 
BMCArmor: A Hardware Protection Scheme for Bare-metal Clouds
BMCArmor: A Hardware Protection Scheme for Bare-metal CloudsBMCArmor: A Hardware Protection Scheme for Bare-metal Clouds
BMCArmor: A Hardware Protection Scheme for Bare-metal Clouds
Shinagawa Laboratory, The University of Tokyo
 
The Quick Migration of File Servers
The Quick Migration of File ServersThe Quick Migration of File Servers
The Quick Migration of File Servers
Shinagawa Laboratory, The University of Tokyo
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit ppt
Nitesh Dubey
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
Sandeep Singh
 
Dad i want a supercomputer on my next
Dad i want a supercomputer on my nextDad i want a supercomputer on my next
Dad i want a supercomputer on my next
Akash Sahoo
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)
self employed
 
GPU Computing: A brief overview
GPU Computing: A brief overviewGPU Computing: A brief overview
GPU Computing: A brief overview
Rajiv Kumar
 
Warehouse scale computer
Warehouse scale computerWarehouse scale computer
Warehouse scale computer
Hassan A-j
 
An introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale ComputersAn introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale Computers
Alessio Villardita
 
Hardware-aware thread scheduling: the case of asymmetric multicore processors
Hardware-aware thread scheduling: the case of asymmetric multicore processorsHardware-aware thread scheduling: the case of asymmetric multicore processors
Hardware-aware thread scheduling: the case of asymmetric multicore processors
Achille Peternier
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
Dhan V Sagar
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
arnamoy10
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
Saksham Tanwar
 
Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1
Gao Boyang
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit ppt
Nitesh Dubey
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
Sandeep Singh
 
Dad i want a supercomputer on my next
Dad i want a supercomputer on my nextDad i want a supercomputer on my next
Dad i want a supercomputer on my next
Akash Sahoo
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)
self employed
 

Similar to GPUs vs CPUs for Parallel Processing (20)

module01.ppt
module01.pptmodule01.ppt
module01.ppt
Subhasis Dash
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
Subhasis Dash
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
Antonios Katsarakis
 
Ca lecture 03
Ca lecture 03Ca lecture 03
Ca lecture 03
Haris456
 
Computer architecture multi core processor
Computer architecture multi core processorComputer architecture multi core processor
Computer architecture multi core processor
Mazin Alwaaly
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
zaid_b
 
processor struct
processor structprocessor struct
processor struct
waqasjadoon11
 
Modern processor art
Modern processor artModern processor art
Modern processor art
waqasjadoon11
 
Throughput oriented aarchitectures
Throughput oriented aarchitecturesThroughput oriented aarchitectures
Throughput oriented aarchitectures
Nomy059
 
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdffinaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
NazarAhmadAlkhidir
 
Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)
Sudip Roy
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
Edge AI and Vision Alliance
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)
Ankit Gupta
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
Design World
 
trends of microprocessor field
trends of microprocessor fieldtrends of microprocessor field
trends of microprocessor field
Ramya SK
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
Uni Processor Architecture
Uni Processor ArchitectureUni Processor Architecture
Uni Processor Architecture
Ashish KC
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
Nipun Sharma
 
Trends in computer architecture
Trends in computer architectureTrends in computer architecture
Trends in computer architecture
muhammedsalihabbas
 
CA presentation of multicore processor
CA presentation of multicore processorCA presentation of multicore processor
CA presentation of multicore processor
Zeeshan Aslam
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
Subhasis Dash
 
Ca lecture 03
Ca lecture 03Ca lecture 03
Ca lecture 03
Haris456
 
Computer architecture multi core processor
Computer architecture multi core processorComputer architecture multi core processor
Computer architecture multi core processor
Mazin Alwaaly
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
zaid_b
 
Modern processor art
Modern processor artModern processor art
Modern processor art
waqasjadoon11
 
Throughput oriented aarchitectures
Throughput oriented aarchitecturesThroughput oriented aarchitectures
Throughput oriented aarchitectures
Nomy059
 
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdffinaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
NazarAhmadAlkhidir
 
Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)
Sudip Roy
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
Edge AI and Vision Alliance
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)
Ankit Gupta
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
Design World
 
trends of microprocessor field
trends of microprocessor fieldtrends of microprocessor field
trends of microprocessor field
Ramya SK
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
Uni Processor Architecture
Uni Processor ArchitectureUni Processor Architecture
Uni Processor Architecture
Ashish KC
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
Nipun Sharma
 
Trends in computer architecture
Trends in computer architectureTrends in computer architecture
Trends in computer architecture
muhammedsalihabbas
 
CA presentation of multicore processor
CA presentation of multicore processorCA presentation of multicore processor
CA presentation of multicore processor
Zeeshan Aslam
 
Ad

Recently uploaded (20)

Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Journal of Soft Computing in Civil Engineering
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Ad

GPUs vs CPUs for Parallel Processing

  • 1. GPUs vs CPUs For Parallel Processing Mohammed Billoo MAB Labs, LLC
  • 2. Outline ● Overview ● Processor Trends ● CPU Design ● GPU Design ● Comparison ● Summary/Follow-Up 2
  • 3. Overview ● GPUs blow CPUs out of the water when it comes to raw processing horsepower of a specific problem set ○ “Specific”: Where computation can be whittled down to a singular algorithm that can be applied across a wide dataset ● Why? ○ The original purpose of CPUs (and their resulting design) has led to this limitation ○ The original purpose of GPUs (and their resulting design) has made them ideal for use in this particular application 3
  • 4. Processor Trends ● Previously, for about 20 years, the driving factor in processor design has been performance ○ Processor design had been targeted to provide more features and functionality to users ○ Processor design had been driven by increased clock rate ■ “CPU arms race” ○ Fundamental CPU architecture has been developed to minimize responsiveness of a single application run by a single user ● Since 2003, reduction in size of computational devices has shifted focus from raw processing to energy consumption and heat dissipation ○ Battery life! ○ Resulted in vendors shifting focus from pure clock rate to the number of “cores” in a processor ■ Core = processing element 4
  • 5. CPU Design ● Traditionally, most CPU software was developed to behave in a sequential manner ○ Before the advent of multiple cores that can operate in true parallel fashion, either: ■ SW had to play tricks to make it seem that multiple applications were being executed in parallel (relying on increasing CPU clock rates) ■ HW enhancements to make sequential processing “look” parallel (i.e. pipelining) ● With increasing number of cores that can truly run in parallel on a single CPU silicon die, SW developers have had to rethink app development ○ Emphasis has been placed on parallel programs ○ But parallel development is not new! ○ Programs that truly run in parallel have been developed for decades ■ High performance computing applications ■ Run on expensive, dedicated HW 5
  • 6. CPU Design ● The fundamental architecture of CPUs has limited the number of cores that can exist on a single silicon die ○ Premise of CPU architecture was to (originally) optimize responsiveness of a single application executed by a single user ○ HW design to support true parallel behavior has had to be “shoehorned”, limiting the number of cores that is attainable ■ Maximum number of CPU cores ⇒ ~10ish ● Nature of original CPU architectures has required additions to support efficient floating point operations ○ Again, because there was no original need to perform floating operations efficiently ○ Required additions to the Instruction Set Architecture (ISA), and in turn modifications to the underlying HW ○ Another alternative was to add a dedicated controller in the processor for floating point operations (i.e. FPU) 6
  • 7. CPU Design ● Why can’t more cores be easily included in CPU designs? ○ Over the past ~17 years (since 2003), number of HW cores has increased from 1 → 10ish, 20-ish in CPU designs ○ Limited by the original CPU architecture, since more silicon “real estate” was devoted to: ■ The control logic to transfer instructions and data to the core ■ The processor cache to avoid having to fetch instructions that are frequently used ■ Goal was/has been to keep instruction and data access latencies to a minimum ○ Unfortunately, there is less real estate available for the actual processing cores ● Transfer of data has been another issue ○ Again, because the original problem that CPUs were meant to solve didn’t involve a significant amount of data ○ Data transfer speeds is another issue/bottleneck for faster parallel processing 7
  • 8. (Simplified) CPU Design * Fewer resources devoted to “actual” processing (i.e. core) Contr ol Core Core Core Core Cach e 8
  • 9. GPU Design ● GPUs were originally (and still are) designed for graphics intensive applications ● Graphics applications are inherently parallel in nature ○ Each pixel is (usually) independent of another pixel ○ The same operations are (usually) performed on each pixel ○ Each frame usually consists of 100k, 1M pixels ● Because of the nature of the problem that GPUs were originally meant to solve, they have become ideal candidates for highly parallel, non-graphics applications ○ Machine Learning ○ Artificial Intelligence ○ Data Science 9
  • 10. GPU Design ● The fundamental problem that GPUs were meant to solve has allowed for many more cores to be easily added over the years ○ Don’t really care about responsiveness of a single application but rather the overall execution throughput ■ Gamer doesn’t care about how long it takes for a particular pixel to be rendered, but rather an entire frame ■ A video editor doesn’t care about how long it takes for a particular pixel (or even frame) to be processed, but rather an entire video ○ “Manycore” computing device vs CPU-based “multi-core” computing device ■ Manycore: 10k, 100k, 1M cores ■ Multi-core: Single, double-digit cores ● Nature of graphics applications resulted in native support for fast floating point operations in GPUs ○ Ray-tracing, 2D, 3D graphics inherently must be done using floating point numbers ○ HW was designed to support optimal floating point operations 10
  • 11. GPU Design ● Due to original problem that GPUs were meant to solve, adding more cores is much easier than on a CPU ○ Increase in number of cores in a GPU is by orders of magnitude year-over-year (e.g. 10x) ○ GPU architecture allows for fast execution of instructions on a large dataset in parallel ○ More silicon “real estate” devoted to the processing cores themselves vs control logic to transfer instructions and data ● GPU Architecture was developed to allow for transfer of large datasets ○ Graphics processing involves transferring a ton of data at once (e.g. individual frame of pixels) ○ Memory was optimized to NOT be a bottleneck 11
  • 12. (Simplified) GPU Design Control Core CoreCore Cache . . . . . . . . . . . Control Core CoreCore . . . . . . . . . . . Control Core CoreCore . . . . . . . . . . . Control Core CoreCore . . . . . . . . . . . * More processor resources devoted to cores 12 Cache Cache Cache
  • 13. Comparison Category CPU GPU Number of cores Few (10s, 100s (maybe?)) Many (10k, 100k, 1M) Capability of each core Can perform more complex operations Can perform simpler operations Floating Point Support Added later (either via modifications to the ISA or with a dedicated FPU) Native support in computation core Memory Transfer Slower and much more frequent (can use cache to alleviate this) Faster and much less frequent (usually transfer large dataset between system memory and GPU memory “at once”) SW Development Effort Simpler Complex (requires dataset to be structured a certain way and have to write SW a particular way to leverage HW) 13
  • 14. Summary/Follow-Up ● CPUs are the optimal choice for one set of problems and GPUs are the optimal choice for another set of problems ● Can’t use a single processor type ● Need to use both in a complete system ○ Even in a GPU-based system, need file-transfer, network operations, etc.. which are ideally suited for a CPU ● Follow-Up ○ How to implement a simple algorithm on an Nvidia GPU using CUDA C ■ Discuss the challenges that are usually associated with such a task ● Data structure ● Core interactions ● Data transfer from system memory to GPU memory ■ CUDA C ⇒ Extension of the C language to support optimal operations on an Nvidia GPU 14