This document provides an outline of manycore GPU architectures and programming. It introduces GPU architectures, the GPGPU concept, and CUDA programming. It discusses the GPU execution model, CUDA programming model, and how to work with different memory types in CUDA like global, shared and constant memory. It also covers streams and concurrency, CUDA intrinsics and libraries, performance profiling and debugging. Finally, it mentions directive-based programming models like OpenACC and OpenMP.
The document provides an overview of introductory GPGPU programming with CUDA. It discusses why GPUs are useful for parallel computing applications due to their high FLOPS and memory bandwidth capabilities. It then outlines the CUDA programming model, including launching kernels on the GPU with grids and blocks of threads, and memory management between CPU and GPU. As an example, it walks through a simple matrix multiplication problem implemented on the CPU and GPU to illustrate CUDA programming concepts.
Stream processing is a computer programming paradigm that allows for parallel processing of data streams. It involves applying the same kernel function to each element in a stream. Stream processing is suitable for applications involving large datasets where each data element can be processed independently, such as audio, video, and signal processing. Modern GPUs use a stream processing approach to achieve high performance by running kernels on multiple data elements simultaneously.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU, PPU and GPGPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics and scientific research.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs provide massively parallel and multithreaded processing capabilities. GPUs are now commonly used for general purpose computing due to their ability to handle complex computational tasks faster than CPUs in some cases. The document then discusses GPU and PPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics, and scientific research.
This document summarizes VPU and GPGPU technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU architecture including stream processing, graphics pipelines, shaders, and GPU clusters. It provides an example of using CUDA for GPU computing and discusses how GPUs are used for general purpose computing through frameworks like CUDA.
The document discusses VPU and GPGPU computing. It explains that a VPU is a visual processing unit, also known as a GPU. GPUs are massively parallel and multithreaded processors that are better than CPUs for tasks like machine learning and graphics processing. The document then discusses GPU architecture, memory, and programming models like CUDA. It provides examples of GPU usage and concludes that GPGPU is used in fields like machine learning, robotics, and scientific computing.
This lecture discusses manycore GPU architectures and programming, focusing on the CUDA programming model. It covers GPU execution models, CUDA programming concepts like threads and blocks, and how to manage GPU memory including different memory types like global and shared memory. It also discusses optimizing memory access patterns for global memory and profiling CUDA programs.
C for Cuda - Small Introduction to GPU computingIPALab
In this talk, we are presenting a short introduction to CUDA and GPU computing to help anyone who reads it to get started with this technology.
At first, we are introducing the GPU from the hardware point of view: what is it? How is it built? Why use it for General Purposes (GPGPU)? How does it differ from the CPU?
The second part of the presentation is dealing with the software abstraction and the use of CUDA to implement parallel computing. The software architecture, the kernels and the different types of memories are tackled in this part.
Finally, and to illustrate what has been presented previously, examples of codes are given. These examples are also highlighting the issues that may occur while using parallel-computing.
The document provides an introduction to GPU programming using CUDA. It outlines GPU and CPU architectures, the CUDA programming model involving threads, blocks and grids, and CUDA C language extensions. It also discusses compilation with NVCC, memory hierarchies, profiling code with Valgrind/Callgrind, and Amdahl's law in the context of parallelization. A simple CUDA program example is provided to demonstrate basic concepts like kernel launches and data transfers between host and device memory.
Cuda Without a Phd - A practical guick startLloydMoore
NVIDIA CUDA is a tool kit for development of GPU accelerated applications. For specific types of applications and computational patterns the GPU allows you to deploy thousands of cores for processing in a very cost effective manner.
While getting the full benefit of GPU acceleration can take a considerable amount of knowledge and effort, considerable speedups can be achieved with minimal program changes.
This talk provides an overview of what CUDA is, where it can be effective, and then does a deep dive to convert a simple, sequential data processing loop running as a single thread on the CPU into a massively parallel operation running on the GPU.
Monte Carlo simulation is well-suited for GPU acceleration due to its highly parallel nature. GPUs provide lower cost and higher performance than CPUs for Monte Carlo applications. Numerical libraries for GPUs allow developers to focus on their models rather than reimplementing basic components. NAG has developed GPU libraries including random number generators and is working with financial institutions to apply Monte Carlo simulations to problems in finance.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
This document provides an introduction to CUDA and OpenCL for graphics processors. It discusses how GPUs are optimized for throughput rather than latency via parallel processing. The CUDA programming model exposes thread-level parallelism through blocks of cooperative threads and SIMD parallelism. OpenCL is inspired by CUDA but is hardware-vendor neutral. Both support features like shared memory, synchronization, and memory copies between host and device. Efficient CUDA coding requires exposing abundant fine-grained parallelism and minimizing execution and memory divergence.
This document provides an overview of parallel and distributed computing using GPUs. It discusses GPU architecture and how GPUs are designed for massively parallel processing using hundreds of smaller cores compared to CPUs which use 4-8 larger cores. The document also covers GPU memory hierarchy, programming GPUs using OpenCL, and key concepts like work items, work groups, and occupancy which is keeping GPU compute units busy with work to process.
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
These slides have been presented by Dr. Alessandro Vallero at the IEEE VLSI Test Symposium, San Francisco, CA, USA (April 22-25, 2018).
General Purpose computing on Graphics Processing Unit offers a remarkable speedup for data parallel workloads, leveraging GPUs computational power. However, differently from graphic computing, it requires highly reliable operation in most of application domains.
This presentation talk about a “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs“. The work is the outcome of a collaboration between the TestGroup of Politecnico di Torino (http://www.testgroup.polito.it) and the Computer Architecture Lab of the University of Athens (dscal.di.uoa.gr) started under the FP7 Clereco Project (http://www.clereco.eu). It presents an extended study based on a consolidated workflow for the evaluation of the reliability in correlation with the performance of four GPU architectures and corresponding chips: AMD Southern Islands and NVIDIA G80/GT200/Fermi. We obtained reliability measurements (AVF and FIT) employing both fault injection and ACE-analysis based on microarchitecture-level simulators. Apart from the reliability-only and performance-only measurements, we propose combined metrics for performance and reliability (to quantify instruction throughput or task execution throughput between failures) that assist comparisons for the same application among GPU chips of different ISAs and vendors, as well as among benchmarks on the same GPU chip.
Watch the presentation at: https://youtu.be/GV5xRDgfCw4
Paper Information:
Alessandro Vallero§ , Sotiris Tselonis, Dimitris Gizopoulos* and Stefano Di Carlo§, “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs”, IEEE VLSI Test Symposium 2018 (VTS 2018), San Francisco, CA (USA), April 22-25, 2018.
∗Politecnico di Torino, Italy. Email: stefano.dicarlo,[email protected] †University of Athens, Greece Email: [email protected]
The document provides an overview of GPU computing and CUDA programming. It discusses how GPUs enable massively parallel and affordable computing through their manycore architecture. The CUDA programming model allows developers to accelerate applications by launching parallel kernels on the GPU from their existing C/C++ code. Kernels contain many concurrent threads that execute the same code on different data. CUDA features a memory hierarchy and runtime for managing GPU memory and launching kernels. Overall, the document introduces GPU and CUDA concepts for general-purpose parallel programming on NVIDIA GPUs.
The document discusses challenges in GPU compilers. It begins with introductions and abbreviations. It then outlines the topics to be covered: a brief history of GPUs, what makes GPUs special, how to program GPUs, writing a GPU compiler including front-end, middle-end, and back-end aspects, and a few words about graphics. Key points are that GPUs are massively data-parallel, execute instructions in lockstep, and require supporting new language features like OpenCL as well as optimizing for and mapping to the GPU hardware architecture.
Monte Carlo simulation is one of the most important numerical methods in financial derivative pricing and risk management. Due to the increasing sophistication of exotic derivative models, Monte Carlo becomes the method of choice for numerical implementations because of its flexibility in high-dimensional problems. However, the method of discretization of the underlying stochastic differential equation (SDE) has a significant effect on convergence. In addition the choice of computing platform and the exploitation of parallelism offers further efficiency gains. We consider here the effect of higher order discretization methods together with the possibilities opened up by the advent of programmable graphics processing units (GPUs) on the overall performance of Monte Carlo and quasi-Monte Carlo methods.
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
This document provides an overview of GPU programming with CUDA. It defines what a GPU is, that it has many compute cores for graphics processing. It explains that CUDA extends C to access GPU capabilities, allowing for parallel execution across GPU threads. It provides examples of CUDA code structure and keywords to specify where code runs and launch kernels. Performance considerations include data storage, shared memory, and efficient thread scheduling.
This document provides an overview of CUDA (Compute Unified Device Architecture) and GPU programming. It begins with definitions of CUDA and GPU hardware architecture. The history of GPU development from basic graphics cards to modern programmable GPUs is discussed. The document then covers the CUDA programming model including the device model with multiprocessors and threads, and the execution model with grids, blocks and threads. It includes a code example to calculate squares on the GPU. Performance results are shown for different GPUs on a radix sort algorithm. The document concludes that GPU computing is powerful and will continue growing in importance for applications.
This document provides a high-level overview of GPU architecture, AMD and Nvidia GPU hardware, the OpenCL compilation system, and the installable client driver (ICD). It contrasts conventional CPU and GPU architectures, describes the SIMD and SIMT execution models, and examines key aspects of AMD's VLIW and Nvidia's scalar architectures like memory hierarchies and how they map to the OpenCL memory model. It stresses that understanding hardware can help optimize OpenCL code and provides guidelines for writing optimal GPU kernels.
GPUs have evolved from graphics cards to platforms for general purpose high performance computing. CUDA is a programming model that allows GPUs to execute programs written in C for general computing tasks using a single-instruction multiple-thread model. A basic CUDA program involves allocating memory on the GPU, copying data to the GPU, launching a kernel function that executes in parallel across threads on the GPU, copying results back to the CPU, and freeing GPU memory.
This document provides an introduction to accelerators such as GPUs and Intel Xeon Phi. It discusses the architecture and programming of GPUs using CUDA. GPUs are massively parallel many-core processors designed for graphics processing but now used for general purpose computing. They provide much higher floating point performance than CPUs. The document outlines GPU memory architecture and programming using CUDA. It also provides an overview of Intel Xeon Phi which contains over 50 simple CPU cores for highly parallel workloads.
This document discusses exploring GPGPU workloads. It provides an introduction to GPGPU and GPU architecture. It analyzes workloads using statistical methods like PCA and hierarchical clustering. The results show that branch divergence, instruction count, and memory usage are key factors affecting efficiency. Workloads can be classified based on their characteristics. Future trends include GPUs being used more for computing and evolving architectures.
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecturemohamedragabslideshare
This document summarizes research on revisiting co-processing techniques for hash joins on coupled CPU-GPU architectures. It discusses three co-processing mechanisms: off-loading, data dividing, and pipelined execution. Off-loading involves assigning entire operators like joins to either the CPU or GPU. Data dividing partitions data between the processors. Pipelined execution aims to schedule workloads adaptively between the CPU and GPU to maximize efficiency on the coupled architecture. The researchers evaluate these approaches for hash join algorithms, which first partition, build hash tables, and probe tables on the input relations.
The role of the lexical analyzer
Specification of tokens
Finite state machines
From a regular expressions to an NFA
Convert NFA to DFA
Transforming grammars and regular expressions
Transforming automata to grammars
Language for specifying lexical analyzers
This lecture discusses manycore GPU architectures and programming, focusing on the CUDA programming model. It covers GPU execution models, CUDA programming concepts like threads and blocks, and how to manage GPU memory including different memory types like global and shared memory. It also discusses optimizing memory access patterns for global memory and profiling CUDA programs.
C for Cuda - Small Introduction to GPU computingIPALab
In this talk, we are presenting a short introduction to CUDA and GPU computing to help anyone who reads it to get started with this technology.
At first, we are introducing the GPU from the hardware point of view: what is it? How is it built? Why use it for General Purposes (GPGPU)? How does it differ from the CPU?
The second part of the presentation is dealing with the software abstraction and the use of CUDA to implement parallel computing. The software architecture, the kernels and the different types of memories are tackled in this part.
Finally, and to illustrate what has been presented previously, examples of codes are given. These examples are also highlighting the issues that may occur while using parallel-computing.
The document provides an introduction to GPU programming using CUDA. It outlines GPU and CPU architectures, the CUDA programming model involving threads, blocks and grids, and CUDA C language extensions. It also discusses compilation with NVCC, memory hierarchies, profiling code with Valgrind/Callgrind, and Amdahl's law in the context of parallelization. A simple CUDA program example is provided to demonstrate basic concepts like kernel launches and data transfers between host and device memory.
Cuda Without a Phd - A practical guick startLloydMoore
NVIDIA CUDA is a tool kit for development of GPU accelerated applications. For specific types of applications and computational patterns the GPU allows you to deploy thousands of cores for processing in a very cost effective manner.
While getting the full benefit of GPU acceleration can take a considerable amount of knowledge and effort, considerable speedups can be achieved with minimal program changes.
This talk provides an overview of what CUDA is, where it can be effective, and then does a deep dive to convert a simple, sequential data processing loop running as a single thread on the CPU into a massively parallel operation running on the GPU.
Monte Carlo simulation is well-suited for GPU acceleration due to its highly parallel nature. GPUs provide lower cost and higher performance than CPUs for Monte Carlo applications. Numerical libraries for GPUs allow developers to focus on their models rather than reimplementing basic components. NAG has developed GPU libraries including random number generators and is working with financial institutions to apply Monte Carlo simulations to problems in finance.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
This document provides an introduction to CUDA and OpenCL for graphics processors. It discusses how GPUs are optimized for throughput rather than latency via parallel processing. The CUDA programming model exposes thread-level parallelism through blocks of cooperative threads and SIMD parallelism. OpenCL is inspired by CUDA but is hardware-vendor neutral. Both support features like shared memory, synchronization, and memory copies between host and device. Efficient CUDA coding requires exposing abundant fine-grained parallelism and minimizing execution and memory divergence.
This document provides an overview of parallel and distributed computing using GPUs. It discusses GPU architecture and how GPUs are designed for massively parallel processing using hundreds of smaller cores compared to CPUs which use 4-8 larger cores. The document also covers GPU memory hierarchy, programming GPUs using OpenCL, and key concepts like work items, work groups, and occupancy which is keeping GPU compute units busy with work to process.
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
These slides have been presented by Dr. Alessandro Vallero at the IEEE VLSI Test Symposium, San Francisco, CA, USA (April 22-25, 2018).
General Purpose computing on Graphics Processing Unit offers a remarkable speedup for data parallel workloads, leveraging GPUs computational power. However, differently from graphic computing, it requires highly reliable operation in most of application domains.
This presentation talk about a “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs“. The work is the outcome of a collaboration between the TestGroup of Politecnico di Torino (http://www.testgroup.polito.it) and the Computer Architecture Lab of the University of Athens (dscal.di.uoa.gr) started under the FP7 Clereco Project (http://www.clereco.eu). It presents an extended study based on a consolidated workflow for the evaluation of the reliability in correlation with the performance of four GPU architectures and corresponding chips: AMD Southern Islands and NVIDIA G80/GT200/Fermi. We obtained reliability measurements (AVF and FIT) employing both fault injection and ACE-analysis based on microarchitecture-level simulators. Apart from the reliability-only and performance-only measurements, we propose combined metrics for performance and reliability (to quantify instruction throughput or task execution throughput between failures) that assist comparisons for the same application among GPU chips of different ISAs and vendors, as well as among benchmarks on the same GPU chip.
Watch the presentation at: https://youtu.be/GV5xRDgfCw4
Paper Information:
Alessandro Vallero§ , Sotiris Tselonis, Dimitris Gizopoulos* and Stefano Di Carlo§, “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs”, IEEE VLSI Test Symposium 2018 (VTS 2018), San Francisco, CA (USA), April 22-25, 2018.
∗Politecnico di Torino, Italy. Email: stefano.dicarlo,[email protected] †University of Athens, Greece Email: [email protected]
The document provides an overview of GPU computing and CUDA programming. It discusses how GPUs enable massively parallel and affordable computing through their manycore architecture. The CUDA programming model allows developers to accelerate applications by launching parallel kernels on the GPU from their existing C/C++ code. Kernels contain many concurrent threads that execute the same code on different data. CUDA features a memory hierarchy and runtime for managing GPU memory and launching kernels. Overall, the document introduces GPU and CUDA concepts for general-purpose parallel programming on NVIDIA GPUs.
The document discusses challenges in GPU compilers. It begins with introductions and abbreviations. It then outlines the topics to be covered: a brief history of GPUs, what makes GPUs special, how to program GPUs, writing a GPU compiler including front-end, middle-end, and back-end aspects, and a few words about graphics. Key points are that GPUs are massively data-parallel, execute instructions in lockstep, and require supporting new language features like OpenCL as well as optimizing for and mapping to the GPU hardware architecture.
Monte Carlo simulation is one of the most important numerical methods in financial derivative pricing and risk management. Due to the increasing sophistication of exotic derivative models, Monte Carlo becomes the method of choice for numerical implementations because of its flexibility in high-dimensional problems. However, the method of discretization of the underlying stochastic differential equation (SDE) has a significant effect on convergence. In addition the choice of computing platform and the exploitation of parallelism offers further efficiency gains. We consider here the effect of higher order discretization methods together with the possibilities opened up by the advent of programmable graphics processing units (GPUs) on the overall performance of Monte Carlo and quasi-Monte Carlo methods.
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
This document provides an overview of GPU programming with CUDA. It defines what a GPU is, that it has many compute cores for graphics processing. It explains that CUDA extends C to access GPU capabilities, allowing for parallel execution across GPU threads. It provides examples of CUDA code structure and keywords to specify where code runs and launch kernels. Performance considerations include data storage, shared memory, and efficient thread scheduling.
This document provides an overview of CUDA (Compute Unified Device Architecture) and GPU programming. It begins with definitions of CUDA and GPU hardware architecture. The history of GPU development from basic graphics cards to modern programmable GPUs is discussed. The document then covers the CUDA programming model including the device model with multiprocessors and threads, and the execution model with grids, blocks and threads. It includes a code example to calculate squares on the GPU. Performance results are shown for different GPUs on a radix sort algorithm. The document concludes that GPU computing is powerful and will continue growing in importance for applications.
This document provides a high-level overview of GPU architecture, AMD and Nvidia GPU hardware, the OpenCL compilation system, and the installable client driver (ICD). It contrasts conventional CPU and GPU architectures, describes the SIMD and SIMT execution models, and examines key aspects of AMD's VLIW and Nvidia's scalar architectures like memory hierarchies and how they map to the OpenCL memory model. It stresses that understanding hardware can help optimize OpenCL code and provides guidelines for writing optimal GPU kernels.
GPUs have evolved from graphics cards to platforms for general purpose high performance computing. CUDA is a programming model that allows GPUs to execute programs written in C for general computing tasks using a single-instruction multiple-thread model. A basic CUDA program involves allocating memory on the GPU, copying data to the GPU, launching a kernel function that executes in parallel across threads on the GPU, copying results back to the CPU, and freeing GPU memory.
This document provides an introduction to accelerators such as GPUs and Intel Xeon Phi. It discusses the architecture and programming of GPUs using CUDA. GPUs are massively parallel many-core processors designed for graphics processing but now used for general purpose computing. They provide much higher floating point performance than CPUs. The document outlines GPU memory architecture and programming using CUDA. It also provides an overview of Intel Xeon Phi which contains over 50 simple CPU cores for highly parallel workloads.
This document discusses exploring GPGPU workloads. It provides an introduction to GPGPU and GPU architecture. It analyzes workloads using statistical methods like PCA and hierarchical clustering. The results show that branch divergence, instruction count, and memory usage are key factors affecting efficiency. Workloads can be classified based on their characteristics. Future trends include GPUs being used more for computing and evolving architectures.
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecturemohamedragabslideshare
This document summarizes research on revisiting co-processing techniques for hash joins on coupled CPU-GPU architectures. It discusses three co-processing mechanisms: off-loading, data dividing, and pipelined execution. Off-loading involves assigning entire operators like joins to either the CPU or GPU. Data dividing partitions data between the processors. Pipelined execution aims to schedule workloads adaptively between the CPU and GPU to maximize efficiency on the coupled architecture. The researchers evaluate these approaches for hash join algorithms, which first partition, build hash tables, and probe tables on the input relations.
The role of the lexical analyzer
Specification of tokens
Finite state machines
From a regular expressions to an NFA
Convert NFA to DFA
Transforming grammars and regular expressions
Transforming automata to grammars
Language for specifying lexical analyzers
π0.5: a Vision-Language-Action Model with Open-World GeneralizationNABLAS株式会社
今回の資料「Transfusion / π0 / π0.5」は、画像・言語・アクションを統合するロボット基盤モデルについて紹介しています。
拡散×自己回帰を融合したTransformerをベースに、π0.5ではオープンワールドでの推論・計画も可能に。
This presentation introduces robot foundation models that integrate vision, language, and action.
Built on a Transformer combining diffusion and autoregression, π0.5 enables reasoning and planning in open-world settings.
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...Infopitaara
A Boiler Feed Pump (BFP) is a critical component in thermal power plants. It supplies high-pressure water (feedwater) to the boiler, ensuring continuous steam generation.
⚙️ How a Boiler Feed Pump Works
Water Collection:
Feedwater is collected from the deaerator or feedwater tank.
Pressurization:
The pump increases water pressure using multiple impellers/stages in centrifugal types.
Discharge to Boiler:
Pressurized water is then supplied to the boiler drum or economizer section, depending on design.
🌀 Types of Boiler Feed Pumps
Centrifugal Pumps (most common):
Multistage for higher pressure.
Used in large thermal power stations.
Positive Displacement Pumps (less common):
For smaller or specific applications.
Precise flow control but less efficient for large volumes.
🛠️ Key Operations and Controls
Recirculation Line: Protects the pump from overheating at low flow.
Throttle Valve: Regulates flow based on boiler demand.
Control System: Often automated via DCS/PLC for variable load conditions.
Sealing & Cooling Systems: Prevent leakage and maintain pump health.
⚠️ Common BFP Issues
Cavitation due to low NPSH (Net Positive Suction Head).
Seal or bearing failure.
Overheating from improper flow or recirculation.
Raish Khanji GTU 8th sem Internship Report.pdfRaishKhanji
This report details the practical experiences gained during an internship at Indo German Tool
Room, Ahmedabad. The internship provided hands-on training in various manufacturing technologies, encompassing both conventional and advanced techniques. Significant emphasis was placed on machining processes, including operation and fundamental
understanding of lathe and milling machines. Furthermore, the internship incorporated
modern welding technology, notably through the application of an Augmented Reality (AR)
simulator, offering a safe and effective environment for skill development. Exposure to
industrial automation was achieved through practical exercises in Programmable Logic Controllers (PLCs) using Siemens TIA software and direct operation of industrial robots
utilizing teach pendants. The principles and practical aspects of Computer Numerical Control
(CNC) technology were also explored. Complementing these manufacturing processes, the
internship included extensive application of SolidWorks software for design and modeling tasks. This comprehensive practical training has provided a foundational understanding of
key aspects of modern manufacturing and design, enhancing the technical proficiency and readiness for future engineering endeavors.
Sorting Order and Stability in Sorting.
Concept of Internal and External Sorting.
Bubble Sort,
Insertion Sort,
Selection Sort,
Quick Sort and
Merge Sort,
Radix Sort, and
Shell Sort,
External Sorting, Time complexity analysis of Sorting Algorithms.
Concept of Problem Solving, Introduction to Algorithms, Characteristics of Algorithms, Introduction to Data Structure, Data Structure Classification (Linear and Non-linear, Static and Dynamic, Persistent and Ephemeral data structures), Time complexity and Space complexity, Asymptotic Notation - The Big-O, Omega and Theta notation, Algorithmic upper bounds, lower bounds, Best, Worst and Average case analysis of an Algorithm, Abstract Data Types (ADT)
6. NVIDIA GeForce GTX 285 “core”
…
= instruction stream decode
= SIMD functional unit, control
shared across 8 units
= execution context storage
= multiply-add
= multiply
64 KB of storage
for thread contexts
(registers)
Slide credit: Kayvon Fatahalian 6
7. NVIDIA GeForce GTX 285 “core”
…
64 KB of storage
for thread contexts
(registers)
Groups of 32 threads share instruction stream (each group is
a Warp)
Up to 32 warps are simultaneously interleaved
Up to 1024 thread contexts can be stored
Slide credit: Kayvon Fatahalian 7
10. Recall: NVIDIA V100
NVIDIA-speak:
5120 stream processors
“SIMT execution”
Generic speak:
80 cores
64 SIMD functional units per core
Specialized Functional Units for Machine Learning (tensor
”cores” in NVIDIA-speak)
10
11. Recall: NVIDIA V100 Block Diagram
80 cores on the V100
https://devblogs.nvidia.com/inside-volta/
11
12. Recall: NVIDIA V100 Core
15.7 TFLOPS Single Precision
7.8 TFLOPS Double Precision
125 TFLOPS for Deep Learning (Tensor ”cores”)
12
https://devblogs.nvidia.com/inside-volta/
13. Food for Thought
What is the main bottleneck in GPU programs?
“Tensor cores”:
Can you think about other operations than matrix multiplication?
What other applications could benefit from specialized cores?
Compare and contrast GPUs vs other accelerators (e.g., systolic
arrays)
Which one is better for machine learning?
Which one is better for image/vision processing?
What types of parallelism each one exploits?
What are the tradeoffs?
13
14. Recall: Latency Hiding via Warp-Level FGMT
Warp: A set of threads that
execute the same instruction
(on different data elements)
Fine-grained multithreading
One instruction per thread in
pipeline at a time (No
interlocking)
Interleave warp execution to
hide latencies
Register values of all threads stay
in register file
FGMT enables long latency
tolerance
Millions of pixels
14
Decode
RF
RF
RF
ALU
ALU
ALU
D-Cache
Thread Warp 6
Thread Warp 1
Thread Warp 2
Data
All Hit?
Miss?
Warps accessing
memory hierarchy
Thread Warp 3
Thread Warp 8
Writeback
Warps available
for scheduling
Thread Warp 7
I-Fetch
SIMD Pipeline
Slide credit: Tor Aamodt
15. Recall: Warp Execution
15
32-thread warp executing ADD A[tid],B[tid] C[tid]
C[1]
C[2]
C[0]
A[3] B[3]
A[4] B[4]
A[5] B[5]
A[6] B[6]
Execution using
one pipelined
functional unit
C[4]
C[8]
C[0]
A[12] B[12]
A[16] B[16]
A[20] B[20]
A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]
A[17] B[17]
A[21] B[21]
A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]
A[18] B[18]
A[22] B[22]
A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]
A[19] B[19]
A[23] B[23]
A[27] B[27]
Execution using
four pipelined
functional units
Slide credit: Krste Asanovic
Time
Space
Time
16. 16
Lane
Functional Unit
Registers
for each
Thread
Memory Subsystem
Registers for
thread IDs
0, 4, 8, …
Registers for
thread IDs
1, 5, 9, …
Registers for
thread IDs
2, 6, 10, …
Registers for
thread IDs
3, 7, 11, …
Slide credit: Krste Asanovic
Recall: SIMD Execution Unit Structure
17. Recall: Warp Instruction Level Parallelism
Can overlap execution of multiple instructions
Example machine has 32 threads per warp and 8 lanes
Completes 24 operations/cycle while issuing 1 warp/cycle
17
W3
W0
W1
W4
W2
W5
Load Unit Multiply Unit Add Unit
time
Warp issue
Slide credit: Krste Asanovic
18. Clarification of some GPU Terms
18
Generic Term NVIDIA Term AMD Term Comments
Vector length Warp size Wavefront size Number of threads that run in parallel (lock-step)
on a SIMD functional unit
Pipelined
functional unit /
Scalar pipeline
Streaming
processor /
CUDA core
- Functional unit that executes instructions for one
GPU thread
SIMD functional
unit /
SIMD pipeline
Group of N
streaming
processors (e.g.,
N=8 in GTX 285,
N=16 in Fermi)
Vector ALU SIMD functional unit that executes instructions for
an entire warp
GPU core Streaming
multiprocessor
Compute unit It contains one or more warp schedulers and one
or several SIMD pipelines
20. Recall: Vector Processor Disadvantages
-- Works (only) if parallelism is regular (data/SIMD parallelism)
++ Vector operations
-- Very inefficient if parallelism is irregular
-- How about searching for a key in a linked list?
20
Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983.
21. General Purpose Processing on GPU
Easier programming of SIMD processors with SPMD
GPUs have democratized High Performance Computing (HPC)
Great FLOPS/$, massively parallel chip on a commodity PC
Many workloads exhibit inherent parallelism
Matrices
Image processing
Deep neural networks
However, this is not for free
New programming model
Algorithms need to be re-implemented and rethought
Still some bottlenecks
CPU-GPU data transfers (PCIe, NVLINK)
DRAM memory bandwidth (GDDR5, GDDR6, HBM2)
Data layout
21
22. CPU vs. GPU
Different design philosophies
CPU: A few out-of-order cores
GPU: Many in-order FGMT cores
22
Slide credit: Hwu & Kirk
23. GPU Computing
Computation is offloaded to the GPU
Three steps
CPU-GPU data transfer (1)
GPU kernel execution (2)
GPU-CPU data transfer (3)
CPU
memory
CPU
cores
Matrix
GPU
memory
GPU
cores
Matrix
1
3
2
23
24. CPU threads and GPU kernels
Sequential or modestly parallel sections on CPU
Massively parallel sections on GPU
Serial Code (host)
. . .
. . .
Parallel Kernel (device)
KernelA<<< nBlk, nThr >>>(args);
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nThr >>>(args);
Traditional Program Structure
24
Slide credit: Hwu & Kirk
25. Recall: SPMD
Single procedure/program, multiple data
This is a programming model rather than computer organization
Each processing element executes the same procedure, except on
different data elements
Procedures can synchronize at certain points in program, e.g. barriers
Essentially, multiple instruction streams execute the same
program
Each program/procedure 1) works on different data, 2) can execute a
different control-flow path, at run-time
Many scientific applications are programmed this way and run on MIMD
hardware (multiprocessors)
Modern GPUs programmed in a similar way on a SIMD hardware
25
26. CUDA/OpenCL Programming Model
SIMT or SPMD
Bulk synchronous programming
Global (coarse-grain) synchronization between kernels
The host (typically CPU) allocates memory, copies data,
and launches kernels
The device (typically GPU) executes kernels
Grid (NDRange)
Block (work-group)
Within a block, shared memory, and synchronization
Thread (work-item)
26
27. Transparent Scalability
Hardware is free to schedule thread blocks
Device
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Kernel grid
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Device
Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 Block 6 Block 7
Each block can execute in any order relative to other blocks.
time
27
Slide credit: Hwu & Kirk
time
37. Brief Review of GPU Architecture (III)
Streaming Multiprocessors (SM) or Compute Units (CU)
SIMD pipelines
Streaming Processors (SP) or CUDA ”cores”
Vector lanes
Number of SMs x SPs across generations
Tesla (2007): 30 x 8
Fermi (2010): 16 x 32
Kepler (2012): 15 x 192
Maxwell (2014): 24 x 128
Pascal (2016): 56 x 64
Volta (2017): 80 x 64
37
39. Performance Considerations
Main bottlenecks
Global memory access
CPU-GPU data transfers
Memory access
Latency hiding
Occupancy
Memory coalescing
Data reuse
Shared memory usage
SIMD (Warp) Utilization: Divergence
Atomic operations: Serialization
Data transfers between CPU and GPU
Overlap of communication and computation
39
41. Latency Hiding
FGMT can hide long latency operations (e.g., memory accesses)
Occupancy: ratio of active warps
4 active warps 2 active warps
41
42. Occupancy
SM resources (typical values)
Maximum number of warps per SM (64)
Maximum number of blocks per SM (32)
Register usage (256KB)
Shared memory usage (64KB)
Occupancy calculation
Number of threads per block (defined by the programmer)
Registers per thread (known at compile time)
Shared memory per block (defined by the programmer)
42
43. When accessing global memory, we want to make sure
that concurrent threads access nearby memory locations
Peak bandwidth utilization occurs when all threads in a
warp access one cache line
Md Nd
WIDTH
WIDTH
Thread 1
Thread 2
Not coalesced Coalesced
Memory Coalescing
43
Slide credit: Hwu & Kirk
45. Coalesced Memory Accesses
M2,0
M1,1
M1,0
M0,0
M0,1
M3,0
M2,1 M3,1
M2,0
M1,0
M0,0 M3,0 M1,1
M0,1 M2,1 M3,1 M1,2
M0,2 M2,2 M3,2
M1,2
M0,2 M2,2 M3,2
M1,3
M0,3 M2,3 M3,3
M1,3
M0,3 M2,3 M3,3
M
T1 T2 T3 T4
Time Period 1
T1 T2 T3 T4
Time Period 2
Access
direction
in Kernel
code
…
45
Slide credit: Hwu & Kirk
46. AoS vs. SoA
Array of Structures vs. Structure of Arrays
Tenemos 2 data layouts principales (AoS y SoA) y uno nuevo propuesto (A
ASTA permite transformar uno en otro más rápidamente y facilita hacerlo in-place
ahorrar memoria. En la siguiente figura se ven los tres:
Data Layout Alternatives
Array of
Structures
(AoS)
Array of
Structure of
Tiled Array
(ASTA)
struct foo{
float a;
float b;
float c;
int d;
} A[8];
struct foo{
float a[4];
float b[4];
float c[4];
int d[4];
} A[2];
Structure of
Arrays
(SoA)
struct foo{
float a[8];
float b[8];
float c[8];
int d[8];
} A;
Layout Conversion and Transposition 46
48. Data Reuse
Same memory locations accessed by neighboring threads
for (int i = 0; i < 3; i++){
for (int j = 0; j < 3; j++){
sum += gauss[i][j] * Image[(i+row-1)*width + (j+col-1)];
}
}
48
49. Data Reuse: Tiling
To take advantage of data reuse, we divide the input into tiles
that can be loaded into shared memory
__shared__ int l_data[(L_SIZE+2)*(L_SIZE+2)];
…
Load tile into shared memory
__syncthreads();
for (int i = 0; i < 3; i++){
for (int j = 0; j < 3; j++){
sum += gauss[i][j] * l_data[(i+l_row-1)*(L_SIZE+2)+j+l_col-1];
}
}
49
50. Shared Memory
Shared memory is an interleaved (banked) memory
Each bank can service one address per cycle
Typically, 32 banks in NVIDIA GPUs
Successive 32-bit words are assigned to successive banks
Bank = Address % 32
Bank conflicts are only possible within a warp
No bank conflicts between different warps
50
51. Shared Memory Bank Conflicts (I)
Bank conflict free
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Linear addressing: stride = 1 Random addressing 1:1
51
Slide credit: Hwu & Kirk
52. Shared Memory Bank Conflicts (II)
N-way bank conflicts
2-way bank conflict: stride = 2 8-way bank conflict: stride = 8
Thread 11
Thread 10
Thread 9
Thread 8
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Bank 9
Bank 8
Bank 15
Bank 7
Bank 2
Bank 1
Bank 0
x8
x8
52
Slide credit: Hwu & Kirk
53. Reducing Shared Memory Bank Conflicts
Bank conflicts are only possible within a warp
No bank conflicts between different warps
If strided accesses are needed, some optimization
techniques can help
Padding
Randomized mapping
Rau, “Pseudo-randomly interleaved memory,” ISCA 1991
Hash functions
V.d.Braak+, “Configurable XOR Hash Functions for Banked
Scratchpad Memories in GPUs,” IEEE TC, 2016
53
55. Control Flow Problem in GPUs/SIMT
A GPU uses a SIMD
pipeline to save area
on control logic
Groups scalar threads
into warps
Branch divergence
occurs when threads
inside warps branch to
different execution
paths
55
Branch
Path A
Path B
Branch
Path A
Path B
Slide credit: Tor Aamodt
This is the same as conditional/predicated/masked execution.
Recall the Vector Mask and Masked Vector Operations?
63. Atomic Operations are needed when threads might update the
same memory locations at the same time
CUDA: int atomicAdd(int*, int);
PTX: atom.shared.add.u32 %r25, [%rd14], %r24;
SASS:
/*00a0*/ LDSLK P0, R9, [R8];
/*00a8*/ @P0 IADD R10, R9, R7;
/*00b0*/ @P0 STSCUL P1, [R8], R10;
/*00b8*/ @!P1 BRA 0xa0;
/*01f8*/ ATOMS.ADD RZ, [R7], R11;
Native atomic operations for
32-bit integer, and 32-bit and
64-bit atomicCAS
Tesla, Fermi, Kepler Maxwell, Pascal, Volta
Shared Memory Atomic Operations
63
64. We define the intra-warp conflict degree as the number of
threads in a warp that update the same memory position
The conflict degree can be between 1 and 32
tbase
tconflict
Shared memory
Shared memory
tbase
No atomic conflict =
concurrent updates
Atomic conflict =
serialized updates
Atomic Conflicts
64
65. Histogram Calculation
Histograms count the number of data instances in disjoint
categories (bins)
for (each pixel i in image I){
Pixel = I[i] // Read pixel
Pixel’ = Computation(Pixel) // Optional computation
Histogram[Pixel’]++ // Vote in histogram bin
}
Atomic additions
65
69. Data Transfers
Synchronous and asynchronous transfers
Streams (Command queues)
Sequence of operations that are performed in order
CPU-GPU data transfer
Kernel execution
D input data instances, B blocks
GPU-CPU data transfer
Default stream
69
70. Asynchronous Transfers
Computation divided into nStreams
D input data instances, B blocks
nStreams
D/nStreams data instances
B/nStreams blocks
Estimates tT +
tE
nStreams
tE +
tT
nStreams
tE >= tT (dominant kernel) tT > tE (dominant transfers)
70
71. Applications with independent computation on different data
instances can benefit from asynchronous transfers
For instance, video processing
Overlap of Communication and Computation
71
Gomez-Luna+, “Performance models for asynchronous data transfers on consumer
Graphics Processing Units,” JPDC, 2012.
72. Summary
GPU as an accelerator
Program structure
Bulk synchronous programming model
Memory hierarchy and memory management
Performance considerations
Memory access
Latency hiding: occupancy (TLP)
Memory coalescing
Data reuse: shared memory
SIMD utilization
Atomic operations
Data transfers
72
77. Case studies using CPU and GPU
Kernel launches are asynchronous
CPU can work while waits for GPU to finish
Traditionally, this is the most efficient way to exploit
heterogeneity
// Allocate input
malloc(input, ...);
cudaMalloc(d_input, ...);
cudaMemcpy(d_input, input, ..., HostToDevice); // Copy to device memory
// Allocate output
malloc(output, ...);
cudaMalloc(d_output, ...);
// Launch GPU kernel
gpu_kernel<<<blocks, threads>>> (d_output, d_input, ...);
// CPU can do things here
// Synchronize
cudaDeviceSynchronize();
// Copy output to host memory
cudaMemcpy(output, d_output, ..., DeviceToHost);
Collaborative Computing Algorithms
77
78. Fine-grain heterogeneity becomes possible with
Pascal/Volta architecture
Pascal/Volta Unified Memory
CPU-GPU memory coherence
System-wide atomic operations
// Allocate input
cudaMallocManaged(input, ...);
// Allocate output
cudaMallocManaged(output, ...);
// Launch GPU kernel
gpu_kernel<<<blocks, threads>>> (output, input, ...);
// CPU can do things here
output[x] = input[y];
output[x+1].fetch_add(1);
Fine-Grained Heterogeneity
78
79. Since CUDA 8.0
Unified memory
cudaMallocManaged(&h_in, in_size);
System-wide atomics
old = atomicAdd_system(&h_out[x], inc);
79
93. Static vs. dynamic implementation
Pascal/Volta Unified Memory: system-wide atomic operations
while(true){
if(threadIdx.x == 0)
my_tile = atomicAdd_system(tile_num, 1); // my_tile in shared memory; tile_num in UM
__syncthreads(); // Synchronization
if(my_tile >= number_of_tiles) break; // Break when all tiles processed
...
}
Bézier Surfaces (VII)
93
94. Benefits of Collaboration
Data partitioning improves performance
AMD Kaveri (4 CPU cores + 8 GPU CUs)
4
16
64
256
1024
4096
Execution
Time
(
ms
)
12x12 (300x300)
8x8 (300x300)
4x4 (300x300)
Bézier Surfaces
(up to 47% improvement over GPU only)
best
94
95. Matrix padding
Memory alignment
Transposition of near-square matrices
Traditionally, it can only be performed out-of-place
Padding
Padding (I)
95
104. // GPU kernel
const int gtid = blockIdx.x * blockDim.x + threadIdx.x;
while(frontier_size != 0){
for(node = gtid; node < frontier_size; node += blockDim.x*gridDim.x){
// Visit neighbors
// Enqueue in output queue if needed (global or local queue)
}
// Update frontier_size
// Global synchronization
}
Atomic-Based Block Synchronization (II)
Code (simplified)
104
105. const int tid = threadIdx.x;
const int gtid = blockIdx.x * blockDim.x + threadIdx.x;
atomicExch(ptr_threads_run, 0);
atomicExch(ptr_threads_end, 0);
int frontier = 0;
...
frontier++;
if(tid == 0){
atomicAdd(ptr_threads_end, 1); // Thread block finishes iteration
}
if(gtid == 0){
while(atomicAdd(ptr_threads_end, 0) != gridDim.x){;} // Wait until all blocks finish
atomicExch(ptr_threads_end, 0); // Reset
atomicAdd(ptr_threads_run, 1); // Count iteration
}
if(tid == 0 && gtid != 0){
while(atomicAdd(ptr_threads_run, 0) < frontier){;} // Wait until ptr_threads_run is updated
}
__syncthreads(); // Rest of threads wait here
...
Atomic-Based Block Synchronization (III)
Global synchronization (simplified)
At the end of each iteration
105
106. Collaborative Implementation (I)
Motivation
Small-sized frontiers underutilize GPU resources
NVIDIA Jetson TX1 (4 ARMv8 CPUs + 2 SMXs)
New York City roads
106
107. Collaborative Implementation (II)
Choose the most appropriate device
CPU GPU
small frontiers
processed on
CPU
large frontiers
processed on
GPU
107
108. Choose CPU or GPU depending on frontier size
CPU threads or GPU kernel keep running while the
condition is satisfied
// Host code
while(frontier_size != 0){
if(frontier_size < LIMIT){
// Launch CPU threads
}
else{
// Launch GPU kernel
}
}
Collaborative Implementation (III)
108
112. Collaborative Implementation (VII)
Pascal/Volta Unified Memory
CPU/GPU coherence
System-wide atomic operations
No need to re-launch kernel or CPU threads
Possibility of CPU and GPU working on the same frontier
112
113. Benefits of Collaboration
SSSP performs more computation than BFS
16
128
1024
8192
65536
524288
Execution
Time
(
ms
)
NE
NY
UT
Single Source Shortest Path
(up to 22% improvement over GPU only)
113
114. Egomotion Compensation and Moving Objects
Detection (I)
Hexapod robot OSCAR
Rescue scenarios
Strong egomotion on uneven terrains
Algorithm
Random Sample Consensus (RANSAC): F-o-F model
114
116. While (iteration < MAX_ITER){
Fitting stage (Compute F-o-F model) // SISD phase
Evaluation stage (Count outliers) // SIMD phase
Comparison to best model // SISD phase
Check if best model is good enough and iteration >= MIN_ITER // SISD phase
}
SISD and SIMD phases
RANSAC (Fischler et al. 1981)
Fitting stage picks two flow
vectors randomly
Evaluation generates motion
vectors from F-o-F model, and
compares them to real flow
vectors
116
120. We did not cover the following slides in lecture.
These are for your preparation for the next lecture.
121. Benefits of Unified Memory (I)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
D U D U D U D U D U D U D U D U D U D U D U D U D U D U
BS CEDDHSTIHSTOPADRSCD SC TRNS RSCT TQ TQH BFS CEDTSSSP
Fine-grain Coarse-grain
Data Partitioning Task Partitioning
Execution
Time
(
normalized
)
Kernel
Comparable (same kernels,
system-wide atomics make
Unified sometimes slower)
Unified kernels can
exploit more
parallelism
Unified kernels
avoid kernel
launch overhead
121
122. Benefits of Unified Memory (II)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
D U D U D U D U D U D U D U D U D U D U D U D U D U D U
BS CEDDHSTIHSTOPADRSCD SC TRNS RSCT TQ TQH BFS CEDTSSSP
Fine-grain Coarse-grain
Data Partitioning Task Partitioning
Execution
Time
(
normalized
)
Kernel Copy Back & Merge Copy To Device
Unified versions avoid copy overhead
122
123. Benefits of Unified Memory (III)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
D U D U D U D U D U D U D U D U D U D U D U D U D U D U
BS CEDDHSTIHSTOPADRSCD SC TRNS RSCT TQ TQH BFS CEDTSSSP
Fine-grain Coarse-grain
Data Partitioning Task Partitioning
Execution
Time
(
normalized
)
Kernel Copy Back & Merge Copy To Device Allocation
SVM allocation
seems to take
longer
123
124. Benefits of Collaboration on FPGA (I)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
C F C F C F C F C F C F C F C F
CPU FPGA Data Task CPU FPGA Data Task
Single device Collaborative Single device Collaborative
Stratix V Arria 10
Execution
Time
(s)
Idle
Copy
Compute
Case Study:
Canny Edge
Detection
Source: Collaborative Computing for Heterogeneous Integrated Systems. ICPE’17
Vision Track.
Similar
improvement
from data and
task partitioning
124
125. Benefits of Collaboration on FPGA (II)
Case Study:
Random
Sample
Consensus
0
5
10
15
20
25
30
35
40
45
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Execution
Time
(ms)
Data Partitioning (Stratix V)
Task Partitioning (Stratix V)
Data Partitioning (Arria 10)
Task Partitioning (Arria 10)
Source: Collaborative Computing for Heterogeneous Integrated Systems. ICPE’17
Vision Track.
Task partitioning
exploits disparity in
nature of tasks
125
126. Benefits of Collaboration on FPGA (III)
126
Sitao Huang, Li-Wen Chang, Izzat El Hajj, Simon Garcia De Gonzalo, Juan Gomez-Luna, Sai Rahul
Chalamalasetti, Mohamed El-Hadedy, Dejan Milojicic, Onur Mutlu, Deming Chen, and Wen-mei Hwu,
"Analysis and Modeling of Collaborative Execution Strategies for Heterogeneous CPU-
FPGA Architectures"
Proceedings of the 10th ACM/SPEC International Conference on Performance Engineering (ICPE),
Mumbai, India, April 2019.
[Slides (pptx) (pdf)]
[Chai CPU-FPGA Benchmark Suite]
127. Conclusions
Possibility of having CPU threads and GPU blocks
collaborating on the same workload
Or having the most appropriate cores for each workload
Easier programming with Unified Memory or Shared Virtual
Memory
System-wide atomic operations in NVIDIA Pascal/Volta and
HSA
Fine-grain collaboration
127