100% found this document useful (1 vote)
101 views

ECMWF Advanced GPU Topics 1

This document provides an overview and summary of using profiling tools nvprof and NVIDIA Visual Profiler (nvvp) to analyze GPU applications. Key points include: - nvprof is a command line profiler that provides runtime and API call profiling. It can identify bottlenecks like memory transfers. - nvvp is a GUI profiler that can import nvprof output files or attach live. It provides a timeline view and metrics analysis. - Useful nvprof options include -o to output a file for nvvp, and --metrics to collect specific metrics. Metrics like FLOP count and memory bandwidth help analyze kernel performance. - A vector addition example showed memory transfers taking most time. Pro

Uploaded by

kosakakosa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
101 views

ECMWF Advanced GPU Topics 1

This document provides an overview and summary of using profiling tools nvprof and NVIDIA Visual Profiler (nvvp) to analyze GPU applications. Key points include: - nvprof is a command line profiler that provides runtime and API call profiling. It can identify bottlenecks like memory transfers. - nvvp is a GUI profiler that can import nvprof output files or attach live. It provides a timeline view and metrics analysis. - Useful nvprof options include -o to output a file for nvvp, and --metrics to collect specific metrics. Metrics like FLOP count and memory bandwidth help analyze kernel performance. - A vector addition example showed memory transfers taking most time. Pro

Uploaded by

kosakakosa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Advanced GPU Topics #1

Jeremy Appleyard, September 2015


• Profiling GPU applications
• Optimizing Data Movement

In this talk • Using MPI with GPUs

Ask questions at any point!

2
Profiling GPU Applications

3
Profiling Tools
Many options!

From
FromNVIDIA
NVIDIA From
ThirdNVIDIA
Party
• nvprof • nvprof
• nvprof • System
• NVIDIA Visual profiler • NVIDIA Visual profiler
• NVIDIA Visual profiler • VampirTrace
• Standalone (nvvp) • Standalone (nvvp)
• Standalone (nvvp)
• Integrated into Nsight Eclipse • Integrated into Nsight Eclipse
• Integrated into Nsight Eclipse • PAPI CUDA component
Edition (nsight) Edition (nsight)
Edition (nsight)
• Nsight Visual Studio Edition • HPC Toolkit
• Nsight Visual Studio Edition
• Nsight Visual Studio Edition

4
This talk

We will focus on nvprof and nvvp


nvprof => NVIDIA profiler
Command line

nvvp => NVIDIA Visual Profiler


GUI based

5
nvprof
Simple usage

• nvprof ./<executable> attributes(global) subroutine vecAdd_GPU(c, a, b, n)


INTEGER, value :: n
• Example: vector addition REAL, device, intent(in) :: a(n), b(n)
REAL, device, intent(out) :: c(n)
from yesterday’s talk
INTEGER :: i

i = (blockIdx%x – 1) * blockDim%x + threadIdx%x

if (i <= n) then
c(i) = a(i) + b(i)
end if
end subroutine vecAdd_GPU

6
nvprof ./vecAdd
==34092== NVPROF is profiling process 34092, command: ./vecAdd
==34092== Profiling application: ./vecAdd
==34092== Profiling result:
Time(%) Time Calls Avg Min Max Name
69.07% 3.8547ms 2 1.9274ms 1.9093ms 1.9454ms [CUDA memcpy HtoD]
29.04% 1.6205ms 1 1.6205ms 1.6205ms 1.6205ms [CUDA memcpy DtoH]
1.19% 66.178us 1 66.178us 66.178us 66.178us kernel_vecadd_gpu_
0.71% 39.650us 2 19.825us 19.617us 20.033us __pgi_dev_cumemset_4

==34092== API calls:


Time(%) Time Calls Avg Min Max Name
95.29% 321.93ms 3 107.31ms 215.28us 321.48ms cudaMalloc
2.70% 9.1355ms 3 3.0452ms 2.1550ms 3.5234ms cudaMemcpy
1.47% 4.9710ms 498 9.9810us 177ns 498.93us cuDeviceGetAttribute
0.22% 758.16us 3 252.72us 234.62us 284.11us cudaFree
0.15% 519.49us 6 86.581us 82.801us 89.206us cuDeviceTotalMem
0.13% 449.63us 6 74.938us 70.892us 81.049us cuDeviceGetName
0.02% 69.908us 3 23.302us 11.485us 43.331us cudaLaunch
0.00% 12.300us 10 1.2300us 179ns 8.0460us cudaSetupArgument
0.00% 3.4430us 12 286ns 193ns 716ns cuDeviceGet
0.00% 2.8280us 2 1.4140us 285ns 2.5430us cuDeviceGetCount
0.00% 2.5750us 3 858ns 448ns 1.6120us cudaConfigureCall
7
nvprof ./vecAdd
==34092== NVPROF is profiling process 34092, command: ./vecAdd
==34092== Profiling application: ./vecAdd
==34092== Profiling result:
Time(%) Time Calls Avg Min Max Name
69.07% 3.8547ms 2 1.9274ms 1.9093ms 1.9454ms [CUDA memcpy HtoD]
29.04% 1.6205ms 1 1.6205ms 1.6205ms 1.6205ms [CUDA memcpy DtoH]
1.19% 66.178us 1 66.178us 66.178us 66.178us kernel_vecadd_gpu_
0.71% 39.650us 2 19.825us 19.617us 20.033us __pgi_dev_cumemset_4

• Top half of the profile is runtime measured from the GPU perspective
• What can we see?
• Memcpy XtoX is memory copy to and from the GPU

• The vector addition is only 1.19%!

• This is valuable information!

8
nvprof ./vecAdd
==34092== NVPROF is profiling process 34092, command: ./vecAdd
==34092== Profiling application: ./vecAdd
==34092== Profiling result:
Time(%) Time Calls Avg Min Max Name
69.07% 3.8547ms 2 1.9274ms 1.9093ms 1.9454ms [CUDA memcpy HtoD]
29.04% 1.6205ms 1 1.6205ms 1.6205ms 1.6205ms [CUDA memcpy DtoH]
1.19% 66.178us 1 66.178us 66.178us 66.178us kernel_vecadd_gpu_
0.71% 39.650us 2 19.825us 19.617us 20.033us __pgi_dev_cumemset_4

• (PGI) OpenACC kernels will be named after the subroutine name and line number
• eg. vecadd_17_gpu
• Subroutine vecadd

• Line 17 in the file

• Compiled for the gpu

9
• Bottom half of the profile is runtime measured from the CPU
perspective
• What can we see?
• First allocation is expensive (cuda initialisation)

==34092== API calls:


Time(%) Time Calls Avg Min Max Name
95.29% 321.93ms 3 107.31ms 215.28us 321.48ms cudaMalloc
2.70% 9.1355ms 3 3.0452ms 2.1550ms 3.5234ms cudaMemcpy
1.47% 4.9710ms 498 9.9810us 177ns 498.93us cuDeviceGetAttribute
0.22% 758.16us 3 252.72us 234.62us 284.11us cudaFree
0.15% 519.49us 6 86.581us 82.801us 89.206us cuDeviceTotalMem
0.13% 449.63us 6 74.938us 70.892us 81.049us cuDeviceGetName
0.02% 69.908us 3 23.302us 11.485us 43.331us cudaLaunch
0.00% 12.300us 10 1.2300us 179ns 8.0460us cudaSetupArgument
0.00% 3.4430us 12 286ns 193ns 716ns cuDeviceGet
0.00% 2.8280us 2 1.4140us 285ns 2.5430us cuDeviceGetCount
0.00% 2.5750us 3 858ns 448ns 1.6120us cudaConfigureCall
10
nvprof
More advanced options

• nvprof -h
• There are quite a few options!
• Some useful ones:
• -o: creates an output file which can be imported into nvvp
• -m and -e: collect metrics or events
• --analysis-metrics: collect all metrics for import into nvvp
• --query-metrics and --query-events: query which metrics/events are available
11
nvprof
Events and Metrics

Most are quite in-depth, however some useful ones for quick analysis
In general, events are only for the expert. Rarely useful
(A few) useful metrics:
dram_read_throughput: Main GPU memory read throughput
dram_write_throughput: Main GPU memory write throughput
flop_count_sp: Number of single precision floating point operations
flop_count_dp : Number of double precision floating point operations
12
nvprof
Metrics Example

nvprof --metrics dram_read_throughput,dram_write_throughput,flop_count_sp,flop_count_dp ./vecAdd


==32928== NVPROF is profiling process 32928, command: ./vecAdd
==32928== Warning: Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==32928== Profiling application: ./vecAdd
==32928== Profiling result:
==32928== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla K40m (0)"
Kernel: kernel_vecadd_gpu_
1 dram_read_throughput Device Memory Read Throughput 136.55GB/s 136.55GB/s 136.55GB/s
1 dram_write_throughput Device Memory Write Throughput 70.062GB/s 70.062GB/s 70.062GB/s
1 flop_count_sp Floating Point Operations(Single Precisi 1000000 1000000 1000000
1 flop_count_dp Floating Point Operations(Double Precisi 0 0 0

13
nvprof
Metrics Example
Problem was addition of 1,000,000
element single precision vectors
1,000,000 single precision flop count
is therefore expected!
dram_read_throughput 136.55GB/s
dram_write_throughput 70.062GB/s ~50 GFLOP/s
flop_count_sp 1000000
flop_count_dp 0 About 1% peak FLOPs

Measured dram throughput is


~206GB/s
Peak for this machine is 288 GB/s
72% peak bandwidth
14
Visual Profiler
nvvp

• Either import the output of nvprof –o ...


• Need to select metrics in advance

• Or run application through GUI


• Re-run on-the-fly to compute new metrics
based on requirements

15
Vecadd - timeline

16
NEMO - timeline

17
NEMO - details

18
Back to vecadd
More details!

• If we collected our profiling


information with –analysis-
metrics, or if we’re running the
application through nvvp
directly, more information is
available! This is what we want!

• -analysis-metrics may take some


time to run
• Run separately after
generating timeline info

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 19


20
• Matches our earlier calculations of ~72% peak bandwidth

21
Compute, Bandwidth or Latency Bound

 Other analysis options available


 The tool will guide you
 Becomes more useful the more experience
you get

22
Profiling tips
 The nvprof output is a very good place to start
 The timeline is a good place to go next
 Only dig deep into a kernel if it’s taking a significant amount of your time
 Where possible, try to match profiler output with theory
 For example, if I know I’m moving 1GB, and my kernel takes 10ms, I expect the
profiler to report 100GB/s.

 Discrepancies are likely to mean your application isn’t doing what you thought it was

23
Optimising Data Movement

24
Heterogeneous Computing

PCI Bus

25
Data movement
What are the costs?

Two costs:
Bandwidth
Latency

Large copies will be bandwidth bound.


Peak PCI-E bandwidth is 16 GB/s

Small copies will be high latency


Many small copies are expensive
26
Data movement
Several optimisation options

Several ways to optimise data movement:


1. Do less of it!
2. Fuse small copies into larger ones
3. Use pinned buffers
4. Overlap data movement with execution

Remember: profile before optimising!


27
Optimising Data Movement
1. Do less of it!

• Consider moving data movement outside of main loop


• For example, an iterative solver would not want data movement within the
iteration loop
• Often, during the process of making an application run on a GPU these additional
copies are unavoidable
• If doing intermediate performance analysis, bear in mind these copies may disappear
in the final application

• Consider if data can be passed as an argument (CUDA), or declared as private


(OpenACC)

28
Optimising Data Movement
2. Fuse small copies into larger ones

• Small data transfers are much less efficient


• If the opportunity arises, merging copies can be beneficial
• Can be difficult to accomplish in some applications
• Additional cost due to packing/unpacking may outweigh the savings

29
Optimising Data Movement
3. Use pinned buffers

• Page-locked (or pinned) allocations generally have higher bandwidth


• Can harm system performance if overused
• In CUDA:
• cudaHostAlloc() – allocate page-locked memory

• cudaHostRegister() – convert normal allocation to page-locked

• Pinned attribute in CUDA Fortran

• In OpenACC:
• (Hopefully) done automatically!
30
Optimising Data Movement
4. Overlap data movement with execution

• With pinned memory PCI-E transfers can occur at the same time as kernel
execution
• In CUDA:
• cudaMemcpyAsync

• In OpenACC:
• async clause on parallel or kernels directives

• This can greatly reduce the cost of memory copies – almost to none in the right
case

31
Optimising Data Movement
Summary

• There are good (and bad) ways of moving data between GPU and CPU
• For many applications this is not imporant
• Data resides on the GPU for the application’s lifetime

• Next-gen GPU Pascal will bring hardware improvements - NVLINK

32
NVLINK
Faster communications between processors

NVLINK
• High speed interconnect
• Alongside/replacing PCI-E
• 80-200 GB/s
• Improved energy efficiency
• Improved flexibility

33
NVLink Unleashes Multi-GPU Performance
GPUs Interconnected with NVLink Over 2x Application Performance Speedup
When Next-Gen GPUs Connect via NVLink Versus PCIe
Speedup vs
PCIe based Server
2.25x

CPU
2.00x

PCIe Switch
1.75x

1.50x
TESLA TESLA
GPU GPU
1.25x

5x Faster than 1.00x


PCIe Gen3 x16 ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT

34
Using MPI with GPUs

35
MPI+CUDA

System System System


GDDR5 Memory GDDR5 Memory GDDR5 Memory
Memory Memory Memory


GPU CPU GPU CPU GPU CPU

PCI-e PCI-e PCI-e


Network Network Network
Card Card Card

Node 0 Node 1 Node n-1

36
MPI+CUDA

System System System


GDDR5 Memory GDDR5 Memory GDDR5 Memory
Memory Memory Memory


GPU CPU GPU CPU GPU CPU

PCI-e PCI-e PCI-e


Network Network Network
Card Card Card

Node 0 Node 1 Node n-1

37
MPI+CUDA

//MPI rank 0
MPI_Send(s_buf_d, size, MPI_CHAR, n-1, tag, MPI_COMM_WORLD);

//MPI rank n-1


MPI_Recv(r_buf_d, size, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &stat);
38
Message Passing Interface
MPI

• Standard to exchange data between processes via messages


• Defines API to exchanges messages
• Pt. 2 Pt.: e.g. MPI_Send, MPI_Recv

• Collectives, e.g. MPI_Reduce

• Multiple implementations (open source and commercial)


• Binding for C/C++, Fortran, Python, …

• E.g. MPICH, OpenMPI, MVAPICH, IBM Platform MPI, Cray MPT, …

39
MPI
Compiling and launching

$ mpicc –o myapp myapp.c


$ mpirun –np 4 ./myapp <args>

myapp myapp myapp myapp

40
Launch MPI + CUDA/OpenACC programs
 Launch one process per GPU
 MVAPICH: MV2_USE_CUDA

$ MV2_USE_CUDA=1 mpirun –np ${np} ./myapp <args>


 Open MPI: CUDA-aware features are enabled per default

 Cray: MPICH_RDMA_ENABLED_CUDA

 IBM Platform MPI: PMPI_GPU_AWARE

41
Unified Virtual Addressing

• One address space for all CPU and GPU memory


• Determine physical memory location from a pointer value

• Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)

• Support:
• 64-bit applications on Linux

• Windows using TCC mode

42
Unified Virtual Addressing
No UVA : Separate Address Spaces UVA : Single Address Space

System GPU System GPU


Memory Memory Memory Memory
0x0000 0x0000 0x0000

0xFFFF 0xFFFF 0xFFFF

CPU GPU CPU GPU

PCI-e PCI-e
43
MPI + CUDA

With UVA CUDA-aware MPI No UVA and regular MPI


//MPI rank 0
//MPI rank 0 s_buf_h = s_buf_d
call MPI_Send(s_buf_d,size,…) call MPI_Send(s_buf_h,size,…)

//MPI rank n-1 //MPI rank n-1


call MPI_Recv(r_buf_d,size,…) call MPI_Recv(r_buf_h,size,…)
r_buf_d = r_buf_h

44
MPI + OpenACC

With UVA CUDA-aware MPI No UVA and regular MPI


!$acc host_data use_device (s_buf, r_buf) !$acc update host(s_buf)
!MPI rank 0 !MPI rank 0
call MPI_Send(s_buf,size,…) call MPI_Send(s_buf,size,…)

!MPI rank n-1 !MPI rank n-1


call MPI_Recv(r_buf,size,…) call MPI_Recv(r_buf,size,…)
!$acc end host_data !$acc update device(r_buf)

45
NVIDIA GPUDirectTM
Peer to Peer tranfers
GPU1 GPU2
Memory Memory

System
Memory

CPU
GPU GPU
1 2

PCI-e Chip
set
IB
46
NVIDIA GPUDirectTM
Peer to Peer tranfers
GPU1 GPU2
Memory Memory

System
Memory

CPU
GPU GPU
1 2

PCI-e Chip
set
IB
47
NVIDIA GPUDirectTM
Support for RDMA
GPU1 GPU2
Memory Memory

System
Memory

CPU
GPU GPU
1 2

PCI-e Chip
set
IB
48
NVIDIA GPUDirectTM
Support for RDMA
GPU1 GPU2
Memory Memory

System
Memory

CPU
GPU GPU
1 2

PCI-e Chip
set
IB
49
Performance Results two Nodes
OpenMPI 1.8.4 MLNX FDR IB (4X) Tesla K40 @ 875GHz
7000
6000
5000
BW (MB/s)

4000 GPU-aware MPI with


3000 GPUDirect RDMA
2000 GPU-aware MPI
1000
0 regular MPI

Message Size (byte)


Latency (1 byte) 19.79 us 17.97 us 5.70 us 50
• Profiling GPU applications
• Optimizing Data Movement

In this talk • Using MPI with GPUs

Any questions?

51
Bonus slides - Pascal!

52
GPU Roadmap

20
Pascal
Unified Memory
3D Memory
18 NVLink

16
SGEMM / W Normalized

14

Maxwell
12 DX12

10

8
Kepler
Dynamic Parallelism
6

Fermi
2 FP64
Tesla
CUDA
0
2008 2010 2012 2014 2016
53
Pascal

• Faster and larger global memory


• Faster communication between processors
• More powerful unified memory
• Mixed precision computing

54
Stacked Memory
High performance global memory

3D Stacked Memory
• 4x Higher Bandwidth (~1 TB/s)
• 3x Larger Capacity (up to 32GB)
• 4x More Energy Efficient per bit

55
NVLINK
Faster communications between processors

NVLINK
• High speed interconnect
• Alongside/replacing PCI-E
• 80-200 GB/s
• Improved energy efficiency
• Improved flexibility

56
KEPLER GPU PASCAL GPU

NVLink

NVLink
High-Speed GPU
Interconnect POWER CPU

NVLink

PCIe PCIe

X86, ARM64, X86, ARM64,


POWER CPU POWER CPU

2014 2016
57
Unified Memory: Simpler & Faster with NVLink

Traditional Developer View Developer View With Developer View With


Unified Memory Pascal & NVLink

NVLink
80 GB/s

System GPU Memory Unified Memory Unified Memory


Memory
Share Data Structures at
CPU Memory Speeds, not PCIe speeds
Oversubscribe GPU Memory

58
Mixed precision computing
IEEE 16-bit float support

• Halving precision results in:


• Half the memory footprint

• Half the bandwidth required

• Double the computational throughput

• Obviously comes at an accuracy penalty


• Application dependent

• 16/32/64 bit all supported

• Supported in software now


59

You might also like