ECMWF Advanced GPU Topics 1
ECMWF Advanced GPU Topics 1
2
Profiling GPU Applications
3
Profiling Tools
Many options!
From
FromNVIDIA
NVIDIA From
ThirdNVIDIA
Party
• nvprof • nvprof
• nvprof • System
• NVIDIA Visual profiler • NVIDIA Visual profiler
• NVIDIA Visual profiler • VampirTrace
• Standalone (nvvp) • Standalone (nvvp)
• Standalone (nvvp)
• Integrated into Nsight Eclipse • Integrated into Nsight Eclipse
• Integrated into Nsight Eclipse • PAPI CUDA component
Edition (nsight) Edition (nsight)
Edition (nsight)
• Nsight Visual Studio Edition • HPC Toolkit
• Nsight Visual Studio Edition
• Nsight Visual Studio Edition
4
This talk
5
nvprof
Simple usage
if (i <= n) then
c(i) = a(i) + b(i)
end if
end subroutine vecAdd_GPU
6
nvprof ./vecAdd
==34092== NVPROF is profiling process 34092, command: ./vecAdd
==34092== Profiling application: ./vecAdd
==34092== Profiling result:
Time(%) Time Calls Avg Min Max Name
69.07% 3.8547ms 2 1.9274ms 1.9093ms 1.9454ms [CUDA memcpy HtoD]
29.04% 1.6205ms 1 1.6205ms 1.6205ms 1.6205ms [CUDA memcpy DtoH]
1.19% 66.178us 1 66.178us 66.178us 66.178us kernel_vecadd_gpu_
0.71% 39.650us 2 19.825us 19.617us 20.033us __pgi_dev_cumemset_4
• Top half of the profile is runtime measured from the GPU perspective
• What can we see?
• Memcpy XtoX is memory copy to and from the GPU
8
nvprof ./vecAdd
==34092== NVPROF is profiling process 34092, command: ./vecAdd
==34092== Profiling application: ./vecAdd
==34092== Profiling result:
Time(%) Time Calls Avg Min Max Name
69.07% 3.8547ms 2 1.9274ms 1.9093ms 1.9454ms [CUDA memcpy HtoD]
29.04% 1.6205ms 1 1.6205ms 1.6205ms 1.6205ms [CUDA memcpy DtoH]
1.19% 66.178us 1 66.178us 66.178us 66.178us kernel_vecadd_gpu_
0.71% 39.650us 2 19.825us 19.617us 20.033us __pgi_dev_cumemset_4
• (PGI) OpenACC kernels will be named after the subroutine name and line number
• eg. vecadd_17_gpu
• Subroutine vecadd
9
• Bottom half of the profile is runtime measured from the CPU
perspective
• What can we see?
• First allocation is expensive (cuda initialisation)
• nvprof -h
• There are quite a few options!
• Some useful ones:
• -o: creates an output file which can be imported into nvvp
• -m and -e: collect metrics or events
• --analysis-metrics: collect all metrics for import into nvvp
• --query-metrics and --query-events: query which metrics/events are available
11
nvprof
Events and Metrics
Most are quite in-depth, however some useful ones for quick analysis
In general, events are only for the expert. Rarely useful
(A few) useful metrics:
dram_read_throughput: Main GPU memory read throughput
dram_write_throughput: Main GPU memory write throughput
flop_count_sp: Number of single precision floating point operations
flop_count_dp : Number of double precision floating point operations
12
nvprof
Metrics Example
13
nvprof
Metrics Example
Problem was addition of 1,000,000
element single precision vectors
1,000,000 single precision flop count
is therefore expected!
dram_read_throughput 136.55GB/s
dram_write_throughput 70.062GB/s ~50 GFLOP/s
flop_count_sp 1000000
flop_count_dp 0 About 1% peak FLOPs
15
Vecadd - timeline
16
NEMO - timeline
17
NEMO - details
18
Back to vecadd
More details!
21
Compute, Bandwidth or Latency Bound
22
Profiling tips
The nvprof output is a very good place to start
The timeline is a good place to go next
Only dig deep into a kernel if it’s taking a significant amount of your time
Where possible, try to match profiler output with theory
For example, if I know I’m moving 1GB, and my kernel takes 10ms, I expect the
profiler to report 100GB/s.
Discrepancies are likely to mean your application isn’t doing what you thought it was
23
Optimising Data Movement
24
Heterogeneous Computing
PCI Bus
25
Data movement
What are the costs?
Two costs:
Bandwidth
Latency
28
Optimising Data Movement
2. Fuse small copies into larger ones
29
Optimising Data Movement
3. Use pinned buffers
• In OpenACC:
• (Hopefully) done automatically!
30
Optimising Data Movement
4. Overlap data movement with execution
• With pinned memory PCI-E transfers can occur at the same time as kernel
execution
• In CUDA:
• cudaMemcpyAsync
• In OpenACC:
• async clause on parallel or kernels directives
• This can greatly reduce the cost of memory copies – almost to none in the right
case
31
Optimising Data Movement
Summary
• There are good (and bad) ways of moving data between GPU and CPU
• For many applications this is not imporant
• Data resides on the GPU for the application’s lifetime
32
NVLINK
Faster communications between processors
NVLINK
• High speed interconnect
• Alongside/replacing PCI-E
• 80-200 GB/s
• Improved energy efficiency
• Improved flexibility
33
NVLink Unleashes Multi-GPU Performance
GPUs Interconnected with NVLink Over 2x Application Performance Speedup
When Next-Gen GPUs Connect via NVLink Versus PCIe
Speedup vs
PCIe based Server
2.25x
CPU
2.00x
PCIe Switch
1.75x
1.50x
TESLA TESLA
GPU GPU
1.25x
34
Using MPI with GPUs
35
MPI+CUDA
…
GPU CPU GPU CPU GPU CPU
36
MPI+CUDA
…
GPU CPU GPU CPU GPU CPU
37
MPI+CUDA
//MPI rank 0
MPI_Send(s_buf_d, size, MPI_CHAR, n-1, tag, MPI_COMM_WORLD);
39
MPI
Compiling and launching
40
Launch MPI + CUDA/OpenACC programs
Launch one process per GPU
MVAPICH: MV2_USE_CUDA
Cray: MPICH_RDMA_ENABLED_CUDA
41
Unified Virtual Addressing
• Support:
• 64-bit applications on Linux
42
Unified Virtual Addressing
No UVA : Separate Address Spaces UVA : Single Address Space
PCI-e PCI-e
43
MPI + CUDA
44
MPI + OpenACC
45
NVIDIA GPUDirectTM
Peer to Peer tranfers
GPU1 GPU2
Memory Memory
System
Memory
CPU
GPU GPU
1 2
PCI-e Chip
set
IB
46
NVIDIA GPUDirectTM
Peer to Peer tranfers
GPU1 GPU2
Memory Memory
System
Memory
CPU
GPU GPU
1 2
PCI-e Chip
set
IB
47
NVIDIA GPUDirectTM
Support for RDMA
GPU1 GPU2
Memory Memory
System
Memory
CPU
GPU GPU
1 2
PCI-e Chip
set
IB
48
NVIDIA GPUDirectTM
Support for RDMA
GPU1 GPU2
Memory Memory
System
Memory
CPU
GPU GPU
1 2
PCI-e Chip
set
IB
49
Performance Results two Nodes
OpenMPI 1.8.4 MLNX FDR IB (4X) Tesla K40 @ 875GHz
7000
6000
5000
BW (MB/s)
Any questions?
51
Bonus slides - Pascal!
52
GPU Roadmap
20
Pascal
Unified Memory
3D Memory
18 NVLink
16
SGEMM / W Normalized
14
Maxwell
12 DX12
10
8
Kepler
Dynamic Parallelism
6
Fermi
2 FP64
Tesla
CUDA
0
2008 2010 2012 2014 2016
53
Pascal
54
Stacked Memory
High performance global memory
3D Stacked Memory
• 4x Higher Bandwidth (~1 TB/s)
• 3x Larger Capacity (up to 32GB)
• 4x More Energy Efficient per bit
55
NVLINK
Faster communications between processors
NVLINK
• High speed interconnect
• Alongside/replacing PCI-E
• 80-200 GB/s
• Improved energy efficiency
• Improved flexibility
56
KEPLER GPU PASCAL GPU
NVLink
NVLink
High-Speed GPU
Interconnect POWER CPU
NVLink
PCIe PCIe
2014 2016
57
Unified Memory: Simpler & Faster with NVLink
NVLink
80 GB/s
58
Mixed precision computing
IEEE 16-bit float support