Highlighted notes on NVIDIA Tesla V100 GPU Architecture Whitepaper
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Here is a my short summary of NVIDIA Tesla GV100 (Volta) architecture from the whitepaper:
84 SMs, each with 64 independent FP, INT cores.
Shared mem. size config. up to 96KB / SM.
4 512-bit mem. controllers (total 4096-bit).
Upto 6 Bidirectional NVLink, 25 GB/s per direction (w/ IBM Power 9 CPUs).
4 dies / HBM stack, 4 stacks. 16 GB w/ 900 GB/s HBM2 (Samsung).
1 err. correcting, 2 err. detecting native/sideband ECC (HBM, REG, L1, L2) (1 bit / byte).
Each SM has 4 processing blocks (each handles 1 warp of 32 threads).
L1 data cache is combined w/ shared mem. = 128 KB / SM (explicit caching not as imp.).
Volta also supports write-caching (not just load, as prev. arch.).
NVLink supports coherency allowing data reads from GPU mem. to be stored in CPU cache.
Addr. Translation Serv. (ATS) allows GPU to access CPU page tables directly (malloc ptr).
Copy engine dont need pinned memory (that's why i saw ~no speedup w/ pinned mem. in PR).
Volta per-thread PC, call-stack, allows interleaved exec. of warp threads, ok fine-grained sync. (__syncwarp()).
Cooperative groups enable sync. between warps, grid-wide, multi-GPUs, cross-warp, sub-warp.