- The document discusses current R&D work on pre-Exascale HPC systems, including a PRACE 2011 prototype that delivers over 10 TFLOPS in a single rack using heterogeneous hardware with GPUs and achieves over 1.1 TFLOPS/kW efficiency.
- Performance debugging techniques are discussed for multi-socket, multi-chipset, multi-GPU systems to analyze issues like bottlenecks in the cache hierarchy topology and imbalanced I/O. Affinity and memory binding are important to optimize performance.
- Linux and Windows tools like HWLOC can be used to set CPU and GPU affinity as well as memory binding to improve data transfer rates between devices by ensuring local memory access.