FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote) Wim Vanderbauwhede
This document provides a historical overview of the evolution of FPGA technology and programming approaches over several decades. It discusses early theoretical foundations in the 1930s-40s and the development of integrated circuits, hardware description languages, and high-level synthesis tools from the 1950s onwards. More recently, it describes the rise of heterogeneous computing using GPUs, FPGAs and other accelerators, and the ongoing challenges around programming such systems at a suitable level of abstraction.
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
Orbital representations that are based on B-splines are widely used in quantum Monte Carlo (QMC) simulations of solids, which historically take as much as 50 percent of the total runtime. Random access to a large four-dimensional array make it challenging to efficiently use caches and wide vector units in modern CPUs. So, we present node-level optimizations of B-spline evaluations on multicore and manycore shared memory processors.
To increase single instruction multiple data (SIMD) efficiency and bandwidth utilization, we first apply data layout transformation from an array of structures (AoS) to a structure of arrays (SoA). Then, by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations, resulting with performance enhancements. Finally, we employ roofline performance analysis to model the impacts of our optimizations.
In this deck from the 2018 Swiss HPC Conference, DK Panda from Ohio State University presents: Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark, and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project will be shown. Benefits of these stacks to accelerate deep learning frameworks (such as CaffeOnSpark and TensorFlowOnSpark) will be presented."
Watch the video: https://wp.me/p3RLHQ-iko
Learn more: http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
A Library for Emerging High-Performance Computing ClustersIntel® Software
This document discusses the challenges of developing communication libraries for exascale systems using hybrid MPI+X programming models. It describes how current MPI+PGAS approaches use separate runtimes, which can lead to issues like deadlock. The document advocates for a unified runtime that can support multiple programming models simultaneously to avoid such issues and enable better performance. It also outlines MVAPICH2's work on designs like multi-endpoint that integrate MPI and OpenMP to efficiently support emerging highly threaded systems.
NERSC is the production high-performance computing (HPC) center for the United States Department of Energy (DOE) Office of Science. The center supports over 6,000 users in 600 projects, using a variety of applications in materials science, chemistry, biology, astrophysics, high energy physics, climate science, fusion science, and more.
NERSC deployed the Cori system on over 9,000 Intel® Xeon Phi™ processors. This session describes the optimization strategy for porting codes that target traditional manycore architectures to the processors. We also discuss highlights and lessons learned from the optimization process on 20 applications associated with the NERSC Exascale Science Application Program (NESAP).
Dr. Robert Voigt from the Krell Institute presented this deck at the recent HPC Saudi conference.
"This talk will provide a historical perspective on the challenges of educating computational scientists based on my personal involvement over a number of years. Three decidedly different activities will be drawn on to indicate how one can successfully approach the challenge. The first is based on experiences at the Institute for Computer Applications in Science and Engineering at the NASA Langley Research Center where visiting students were exposed to multidisciplinary research driven by computer simulations. The second is the Predictive Science Academic Alliance Program funded by the National Nuclear Security Administration, a component of the US Department of Energy (DOE). The third is the Computational Science Graduate Fellowship program funded by the DOE. The latter two programs provide students with exposure to multidisciplinary research and perhaps more unique, require them to spend a three month period at one of the DOE national laboratories. My experience with these three efforts suggest that development of computational scientists require three key components: class room exposure to applied mathematics, computer science and a scientific or engineering discipline; exposure to teams conducting multidisciplinary research; and a significant internship at a major research facility."
Watch a conversation with Dr. Robert Voight: http://wp.me/p3RLHQ-gBl
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...inside-BigData.com
In this deck from PASC18, Robert Searles from the University of Delaware presents: Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures.
"Architectures are rapidly evolving, and exascale machines are expected to offer billion-way concurrency. We need to rethink algorithms, languages and programming models among other components in order to migrate large scale applications and explore parallelism on these machines. Although directive-based programming models allow programmers to worry less about programming and more about science, expressing complex parallel patterns in these models can be a daunting task especially when the goal is to match the performance that the hardware platforms can offer. One such pattern is wavefront. This paper extensively studies a wavefront-based miniapplication for Denovo, a production code for nuclear reactor modeling.
We parallelize the Koch-Baker-Alcouffe (KBA) parallel-wavefront sweep algorithm in the main kernel of Minisweep (the miniapplication) using CUDA, OpenMP and OpenACC. Our OpenACC implementation running on NVIDIA's next-generation Volta GPU boasts an 85.06x speedup over serial code, which is larger than CUDA's 83.72x speedup over the same serial implementation. Our experimental platform includes SummitDev, an ORNL representative architecture of the upcoming Summit supercomputer. Our parallelization effort across platforms also motivated us to define an abstract parallelism model that is architecture independent, with a goal of creating software abstractions that can be used by applications employing the wavefront sweep motif."
Watch the video: https://wp.me/p3RLHQ-iPU
Read the Full Paper: https://doi.org/10.1145/3218176.3218228
and
https://pasc18.pasc-conference.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the GoingARM workshop at SC17, Filippo Mantovani describes the contributions of the Barcelona Supercomputing Center to the European Mont-Blanc project.
"Since 2011, Mont-Blanc has pushed the adoption of Arm technology in High Performance Computing, deploying Arm-based prototypes, enhancing system software ecosystem and projecting performance of current systems for developing new, more powerful and less power hungry HPC computing platforms based on Arm SoC. In this talk, Filippo introduces the last Mont-Blanc system, called Dibona, designed and integrated by the coordinator and industrial partner of the project, Bull/ATOS. He also talks about tests performed at BSC of the Arm software tools (HPC compiler and mathematical libraries) as well as the Dynamic Load Balancing (DLB) technique and the Multiscale Simulator Architecture (MUSA)."
Watch the video: https://wp.me/p3RLHQ-i6o
Learn more: http://www.goingarm.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Perth HPC Conference, Rob Farber from TechEnablement presents: AI is Impacting HPC Everywhere.
"The convergence of AI and HPC has created a fertile venue that is ripe for imaginative researchers — versed in AI technology — to make a big impact in a variety of scientific fields. From new hardware to new computational approaches, the true impact of deep- and machine learning on HPC is, in a word, “everywhere”. Just as technology changes in the personal computer market brought about a revolution in the design and implementation of the systems and algorithms used in high performance computing (HPC), so are recent technology changes in machine learning bringing about an AI revolution in the HPC community. Expect new HPC analytic techniques including the use of GANs (Generative Adversarial Networks) in physics-based modeling and simulation, as well as reduced precision math libraries such as NLAFET and HiCMA to revolutionize many fields of research. Other benefits of the convergence of AI and HPC include the physical instantiation of data flow architectures in FPGAs and ASICs, plus the development of powerful data analytic services."
Learn more: http://www.techenablement.com/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the HPC Advisory Council Spain Conference, DK Panda from Ohio State University presents: Communication Frameworks for HPC and Big Data.
Watch the video presentation: http://insidehpc.com/2015/09/video-communication-frameworks-for-hpc-and-big-data/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Learn more: http://www.hpcadvisorycouncil.com/events/2015/spain-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Some experiences for porting application to Intel Xeon PhiMaho Nakata
The document discusses experiences porting applications to Intel Xeon Phi. It provides tips for compiling applications with Intel Composer XE 2013 using the -mmic flag. While some applications like DGEMM require tuning to achieve peak performance, others like Gaussian09 and Povray require patches and multi-step configurations to build for Xeon Phi. There is also an effort underway to port the pkgsrc packaging system to help bring more software packages to the Xeon Phi.
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
In this presentation, we focus on an alternative approach that uses nodes that contain Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. Programming models and the development tools are identical for these resources, greatly simplifying development. We discuss how the same models for vectorization and threading can be used across these compute resources to create software that performs well on them. We further propose an extension to the Intel® Threading Building Blocks (Intel® TBB) flow graph interface that enables intra-node distributed memory programming, simplifying communication, and load balancing between the processors and coprocessors. Finally, we validate this approach by presenting a benchmark of a risk analysis implementation that achieves record-setting performance.
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
This document describes PyCoRAM, a Python-based implementation of the CoRAM memory architecture for FPGA-based computing. PyCoRAM provides a high-level abstraction for memory management that decouples computing logic from memory access behaviors. It allows defining memory access patterns using Python control threads. PyCoRAM generates an IP core that integrates with standard IP cores on Xilinx FPGAs using the AMBA AXI4 interconnect. It supports parameterized RTL design and achieves high memory bandwidth utilization of over 84% on two FPGA boards in evaluations of an array summation application.
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
The document discusses performance optimization of Apache Spark on scale-up servers through near-data processing. It finds that Spark workloads have poor multi-core scalability and high I/O wait times on scale-up servers. It proposes exploiting near-data processing through in-storage processing and 2D-integrated processing-in-memory to reduce data movements and latency. The author evaluates these techniques through modeling and a programmable FPGA accelerator to improve the performance of Spark MLlib workloads by up to 9x. Challenges in hybrid CPU-FPGA design and attaining peak performance are also discussed.
This document discusses how HPC infrastructure is being transformed with AI. It summarizes that cognitive systems use distributed deep learning across HPC clusters to speed up training times. It also outlines IBM's hardware portfolio expansion for AI training, inference, and storage capabilities. The document discusses software stacks for AI like Watson Machine Learning Community Edition that use containers and universal base images to simplify deployment.
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksShinya Takamaeda-Y
The document presents an approach for accelerating convolutional neural networks (CNNs) using a coarse-grained reconfigurable array (CGRA) called EMAX. EMAX features processing elements with local memory to improve data locality and memory bandwidth utilization. CNN computations like convolutions are mapped to EMAX by assigning weight matrices to constant registers and performing numerous small matrix multiplications in parallel. Evaluation shows EMAX achieves better performance per memory bandwidth and area than GPUs for CNN workloads due to its optimization for small matrix operations.
Co-simulation interfaces for connecting distributed real-time simulatorsSteffen Vogel
The document discusses co-simulation interfaces for connecting distributed real-time simulators. It describes an ongoing project using an OPAL-RT and RTDS co-simulation interface with an FPGA board to connect the two simulators digitally in real-time. It also discusses using the VILLAS framework to enable geographically distributed real-time simulation across multiple labs separated by large distances. The framework uses VPN technology to securely connect simulators like OPAL-RT and RTDS at different university research centers for collaborative simulation.
In this deck from the HPC User Forum at Argonne, Andrew Siegel from Argonne presents: ECP Application Development.
"The Exascale Computing Project is accelerating delivery of a capable exascale computing ecosystem for breakthroughs in scientific discovery, energy assurance, economic competitiveness, and national security. ECP is chartered with accelerating delivery of a capable exascale computing ecosystem to provide breakthrough modeling and simulation solutions to address the most critical challenges in scientific discovery, energy assurance, economic competitiveness, and national security. This role goes far beyond the limited scope of a physical computing system. ECP’s work encompasses the development of an entire exascale ecosystem: applications, system software, hardware technologies and architectures, along with critical workforce development."
Watch the video: https://wp.me/p3RLHQ-kSL
Learn more: https://www.exascaleproject.org
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the 2017 MVAPICH User Group, DK Panda from Ohio State University presents: Overview of the MVAPICH Project and Future Roadmap.
"This talk will provide an overview of the MVAPICH project (past, present and future). Future roadmap and features for upcoming releases of the MVAPICH2 software family (including MVAPICH2-X, MVAPICH2-GDR, MVAPICH2-Virt, MVAPICH2-EA and MVAPICH2-MIC) will be presented. Current status and future plans for OSU INAM, OEMT and OMB will also be presented."
Watch the video: https://www.youtube.com/watch?v=wF7t-oH7wi4
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Real time machine learning proposers day v3mustafa sarac
This document discusses DARPA's Real Time Machine Learning (RTML) program. The objective is to develop hardware generators and compilers that can automatically create application-specific integrated circuits for machine learning from high-level code. This would allow no-human-in-the-loop creation of efficient neural network hardware. The program has two phases: phase 1 develops an ML hardware compiler, and phase 2 demonstrates RTML systems for applications like wireless communication and image processing. Key goals are high performance, low power consumption, and support for a variety of neural network architectures and machine learning techniques.
Challenges and Opportunities for HPC Interconnects and MPIinside-BigData.com
In this video from the 2017 MVAPICH User Group, Ron Brightwell from Sandia presents: Challenges and Opportunities for HPC Interconnects and MPI.
"This talk will reflect on prior analysis of the challenges facing high-performance interconnect technologies intended to support extreme-scale scientific computing systems, how some of these challenges have been addressed, and what new challenges lay ahead. Many of these challenges can be attributed to the complexity created by hardware diversity, which has a direct impact on interconnect technology, but new challenges are also arising indirectly as reactions to other aspects of high-performance computing, such as alternative parallel programming models and more complex system usage models. We will describe some near-term research on proposed extensions to MPI to better support massive multithreading and implementation optimizations aimed at reducing the overhead of MPI tag matching. We will also describe a new portable programming model to offload simple packet processing functions to a network interface that is based on the current Portals data movement layer. We believe this capability will offer significant performance improvements to applications and services relevant to high-performance computing as well as data analytics."
Watch the video: https://wp.me/p3RLHQ-hhK
Learn more: http://mug.mvapich.cse.ohio-state.edu/program/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses implementing the Vector, Signal and Image Processing Library (VSIPL) on FPGA-based reconfigurable computers. It examines the challenges of doing so given VSIPL's boundless vectors and extensive scope. It presents an implementation of a single-precision floating-point boundless convolver as a proof of concept. The convolver achieved a 10x speed increase over software. Three potential architectures for broader FPGA VSIPL implementations are outlined.
The document discusses optimizing and deploying PyTorch models for production use at scale. It covers techniques like quantization, distillation, and conversion to TorchScript to optimize models for low latency inference. It also discusses deploying optimized models using TorchServe, including packaging models with MAR files and writing custom handlers. Key lessons were that a distilled and quantized BERT model could meet latency SLAs of <40ms on CPU and <10ms on GPU, and support throughputs of 1500 requests per second.
Iaetsd multioperand redundant adders on fpg asIaetsd Iaetsd
This paper presents efficient implementations of redundant multi-operand adders on FPGAs. Previous work avoided redundant adders on FPGAs due to the efficient carry propagate adders (CPAs) and area overhead of redundant adders. The paper proposes carry-save compressor tree approaches that achieve fast critical paths independent of bit width with little to no area overhead compared to CPA trees. It presents a classic carry-save compressor tree and a novel linear array structure that efficiently uses fast carry chains. Compared to binary and ternary CPA trees, the approaches achieve speedups of up to 3.81 times for 64-bit width additions.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...inside-BigData.com
In this deck from PASC18, Robert Searles from the University of Delaware presents: Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures.
"Architectures are rapidly evolving, and exascale machines are expected to offer billion-way concurrency. We need to rethink algorithms, languages and programming models among other components in order to migrate large scale applications and explore parallelism on these machines. Although directive-based programming models allow programmers to worry less about programming and more about science, expressing complex parallel patterns in these models can be a daunting task especially when the goal is to match the performance that the hardware platforms can offer. One such pattern is wavefront. This paper extensively studies a wavefront-based miniapplication for Denovo, a production code for nuclear reactor modeling.
We parallelize the Koch-Baker-Alcouffe (KBA) parallel-wavefront sweep algorithm in the main kernel of Minisweep (the miniapplication) using CUDA, OpenMP and OpenACC. Our OpenACC implementation running on NVIDIA's next-generation Volta GPU boasts an 85.06x speedup over serial code, which is larger than CUDA's 83.72x speedup over the same serial implementation. Our experimental platform includes SummitDev, an ORNL representative architecture of the upcoming Summit supercomputer. Our parallelization effort across platforms also motivated us to define an abstract parallelism model that is architecture independent, with a goal of creating software abstractions that can be used by applications employing the wavefront sweep motif."
Watch the video: https://wp.me/p3RLHQ-iPU
Read the Full Paper: https://doi.org/10.1145/3218176.3218228
and
https://pasc18.pasc-conference.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the GoingARM workshop at SC17, Filippo Mantovani describes the contributions of the Barcelona Supercomputing Center to the European Mont-Blanc project.
"Since 2011, Mont-Blanc has pushed the adoption of Arm technology in High Performance Computing, deploying Arm-based prototypes, enhancing system software ecosystem and projecting performance of current systems for developing new, more powerful and less power hungry HPC computing platforms based on Arm SoC. In this talk, Filippo introduces the last Mont-Blanc system, called Dibona, designed and integrated by the coordinator and industrial partner of the project, Bull/ATOS. He also talks about tests performed at BSC of the Arm software tools (HPC compiler and mathematical libraries) as well as the Dynamic Load Balancing (DLB) technique and the Multiscale Simulator Architecture (MUSA)."
Watch the video: https://wp.me/p3RLHQ-i6o
Learn more: http://www.goingarm.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Perth HPC Conference, Rob Farber from TechEnablement presents: AI is Impacting HPC Everywhere.
"The convergence of AI and HPC has created a fertile venue that is ripe for imaginative researchers — versed in AI technology — to make a big impact in a variety of scientific fields. From new hardware to new computational approaches, the true impact of deep- and machine learning on HPC is, in a word, “everywhere”. Just as technology changes in the personal computer market brought about a revolution in the design and implementation of the systems and algorithms used in high performance computing (HPC), so are recent technology changes in machine learning bringing about an AI revolution in the HPC community. Expect new HPC analytic techniques including the use of GANs (Generative Adversarial Networks) in physics-based modeling and simulation, as well as reduced precision math libraries such as NLAFET and HiCMA to revolutionize many fields of research. Other benefits of the convergence of AI and HPC include the physical instantiation of data flow architectures in FPGAs and ASICs, plus the development of powerful data analytic services."
Learn more: http://www.techenablement.com/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the HPC Advisory Council Spain Conference, DK Panda from Ohio State University presents: Communication Frameworks for HPC and Big Data.
Watch the video presentation: http://insidehpc.com/2015/09/video-communication-frameworks-for-hpc-and-big-data/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Learn more: http://www.hpcadvisorycouncil.com/events/2015/spain-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Some experiences for porting application to Intel Xeon PhiMaho Nakata
The document discusses experiences porting applications to Intel Xeon Phi. It provides tips for compiling applications with Intel Composer XE 2013 using the -mmic flag. While some applications like DGEMM require tuning to achieve peak performance, others like Gaussian09 and Povray require patches and multi-step configurations to build for Xeon Phi. There is also an effort underway to port the pkgsrc packaging system to help bring more software packages to the Xeon Phi.
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
In this presentation, we focus on an alternative approach that uses nodes that contain Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. Programming models and the development tools are identical for these resources, greatly simplifying development. We discuss how the same models for vectorization and threading can be used across these compute resources to create software that performs well on them. We further propose an extension to the Intel® Threading Building Blocks (Intel® TBB) flow graph interface that enables intra-node distributed memory programming, simplifying communication, and load balancing between the processors and coprocessors. Finally, we validate this approach by presenting a benchmark of a risk analysis implementation that achieves record-setting performance.
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
This document describes PyCoRAM, a Python-based implementation of the CoRAM memory architecture for FPGA-based computing. PyCoRAM provides a high-level abstraction for memory management that decouples computing logic from memory access behaviors. It allows defining memory access patterns using Python control threads. PyCoRAM generates an IP core that integrates with standard IP cores on Xilinx FPGAs using the AMBA AXI4 interconnect. It supports parameterized RTL design and achieves high memory bandwidth utilization of over 84% on two FPGA boards in evaluations of an array summation application.
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
The document discusses performance optimization of Apache Spark on scale-up servers through near-data processing. It finds that Spark workloads have poor multi-core scalability and high I/O wait times on scale-up servers. It proposes exploiting near-data processing through in-storage processing and 2D-integrated processing-in-memory to reduce data movements and latency. The author evaluates these techniques through modeling and a programmable FPGA accelerator to improve the performance of Spark MLlib workloads by up to 9x. Challenges in hybrid CPU-FPGA design and attaining peak performance are also discussed.
This document discusses how HPC infrastructure is being transformed with AI. It summarizes that cognitive systems use distributed deep learning across HPC clusters to speed up training times. It also outlines IBM's hardware portfolio expansion for AI training, inference, and storage capabilities. The document discusses software stacks for AI like Watson Machine Learning Community Edition that use containers and universal base images to simplify deployment.
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksShinya Takamaeda-Y
The document presents an approach for accelerating convolutional neural networks (CNNs) using a coarse-grained reconfigurable array (CGRA) called EMAX. EMAX features processing elements with local memory to improve data locality and memory bandwidth utilization. CNN computations like convolutions are mapped to EMAX by assigning weight matrices to constant registers and performing numerous small matrix multiplications in parallel. Evaluation shows EMAX achieves better performance per memory bandwidth and area than GPUs for CNN workloads due to its optimization for small matrix operations.
Co-simulation interfaces for connecting distributed real-time simulatorsSteffen Vogel
The document discusses co-simulation interfaces for connecting distributed real-time simulators. It describes an ongoing project using an OPAL-RT and RTDS co-simulation interface with an FPGA board to connect the two simulators digitally in real-time. It also discusses using the VILLAS framework to enable geographically distributed real-time simulation across multiple labs separated by large distances. The framework uses VPN technology to securely connect simulators like OPAL-RT and RTDS at different university research centers for collaborative simulation.
In this deck from the HPC User Forum at Argonne, Andrew Siegel from Argonne presents: ECP Application Development.
"The Exascale Computing Project is accelerating delivery of a capable exascale computing ecosystem for breakthroughs in scientific discovery, energy assurance, economic competitiveness, and national security. ECP is chartered with accelerating delivery of a capable exascale computing ecosystem to provide breakthrough modeling and simulation solutions to address the most critical challenges in scientific discovery, energy assurance, economic competitiveness, and national security. This role goes far beyond the limited scope of a physical computing system. ECP’s work encompasses the development of an entire exascale ecosystem: applications, system software, hardware technologies and architectures, along with critical workforce development."
Watch the video: https://wp.me/p3RLHQ-kSL
Learn more: https://www.exascaleproject.org
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the 2017 MVAPICH User Group, DK Panda from Ohio State University presents: Overview of the MVAPICH Project and Future Roadmap.
"This talk will provide an overview of the MVAPICH project (past, present and future). Future roadmap and features for upcoming releases of the MVAPICH2 software family (including MVAPICH2-X, MVAPICH2-GDR, MVAPICH2-Virt, MVAPICH2-EA and MVAPICH2-MIC) will be presented. Current status and future plans for OSU INAM, OEMT and OMB will also be presented."
Watch the video: https://www.youtube.com/watch?v=wF7t-oH7wi4
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Real time machine learning proposers day v3mustafa sarac
This document discusses DARPA's Real Time Machine Learning (RTML) program. The objective is to develop hardware generators and compilers that can automatically create application-specific integrated circuits for machine learning from high-level code. This would allow no-human-in-the-loop creation of efficient neural network hardware. The program has two phases: phase 1 develops an ML hardware compiler, and phase 2 demonstrates RTML systems for applications like wireless communication and image processing. Key goals are high performance, low power consumption, and support for a variety of neural network architectures and machine learning techniques.
Challenges and Opportunities for HPC Interconnects and MPIinside-BigData.com
In this video from the 2017 MVAPICH User Group, Ron Brightwell from Sandia presents: Challenges and Opportunities for HPC Interconnects and MPI.
"This talk will reflect on prior analysis of the challenges facing high-performance interconnect technologies intended to support extreme-scale scientific computing systems, how some of these challenges have been addressed, and what new challenges lay ahead. Many of these challenges can be attributed to the complexity created by hardware diversity, which has a direct impact on interconnect technology, but new challenges are also arising indirectly as reactions to other aspects of high-performance computing, such as alternative parallel programming models and more complex system usage models. We will describe some near-term research on proposed extensions to MPI to better support massive multithreading and implementation optimizations aimed at reducing the overhead of MPI tag matching. We will also describe a new portable programming model to offload simple packet processing functions to a network interface that is based on the current Portals data movement layer. We believe this capability will offer significant performance improvements to applications and services relevant to high-performance computing as well as data analytics."
Watch the video: https://wp.me/p3RLHQ-hhK
Learn more: http://mug.mvapich.cse.ohio-state.edu/program/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses implementing the Vector, Signal and Image Processing Library (VSIPL) on FPGA-based reconfigurable computers. It examines the challenges of doing so given VSIPL's boundless vectors and extensive scope. It presents an implementation of a single-precision floating-point boundless convolver as a proof of concept. The convolver achieved a 10x speed increase over software. Three potential architectures for broader FPGA VSIPL implementations are outlined.
The document discusses optimizing and deploying PyTorch models for production use at scale. It covers techniques like quantization, distillation, and conversion to TorchScript to optimize models for low latency inference. It also discusses deploying optimized models using TorchServe, including packaging models with MAR files and writing custom handlers. Key lessons were that a distilled and quantized BERT model could meet latency SLAs of <40ms on CPU and <10ms on GPU, and support throughputs of 1500 requests per second.
Iaetsd multioperand redundant adders on fpg asIaetsd Iaetsd
This paper presents efficient implementations of redundant multi-operand adders on FPGAs. Previous work avoided redundant adders on FPGAs due to the efficient carry propagate adders (CPAs) and area overhead of redundant adders. The paper proposes carry-save compressor tree approaches that achieve fast critical paths independent of bit width with little to no area overhead compared to CPA trees. It presents a classic carry-save compressor tree and a novel linear array structure that efficiently uses fast carry chains. Compared to binary and ternary CPA trees, the approaches achieve speedups of up to 3.81 times for 64-bit width additions.
Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters is fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This talk will discuss and show in action:
* Leveraging Spark and Tensorflow for hyperparameter tuning
* Leveraging Spark and Tensorflow for deploying trained models
* An examination of DeepLearning4J, CaffeOnSpark, IBM's SystemML, and Intel's BigDL
* Sidecar GPU cluster architecture and Spark-GPU data reading patterns
* Pros, cons, and performance characteristics of various approaches
Attendees will leave this session informed on:
* The available architectures for Spark and Deep Learning and Spark with and without GPUs for Deep Learning
* Several deep learning software frameworks, their pros and cons in the Spark context and for various use cases, and their performance characteristics
* A practical, applied methodology and technical examples for tackling big data deep learning
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...Hideyuki Tanaka
This document summarizes research on optimizing an explicit finite-difference scheme for fluid dynamics simulations to achieve high performance on many-core systems like the PEZY-SC2 processor. The researchers developed a code generation framework that uses temporal blocking to optimize for low memory bandwidth. On a PEZY-SC2 system with 16 million cores, they achieved 4.78 PFlops and 21.5% efficiency, comparable to other works on higher bandwidth machines. Temporal blocking reduced the required memory bandwidth and allowed good weak scaling to larger core counts.
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012Joshua Mora
The document discusses how theoretical FLOPs per clock do not necessarily correlate with real application performance. It uses an AMD processor called "Fangio" that has its floating point capability capped to 2 FLOPs/clock compared to 4 FLOPs/clock normally. Despite having only half the theoretical FLOPs, Fangio delivers similar performance to the normal processor on many applications. This shows that FLOPs alone do not determine performance, and that code vectorization and algorithm design are also important factors.
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecturemohamedragabslideshare
This document summarizes research on revisiting co-processing techniques for hash joins on coupled CPU-GPU architectures. It discusses three co-processing mechanisms: off-loading, data dividing, and pipelined execution. Off-loading involves assigning entire operators like joins to either the CPU or GPU. Data dividing partitions data between the processors. Pipelined execution aims to schedule workloads adaptively between the CPU and GPU to maximize efficiency on the coupled architecture. The researchers evaluate these approaches for hash join algorithms, which first partition, build hash tables, and probe tables on the input relations.
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
Apache Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters are fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This session will cover:
– How to leverage Spark and TensorFlow for hyperparameter tuning and for deploying trained models
– DeepLearning4J, CaffeOnSpark, IBM’s SystemML and Intel’s BigDL
– Sidecar GPU cluster architecture and Spark-GPU data reading patterns
– The pros, cons and performance characteristics of various approaches
You’ll leave the session better informed about the available architectures for Spark and deep learning, and Spark with and without GPUs for deep learning. You’ll also learn about the pros and cons of deep learning software frameworks for various use cases, and discover a practical, applied methodology and technical examples for tackling big data deep learning.
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
This document analyzes the performance of lattice quantum chromodynamics (QCD) simulations using the asynchronous partitioned global address space (APGAS) programming model on GPUs. It implements lattice QCD in X10 CUDA and compares performance to other implementations. Results show a 19.4x speedup from using X10 CUDA on 32 nodes of the TSUBAME 2.5 supercomputer compared to the original X10 implementation. Optimizations like data layout transformation and communication overlapping contributed to this acceleration.
FPGA are a special form of Programmable logic devices(PLDs) with higher densities as compared to custom ICs and capable of implementing functionality in a short period of time using computer aided design (CAD) software....by [email protected]
FPGAs for Supercomputing: The Why and HowDESMOND YUEN
Excellent presentation by Hal Finkel ([email protected]), Kazutomo Yoshii, and Franck Cappello as to why FPGAs are a competitive HPC accelerator technology.
If you like what you read be sure you ♥ it below. Thank you!
Short Survey on the current state of Field-programmable gate array usage in Deep learning by several companies like Intel Nervana and Google's TPU (tensor processing units) vs GPU usage in terms of energy consumption and performance.
Team 6 is comprised of 5 members: Sourabh Ketkale, Sahil Kaw, Siddhi Pai, Goutham Nekkalapu, and Prince Jacob Chandy. The document discusses several techniques for optimizing neural network performance on different hardware, including using 8-bit quantization, SSE3 and SSE4 instruction sets, batching, lazy evaluation, batched lazy evaluation, and implementing neural networks on the Xeon Phi processor using techniques such as data parallelism and task parallelism. It also discusses using FPGAs and distributed systems to achieve large-scale deep learning.
The document describes an IBM workshop on CAPI and OpenCAPI technologies. It provides an overview of FPGA acceleration using SNAP, including how SNAP simplifies FPGA programming using a C/C++ based approach. Examples of use cases for FPGA acceleration like video processing and machine learning inference are also presented.
Warp processing is a technique that dynamically optimizes software to improve performance and energy efficiency. It works by profiling an application to identify critical regions, then partitioning those regions to hardware using an FPGA. The binary is updated to execute the partitioned regions on the FPGA circuit while the rest continues in software. This allows applications to achieve speedups of 2-100x or more while using 20x less memory and reducing power consumption by 38-94%.
Fugaku, the Successes and the Lessons LearnedRCCSRENKEI
The document summarizes the successes and lessons learned from Fugaku, Japan's flagship supercomputer. Key points include:
- Fugaku achieved the top performance on all HPC benchmarks in 2020 and 2021, showing high performance across applications, not just traditional HPC workloads.
- While many applications achieved their target performance, some did not due to issues like insufficient parallelism, I/O scalability problems, and compiler vectorization failures.
- Lessons include the need for improved software stacks, application analysis, and adapting to modern applications beyond classic HPC.
- Looking ahead, sustained exascale performance will require data-centric architectures and corresponding system software and algorithms as transistor scaling slow
Fpga implementation of truncated multiplier for array multiplicationFinalyear Projects
The document discusses designing a truncated multiplier for array multiplication on an FPGA. It proposes two improvements: 1) accumulating partial product bits in a carry-save format to reduce area and improve speed compared to other truncated array multipliers, and 2) a new pseudo-carry compensated truncation scheme with an adaptive compensation circuit and fixed bias to minimize truncation error for unsigned integer multiplication. The proposed truncated multiplier is expected to consume less power and area while improving truncation error efficiency compared to existing designs.
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
The document evaluates the performance impact of virtualization on high-performance computing (HPC) clouds. Experiments were conducted on the AIST Super Green Cloud, a 155-node HPC cluster. Benchmark results show that while PCI passthrough mitigates I/O overhead, virtualization still incurs performance penalties for MPI collectives as node counts increase. Application benchmarks demonstrate overhead is limited to around 5%. The study concludes HPC clouds are promising due to utilization improvements from virtualization, but further optimization of virtual machine placement and pass-through technologies could help reduce overhead.
This document summarizes an academic paper that proposes optimizing elliptic curve cryptography (ECC) through application-specific instruction set processor (ASIP) design. It applies pipelining techniques to the data path and uses complex instructions to reduce latency and the number of instructions needed for point multiplication. The paper describes applying different levels of pipelining to explore performance and find an optimal pipeline depth. It also develops a new combined algorithm to perform point doubling and addition using the specialized instructions. An FPGA implementation over GF(2163) is presented and shown to outperform previous work.
Haku, a toy functional language based on literary JapaneseWim Vanderbauwhede
Haku is a natural language functional programming language based on literary Japanese. This talk is discusses the motivation behind Haku and explains the language by example. You don't need to know Japanese or have read the Haku documentation.
https://codeberg.org/wimvanderbauwhede/haku
On the need for low-carbon and sustainable computing and the path towards zero-carbon computing.
See https://wimvanderbauwhede.github.io/articles/frugal-computing/ for the complete article with references.
* The problem:
The current emissions from computing are about 2% of the world total but are projected to rise steeply over the next two decades. By 2040 emissions from computing alone will be close to 80% of the emissions level acceptable to keep global warming below the safe limit of 1.5°C. This growth in computing emissions is unsustainable: it would make it virtually impossible to meet the emissions warming limit.
The emissions from production of computing devices far exceed the emissions from operating them, so even if devices are more energy efficient producing more of them will make the emissions problem worse. Therefore we must extend the useful life of our computing devices.
* The solution:
As a society we need to start treating computational resources as finite and precious, to be utilised only when necessary, and as effectively as possible. We need frugal computing: achieving the same results for less energy.
* The vision:
Imagine we can extend the useful life of our devices and even increase their capabilities without any increase in energy consumption.
Meanwhile, we will develop the technologies for the next generation of devices, designed for energy efficiency as well as long life.
Every subsequent cycle will last longer, until finally the world will have computing resources that last forever and hardly use any energy.
NOTE: there is a small mistake in the presentation, the safe limit for 2040 is 13 GtCO2e, not 23. This makes it even more important to embrace frugal computing.
As Slideshare does not allow re-uploads, please find the corrected slides at https://wimvanderbauwhede.github.io/presentation/Zero-Carbon-Computing.pdf
Many people working in academia find it difficult to achieve or maintain a good work-life balance. This talk goes into the reasons for this, the consequences of working too much, the benefits of having the right balance, and ways of achieving a better balance. The talk is very much based on my personal views and experiences, but I hope there is some interest in sharing these.
In this talk I introduce Perl 6 and some of its exciting new features, especially gradual typing, roles and some functional programming features like lazy lists.
This talk was given at the Scottish Programming Languages Seminar on 24th Feb 2016 at the School of Computing Science of Glasgow University.
Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)Wim Vanderbauwhede
This talk is about two Perl modules (Call:Haskell and Functional::Types) I developed to call Haskell functions as transparently as possible.
In general, the only way to guarantee the correctness of the types of the function arguments in Haskell is to ensure they are well-typed in Perl. So I ended up writing a Haskell-inspired type system for Perl. In this talk I will first discuss the approach I took to call Haskell from Perl, and then the reasons why a type system is needed, and the actual type system I developed. The type system is based on "prototypes", functions that create type descriptors, and a small API of functions to create type constructors and manipulate the types. The system is type checked at run time and supports sum types, product types, function types and polymorphism. The approach is not Perl-specific and suitable for other dynamic languages.
https://github.com/wimvanderbauwhede
These are the slides of the talk I gave at the Dyla'14 workshop (http://conferences.inf.ed.ac.uk/pldi2014/). It's about monads for languages like Perl, Ruby and LiveScript.
The source code is available at
https://github.com/wimvanderbauwhede/Perl-Parser-Combinators
https://github.com/wimvanderbauwhede/parser-combinators-ls
Don't be put off by the word monad or the maths. This is basically a very practical way for doing tasks such as parsing.
1) The author discusses how the late Scottish writer Iain Banks envisioned in his novels that advanced computing and availability of information could lead to a utopian society, as depicted in Banks' fictional culture.
2) The author notes that there are still many unsolved problems in their area of computer science research related to exploiting parallelism and heterogeneous systems.
3) The author wants their research to help address real-world problems like climate change, aging populations, energy security, and issues with the internet, such as by helping weather scientists improve severe weather predictions or linking rainfall to flooding simulations.
2025 Insilicogen Company Korean BrochureInsilico Gen
Insilicogen is a company, specializes in Bioinformatics. Our company provides a platform to share and communicate various biological data analysis effectively.
2025 Insilicogen Company English BrochureInsilico Gen
Insilicogen is a company, specializes in Bioinformatics. Our company provides a platform to share and communicate various biological data analysis effectively.
Infrastructure for Tracking Information Flow from Social Media to U.S. TV New...Himarsha Jayanetti
This study examines the intersection between social media and mainstream television (TV) news with an aim to understand how social media content amplifies its impact through TV broadcasts. While many studies emphasize social media as a primary platform for information dissemination, they often underestimate its total influence by focusing solely on interactions within the platform. This research examines instances where social media posts gain prominence on TV broadcasts, reaching new audiences and prompting public discourse. By using TV news closed captions, on-screen text recognition, and social media logo detection, we analyze how social media is referenced in TV news.
Poultry require at least 38 dietary nutrients inappropriate concentrations for a balanced diet. A nutritional deficiency may be due to a nutrient being omitted from the diet, adverse interaction between nutrients in otherwise apparently well-fortified diets, or the overriding effect of specific anti-nutritional factors.
Major components of foods are – Protein, Fats, Carbohydrates, Minerals, Vitamins
Vitamins are A- Fat soluble vitamins: A, D, E, and K ; B - Water soluble vitamins: Thiamin (B1), Riboflavin (B2), Nicotinic acid (niacin), Pantothenic acid (B5), Biotin, folic acid, pyriodxin and cholin.
Causes: Low levels of vitamin A in the feed. oxidation of vitamin A in the feed, errors in mixing and inter current disease, e.g. coccidiosis , worm infestation
Clinical signs: Lacrimation (ocular discharge), White cheesy exudates under the eyelids (conjunctivitis). Sticky of eyelids and (xerophthalmia). Keratoconjunctivitis.
Watery discharge from the nostrils. Sinusitis. Gasping and sneezing. Lack of yellow pigments,
Respiratory sings due to affection of epithelium of the respiratory tract.
Lesions:
Pseudo diphtheritic membrane in digestive and respiratory system (Keratinized epithelia).
Nutritional roup: respiratory sings due to affection of epithelium of the respiratory tract.
Pustule like nodules in the upper digestive tract (buccal cavity, pharynx, esophagus).
The urate deposits may be found on other visceral organs
Treatment:
Administer 3-5 times the recommended levels of vitamin A @ 10000 IU/ KG ration either through water or feed.
Lesions:
Pseudo diphtheritic membrane in digestive and respiratory system (Keratinized epithelia).
Nutritional roup: respiratory sings due to affection of epithelium of the respiratory tract.
Pustule like nodules in the upper digestive tract (buccal cavity, pharynx, esophagus).
The urate deposits may be found on other visceral organs
Treatment:
Administer 3-5 times the recommended levels of vitamin A @ 10000 IU/ KG ration either through water or feed.
Lesions:
Pseudo diphtheritic membrane in digestive and respiratory system (Keratinized epithelia).
Nutritional roup: respiratory sings due to affection of epithelium of the respiratory tract.
Pustule like nodules in the upper digestive tract (buccal cavity, pharynx, esophagus).
The urate deposits may be found on other visceral organs
Treatment:
Administer 3-5 times the recommended levels of vitamin A @ 10000 IU/ KG ration either through water or feed.
On the Capability and Achievable Performance of FPGAs for HPC Applications
1. "On the Capability and Achievable
Performance of FPGAs for HPC
Applications"
Wim Vanderbauwhede
School of Computing Science, University of Glasgow, UK
2. Or in other words
"How Fast Can Those FPGA Thingies
Really Go?"
3. Outline
Part 1 The Promise of FPGAs for HPC
FPGAs
FLOPS
Performance Model
Part 2 How to Deliver this Promise
Assumptions on Applications
Computational Architecture
Optimising the Performance
A Matter of Programming
Enter TyTra
Conclusions
6. FPGAs in a Nutshell
Field-Programmable Gate Array
Configurable logic
Matrix of look-up tables (LUTs) that can
be configured into any N-input logic
operation
e.g. 2-bit LUT configured as XOR:
Address Value
00 1
01 0
10 0
00 1
Combined with flip-flops to provide state
7. FPGAs in a Nutshell
Communication fabric:
island-style: grid of
wires with islands of
LUTS
wires with switch boxes
provides full connectivity
Also dedicated on-chip
memory
8. FPGAs in a Nutshell
Programming
By configuring the LUTs and their
interconnection, one can create arbitrary
circuits
In practice, circuit description is written
in VHDL or Verilog, and converted into a
configuration file by the vendor tools
Two major vendors: Xilinx and Altera
Many "C-based" programming solutions
have been proposed and are
commercially available. They generate
VHDL or Verilog.
Most recently, OpenCL is available for
FPGA programming (specific
Altera-based boards only)
9. The Promise
FPGAs have great potential for HPC:
Low power consumption
Massive amount of fine grained
parallelism (e.g. Xilinx Virtex-6 has
about 600,000 LUTs)
Huge (TB/s) internal memory
bandwidth
Very high power efficiency
(GFLOPS/W)
10. The Challenge
FPGA Computing Challenge
Device clock speed is very low
Many times lower than memory clock
There is no cache
So random memory access will kill
the performance
Requires a very different
programming paradigm
So, it’s hard
But that shouldn’t stop us
11. Maximum Achievable Performance
The theoretical maximum computational performance is determined
by:
Available Memory bandwidth
Easy: read the datasheets!
Compute Capacity
Hard: what is the relationship between logic gates and “FLOPS”?
12. FLOPS?
What is a FLOP, anyway?
Ubiquitous measure of performance for
HPC systems
Floating-point Operations per second
Floating-point:
Single or double precisions?
Number format: floating-point or
fixed-point?
Operations:
Which operations? Addition? Multiplication
In fact, why floating point?
Depends on the application
13. FLOPS!
FLOPS on Multicore CPUs and GPGPUs
Fixed number of FPUs
Historically, FP operations had higher cost
than integer operations
Today, essentially no difference between
integer and floating-point operations
But scientific applications perform mostly
FP operations
Hence, FLOPS as a measure of
performance
14. An Aside: the GPGPU Promise
Many papers report huge speed-ups: 20x/50x/100x/...
And the vendors promise the world
However, theatrical FLOPS are comparable between
same-complexity CPUs and GPGPUs:
#cores vector
size
Clock
speed
(GHz)
GFLOPS
CPU: Intel Xeon E5-2640 24 8 2.5 480
GPU: Nvidia GeForce GX480 15 32 1.4 672
CPU: AMD Opteron 6176 SE 48 4 2.3 442
GPU: Nvidia Tesla C2070 14 32 1.1 493
FPGA: GiDEL PROCStar-IV ? ? 0.2 ??
Difference is no more than 1.5x
15. The GPGPU Promise (Cont’d)
Memory bandwidth is usually higher for GPGPU:
Memory
BW
(GB/s)
CPU: Intel Xeon E5-2640 42.6
GPU: Nvidia GeForce GX480 177.4
CPU: AMD Opteron 6176 SE 42.7
GPU: Nvidia Tesla C2070 144
FPGA: GiDEL PROCStar-IV 32
The difference is about 4.5x
So where do the 20x/50x/100x figures come from?
Unoptimised baselines!
16. FPGA Power Efficiency Model (1)
On FPGAs, different instructions (e.g. *, +, /) consume different
amount of resources (area and time)
FLOPS should be defined on a per-application basis
We analyse the application code and compute the aggregated
resource requirements based on the count nOP,i and resource
utilisation rOP,i of the required operations
rapp = ∑
Ninstrs
i=1 nOP,i rOP,i
We take into account an area overhead for control logic, I/O etc.
Combined with the available resources on the board rFPGA, the
clock speed fFPGA and the power consumption PFPGA, we can
compute the power efficiency:
Power Efficiency=(1 − )(rFGPA/rapp)/fFPGA/PFPGA GFLOPS/W
17. FPGA Power Efficiency Model (2)
Example: convection kernel from the FLEXPART Lagrangian particle
dispersion simulator
About 600 lines of Fortran 77
This would be a typical kernel for e.g. OpenCL or CUDA on a
GPU
Assuming a GiDEL PROCStar-IV board, PFPGA = 30W
Assume = 0.5 (50% overhead, conservative) and clock speed
fFPGA = 175MHz (again, conservative)
Resulting power efficiency: 30 GFLOPS/W
By comparison: Tesla C2075 GPU: 4.5 GFLOPS/W
If we only did multiplications and similar operations, it would be 15
GFLOPS/W
If we only did additions and similar operations, it would be 225
GFLOPS/W
Depending on the application, the power efficiency can be up to
50x better on FPGA!
21. Assumptions on Applications
Suitable for streaming computation
Data parallelism
If it works well in OpenCL or CUDA, it
will work well on FPGA
Single-precision floating point, integer or
bit-level operations. Doubles take too
much space.
Suitable model for many scientific
applications (esp. NWP)
But also for data search, filtering and
classification
So good for both HPC and data centres
22. Computational Architecture
Essentially, a network of processors
But "processors" defined very loosely
Very different from e.g. Intel CPU
Streaming processor
Minimal control flow
Single-instruction
Coarse-grained instructions
Main challenge is the parallelisation
Optimise memory throughput
Optimise computational performance
23. Example
A – somewhat contrived – example to illustrate our
optimisation approach:
We assume we have an application that
performs 4 additions, 2 multiplications and a
division
We assume that the relative areas of the
operations are 16, 400, 2000 slices
We assume that the multiplication requires 2
clock cycles and the division requires 8 clock
cycles
The processor area would be
4*64+2*200+1*2000 = 2528 slices
The compute time 1*4+2*2+8*1 = 16 cycles
24. Lanes
Memory clock is several times
higher than FPGA clock:
fMEM = n.fFPGA
To match memory bandwidth
requires at least n parallel
lanes
For the GiDEL board, n = 4
So the area requirement is
10,000 slices
But the throughput is still
only 1/16th of the memory
bandwidth
25. Threads
Typically, each lane needs to
perform many operations on each
item of data read from memory (16
in the example)
So we need to parallelise the
computational units per lane
as well
A common approach is to use
data parallel threads to
achieve processing at
memory rate
In our example, this requires
16 threads, so 160,000
slices
26. Pipelining
However, this approach is
wasteful:
Create a pipeline of the
operations
Each stage in the pipeline on
needs the operation that it
executes
In the example, this requires
4*16+2*400+1*2000 slices,
and 8 cycles per datum
Requires only 8 parallel
threads to achieve memory
bandwidth, so 80,000 slices
27. Balancing the Pipeline
This is still not optimal:
As we assume a streaming mode, we can
replicate pipeline stage to balance the
pipeline
In this way, the pipeline will have optimal
throughput
In the example, this requires
4*16+2*2*400+8*2000 slices to process at
1 cycle per datum
So the total resource utilisation is
17,664*4=70,656 slices
To evaluate various trade-offs (e.g lower
clock speeds/ smaller area/ more cycles),
we use the notion of “Effective Slice Count”
(ESC) to express the number of slices
required by an operation in order to achieve
a balanced pipeline.
28. Coarse Grained Operations
We can still do better though:
By grouping fine-grained operations into coarser-grained ones,
we reduce the overhead of the pipeline.
This is effective as long as the clock speed does not degrade
Again, the ESC is used to evaluate the optimal grouping
29. Preliminary Result
We applied our approach manually to a small part of the
convection kernel
The balanced pipeline results in 10GFLOPS/W, without any
optimisation in terms of number representation
This is already better than a Tesla C2075 GPU
30. Application Size
The approach we outlined leads to optimal performance if the
circuit fits on the FPGA
What if the circuit is too large for the FPGA (and you can’t buy a
larger one)?
Only solution is to trade space for time, i.e. reduce throughput
Our approach is to group operations into processors
Each processor instantiates the instruction required to perform all
operations
Because some instructions are executed frequently, there is an
optimum for operations/area
As the search space is small, we perform an exhaustive search for
the optimal solution
The throughput drops with the number of operations per
processor, so based on the theoretical model, for our example
case with 4 to 8 operations it can still be worthwhile to use the
FPGA.
34. A Matter of Programming
In practice, scientists don’t write “streaming
multiple-lane balanced-pipeline” code.
They write code like this −→−→−→−→−→
And current high-level programming tools
still require a lot of programmer know-how
to get good performance, because
essentially the only way is to follow a course
as outlined in this talk.
So we need better programming tools
And specifically, better compilers
35. Enter the TyTra Project
Project between universities of Glasgow,
Heriot-Watt and Imperial College, funded by
EPSRC
The aim: compile scientific code efficiently
for heterogeneous platforms, including
multicore/manycore CPUs GPGPUs and
FPGAs
The approach: TYpe TRAnsformations
Infer the type of all communication in a
program
Transform the types using a formal,
provably correct mechanism
Use a cost model to identify the suitable
transformations
Five-year project, started Jan 2014
36. But Meanwhile
A practical recipe:
Given a legacy Fortran application
And a high-level FPGA programming solution, e.g. Maxeler,
Impulse-C, Vivado or Altera OpenCL
Rewrite your code in data-parallel fashion, e.g in OpenCL
There are tools to help you: automated refactoring, Fortran-to-C
translation
This will produce code suitable for streaming
Now rewrite this code to be similar to the pipeline model
described
Finally, rewrite the code obtained in this way for Maxeler,
Impulse-C etc,mainly a matter of syntax
37. Conclusion
FPGAs are very promising for HPC
We presented a model to estimate the maximum achievable
performance on a per-application basis
Our conclusion is that the power efficiency can be up to 10x
better compared to GPU/multicore CPU
We presented a methodology to achieve the best possible
performance
Better tools are needed, but already with today’s tools very good
performance is achievable