Learn about opportunities and challenges for accelerating big data middleware on modern high-performance computing (HPC) clusters by exploiting HPC technologies.
Best Practices and Performance Studies for High-Performance Computing ClustersIntel® Software
The document discusses best practices and a performance study of HPC clusters. It covers system configuration and tuning, building applications, Intel Xeon processors, efficient execution methods, tools for boosting performance, and application performance highlights using HPL and HPCG benchmarks. The document contains agenda items, market share data, typical BIOS settings, compiler flags, MPI usage, and performance results from single node and cluster runs of the benchmarks.
SDVIs and In-Situ Visualization on TACC's StampedeIntel® Software
Speaker: Paul Navrátil, Texas Advanced Computing Center (TACC)
The design emphasis for supercomputing systems has moved from raw performance to performance-per-watt, and as a result, supercomputing architectures are converging on processors with wide vector units and many processing cores per chip. Such processors are capable of performant image rendering purely in software. This improved capability is fortuitous, since the prevailing homogeneous system designs lack dedicated, hardware-accelerated rendering subsystems for use in data visualization. Reliance on this “software-defined” rendering capability will grow in importance since, due to growing data sizes, visualizations must be performed on the same machine where the data is produced. Further, as data sizes outgrow disk I/O capacity, visualization will be increasingly incorporated into the simulation code itself (in situ visualization).
This talk presents recent work in high-fidelity visualization using the OSPRay ray tracing framework on TACC’s local and remote visualization systems. We present work using OSPRay within ParaView Catalyst in situ framework from Kitware, including capitalizing on opportunities to reduce data costs migrating through VTK filters for visualization. We highlight the performance opportunities and advantages of Intel® Advanced Vector Extensions 512, the memory system improvements possible with Intel® Xeon Phi™ processor multi-channel DRAM (MCDRAM) and the Intel® Omni-Path Architecture interconnect.
"OpenHPC is a collaborative, community effort that initiated from a desire to aggregate a number of common ingredients required to deploy and manage High Performance Computing (HPC) Linux clusters including provisioning tools, resource management, I/O clients, development tools, and a variety of scientific libraries. Packages provided by OpenHPC have been pre-built with HPC integration in mind with a goal to provide re-usable building blocks for the HPC community. Over time, the community also plans to identify and develop abstraction interfaces between key components to further enhance modularity and interchangeability. The community includes representation from a variety of sources including software vendors, equipment manufacturers, research institutions, supercomputing sites, and others."
Watch the video: http://wp.me/p3RLHQ-gKz
Learn more: http://openhpc.community/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Trends in Systems and How to Get Efficient Performanceinside-BigData.com
In this video from Switzerland HPC Conference, Martin Hilgeman from Dell presents: HPC Workload Efficiency and the Challenges for System Builders.
"With all the advances in massively parallel and multi-core computing with CPUs and accelerators it is often overlooked whether the computational work is being done in an efficient manner. This efficiency is largely being determined at the application level and therefore puts the responsibility of sustaining a certain performance trajectory into the hands of the user. It is observed that the adoption rate of new hardware capabilities is decreasing and lead to a feeling of diminishing returns. This presentation shows the well-known laws of parallel performance from the perspective of a system builder. It also covers through the use of real case studies, examples of how to program for energy efficient parallel application performance."
Watch the video: http://wp.me/p3RLHQ-gIS
Learn more: http://dell.com
and
http://www.hpcadvisorycouncil.com/events/2017/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Jean-Pierre Panziera from Atos presents: BXI - Bull eXascale Interconnect.
"Exascale entails an explosion of performance, of the number of nodes/cores, of data volume and data movement. At such a scale, optimizing the network that is the backbone of the system becomes a major contributor to global performance. The interconnect is going to be a key enabling technology for exascale systems. This is why one of the cornerstones of Bull’s exascale program is the development of our own new-generation interconnect. The Bull eXascale Interconnect or BXI introduces a paradigm shift in terms of performance, scalability, efficiency, reliability and quality of service for extreme workloads."
Watch the video: http://wp.me/p3RLHQ-gJa
Learn more: https://bull.com/bull-exascale-interconnect/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Using a Field Programmable Gate Array to Accelerate Application PerformanceOdinot Stanislas
Intel s'intéresse tout particulièrement aux FPGA et notamment au potentiel qu'ils apportent lorsque les ISV et développeurs ont des besoins très spécifiques en Génomique, traitement d'images, traitement de bases de données, et même dans le Cloud. Dans ce document vous aurez l'occasion d'en savoir plus sur notre stratégie, et sur un programme de recherche lancé par Intel et Altera impliquant des Xeon E5 équipés... de FPGA
Intel is looking at FPGA and what they bring to ISVs and developers and their very specific needs in genomics, image processing, databases, and even in the cloud. In this document you will have the opportunity to learn more about our strategy, and a research program initiated by Intel and Altera involving Xeon E5 with... FPGA inside.
Auteur(s)/Author(s):
P. K. Gupta, Director of Cloud Platform Technology, Intel Corporation
"Algorithmic processing performed in High Performance Computing environments impacts the lives of billions of people, and planning for exascale computing presents significant power challenges to the industry. ARM delivers the enabling technology behind HPC. The 64-bit design of the ARMv8-A architecture combined with Advanced SIMD vectorization are ideal to enable large scientific computing calculations to be executed efficiently on ARM HPC machines. In addition ARM and its partners are working to ensure that all the software tools and libraries, needed by both users and systems administrators, are provided in readily available, optimized packages."
Learn more: https://developer.arm.com/hpc
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from ATPESC 2019, Ken Raffenetti from Argonne presents an overview of HPC interconnects.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two-week training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-luc
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
JT Kellington, IBM and Allan Cantle, Nallatech present at the 2015 HPCC Systems Engineering Summit Community Day about porting HPCC Systems to the POWER8-based ppc64el architecture.
In this deck, Ronald P. Luijten from IBM Research in Zurich presents: DOME 64-bit μDataCenter.
I like to call it a datacenter in a shoebox. With the combination of power and energy efficiency, we believe the microserver will be of interest beyond the DOME project, particularly for cloud data centers and Big Data analytics applications."
The microserver’s team has designed and demonstrated a prototype 64-bit microserver using a PowerPC based chip from Freescale Semiconductor running Linux Fedora and IBM DB2. At 133 × 55 mm2 the microserver contains all of the essential functions of today’s servers, which are 4 to 10 times larger in size. Not only is the microserver compact, it is also very energy-efficient.
Watch the video: http://wp.me/p3RLHQ-gJM
Learn more: https://www.zurich.ibm.com/microserver/
Sign up for our insideHPC Newsletter: http://insideHPC/newsletter
In this deck from the Performance Optimisation and Productivity group, Lubomir Riha from IT4Innovations presents: Energy Efficient Computing using Dynamic Tuning.
"We now live in a world of power-constrained architectures and systems and power consumption represents a significant cost factor in the overall HPC system economy. For these reasons, in recent years researchers, supercomputing centers and major vendors have developed new tools and methodologies to measure and optimize the energy consumption of large-scale high performance system installations. Due to the link between energy consumption, power consumption and execution time of an application executed by the final user, it is important for these tools and the methodology used to consider all these aspects, empowering the final user and the system administrator with the capability of finding the best configuration given different high level objectives.
This webinar focused on tools designed to improve the energy-efficiency of HPC applications using a methodology of dynamic tuning of HPC applications, developed under the H2020 READEX project. The READEX methodology has been designed for exploiting the dynamic behaviour of software. At design time, different runtime situations (RTS) are detected and optimized system configurations are determined. RTSs with the same configuration are grouped into scenarios, forming the tuning model. At runtime, the tuning model is used to switch system configurations dynamically.
The MERIC tool, that implements the READEX methodology, is presented. It supports manual or binary instrumentation of the analysed applications to simplify the analysis. This instrumentation is used to identify and annotate the significant regions in the HPC application. Automatic binary instrumentation annotates regions with significant runtime. Manual instrumentation, which can be combined with automatic, allows code developer to annotate regions of particular interest."
Watch the video: https://wp.me/p3RLHQ-lJP
Learn more: https://pop-coe.eu/blog/14th-pop-webinar-energy-efficient-computing-using-dynamic-tuning
and
https://code.it4i.cz/vys0053/meric
Sign up for our insideHPC Newsletter: http://insidehpc.com/newslett
In this deck from the HPC User Forum in Santa Fe, Sibendu Som from Argonne presents: HPC Accelerating Combustion Engine Design.
Sibendu Som is a mechanical engineer and principal investigator for developing predictive spray and combustion modeling capabilities for compression ignition engines. With the aid of high-performance computing, Sibendu focuses on developing robust models which can improve the performance and emission characteristics of a variety of bio-derived fuels. Predictive simulation capability can provide significant insights on how to improve the efficiency and emissions for different bio-derived fuels of interest.
Watch the video: http://wp.me/p3RLHQ-gHO
Learn more: http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
TAU Performance System and the Extreme-scale Scientific Software Stack (E4S) aim to improve productivity for HPC and AI workloads. TAU provides a portable performance evaluation toolkit, while E4S delivers modular and interoperable software stacks. Together, they lower barriers to using software tools from the Exascale Computing Project and enable performance analysis of complex, multi-component applications.
In the design of electronics and semiconductors, challenges are compounded by the integration of AI, multi-core, real-time software, network, connectivity, diagnostics, and security. Performance limits, battery life, and cost are adoption barriers. It is extremely important to have tools and processes that deliver efficiency throughout the design cycle.
Continuous verification from planning to development addresses the multi-discipline needs of hardware, software, and networks. This unique approach accelerates the design phase, defines the test efforts, and finds defects during specification. Architecture modeling is required to meet timing deadlines, generate the lowest power consumption, and attain the highest Quality-of-Service. optimize the electronic design system and designing of custom components.
Luigi Brochard from Lenovo gave this talk at the Switzerland HPC Conference. "High performance computing is converging more and more with the big data topic and related infrastructure requirements in the field. Lenovo is investing in developing systems designed to resolve todays and future problems in a more efficient way and respond to the demands of Industrial and research application landscape."
Watch the video: http://wp.me/p3RLHQ-gDC
Learn more: http://www3.lenovo.com/us/en/data-center/solutions/hpc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Argonne Training Program on Extreme-Scale Computing 2019, Howard Pritchard from LANL and Simon Hammond from Sandia present: NNSA Explorations: ARM for Supercomputing.
"The Arm-based Astra system at Sandia will be used by the National Nuclear Security Administration (NNSA) to run advanced modeling and simulation workloads for addressing areas such as national security, energy and science.
"By introducing Arm processors with the HPE Apollo 70, a purpose-built HPC architecture, we are bringing powerful elements, like optimal memory performance and greater density, to supercomputers that existing technologies in the market cannot match,” said Mike Vildibill, vice president, Advanced Technologies Group, HPE. “Sandia National Laboratories has been an active partner in leveraging our Arm-based platform since its early design, and featuring it in the deployment of the world’s largest Arm-based supercomputer, is a strategic investment for the DOE and the industry as a whole as we race toward achieving exascale computing.”
Watch the video: https://wp.me/p3RLHQ-l29
Learn more: https://insidehpc.com/2018/06/arm-goes-big-hpe-builds-petaflop-supercomputer-sandia/
and
https://extremecomputingtraining.anl.gov/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
OpenCAPI is an open standard interface that provides high bandwidth and low latency connections between processors, accelerators, memory and storage. It addresses the growing need for increased performance driven by workloads like AI and the limitations of Moore's Law. OpenCAPI supports a heterogeneous system architecture with technologies like FPGAs and different memory types. It uses a thin protocol stack and virtual addressing to minimize latency. The SNAP framework also makes programming accelerators using OpenCAPI easier by abstracting the hardware details.
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document discusses OpenCAPI acceleration using the OpenCAPI Acceleration Framework (oc-accel). It provides an overview of the oc-accel components and workflow, benchmarks the OC-Accel bandwidth and latency, and provides examples of how to fully utilize OC-Accel capabilities to accelerate functions on an FPGA. The document also outlines the OC-Accel development process and previews upcoming features like support for ODMA to port existing PCIe accelerators to OpenCAPI.
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Deepak Shankar
The document discusses comparing the performance and power of ARM Cortex and RISC-V processors for AI applications. It outlines a methodology for modeling systems from the microarchitecture to SoC level using different instruction sets. Examples are provided to demonstrate how the methodology can be used to improve the accuracy of comparisons between architectures.
Macromolecular crystallography is an experimental technique allowing to explore 3D atomic structure of proteins, used by academics for research in biology and by pharmaceutical companies in rational drug design. While up to now development of the technique was limited by scientific instruments performance, recently computing performance becomes a key limitation. In my presentation I will present a computing challenge to handle 18 GB/s data stream coming from the new X-ray detector. I will show PSI experiences in applying conventional hardware for the task and why this attempt failed. I will then present how IC 922 server with OpenCAPI enabled FPGA boards allowed to build a sustainable and scalable solution for high speed data acquisition. Finally, I will give a perspective, how the advancement in hardware development will enable better science by users of the Swiss Light Source.
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
This document introduces hardware acceleration using FPGAs with OpenCAPI. It discusses how classic FPGA acceleration has issues like slow CPU-managed memory access and lack of data coherency. OpenCAPI allows FPGAs to directly access host memory, providing faster memory access and data coherency. It also introduces the OC-Accel framework that allows programming FPGAs using C/C++ instead of HDL languages, addressing issues like long development times. Example applications demonstrated significant performance improvements using this approach over CPU-only or classic FPGA acceleration methods.
ILP32 is a programming model that may be useful on AArch64 systems for performance and also for legacy code with 32-bit data size assumptions. We combined ILP32 support from upstream projects with the LEAP distribution to enable experimentation with this model. This talk discusses the relative benchmark performance of the LP64 and ILP32 programming models under AArch64.
Luigi Brochard from Lenovo presented this deck at the Switzerland HPC Conference.
"Lenovo has developed an open source HPC software stack for system management with GUI support. This enables customers to more efficiently manage their clusters by making it simple and easy for both the system administrator and end users.This talk will present this initiative, show a demo and present future evolutions."
Watch the video presentation:
https://www.youtube.com/watch?v=xqwLul_hA28
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
DPDK greatly improves packet processing performance and throughput by allowing applications to directly access hardware and bypass kernel involvement. It can improve performance by up to 10 times, allowing over 80 Mbps throughput on a single CPU or double that with two CPUs. This enables telecom and networking equipment manufacturers to develop products faster and with lower costs. DPDK achieves these gains through techniques like dedicated core affinity, userspace drivers, polling instead of interrupts, and lockless synchronization.
This document discusses OpenCAPI, an open standard for high-performance input/output between processors and accelerators. It provides background on the industry drivers for developing such a standard, an overview of OpenCAPI technology and capabilities, examples of OpenCAPI-based systems from IBM and partners, and performance metrics. The document aims to promote OpenCAPI and growing an open ecosystem around it to support accelerated computing workloads.
The document discusses accelerating Apache Hadoop through high-performance networking and I/O technologies. It describes how technologies like InfiniBand, RoCE, SSDs, and NVMe can benefit big data applications by alleviating bottlenecks. It outlines projects from the High-Performance Big Data project that implement RDMA for Hadoop, Spark, HBase and Memcached to improve performance. Evaluation results demonstrate significant acceleration of HDFS, MapReduce, and other workloads through the high-performance designs.
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video: https://youtu.be/iLTYkTandEA
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from ATPESC 2019, Ken Raffenetti from Argonne presents an overview of HPC interconnects.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two-week training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-luc
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
JT Kellington, IBM and Allan Cantle, Nallatech present at the 2015 HPCC Systems Engineering Summit Community Day about porting HPCC Systems to the POWER8-based ppc64el architecture.
In this deck, Ronald P. Luijten from IBM Research in Zurich presents: DOME 64-bit μDataCenter.
I like to call it a datacenter in a shoebox. With the combination of power and energy efficiency, we believe the microserver will be of interest beyond the DOME project, particularly for cloud data centers and Big Data analytics applications."
The microserver’s team has designed and demonstrated a prototype 64-bit microserver using a PowerPC based chip from Freescale Semiconductor running Linux Fedora and IBM DB2. At 133 × 55 mm2 the microserver contains all of the essential functions of today’s servers, which are 4 to 10 times larger in size. Not only is the microserver compact, it is also very energy-efficient.
Watch the video: http://wp.me/p3RLHQ-gJM
Learn more: https://www.zurich.ibm.com/microserver/
Sign up for our insideHPC Newsletter: http://insideHPC/newsletter
In this deck from the Performance Optimisation and Productivity group, Lubomir Riha from IT4Innovations presents: Energy Efficient Computing using Dynamic Tuning.
"We now live in a world of power-constrained architectures and systems and power consumption represents a significant cost factor in the overall HPC system economy. For these reasons, in recent years researchers, supercomputing centers and major vendors have developed new tools and methodologies to measure and optimize the energy consumption of large-scale high performance system installations. Due to the link between energy consumption, power consumption and execution time of an application executed by the final user, it is important for these tools and the methodology used to consider all these aspects, empowering the final user and the system administrator with the capability of finding the best configuration given different high level objectives.
This webinar focused on tools designed to improve the energy-efficiency of HPC applications using a methodology of dynamic tuning of HPC applications, developed under the H2020 READEX project. The READEX methodology has been designed for exploiting the dynamic behaviour of software. At design time, different runtime situations (RTS) are detected and optimized system configurations are determined. RTSs with the same configuration are grouped into scenarios, forming the tuning model. At runtime, the tuning model is used to switch system configurations dynamically.
The MERIC tool, that implements the READEX methodology, is presented. It supports manual or binary instrumentation of the analysed applications to simplify the analysis. This instrumentation is used to identify and annotate the significant regions in the HPC application. Automatic binary instrumentation annotates regions with significant runtime. Manual instrumentation, which can be combined with automatic, allows code developer to annotate regions of particular interest."
Watch the video: https://wp.me/p3RLHQ-lJP
Learn more: https://pop-coe.eu/blog/14th-pop-webinar-energy-efficient-computing-using-dynamic-tuning
and
https://code.it4i.cz/vys0053/meric
Sign up for our insideHPC Newsletter: http://insidehpc.com/newslett
In this deck from the HPC User Forum in Santa Fe, Sibendu Som from Argonne presents: HPC Accelerating Combustion Engine Design.
Sibendu Som is a mechanical engineer and principal investigator for developing predictive spray and combustion modeling capabilities for compression ignition engines. With the aid of high-performance computing, Sibendu focuses on developing robust models which can improve the performance and emission characteristics of a variety of bio-derived fuels. Predictive simulation capability can provide significant insights on how to improve the efficiency and emissions for different bio-derived fuels of interest.
Watch the video: http://wp.me/p3RLHQ-gHO
Learn more: http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
TAU Performance System and the Extreme-scale Scientific Software Stack (E4S) aim to improve productivity for HPC and AI workloads. TAU provides a portable performance evaluation toolkit, while E4S delivers modular and interoperable software stacks. Together, they lower barriers to using software tools from the Exascale Computing Project and enable performance analysis of complex, multi-component applications.
In the design of electronics and semiconductors, challenges are compounded by the integration of AI, multi-core, real-time software, network, connectivity, diagnostics, and security. Performance limits, battery life, and cost are adoption barriers. It is extremely important to have tools and processes that deliver efficiency throughout the design cycle.
Continuous verification from planning to development addresses the multi-discipline needs of hardware, software, and networks. This unique approach accelerates the design phase, defines the test efforts, and finds defects during specification. Architecture modeling is required to meet timing deadlines, generate the lowest power consumption, and attain the highest Quality-of-Service. optimize the electronic design system and designing of custom components.
Luigi Brochard from Lenovo gave this talk at the Switzerland HPC Conference. "High performance computing is converging more and more with the big data topic and related infrastructure requirements in the field. Lenovo is investing in developing systems designed to resolve todays and future problems in a more efficient way and respond to the demands of Industrial and research application landscape."
Watch the video: http://wp.me/p3RLHQ-gDC
Learn more: http://www3.lenovo.com/us/en/data-center/solutions/hpc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Argonne Training Program on Extreme-Scale Computing 2019, Howard Pritchard from LANL and Simon Hammond from Sandia present: NNSA Explorations: ARM for Supercomputing.
"The Arm-based Astra system at Sandia will be used by the National Nuclear Security Administration (NNSA) to run advanced modeling and simulation workloads for addressing areas such as national security, energy and science.
"By introducing Arm processors with the HPE Apollo 70, a purpose-built HPC architecture, we are bringing powerful elements, like optimal memory performance and greater density, to supercomputers that existing technologies in the market cannot match,” said Mike Vildibill, vice president, Advanced Technologies Group, HPE. “Sandia National Laboratories has been an active partner in leveraging our Arm-based platform since its early design, and featuring it in the deployment of the world’s largest Arm-based supercomputer, is a strategic investment for the DOE and the industry as a whole as we race toward achieving exascale computing.”
Watch the video: https://wp.me/p3RLHQ-l29
Learn more: https://insidehpc.com/2018/06/arm-goes-big-hpe-builds-petaflop-supercomputer-sandia/
and
https://extremecomputingtraining.anl.gov/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
OpenCAPI is an open standard interface that provides high bandwidth and low latency connections between processors, accelerators, memory and storage. It addresses the growing need for increased performance driven by workloads like AI and the limitations of Moore's Law. OpenCAPI supports a heterogeneous system architecture with technologies like FPGAs and different memory types. It uses a thin protocol stack and virtual addressing to minimize latency. The SNAP framework also makes programming accelerators using OpenCAPI easier by abstracting the hardware details.
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document discusses OpenCAPI acceleration using the OpenCAPI Acceleration Framework (oc-accel). It provides an overview of the oc-accel components and workflow, benchmarks the OC-Accel bandwidth and latency, and provides examples of how to fully utilize OC-Accel capabilities to accelerate functions on an FPGA. The document also outlines the OC-Accel development process and previews upcoming features like support for ODMA to port existing PCIe accelerators to OpenCAPI.
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Deepak Shankar
The document discusses comparing the performance and power of ARM Cortex and RISC-V processors for AI applications. It outlines a methodology for modeling systems from the microarchitecture to SoC level using different instruction sets. Examples are provided to demonstrate how the methodology can be used to improve the accuracy of comparisons between architectures.
Macromolecular crystallography is an experimental technique allowing to explore 3D atomic structure of proteins, used by academics for research in biology and by pharmaceutical companies in rational drug design. While up to now development of the technique was limited by scientific instruments performance, recently computing performance becomes a key limitation. In my presentation I will present a computing challenge to handle 18 GB/s data stream coming from the new X-ray detector. I will show PSI experiences in applying conventional hardware for the task and why this attempt failed. I will then present how IC 922 server with OpenCAPI enabled FPGA boards allowed to build a sustainable and scalable solution for high speed data acquisition. Finally, I will give a perspective, how the advancement in hardware development will enable better science by users of the Swiss Light Source.
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
This document introduces hardware acceleration using FPGAs with OpenCAPI. It discusses how classic FPGA acceleration has issues like slow CPU-managed memory access and lack of data coherency. OpenCAPI allows FPGAs to directly access host memory, providing faster memory access and data coherency. It also introduces the OC-Accel framework that allows programming FPGAs using C/C++ instead of HDL languages, addressing issues like long development times. Example applications demonstrated significant performance improvements using this approach over CPU-only or classic FPGA acceleration methods.
ILP32 is a programming model that may be useful on AArch64 systems for performance and also for legacy code with 32-bit data size assumptions. We combined ILP32 support from upstream projects with the LEAP distribution to enable experimentation with this model. This talk discusses the relative benchmark performance of the LP64 and ILP32 programming models under AArch64.
Luigi Brochard from Lenovo presented this deck at the Switzerland HPC Conference.
"Lenovo has developed an open source HPC software stack for system management with GUI support. This enables customers to more efficiently manage their clusters by making it simple and easy for both the system administrator and end users.This talk will present this initiative, show a demo and present future evolutions."
Watch the video presentation:
https://www.youtube.com/watch?v=xqwLul_hA28
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
DPDK greatly improves packet processing performance and throughput by allowing applications to directly access hardware and bypass kernel involvement. It can improve performance by up to 10 times, allowing over 80 Mbps throughput on a single CPU or double that with two CPUs. This enables telecom and networking equipment manufacturers to develop products faster and with lower costs. DPDK achieves these gains through techniques like dedicated core affinity, userspace drivers, polling instead of interrupts, and lockless synchronization.
This document discusses OpenCAPI, an open standard for high-performance input/output between processors and accelerators. It provides background on the industry drivers for developing such a standard, an overview of OpenCAPI technology and capabilities, examples of OpenCAPI-based systems from IBM and partners, and performance metrics. The document aims to promote OpenCAPI and growing an open ecosystem around it to support accelerated computing workloads.
The document discusses accelerating Apache Hadoop through high-performance networking and I/O technologies. It describes how technologies like InfiniBand, RoCE, SSDs, and NVMe can benefit big data applications by alleviating bottlenecks. It outlines projects from the High-Performance Big Data project that implement RDMA for Hadoop, Spark, HBase and Memcached to improve performance. Evaluation results demonstrate significant acceleration of HDFS, MapReduce, and other workloads through the high-performance designs.
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video: https://youtu.be/iLTYkTandEA
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Accelerating Hadoop, Spark, and Memcached with HPC Technologiesinside-BigData.com
DK Panda from Ohio State University presented this deck at the OpenFabrics Workshop.
"Modern HPC clusters are having many advanced features, such as multi-/many-core architectures, highperformance RDMA-enabled interconnects, SSD-based storage devices, burst-buffers and parallel file systems. However, current generation Big Data processing middleware (such as Hadoop, Spark, and Memcached) have not fully exploited the benefits of the advanced features on modern HPC clusters. This talk will present RDMA-based designs using OpenFabrics Verbs and heterogeneous storage architectures to accelerate multiple components of Hadoop (HDFS, MapReduce, RPC, and HBase), Spark and Memcached. An overview of the associated RDMA-enabled software libraries (being designed and publicly distributed as a part of the HiBD project for Apache Hadoop (integrated and plug-ins for Apache, HDP, and Cloudera distributions), Apache Spark and Memcached will be presented. The talk will also address the need for designing benchmarks using a multi-layered and systematic approach, which can be used to evaluate the performance of these Big Data processing middleware."
Watch the video presentation: http://wp.me/p3RLHQ-gzg
Learn more: http://hibd.cse.ohio-state.edu/
and
https://www.openfabrics.org/index.php/abstracts-agenda.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...inside-BigData.com
In this deck from the 2019 Stanford HPC Conference, DK Panda from Ohio State University presents: Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiBD Project.
"This talk will provide an overview of challenges in designing convergent HPC and BigData software stacks on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit HPC scheduler (SLURM), parallel file systems (Lustre) and NVM-based in-memory technology will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown.
DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He has published over 450 papers in the area of high-end computing and networking. The MVAPICH2 (High Performance MPI and PGAS over InfiniBand, Omni-Path, iWARP and RoCE) libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 2,950 organizations worldwide (in 85 countries). More than 518,000 downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 3rd, 14th, 17th, and 27th ranked ones) in the TOP500 list. The RDMA packages for Apache Spark, Apache Hadoop and Memcached together with OSU HiBD benchmarks from his group (http://hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 300 organizations in 35 countries. More than 28,900 downloads of these libraries have taken place. High-performance and scalable versions of the Caffe and TensorFlow framework are available from https://hidl.cse.ohio-state.edu.
Prof. Panda is an IEEE Fellow. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.
Watch the video: https://youtu.be/1QEq0EUErKM
Learn more: http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
Big Data Processing Above and Beyond Hadoop: Data-intensive computing represents a new computing paradigm to address Big Data processing requirements using high-performance architectures supporting scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement new applications previously thought to be impractical or infeasible. The fundamental challenges of data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The open source HPCC (High-Performance Computing Cluster) Systems platform offers a unified approach to Big Data processing requirements: (1) a scalable, integrated computer systems hardware and software architecture designed for parallel processing of data-intensive computing applications, and (2) a new programming paradigm in the form of a high-level, declarative, data-centric programming language designed specifically for big data processing. This presentation explores the challenges of data-intensive computing from a programming perspective, and describes the ECL programming language and the HPCC architecture designed for data-intensive computing applications. HPCC is an alternative to the Hadoop platform, and ECL is compared to Pig Latin, a high-level language developed for the Hadoop MapReduce architecture.
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss the challenges in designing runtime environments for MPI+X (PGAS-OpenSHMEM/UPC/CAF/UPC++, OpenMP and Cuda) programming models by taking into account support for multi-core systems (KNL and OpenPower), high networks, GPGPUs (including GPUDirect RDMA) and energy awareness. Features and sample performance numbers from MVAPICH2 libraries will be presented. For the Deep Learning domain, we will focus on popular Deep Learning framewords (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library and RDMA-enabled Big Data stacks. Finally, we will outline the challenges in moving these middleware to the Cloud environments."
Watch the video: https://youtu.be/i2I6XqOAh_I
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
In this deck from the 2019 Stanford HPC Conference, DK Panda from Ohio State University presents: How to Design Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the challenges in designing runtime environments for MPI+X (PGAS - OpenSHMEM/UPC/CAF/UPC++, OpenMP, and CUDA) programming models taking into account support for multi-core systems (Xeon, OpenPower, and ARM), high-performance networks, GPGPUs (including GPUDirect RDMA), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu) will be presented. For the Deep Learning domain, we will focus on popular Deep Learning frameworks (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library and RDMA-Enabled Big Data stacks. Finally, we will outline the challenges in moving middleware to the Cloud environments."
Watch the video: https://youtu.be/hR8cnFVF8Zg
Learn more: http://www.cse.ohio-state.edu/~panda
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
A Library for Emerging High-Performance Computing ClustersIntel® Software
This document discusses the challenges of developing communication libraries for exascale systems using hybrid MPI+X programming models. It describes how current MPI+PGAS approaches use separate runtimes, which can lead to issues like deadlock. The document advocates for a unified runtime that can support multiple programming models simultaneously to avoid such issues and enable better performance. It also outlines MVAPICH2's work on designs like multi-endpoint that integrate MPI and OpenMP to efficiently support emerging highly threaded systems.
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceObject Automation
This document discusses designing high-performance middleware for HPC, AI, and data science applications. It provides an overview of the MVAPICH2 project, which develops an open-source MPI library supporting modern HPC architectures and networking technologies. MVAPICH2 aims to provide a converged software stack for HPC, deep learning, and data science through libraries like MVAPICH2, HiDL, and HiBD. The document outlines challenges in communication library design for exascale systems and MVAPICH2's architecture supporting programming models across domains.
The document discusses OpenPOWER, an open ecosystem using the POWER architecture to share expertise, investment, and intellectual property. It outlines the goals of the OpenPOWER Foundation to serve evolving customer needs through collaborative innovation and solutions. Examples are provided of innovations developed through partnerships, such as accelerated databases, optimized flash storage, and high performance computing systems. The benefits of the OpenPOWER approach for customers are affirmed through adoption of Linux distributions and cloud deployments.
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
DK Panda from Ohio State University presented this deck at the 2017 HPC Advisory Council Stanford Conference.
"This talk will focus on challenges in designing runtime environments for exascale systems with millions of processors and accelerators to support various programming models. We will focus on MPI, PGAS (OpenSHMEM, CAF, UPC and UPC++) and Hybrid MPI+PGAS programming models by taking into account support for multi-core, high-performance networks, accelerators (GPGPUs and Intel MIC), virtualization technologies (KVM, Docker, and Singularity), and energy-awareness. Features and sample performance numbers from the MVAPICH2 libraries will be presented."
Watch the video: http://wp.me/p3RLHQ-glW
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Designing High performance & Scalable Middleware for HPCObject Automation
The document discusses the design of high-performance middleware for HPC, AI, and data science applications. It describes challenges including supporting various programming models, applications, network technologies, and architectures. The MVAPICH project is presented as an open-source MPI library that supports these domains and has been downloaded over 1.5 million times. It provides optimized communication through features like GPU-direct support and improved nested datatype transfers.
This document discusses how Mellanox technologies can accelerate big data solutions using RDMA. It summarizes that Mellanox provides end-to-end interconnect solutions including adapters, switches, and cables. It also discusses three key areas for acceleration: data analytics, storage, and distributed storage. The document presents the Unstructured Data Accelerator plugin which can double MapReduce performance using RDMA for efficient data shuffling. It also discusses using RDMA and SSDs to unlock higher throughput in HDFS and overcome bandwidth limitations of 1GbE and 10GbE networks.
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
Google’s TensorFlow is one of the most popular deep learning (DL) frameworks. In distributed TensorFlow, gradient updates are a critical step governing the total model training time. These updates incur a massive volume of data transfer over the network.
In this talk, we first present a thorough analysis of the communication patterns in distributed TensorFlow. Then we propose a unified way of achieving high performance through enhancing the gRPC runtime with Remote Direct Memory Access (RDMA) technology on InfiniBand and RoCE. Through our proposed RDMA-gRPC design, TensorFlow only needs to run over the gRPC channel and gets the optimal performance. Our design includes advanced features such as message pipelining, message coalescing, zero-copy transmission, etc. The performance evaluations show that our proposed design can significantly speed up gRPC throughput by up to 1.5x compared to the default gRPC design. By integrating our RDMA-gRPC with TensorFlow, we are able to achieve up to 35% performance improvement for TensorFlow training with CNN models.
Speakers
Dhabaleswar K (DK) Panda, Professor and University Distinguished Scholar, The Ohio State University
Xiaoyi Lu, Research Scientist, The Ohio State University
HPE plans to deliver a DAOS storage solution targeting version 2.0 to enable initial customer deployments. The reference implementation will include HPE servers, Intel Ice Lake CPUs with DCPMM and NVMe SSDs, Mellanox or HPE Slingshot switches, and customized Cray management software. HPE will host potential customers in their CTO lab to run proofs of concept and collect feedback to inform full productization planned with Sapphire Rapids.
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
- The document discusses programming models and challenges for exascale systems. It focuses on MPI and PGAS models like OpenSHMEM.
- Key challenges include supporting hybrid MPI+PGAS programming, efficient communication for multi-core and accelerator nodes, fault tolerance, and extreme low memory usage.
- The MVAPICH2 project aims to address these challenges through its high performance MPI and PGAS implementation and optimization of communication for technologies like InfiniBand.
UCX: An Open Source Framework for HPC Network APIs and BeyondEd Dodds
UCX is an open source framework for high performance computing (HPC) network APIs and beyond. It is a collaborative effort between industry, national laboratories, and academia to develop the next generation HPC communication framework. UCX aims to provide a unified communication API that supports multiple network architectures and HPC programming models through a performance-oriented and community-driven approach.
In this deck from the 2018 Swiss HPC Conference, DK Panda from Ohio State University presents: Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark, and Memcached on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit NVM-based in-memory technology and parallel file systems (such as Lustre) will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project will be shown. Benefits of these stacks to accelerate deep learning frameworks (such as CaffeOnSpark and TensorFlowOnSpark) will be presented."
Watch the video: https://wp.me/p3RLHQ-iko
Learn more: http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systemsinside-BigData.com
In this deck from the 2018 Swiss HPC Conference, DK Panda from Ohio State University presents: Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems.
"This talk will focus on challenges in designing HPC, Deep Learning, and HPC Cloud middleware for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the challenges in designing runtime environments for MPI+X (PGAS - OpenSHMEM/UPC/CAF/UPC++, OpenMP, and CUDA) programming models by taking into account support for multi-core systems (KNL and OpenPower), high-performance networks, GPGPUs (including GPUDirect RDMA), and energy-awareness. Features, sample performance numbers and best practices of using MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu)will be presented.
For the Deep Learning domain, we will focus on popular Deep Learning frameworks (Caffe, CNTK, and TensorFlow) to extract performance and scalability with MVAPICH2-GDR MPI library. Finally, we will outline the challenges in moving these middleware to the Cloud environments."
Watch the video: https://wp.me/p3RLHQ-iyc
Learn more: http://www.cse.ohio-state.edu/~panda
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...OpenStack
Audience Level
Intermediate
Synopsis
High performance computing and cloud computing have traditionally been seen as separate solutions to separate problems, dealing with issues of performance and flexibility respectively. In a diverse research environment however, both sets of compute requirements can occur. In addition to the administrative benefits in combining both requirements into a single unified system, opportunities are provided for incremental expansion.
The deployment of the Spartan cloud-HPC hybrid system at the University of Melbourne last year is an example of such a design. Despite its small size, it has attracted international attention due to its design features. This presentation, in addition to providing a grounding on why one would wish to build an HPC-cloud hybrid system and the results of the deployment, provides a complete technical overview of the design from the ground up, as well as problems encountered and planned future developments.
Speaker Bio
Lev Lafayette is the HPC and Training Officer at the University of Melbourne. Prior to that he worked at the Victorian Partnership for Advanced Computing for several years in a similar role.
AI for All: Biology is eating the world & AI is eating Biology Intel® Software
Advances in cell biology and creation of an immense amount of data are converging with advances in Machine learning to analyze this data. Biology is experiencing its AI moment and driving the massive computation involved in understanding biological mechanisms and driving interventions. Learn about how cutting edge technologies such as Software Guard Extensions (SGX) in the latest Intel Xeon Processors and Open Federated Learning (OpenFL), an open framework for federated learning developed by Intel, are helping advance AI in gene therapy, drug design, disease identification and more.
Python Data Science and Machine Learning at Scale with Intel and AnacondaIntel® Software
Python is the number 1 language for data scientists, and Anaconda is the most popular python platform. Intel and Anaconda have partnered to bring scalability and near-native performance to Python with simple installations. Learn how data scientists can now access oneAPI-optimized Python packages such as NumPy, Scikit-Learn, Modin, Pandas, and XGBoost directly from the Anaconda repository through simple installation and minimal code changes.
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciIntel® Software
Preprocess, visualize, and Build AI Faster at-Scale on Intel Architecture. Develop end-to-end AI pipelines for inferencing including data ingestion, preprocessing, and model inferencing with tabular, NLP, RecSys, video and image using Intel oneAPI AI Analytics Toolkit and other optimized libraries. Build at-scale performant pipelines with Databricks and end-to-end Xeon optimizations. Learn how to visualize with the OmniSci Immerse Platform and experience a live demonstration of the Intel Distribution of Modin and OmniSci.
AI for good: Scaling AI in science, healthcare, and more.Intel® Software
How do we scale AI to its full potential to enrich the lives of everyone on earth? Learn about AI hardware and software acceleration and how Intel AI technologies are being used to solve critical problems in high energy physics, cancer research, financial inclusion, and more. Get started on your AI Developer Journey @ software.intel.com/ai
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Intel® Software
Software AI Accelerators deliver orders of magnitude performance gain for AI across deep learning, classical machine learning, and graph analytics and are key to enabling AI Everywhere. Get started on your AI Developer Journey @ software.intel.com/ai.
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Intel® Software
Learn about the algorithms and associated implementations that power SigOpt, a platform for efficiently conducting model development and hyperparameter optimization. Get started on your AI Developer Journey @ software.intel.com/ai.
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Intel® Software
oneDNN Graph API extends oneDNN with a graph interface which reduces deep learning integration costs and maximizes compute efficiency across a variety of AI hardware including AI accelerators. Get started on your AI Developer Journey @ software.intel.com/ai.
AWS & Intel Webinar Series - Accelerating AI ResearchIntel® Software
Scale your research workloads faster with Intel on AWS. Learn how the performance and productivity of Intel Hardware and Software help bridge the gap between ideation and results in Data Science. Get started on your AI Developer Journey @ software.intel.com/ai.
Whether you are an AI, HPC, IoT, Graphics, Networking or Media developer, visit the Intel Developer Zone today to access the latest software products, resources, training, and support. Test-drive the latest Intel hardware and software products on DevCloud, our online development sandbox, and use DevMesh, our online collaboration portal, to meet and work with other innovators and product leaders. Get started by joining the Intel Developer Community @ software.intel.com.
The document outlines the agenda and code of conduct for an Intel AI Summit event. The agenda includes workshops on Intel's AI portfolio, lunch, more workshops, a break, presentations on applications of Intel AI and an Intel AI partner, and concludes with networking and appetizers. The code of conduct states that Intel aims to create a respectful environment and any disrespectful or harassing behavior will not be tolerated.
This document discusses Bodo Inc.'s product that aims to simplify and accelerate data science workflows. It highlights common problems in data science like complex and slow analytics, segregated development and production environments, and unused data. Bodo provides a unified development and production environment where the same code can run at any scale with automatic parallelization. It integrates an analytics engine and HPC architecture to optimize Python code for performance. Bodo is presented as offering more productive, accurate and cost-effective data science compared to traditional approaches.
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019Intel® Software
QuEST Global is a global engineering company that provides AI and digital transformation services using technologies like computer vision, machine learning, and deep learning. It has developed several AI solutions using Intel technologies like OpenVINO that provide accelerated inferencing on Intel CPUs. Some examples include a lung nodule detection solution to help detect early-stage lung cancer from CT scans and a vision analytics platform used for applications in retail, banking, and surveillance. The company leverages Intel's AI Builder program and ecosystem to develop, integrate, and deploy AI solutions globally.
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Intel® Software
Explore practical elements, such as performance profiling, debugging, and porting advice. Get an overview of advanced programming topics, like common design patterns, SIMD lane interoperability, data conversions, and more.
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Intel® Software
Explore how to build a unified framework based on FFmpeg and GStreamer to enable video analytics on all Intel® hardware, including CPUs, GPUs, VPUs, FPGAs, and in-circuit emulators.
Review state-of-the-art techniques that use neural networks to synthesize motion, such as mode-adaptive neural network and phase-functioned neural networks. See how next-generation CPUs with reinforcement learning can offer better performance.
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...Intel® Software
This talk focuses on the newest release in RenderMan* 22.5 and its adoption at Pixar Animation Studios* for rendering future movies. With native support for Intel® Advanced Vector Extensions, Intel® Advanced Vector Extensions 2, and Intel® Advanced Vector Extensions 512, it includes enhanced library features, debugging support, and an extensive test framework.
This document discusses Intel's hardware and software portfolio for artificial intelligence. It highlights Intel's move from multi-purpose to purpose-built AI compute solutions from the cloud to edge devices. It also discusses Intel's data-centric infrastructure including CPUs, accelerators, networking fabric and memory technologies. Finally, it provides examples of Intel optimizations that have increased AI performance on Intel Xeon scalable processors.
AIDC India - Intel Movidius / Open Vino SlidesIntel® Software
The document discusses a smart tollgate system that uses an Intel Movidius Myriad vision processing unit and the Intel Distribution of OpenVINO Toolkit. The system is able to identify vehicles in real-time and process toll payments automatically without needing to stop.
This document discusses AI vision and a hybrid approach using both edge and server-based analytics. It outlines some of the challenges of vision problems where data is analog, complex, and data-heavy. A hybrid approach is proposed that uses edge devices for initial analysis similar to the ventral stream, while also using servers for deeper correlation and inference like the dorsal stream. This combines the strengths of edge and server-based computing on platforms like Intel that support both CPUs and GPUs to efficiently solve real-world vision problems. Several case studies are provided as examples.
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
TrsLabs - Fintech Product & Business ConsultingTrs Labs
Hybrid Growth Mandate Model with TrsLabs
Strategic Investments, Inorganic Growth, Business Model Pivoting are critical activities that business don't do/change everyday. In cases like this, it may benefit your business to choose a temporary external consultant.
An unbiased plan driven by clearcut deliverables, market dynamics and without the influence of your internal office equations empower business leaders to make right choices.
Getting things done within a budget within a timeframe is key to Growing Business - No matter whether you are a start-up or a big company
Talk to us & Unlock the competitive advantage
Mobile App Development Company in Saudi ArabiaSteve Jonas
EmizenTech is a globally recognized software development company, proudly serving businesses since 2013. With over 11+ years of industry experience and a team of 200+ skilled professionals, we have successfully delivered 1200+ projects across various sectors. As a leading Mobile App Development Company In Saudi Arabia we offer end-to-end solutions for iOS, Android, and cross-platform applications. Our apps are known for their user-friendly interfaces, scalability, high performance, and strong security features. We tailor each mobile application to meet the unique needs of different industries, ensuring a seamless user experience. EmizenTech is committed to turning your vision into a powerful digital product that drives growth, innovation, and long-term success in the competitive mobile landscape of Saudi Arabia.
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity
Join this UiPath Community Berlin meetup to explore the Orchestrator API, Swagger interface, and the Test Manager API. Learn how to leverage these tools to streamline automation, enhance testing, and integrate more efficiently with UiPath. Perfect for developers, testers, and automation enthusiasts!
📕 Agenda
Welcome & Introductions
Orchestrator API Overview
Exploring the Swagger Interface
Test Manager API Highlights
Streamlining Automation & Testing with APIs (Demo)
Q&A and Open Discussion
Perfect for developers, testers, and automation enthusiasts!
👉 Join our UiPath Community Berlin chapter: https://community.uipath.com/berlin/
This session streamed live on April 29, 2025, 18:00 CET.
Check out all our upcoming UiPath Community sessions at https://community.uipath.com/events/.
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok
In today's fast-paced retail environment, efficiency is key. Every minute counts, and every penny matters. One tool that can significantly boost your store's efficiency is a well-executed planogram. These visual merchandising blueprints not only enhance store layouts but also save time and money in the process.
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
With expertise in data architecture, performance tracking, and revenue forecasting, Andrew Marnell plays a vital role in aligning business strategies with data insights. Andrew Marnell’s ability to lead cross-functional teams ensures businesses achieve sustainable growth and operational excellence.
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
Accelerate Big Data Processing with High-Performance Computing Technologies
1. Exploiting HPC Technologies to Accelerate Big Data
Processing (Hadoop, Spark, and Memcached)
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at Intel HPC Developer Conference (SC ‘16)
by
Xiaoyi Lu
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~luxi
2. Intel HPC Dev Conf (SC ‘16) 2Network Based Computing Laboratory
• Big Data has become the one of the most
important elements of business analytics
• Provides groundbreaking opportunities for
enterprise information management and
decision making
• The amount of data is exploding; companies
are capturing and digitizing more information
than ever
• The rate of information growth appears to be
exceeding Moore’s Law
Introduction to Big Data Applications and Analytics
3. Intel HPC Dev Conf (SC ‘16) 3Network Based Computing Laboratory
• Substantial impact on designing and utilizing data management and processing systems in multiple tiers
– Front-end data accessing and serving (Online)
• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)
• HDFS, MapReduce, Spark
Data Management and Processing on Modern Clusters
4. Intel HPC Dev Conf (SC ‘16) 4Network Based Computing Laboratory
0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
PercentageofClusters
NumberofClusters
Timeline
Percentage of Clusters
Number of Clusters
Trends for Commodity Computing Clusters in the Top 500
List (http://www.top500.org)
85%
5. Intel HPC Dev Conf (SC ‘16) 5Network Based Computing Laboratory
Drivers of Modern HPC Cluster Architectures
Tianhe – 2 Titan Stampede Tianhe – 1A
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
Accelerators / Coprocessors
high compute density, high
performance/watt
>1 TFlop DP on a chip
High Performance Interconnects -
InfiniBand
<1usec latency, 100Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM
6. Intel HPC Dev Conf (SC ‘16) 6Network Based Computing Laboratory
• Advanced Interconnects and RDMA protocols
– InfiniBand
– 10-40 Gigabit Ethernet/iWARP
– RDMA over Converged Enhanced Ethernet (RoCE)
• Delivering excellent performance (Latency, Bandwidth and CPU Utilization)
• Has influenced re-designs of enhanced HPC middleware
– Message Passing Interface (MPI) and PGAS
– Parallel File Systems (Lustre, GPFS, ..)
• SSDs (SATA and NVMe)
• NVRAM and Burst Buffer
Trends in HPC Technologies
7. Intel HPC Dev Conf (SC ‘16) 7Network Based Computing Laboratory
Interconnects and Protocols in OpenFabrics Stack for HPC
(http://openfabrics.org)
Kernel
Space
Application /
Middleware
Verbs
Ethernet
Adapter
Ethernet
Switch
Ethernet
Driver
TCP/IP
1/10/40/100
GigE
InfiniBand
Adapter
InfiniBand
Switch
IPoIB
IPoIB
Ethernet
Adapter
Ethernet
Switch
Hardware
Offload
TCP/IP
10/40 GigE-
TOE
InfiniBand
Adapter
InfiniBand
Switch
User
Space
RSockets
RSockets
iWARP
Adapter
Ethernet
Switch
TCP/IP
User
Space
iWARP
RoCE
Adapter
Ethernet
Switch
RDMA
User
Space
RoCE
InfiniBand
Switch
InfiniBand
Adapter
RDMA
User
Space
IB Native
Sockets
Application /
Middleware Interface
Protocol
Adapter
Switch
InfiniBand
Adapter
InfiniBand
Switch
RDMA
SDP
SDP
8. Intel HPC Dev Conf (SC ‘16) 8Network Based Computing Laboratory
• 205 IB Clusters (41%) in the Jun’16 Top500 list
– (http://www.top500.org)
• Installations in the Top 50 (21 systems):
Large-scale InfiniBand Installations
220,800 cores (Pangea) in France (11th) 74,520 cores (Tsubame 2.5) at Japan/GSIC (31st)
462,462 cores (Stampede) at TACC (12th) 88,992 cores (Mistral) at DKRZ Germany (33rd)
185,344 cores (Pleiades) at NASA/Ames (15th) 194,616 cores (Cascade) at PNNL (34th)
72,800 cores Cray CS-Storm in US (19th) 76,032 cores (Makman-2) at Saudi Aramco (39th)
72,800 cores Cray CS-Storm in US (20th) 72,000 cores (Prolix) at Meteo France, France (40th)
124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (21st ) 42,688 cores (Lomonosov-2) at Russia/MSU (41st)
72,000 cores (HPC2) in Italy (22nd) 60,240 cores SGI ICE X at JAEA Japan (43rd)
152,692 cores (Thunder) at AFRL/USA (25th) 70,272 cores (Tera-1000-1) at CEA France (44th)
147,456 cores (SuperMUC) in Germany (27th) 54,432 cores (Marconi) at CINECA Italy (46th)
86,016 cores (SuperMUC Phase 2) in Germany (28th) and many more!
9. Intel HPC Dev Conf (SC ‘16) 9Network Based Computing Laboratory
• Introduced in Oct 2000
• High Performance Data Transfer
– Interprocessor communication and I/O
– Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec -> 100Gbps), and
low CPU utilization (5-10%)
• Multiple Operations
– Send/Recv
– RDMA Read/Write
– Atomic Operations (very unique)
• high performance and scalable implementations of distributed locks, semaphores, collective
communication operations
• Leading to big changes in designing
– HPC clusters
– File systems
– Cloud computing systems
– Grid computing systems
Open Standard InfiniBand Networking Technology
10. Intel HPC Dev Conf (SC ‘16) 10Network Based Computing Laboratory
How Can HPC Clusters with High-Performance Interconnect and Storage
Architectures Benefit Big Data Applications?
Bring HPC and Big Data processing into a
“convergent trajectory”!
What are the major
bottlenecks in current Big
Data processing
middleware (e.g. Hadoop,
Spark, and Memcached)?
Can the bottlenecks be
alleviated with new
designs by taking
advantage of HPC
technologies?
Can RDMA-enabled
high-performance
interconnects
benefit Big Data
processing?
Can HPC Clusters with
high-performance
storage systems (e.g.
SSD, parallel file
systems) benefit Big
Data applications?
How much
performance benefits
can be achieved
through enhanced
designs?
How to design
benchmarks for
evaluating the
performance of Big
Data middleware on
HPC clusters?
11. Intel HPC Dev Conf (SC ‘16) 11Network Based Computing Laboratory
Designing Communication and I/O Libraries for Big Data Systems:
Challenges
Big Data Middleware
(HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies
(InfiniBand, 1/10/40/100 GigE
and Intelligent NICs)
Storage Technologies
(HDD, SSD, and NVMe-SSD)
Programming Models
(Sockets)
Applications
Commodity Computing System
Architectures
(Multi- and Many-core
architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point
Communication
QoS
Threaded Models
and Synchronization
Fault-ToleranceI/O and File Systems
Virtualization
Benchmarks
Upper level
Changes?
12. Intel HPC Dev Conf (SC ‘16) 12Network Based Computing Laboratory
• Sockets not designed for high-performance
– Stream semantics often mismatch for upper layers
– Zero-copy not available for non-blocking sockets
Can Big Data Processing Systems be Designed with High-
Performance Networks and Protocols?
Current Design
Application
Sockets
1/10/40/100 GigE
Network
Our Approach
Application
OSU Design
10/40/100 GigE or
InfiniBand
Verbs Interface
13. Intel HPC Dev Conf (SC ‘16) 13Network Based Computing Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, and HBase Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 195 organizations from 27 countries
• More than 18,500 downloads from the project site
• RDMA for Impala (upcoming)
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
14. Intel HPC Dev Conf (SC ‘16) 14Network Based Computing Laboratory
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and
RPC components
– Enhanced HDFS with in-memory and heterogeneous storage
– High performance design of MapReduce over Lustre
– Memcached-based burst buffer for MapReduce over Lustre-integrated HDFS (HHH-L-BB mode)
– Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, CDH and HDP
– Easily configurable for different running modes (HHH, HHH-M, HHH-L, HHH-L-BB, and MapReduce over Lustre) and different
protocols (native InfiniBand, RoCE, and IPoIB)
• Current release: 1.1.0
– Based on Apache Hadoop 2.7.3
– Compliant with Apache Hadoop 2.7.1, HDP 2.5.0.3 and CDH 5.8.2 APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
• Different file systems with disks and SSDs and Lustre
RDMA for Apache Hadoop 2.x Distribution
http://hibd.cse.ohio-state.edu
15. Intel HPC Dev Conf (SC ‘16) 15Network Based Computing Laboratory
• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well
as performance. This mode is enabled by default in the package.
• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-
memory and obtain as much performance benefit as possible.
• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst
buffer design is hosted by Memcached servers, each of which has a local SSD.
• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top
of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-
L, and MapReduce over Lustre).
Different Modes of RDMA for Apache Hadoop 2.x
16. Intel HPC Dev Conf (SC ‘16) 16Network Based Computing Laboratory
• High-Performance Design of Spark over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level
for Spark
– RDMA-based data shuffle and SEDA-based shuffle architecture
– Non-blocking and chunk-based data transfer
– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.1
– Based on Apache Spark 1.5.1
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR and FDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
• RAM disks, SSDs, and HDD
– http://hibd.cse.ohio-state.edu
RDMA for Apache Spark Distribution
17. Intel HPC Dev Conf (SC ‘16) 17Network Based Computing Laboratory
• RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are installed and
available on SDSC Comet.
– Examples for various modes of usage are available in:
• RDMA for Apache Hadoop 2.x: /share/apps/examples/HADOOP
• RDMA for Apache Spark: /share/apps/examples/SPARK/
– Please email [email protected] (reference Comet as the machine, and SDSC as the
site) if you have any further questions about usage and configuration.
• RDMA for Apache Hadoop is also available on Chameleon Cloud as an
appliance
– https://www.chameleoncloud.org/appliances/17/
HiBD Packages on SDSC Comet and Chameleon Cloud
M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC
Comet, XSEDE’16, July 2016
18. Intel HPC Dev Conf (SC ‘16) 18Network Based Computing Laboratory
• High-Performance Design of HBase over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level
for HBase
– Compliant with Apache HBase 1.1.2 APIs and applications
– On-demand connection setup
– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.1
– Based on Apache HBase 1.1.2
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
– http://hibd.cse.ohio-state.edu
RDMA for Apache HBase Distribution
19. Intel HPC Dev Conf (SC ‘16) 19Network Based Computing Laboratory
• High-Performance Design of Memcached over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Memcached and
libMemcached components
– High performance design of SSD-Assisted Hybrid Memory
– Non-Blocking Libmemcached Set/Get API extensions
– Support for burst-buffer mode in Lustre-integrated design of HDFS in RDMA for Apache Hadoop-2.x
– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with
IPoIB)
• Current release: 0.9.5
– Based on Memcached 1.4.24 and libMemcached 1.0.18
– Compliant with libMemcached APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
• SSD
– http://hibd.cse.ohio-state.edu
RDMA for Memcached Distribution
20. Intel HPC Dev Conf (SC ‘16) 20Network Based Computing Laboratory
• Micro-benchmarks for Hadoop Distributed File System (HDFS)
– Sequential Write Latency (SWL) Benchmark, Sequential Read Latency (SRL) Benchmark,
Random Read Latency (RRL) Benchmark, Sequential Write Throughput (SWT) Benchmark,
Sequential Read Throughput (SRT) Benchmark
– Support benchmarking of
• Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Platform (HDP) HDFS, Cloudera Distribution of
Hadoop (CDH) HDFS
• Micro-benchmarks for Memcached
– Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark, Non-Blocking API Latency
Benchmark, Hybrid Memory Latency Benchmark
• Micro-benchmarks for HBase
– Get Latency Benchmark, Put Latency Benchmark
• Current release: 0.9.1
• http://hibd.cse.ohio-state.edu
OSU HiBD Micro-Benchmark (OHB) Suite – HDFS, Memcached, and HBase
21. Intel HPC Dev Conf (SC ‘16) 21Network Based Computing Laboratory
• Basic Designs
– HDFS, MapReduce, and RPC
– HBase
– Spark
– Memcached
– OSU HiBD Benchmarks (OHB)
• Advanced Designs
– HDFS+Memcached-based Burst-Buffer
– Memcached with Hybrid Memory and Non-blocking APIs
– Data Access Strategies with Heterogeneous Storage (Hadoop and Spark)
– Accelerating Big Data I/O (Lustre + Burst-Buffer)
– Efficient Indexing with RDMA-HBase
– MR-Advisor
• BigData + HPC Cloud
Acceleration Case Studies and Performance Evaluation
22. Intel HPC Dev Conf (SC ‘16) 22Network Based Computing Laboratory
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based HDFS with communication library written in native code
Design Overview of HDFS with RDMA
HDFS
Verbs
RDMA Capable Networks
(IB, iWARP, RoCE ..)
Applications
1/10/40/100 GigE, IPoIB
Network
Java Socket Interface Java Native Interface (JNI)
WriteOthers
OSU Design
• Design Features
– RDMA-based HDFS write
– RDMA-based HDFS
replication
– Parallel replication support
– On-demand connection
setup
– InfiniBand/RoCE support
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS
over InfiniBand , Supercomputing (SC), Nov 2012
N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014
23. Intel HPC Dev Conf (SC ‘16) 23Network Based Computing Laboratory
Triple-H
Heterogeneous Storage
• Design Features
– Three modes
• Default (HHH)
• In-Memory (HHH-M)
• Lustre-Integrated (HHH-L)
– Policies to efficiently utilize the heterogeneous
storage devices
• RAM, SSD, HDD, Lustre
– Eviction/Promotion based on data usage
pattern
– Hybrid Replication
– Lustre-Integrated mode:
• Lustre-based fault-tolerance
Enhanced HDFS with In-Memory and Heterogeneous Storage
Hybrid Replication
Data Placement Policies
Eviction/Promotion
RAM Disk SSD HDD
Lustre
N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters
with Heterogeneous Storage Architecture, CCGrid ’15, May 2015
Applications
24. Intel HPC Dev Conf (SC ‘16) 24Network Based Computing Laboratory
Design Overview of MapReduce with RDMA
MapReduce
Verbs
RDMA Capable Networks
(IB, iWARP, RoCE ..)
OSU Design
Applications
1/10/40/100 GigE, IPoIB
Network
Java Socket Interface Java Native Interface (JNI)
Job
Tracker
Task
Tracker
Map
Reduce
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based MapReduce with communication library written in native code
• Design Features
– RDMA-based shuffle
– Prefetching and caching map output
– Efficient Shuffle Algorithms
– In-memory merge
– On-demand Shuffle Adjustment
– Advanced overlapping
• map, shuffle, and merge
• shuffle, merge, and reduce
– On-demand connection setup
– InfiniBand/RoCE support
M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in
MapReduce over High Performance Interconnects, ICS, June 2014
25. Intel HPC Dev Conf (SC ‘16) 25Network Based Computing Laboratory
0
50
100
150
200
250
300
350
400
80 120 160
ExecutionTime(s)
Data Size (GB)
IPoIB (EDR)
OSU-IB (EDR)
0
100
200
300
400
500
600
700
800
80 160 240
ExecutionTime(s)
Data Size (GB)
IPoIB (EDR)
OSU-IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x –
RandomWriter & TeraGen in OSU-RI2 (EDR)
Cluster with 8 Nodes with a total of 64 maps
• RandomWriter
– 3x improvement over IPoIB
for 80-160 GB file size
• TeraGen
– 4x improvement over IPoIB for
80-240 GB file size
RandomWriter TeraGen
Reduced by 3x Reduced by 4x
26. Intel HPC Dev Conf (SC ‘16) 26Network Based Computing Laboratory
0
100
200
300
400
500
600
700
800
80 120 160
ExecutionTime(s)
Data Size (GB)
IPoIB (EDR)
OSU-IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x – Sort & TeraSort
in OSU-RI2 (EDR)
Cluster with 8 Nodes with a total of
64 maps and 32 reduces
• Sort
– 61% improvement over IPoIB for
80-160 GB data
• TeraSort
– 18% improvement over IPoIB for
80-240 GB data
Reduced by 61%
Reduced by 18%
Cluster with 8 Nodes with a total of
64 maps and 14 reduces
Sort TeraSort
0
100
200
300
400
500
600
80 160 240
ExecutionTime(s)
Data Size (GB)
IPoIB (EDR)
OSU-IB (EDR)
27. Intel HPC Dev Conf (SC ‘16) 27Network Based Computing Laboratory
Evaluation of HHH and HHH-L with Applications
HDFS (FDR) HHH (FDR)
60.24 s 48.3 s
CloudBurstMR-MSPolyGraph
0
200
400
600
800
1000
4 6 8
ExecutionTime(s)
Concurrent maps per host
HDFS Lustre HHH-L Reduced by 79%
• MR-MSPolygraph on OSU RI with
1,000 maps
– HHH-L reduces the execution time
by 79% over Lustre, 30% over HDFS
• CloudBurst on TACC Stampede
– With HHH: 19% improvement over
HDFS
28. Intel HPC Dev Conf (SC ‘16) 28Network Based Computing Laboratory
Evaluation with Spark on SDSC Gordon (HHH vs. Tachyon/Alluxio)
• For 200GB TeraGen on 32 nodes
– Spark-TeraGen: HHH has 2.4x improvement over Tachyon; 2.3x over HDFS-IPoIB (QDR)
– Spark-TeraSort: HHH has 25.2% improvement over Tachyon; 17% over HDFS-IPoIB (QDR)
0
20
40
60
80
100
120
140
160
180
8:50 16:100 32:200
ExecutionTime(s)
Cluster Size : Data Size (GB)
IPoIB (QDR) Tachyon OSU-IB (QDR)
0
100
200
300
400
500
600
700
8:50 16:100 32:200
ExecutionTime(s)
Cluster Size : Data Size (GB)
Reduced
by 2.4x
Reduced by 25.2%
TeraGen TeraSort
N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-Memory File
Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015
29. Intel HPC Dev Conf (SC ‘16) 29Network Based Computing Laboratory
• Basic Designs
– HDFS, MapReduce, and RPC
– HBase
– Spark
– Memcached
– OSU HiBD Benchmarks (OHB)
• Advanced Designs
– HDFS+Memcached-based Burst-Buffer
– Memcached with Hybrid Memory and Non-blocking APIs
– Data Access Strategies with Heterogeneous Storage (Hadoop and Spark)
– Accelerating Big Data I/O (Lustre + Burst-Buffer)
– Efficient Indexing with RDMA-HBase
– MR-Advisor
• BigData + HPC Cloud
Acceleration Case Studies and Performance Evaluation
30. Intel HPC Dev Conf (SC ‘16) 30Network Based Computing Laboratory
Performance Numbers of RDMA for Apache HBase – OHB in
SDSC-Comet
0
2000
4000
6000
8000
10000
12000
14000
2 8 32 128 512 2k 8k 32k 128k 512k
Latency(us)
Message Size
Put
IPoIB RDMA
0
1000
2000
3000
4000
5000
2 8 32 128 512 2k 8k 32k 128k512k
Latency(us)
Message Size
Get
IPoIB RDMA
• Up to 8.6x improvement over
IPoIB
Reduced by
8.6x
• Up to 5.3x improvement over
IPoIB
Reduced by
3.8x
Reduced by
5.3x
Reduced by
8.6x
Evaluation with OHB Put and Get Micro-Benchmarks (1 Server, 1 Client)
31. Intel HPC Dev Conf (SC ‘16) 31Network Based Computing Laboratory
Performance Numbers of RDMA for Apache HBase –
YCSB in SDSC-Comet
0
20000
40000
60000
80000
100000
120000
4 8 16 32 64 128
Throughput(Operations/sec)
Number of clients
Workload A (50% read, 50% update)
IPoIB RDMA
0
100000
200000
300000
400000
500000
4 8 16 32 64 128
Throughput(Operations/sec)
Number of clients
Workload C (100% read)
IPoIB RDMA
• Up to 2.4x improvement over
IPoIB
Increased by
83%
• Up to 3.6x improvement over
IPoIB
Increased by
3.6x
Evaluation with YCSB Workloads A and C (4 Servers)
32. Intel HPC Dev Conf (SC ‘16) 32Network Based Computing Laboratory
• Basic Designs
– HDFS, MapReduce, and RPC
– HBase
– Spark
– Memcached
– OSU HiBD Benchmarks (OHB)
• Advanced Designs
– HDFS+Memcached-based Burst-Buffer
– Memcached with Hybrid Memory and Non-blocking APIs
– Data Access Strategies with Heterogeneous Storage (Hadoop and Spark)
– Accelerating Big Data I/O (Lustre + Burst-Buffer)
– Efficient Indexing with RDMA-HBase
– MR-Advisor
• BigData + HPC Cloud
Acceleration Case Studies and Performance Evaluation
33. Intel HPC Dev Conf (SC ‘16) 33Network Based Computing Laboratory
• Design Features
– RDMA based shuffle plugin
– SEDA-based architecture
– Dynamic connection
management and sharing
– Non-blocking data transfer
– Off-JVM-heap buffer
management
– InfiniBand/RoCE support
Design Overview of Spark with RDMA
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Scala based Spark with communication library written in native code
X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High
Performance Interconnects (HotI'14), August 2014
X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016.
Spark Core
RDMA Capable Networks
(IB, iWARP, RoCE ..)
Apache Spark Benchmarks/Applications/Libraries/Frameworks
1/10/40/100 GigE, IPoIB Network
Java Socket Interface Java Native Interface (JNI)
Native RDMA-based Comm. Engine
Shuffle Manager (Sort, Hash, Tungsten-Sort)
Block Transfer Service (Netty, NIO, RDMA-Plugin)
Netty
Server
NIO
Server
RDMA
Server
Netty
Client
NIO
Client
RDMA
Client
34. Intel HPC Dev Conf (SC ‘16) 34Network Based Computing Laboratory
• InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)
• RDMA-based design for Spark 1.5.1
• RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node.
– SortBy: Total time reduced by up to 80% over IPoIB (56Gbps)
– GroupBy: Total time reduced by up to 74% over IPoIB (56Gbps)
Performance Evaluation on SDSC Comet – SortBy/GroupBy
64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time
0
50
100
150
200
250
300
64 128 256
Time(sec)
Data Size (GB)
IPoIB
RDMA
0
50
100
150
200
250
64 128 256
Time(sec)
Data Size (GB)
IPoIB
RDMA
74%80%
35. Intel HPC Dev Conf (SC ‘16) 35Network Based Computing Laboratory
• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
• RDMA-based design for Spark 1.5.1
• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.
– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)
– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)
Performance Evaluation on SDSC Comet – HiBench PageRank
32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time
0
50
100
150
200
250
300
350
400
450
Huge BigData Gigantic
Time(sec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData Gigantic
Time(sec)
Data Size (GB)
IPoIB
RDMA
43%37%
36. Intel HPC Dev Conf (SC ‘16) 36Network Based Computing Laboratory
Performance Evaluation on SDSC Comet: Astronomy Application
• Kira Toolkit1: Distributed astronomy image
processing toolkit implemented using Apache Spark.
• Source extractor application, using a 65GB dataset
from the SDSS DR2 survey that comprises 11,150
image files.
• Compare RDMA Spark performance with the
standard apache implementation using IPoIB.
1. Z. Zhang, K. Barbary, F. A. Nothaft, E.R. Sparks, M.J. Franklin, D.A.
Patterson, S. Perlmutter. Scientific Computing meets Big Data Technology: An
Astronomy Use Case. CoRR, vol: abs/1507.03325, Aug 2015.
0
20
40
60
80
100
120
RDMA Spark Apache Spark
(IPoIB)
21 %
Execution times (sec) for Kira SE
benchmark using 65 GB dataset, 48 cores.
M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC
Comet, XSEDE’16, July 2016
37. Intel HPC Dev Conf (SC ‘16) 37Network Based Computing Laboratory
• Basic Designs
– HDFS, MapReduce, and RPC
– HBase
– Spark
– Memcached
– OSU HiBD Benchmarks (OHB)
• Advanced Designs and Studies
– HDFS+Memcached-based Burst-Buffer
– Memcached with Hybrid Memory and Non-blocking APIs
– Data Access Strategies with Heterogeneous Storage (Hadoop and Spark)
– Accelerating Big Data I/O (Lustre + Burst-Buffer)
– MR-Advisor
• BigData + HPC Cloud
Acceleration Case Studies and Performance Evaluation
38. Intel HPC Dev Conf (SC ‘16) 38Network Based Computing Laboratory
1
10
100
1000
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
Time(us)
Message Size
OSU-IB (FDR)
0
200
400
600
800
16 32 64 128 256 512 102420484080
Thousandsof
Transactionsper
Second(TPS)
No. of Clients
• Memcached Get latency
– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us, 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us
• Memcached Throughput (4bytes)
– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s, Nearly 2X improvement in throughput
Memcached GET Latency Memcached Throughput
Memcached Performance (FDR Interconnect)
Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)
Latency Reduced
by nearly 20X
2X
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High
Performance RDMA Capable Interconnects, ICPP’11
J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid
Transport, CCGrid’12
39. Intel HPC Dev Conf (SC ‘16) 39Network Based Computing Laboratory
• Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing
tool
• Memcached-RDMA can
- improve query latency by up to 66% over IPoIB (32Gbps)
- throughput by up to 69% over IPoIB (32Gbps)
Micro-benchmark Evaluation for OLDP workloads
0
1
2
3
4
5
6
7
8
64 96 128 160 320 400
Latency(sec)
No. of Clients
Memcached-IPoIB (32Gbps)
Memcached-RDMA (32Gbps)
0
1000
2000
3000
4000
64 96 128 160 320 400
Throughput(Kq/s)
No. of Clients
Memcached-IPoIB (32Gbps)
Memcached-RDMA (32Gbps)
D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads
with Memcached and MySQL, ISPASS’15
Reduced by 66%
40. Intel HPC Dev Conf (SC ‘16) 40Network Based Computing Laboratory
• RDMA-based Designs
– HDFS, MapReduce, and RPC
– HBase
– Spark
– Memcached
– OSU HiBD Benchmarks (OHB)
• Advanced Designs and Studies
– HDFS+Memcached-based Burst-Buffer
– Memcached with Hybrid Memory and Non-blocking APIs
– Data Access Strategies with Heterogeneous Storage (Hadoop and Spark)
– Accelerating Big Data I/O (Lustre + Burst-Buffer)
– Efficient Indexing with RDMA-HBase
– MR-Advisor
• BigData + HPC Cloud
Acceleration Case Studies and Performance Evaluation
41. Intel HPC Dev Conf (SC ‘16) 41Network Based Computing Laboratory
• The current benchmarks provide some performance behavior
• However, do not provide any information to the designer/developer on:
– What is happening at the lower-layer?
– Where the benefits are coming from?
– Which design is leading to benefits or bottlenecks?
– Which component in the design needs to be changed and what will be its impact?
– Can performance gain/loss at the lower-layer be correlated to the performance
gain/loss observed at the upper layer?
Are the Current Benchmarks Sufficient for Big Data?
42. Intel HPC Dev Conf (SC ‘16) 42Network Based Computing Laboratory
Big Data Middleware
(HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies
(InfiniBand, 1/10/40/100 GigE
and Intelligent NICs)
Storage Technologies
(HDD, SSD, and NVMe-SSD)
Programming Models
(Sockets)
Applications
Commodity Computing System
Architectures
(Multi- and Many-core
architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point
Communication
QoS
Threaded Models
and Synchronization
Fault-ToleranceI/O and File Systems
Virtualization
Benchmarks
RDMA Protocols
Challenges in Benchmarking of RDMA-based Designs
Current
Benchmarks
No Benchmarks
Correlation?
43. Intel HPC Dev Conf (SC ‘16) 43Network Based Computing Laboratory
Big Data Middleware
(HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies
(InfiniBand, 1/10/40/100 GigE
and Intelligent NICs)
Storage Technologies
(HDD, SSD, and NVMe-SSD)
Programming Models
(Sockets)
Applications
Commodity Computing System
Architectures
(Multi- and Many-core
architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point
Communication
QoS
Threaded Models
and Synchronization
Fault-ToleranceI/O and File Systems
Virtualization
Benchmarks
RDMA Protocols
Iterative Process – Requires Deeper Investigation and Design for
Benchmarking Next Generation Big Data Systems and Applications
Applications-Level
Benchmarks
Micro-
Benchmarks
44. Intel HPC Dev Conf (SC ‘16) 44Network Based Computing Laboratory
• HDFS Benchmarks
– Sequential Write Latency (SWL) Benchmark
– Sequential Read Latency (SRL) Benchmark
– Random Read Latency (RRL) Benchmark
– Sequential Write Throughput (SWT) Benchmark
– Sequential Read Throughput (SRT) Benchmark
• Memcached Benchmarks
– Get, Set and Mixed Get/Set Benchmarks
– Non-blocking and Hybrid Memory Latency Benchmarks
• HBase Benchmarks
– Get and Put Latency Benchmarks
• Available as a part of OHB 0.9.1
OSU HiBD Benchmarks (OHB)
N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D.
K. Panda, A Micro-benchmark Suite for
Evaluating HDFS Operations on Modern
Clusters, Int'l Workshop on Big Data
Benchmarking (WBDB '12), December 2012
D. Shankar, X. Lu, M. W. Rahman, N. Islam, and
D. K. Panda, A Micro-Benchmark Suite for
Evaluating Hadoop MapReduce on High-
Performance Networks, BPOE-5 (2014)
X. Lu, M. W. Rahman, N. Islam, and D. K. Panda,
A Micro-Benchmark Suite for Evaluating Hadoop
RPC on High-Performance Networks, Int'l
Workshop on Big Data Benchmarking (WBDB
'13), July 2013
To be Released
45. Intel HPC Dev Conf (SC ‘16) 45Network Based Computing Laboratory
• Basic Designs
– HDFS, MapReduce, and RPC
– HBase
– Spark
– Memcached
– OSU HiBD Benchmarks (OHB)
• Advanced Designs
– HDFS+Memcached-based Burst-Buffer
– Memcached with Hybrid Memory and Non-blocking APIs
– Data Access Strategies with Heterogeneous Storage (Hadoop and Spark)
– Accelerating Big Data I/O (Lustre + Burst-Buffer)
– Efficient Indexing with RDMA-HBase
– MR-Advisor
• BigData + HPC Cloud
Acceleration Case Studies and Performance Evaluation
46. Intel HPC Dev Conf (SC ‘16) 46Network Based Computing Laboratory
• Design Features
– Memcached-based burst-buffer
system
• Hides latency of parallel file
system access
• Read from local storage and
Memcached
– Data locality achieved by writing data
to local storage
– Different approaches of integration
with parallel file system to guarantee
fault-tolerance
Accelerating I/O Performance of Big Data Analytics
through RDMA-based Key-Value Store
Application
I/O Forwarding Module
Map/Reduce Task DataNode
Local Disk
Data LocalityFault-tolerance
Lustre
Memcached-based Burst Buffer System
47. Intel HPC Dev Conf (SC ‘16) 47Network Based Computing Laboratory
Evaluation with PUMA Workloads
Gains on OSU RI with our approach (Mem-bb) on 24 nodes
• SequenceCount: 34.5% over Lustre, 40% over HDFS
• RankedInvertedIndex: 27.3% over Lustre, 48.3% over HDFS
• HistogramRating: 17% over Lustre, 7% over HDFS
0
500
1000
1500
2000
2500
3000
3500
4000
4500
SeqCount RankedInvIndex HistoRating
ExecutionTime(s)
Workloads
HDFS (32Gbps)
Lustre (32Gbps)
Mem-bb (32Gbps)
48.3%
40%
17%
N. S. Islam, D. Shankar, X. Lu, M.
W. Rahman, and D. K. Panda,
Accelerating I/O Performance of
Big Data Analytics with RDMA-
based Key-Value Store, ICPP
’15, September 2015
48. Intel HPC Dev Conf (SC ‘16) 48Network Based Computing Laboratory
• Basic Designs
– HDFS, MapReduce, and RPC
– HBase
– Spark
– Memcached
– OSU HiBD Benchmarks (OHB)
• Advanced Designs
– HDFS+Memcached-based Burst-Buffer
– Memcached with Hybrid Memory and Non-blocking APIs
– Data Access Strategies with Heterogeneous Storage (Hadoop and Spark)
– Accelerating Big Data I/O (Lustre + Burst-Buffer)
– Efficient Indexing with RDMA-HBase
– MR-Advisor
• BigData + HPC Cloud
Acceleration Case Studies and Performance Evaluation
49. Intel HPC Dev Conf (SC ‘16) 49Network Based Computing Laboratory
• ohb_memlat & ohb_memthr latency & throughput micro-benchmarks
• Memcached-RDMA can
- improve query latency by up to 70% over IPoIB (32Gbps)
- improve throughput by up to 2X over IPoIB (32Gbps)
- No overhead in using hybrid mode when all data can fit in memory
Performance Benefits of Hybrid Memcached (Memory + SSD) on
SDSC-Gordon
0
2
4
6
8
10
64 128 256 512 1024
Throughput(milliontrans/sec)
No. of Clients
IPoIB (32Gbps)
RDMA-Mem (32Gbps)
RDMA-Hybrid (32Gbps)
0
100
200
300
400
500
Averagelatency(us)
Message Size (Bytes)
2X
50. Intel HPC Dev Conf (SC ‘16) 50Network Based Computing Laboratory
– Memcached latency test with Zipf distribution, server with 1 GB memory, 32 KB key-value pair size, total
size of data accessed is 1 GB (when data fits in memory) and 1.5 GB (when data does not fit in memory)
– When data fits in memory: RDMA-Mem/Hybrid gives 5x improvement over IPoIB-Mem
– When data does not fit in memory: RDMA-Hybrid gives 2x-2.5x over IPoIB/RDMA-Mem
Performance Evaluation on IB FDR + SATA/NVMe SSDs (Hybrid Memory)
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get
IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-
NVMe
IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid-
NVMe
Data Fits In Memory Data Does Not Fit In Memory
Latency(us)
slab allocation (SSD write) cache check+load (SSD read) cache update server response client wait miss-penalty
51. Intel HPC Dev Conf (SC ‘16) 51Network Based Computing Laboratory
– Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve
• >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty
• >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid
– Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x
improvement over IPoIB-Mem
Performance Evaluation with Non-Blocking Memcached API
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set Get
IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonB-
i
H-RDMA-Opt-NonB-
b
AverageLatency(us)
MissPenalty(BackendDBAccessOverhead)
ClientWait
ServerResponse
CacheUpdate
Cachecheck+Load(Memoryand/orSSDread)
SlabAllocation(w/SSDwriteonOut-of-Mem)
H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API
NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee
52. Intel HPC Dev Conf (SC ‘16) 52Network Based Computing Laboratory
• Basic Designs
– HDFS, MapReduce, and RPC
– HBase
– Spark
– Memcached
– OSU HiBD Benchmarks (OHB)
• Advanced Designs
– HDFS+Memcached-based Burst-Buffer
– Memcached with Hybrid Memory and Non-blocking APIs
– Data Access Strategies with Heterogeneous Storage (Hadoop and Spark)
– Accelerating Big Data I/O (Lustre + Burst-Buffer)
– Efficient Indexing with RDMA-HBase
– MR-Advisor
• BigData + HPC Cloud
Acceleration Case Studies and Performance Evaluation
53. Intel HPC Dev Conf (SC ‘16) 53Network Based Computing Laboratory
• Locality and Storage type-aware data
access
• Enhanced data access strategies for
Hadoop and Spark for clusters with
heterogeneous storage characteristics
– different nodes have different types of
storage devices
• Re-design HDFS to incorporate the
proposed strategies
– Upper-level frameworks can
transparently leverage the benefits
Efficient Data Access Strategies with Heterogeneous Storage
Hadoop/Spark Workloads
Access
Strategy
Selector
Weight
Distributor
Locality
Detector
Storage Type
Fetcher
DataNode Selector
DataNode
SSD HDD
Storage Monitor
Connection
Tracker
N. Islam, M. W. Rahman, X. Lu, and D. K. Panda, Efficient Data Access Strategies for Hadoop and Spark on HPC Clusters with
Heterogeneous Storage, accepted at BigData ’16, December 2016
54. Intel HPC Dev Conf (SC ‘16) 54Network Based Computing Laboratory
Evaluation with Sort Benchmark
OSU RI2 EDR Cluster
• Hadoop
– 32% improvement for 320GB
data size on 16 nodes
• Spark
– 18% improvement for 320GB
data size on 16 nodes
Hadoop Sort Spark Sort
Reduced by 32%
Reduced by 18%
0
100
200
300
400
500
600
700
4:80 8:160 16:320
ExecutionTime(s)
Cluster Size: Data Size (GB)
HDFS OSU-Design
0
50
100
150
200
250
300
350
400
450
4:80 8:160 16:320
ExecutionTime(s)
Cluster Size: Data Size (GB)
55. Intel HPC Dev Conf (SC ‘16) 55Network Based Computing Laboratory
• Basic Designs
– HDFS, MapReduce, and RPC
– HBase
– Spark
– Memcached
– OSU HiBD Benchmarks (OHB)
• Advanced Designs
– HDFS+Memcached-based Burst-Buffer
– Memcached with Hybrid Memory and Non-blocking APIs
– Data Access Strategies with Heterogeneous Storage (Hadoop and Spark)
– Accelerating Big Data I/O (Lustre + Burst-Buffer)
– Efficient Indexing with RDMA-HBase
– MR-Advisor
• BigData + HPC Cloud
Acceleration Case Studies and Performance Evaluation
56. Intel HPC Dev Conf (SC ‘16) 56Network Based Computing Laboratory
• Hybrid and resilient key-value store-
based Burst-Buffer system Over Lustre
• Overcome limitations of local storage on
HPC cluster nodes
• Light-weight transparent interface to
Hadoop/Spark applications
• Accelerating I/O-intensive Big Data
workloads
– Non-blocking Memcached APIs to
maximize overlap
– Client-based replication for resilience
– Asynchronous persistence to Lustre
parallel file system
Burst-Buffer Over Lustre for Accelerating Big Data I/O (Boldio)
D. Shankar, X. Lu, D. K. Panda, Boldio: A Hybrid and Resilient Burst-Buffer over Lustre for Accelerating Big Data I/O, IEEE Big Data 2016.
DirectoverLustre
Hadoop Applications/Benchmarks (E.g. MapReduce, Spark)
Hadoop File System Class Abstraction (LocalFileSystem)
Burst-Buffer Memcached Cluster
Burst-Buffer Libmemcached Client
RDMA-enhanced Communication Engine
Non-Blocking API Blocking API
RDMA-enhanced
Comm. Engine
Hyb-Mem Manager
(RAM/SSD)
Persistence Mgr.
RDMA-enhanced
Comm. Engine
Hyb-Mem Manager
(RAM/SSD)
Persistence Mgr.
Boldio
…..
Co-Design
BoldioFileSystem Abs.
Lustre Parallel File System
MDS OSS OSS OSS
MDT MDT OST OST OST OST OST OST
BoldioServerBoldioClient
57. Intel HPC Dev Conf (SC ‘16) 57Network Based Computing Laboratory
• Based on RDMA-based Libmemcached/Memcached 0.9.3, Hadoop-2.6.0
• InfiniBand QDR, 24GB RAM + PCIe-SSDs, 12 nodes, 32/48 Map/Reduce Tasks, 4-node Memcached cluster
• Boldio can improve
– throughput over Lustre by about 3x for write throughput and 7x for read throughput
– execution time of Hadoop benchmarks over Lustre, e.g. Wordcount, Cloudburst by >21%
• Contrasting with Alluxio (formerly Tachyon)
– Performance degrades about 15x when Alluxio cannot leverage local storage (Alluxio-Local vs. Alluxio-Remote)
– Boldio can improve throughput over Alluxio with all remote workers by about 3.5x - 8 .8x (Alluxio-Remote vs. Boldio)
Performance Evaluation with Boldio
Hadoop/Spark Workloads
21%
0
50
100
150
200
250
300
350
400
450
WordCount InvIndx CloudBurst Spark TeraGen
Latency(sec)
Lustre-Direct Alluxio-Remote Boldio
DFSIO Throughput
0.00
10000.00
20000.00
30000.00
40000.00
50000.00
60000.00
70000.00
20 GB 40 GB 20 GB 40 GB
Write Read
Agg.Throughput(MBps)
Lustre-Direct Alluxio-Local
Alluxio-Remote Boldio
~3x
~7x
58. Intel HPC Dev Conf (SC ‘16) 58Network Based Computing Laboratory
• Basic Designs
– HDFS, MapReduce, and RPC
– HBase
– Spark
– Memcached
– OSU HiBD Benchmarks (OHB)
• Advanced Designs
– HDFS+Memcached-based Burst-Buffer
– Memcached with Hybrid Memory and Non-blocking APIs
– Data Access Strategies with Heterogeneous Storage (Hadoop and Spark)
– Accelerating Big Data I/O (Lustre + Burst-Buffer)
– Efficient Indexing with RDMA-HBase
– MR-Advisor
• BigData + HPC Cloud
Acceleration Case Studies and Performance Evaluation
59. Intel HPC Dev Conf (SC ‘16) 59Network Based Computing Laboratory
• Challenges
– Operations on Distributed Ordered Table (DOT)
with indexing techniques are network intensive
– Additional overhead of creating and maintaining
secondary indices
– Can RDMA benefit indexing techniques (Apache
Phoenix and CCIndex) on HBase?
• Results
– Evaluation with Apache Phoenix and CCIndex
– Up to 2x improvement in query throughput
– Up to 35% reduction in application workload
execution time
Collaboration with Institute of Computing Technology,
Chinese Academy of Sciences
Accelerating Indexing Techniques on HBase with RDMA
S. Gugnani, X. Lu, L. Zha, and D. K. Panda, Characterizing and Accelerating Indexing Techniques on HBase for Distributed Ordered
Table-based Queries and Applications (Under review)
0
5000
10000
15000
Query1 Query2
THROUGHPUT
TPC-H Query Benchmarks
HBase RDMA-HBase HBase-Phoenix
RDMA-HBase-Phoenix HBase-CCIndex RDMA-HBase-CCIndex
Increased by
2x
0
100
200
300
400
500
600
Workload1 Workload2 Workload3 Workload4
EXECUTIONTIME
Ad Master Application Workloads
HBase-Phoenix RDMA-HBase-Phoenix
HBase-CCIndex RDMA-HBase-CCIndex Reduced by
35%
60. Intel HPC Dev Conf (SC ‘16) 60Network Based Computing Laboratory
• Basic Designs
– HDFS, MapReduce, and RPC
– HBase
– Spark
– Memcached
– OSU HiBD Benchmarks (OHB)
• Advanced Designs
– HDFS+Memcached-based Burst-Buffer
– Memcached with Hybrid Memory and Non-blocking APIs
– Data Access Strategies with Heterogeneous Storage (Hadoop and Spark)
– Accelerating Big Data I/O (Lustre + Burst-Buffer)
– Efficient Indexing with RDMA-HBase
– MR-Advisor
• BigData + HPC Cloud
Acceleration Case Studies and Performance Evaluation
61. Intel HPC Dev Conf (SC ‘16) 61Network Based Computing Laboratory
Challenges of Tuning and Profiling
• MapReduce systems have different
configuration parameters based on the
underlying component that uses these
• The parameter files vary across
different MapReduce stacks
• Proposed a generalized parameter
space for HPC clusters
• Two broad dimensions: user space and
system space; existing parameters can
be categorized in the proposed spaces
62. Intel HPC Dev Conf (SC ‘16) 62Network Based Computing Laboratory
MR-Advisor Overview
• A generalized framework for
Big Data processing engines
to perform tuning, profiling,
and prediction
• Current framework can work
with Hadoop, Spark, and
RDMA MapReduce (OSU-MR)
• Can also provide tuning for
different file systems (e.g.
HDFS, Lustre, Tachyon),
resource managers (e.g.
YARN), and applications
63. Intel HPC Dev Conf (SC ‘16) 63Network Based Computing Laboratory
Execution Details in MR-Advisor
• Workload Generator maps the
input parameters to the
appropriate configuration and
generates workload for each
tuning test
• Job submitter deploys small
clusters for each experiment and
runs the workload on these
deployments
• Job Tuning Tracker monitors failed
jobs and re-launches, if necessary
Workload Preparation and Deployment Unit
Workload
Generator
Job Submitter
Job Tuning
Tracker
Compute Nodes
Lustre Installation
MDS OSS
InfiniBand/
Ethernet
App Master
Map/Reduce
NM DN
RM/Spark
Master
Tachyon
Master/ NN
Spark Worker
Tachyon
Worker
DN
App Master
Map/Reduce
NM LC
M. W. Rahman , N. S. Islam, X. Lu, D. Shankar, and D. K.
Panda, MR-Advisor: A Comprehensive Tuning Tool for
Advising HPC Users to Accelerate MapReduce Applications on
Supercomputers, SBAC-PAD, 2016.
64. Intel HPC Dev Conf (SC ‘16) 64Network Based Computing Laboratory
Tuning Experiments with MR-Advisor (TACC Stampede)
Apache MR over HDFS OSU-MR over HDFS Apache Spark over HDFS
Apache MR over Lustre OSU-MR over Lustre Apache Spark over Lustre
23% 17% 34%
46% 58% 28%
Performance improvements compared to current best practice values
65. Intel HPC Dev Conf (SC ‘16) 65Network Based Computing Laboratory
• Basic Designs
– HDFS, MapReduce, and RPC
– HBase
– Spark
– Memcached
– OSU HiBD Benchmarks (OHB)
• Advanced Designs
– HDFS+Memcached-based Burst-Buffer
– Memcached with Hybrid Memory and Non-blocking APIs
– Data Access Strategies with Heterogeneous Storage (Hadoop and Spark)
– Accelerating Big Data I/O (Lustre + Burst-Buffer)
– Efficient Indexing with RDMA-HBase
– MR-Advisor
• BigData + HPC Cloud
Acceleration Case Studies and Performance Evaluation
66. Intel HPC Dev Conf (SC ‘16) 66Network Based Computing Laboratory
• Motivation
– Performance attributes of Big Data workloads
when using SR-IOV are not known
– Impact of VM subscription policies, data size, and
type of workload on performance of workloads
with SR-IOV not evaluated in systematic manner
• Results
– Evaluation on Chameleon Cloud with RDMA-
Hadoop
– Only 0.3 – 13% overhead with SR-IOV compared
to native execution
– Best VM subscription policy depends on type of
workload
Performance Characterization of Hadoop Workloads on SR-
IOV-enabled Clouds
S. Gugnani, X. Lu, and D. K. Panda, Performance Characterization of Hadoop Workloads on SR-IOV-enabled Virtualized InfiniBand
Clusters, accepted at BDCAT’16, December 2016
0
200
400
40 GB 60 GB 40 GB 60 GB
EXECUTIONTIME
Hadoop Workloads
Native (1 DN, 1 NM) Native (2 DN, 2 NM)
VM per node VM per socket
TeraGen WordCount
0
500
1000
Sample Data 1 GB 60 GB 90 GB
EXECUTIONTIME
Applications
Native (1 DN, 1 NM) Native (2 DN, 2 NM)
VM per node VM per socket
CloudBurst Self-join
67. Intel HPC Dev Conf (SC ‘16) 67Network Based Computing Laboratory
• Challenges
– Existing designs in Hadoop not virtualization-
aware
– No support for automatic topology detection
• Design
– Automatic Topology Detection using
MapReduce-based utility
• Requires no user input
• Can detect topology changes during
runtime without affecting running jobs
– Virtualization and topology-aware
communication through map task scheduling and
YARN container allocation policy extensions
Virtualization-aware and Automatic Topology Detection
Schemes in Hadoop on InfiniBand
S. Gugnani, X. Lu, and D. K. Panda, Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating
Hadoop on SR-IOV-enabled Clouds, CloudCom’16, December 2016
0
2000
4000
6000
40 GB 60 GB 40 GB 60 GB 40 GB 60 GB
EXECUTIONTIME
Hadoop Benchmarks
RDMA-Hadoop Hadoop-Virt
0
100
200
300
400
Default
Mode
Distributed
Mode
Default
Mode
Distributed
Mode
EXECUTIONTIME
Hadoop Applications
RDMA-Hadoop Hadoop-Virt
CloudBurst Self-join
Sort WordCount PageRank
Reduced by
55%
Reduced by
34%
68. Intel HPC Dev Conf (SC ‘16) 68Network Based Computing Laboratory
• Upcoming Releases of RDMA-enhanced Packages will support
– Upgrades to the latest versions of Hadoop and Spark
– Streaming
– MR-Advisor
– Impala
• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support
– MapReduce, RPC and Spark
• Advanced designs with upper-level changes and optimizations
– Boldio
– Efficient Indexing
On-going and Future Plans of OSU High Performance Big Data
(HiBD) Project
69. Intel HPC Dev Conf (SC ‘16) 69Network Based Computing Laboratory
• Discussed challenges in accelerating Big Data middleware with HPC
technologies
• Presented basic and advanced designs to take advantage of InfiniBand/RDMA
for HDFS, MapReduce, RPC, HBase, Memcached, and Spark
• Results are promising
• Many other open issues need to be solved
• Will enable Big Data community to take advantage of modern HPC
technologies to carry out their analytics in a fast and scalable manner
• Looking forward to collaboration with the community
Concluding Remarks
70. Intel HPC Dev Conf (SC ‘16) 70Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by
Equipment Support by
71. Intel HPC Dev Conf (SC ‘16) 71Network Based Computing Laboratory
Personnel Acknowledgments
Current Students
– A. Awan (Ph.D.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
Past Students
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist
– S. Sur
Past Post-Docs
– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– S. Guganani (Ph.D.)
– J. Hashmi (Ph.D.)
– N. Islam (Ph.D.)
– M. Li (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Scientists
– K. Hamidouche
– X. Lu
– H. Subramoni
Past Programmers
– D. Bureddy
– M. Arnold
– J. Perkins
Current Research Specialist
– J. Smith
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– S. Marcarelli
– J. Vienne
– H. Wang
72. Intel HPC Dev Conf (SC ‘16) 72Network Based Computing Laboratory
The 3rd International Workshop on
High-Performance Big Data Computing (HPBDC)
HPBDC 2017 will be held with IEEE International Parallel and Distributed Processing
Symposium (IPDPS 2017), Orlando, Florida USA, May, 2017
Tentative Submission Deadline
Abstract: January 10, 2017
Full Submission: January 17, 2017
HPBDC 2016 was held in conjunction with IPDPS’16
Keynote Talk: Dr. Chaitanya Baru,
Senior Advisor for Data Science, National Science Foundation (NSF);
Distinguished Scientist, San Diego Supercomputer Center (SDSC)
Panel Moderator: Jianfeng Zhan (ICT/CAS)
Panel Topic: Merge or Split: Mutual Influence between Big Data and HPC Techniques
Six Regular Research Papers and Two Short Research Papers
http://web.cse.ohio-state.edu/~luxi/hpbdc2016
73. Intel HPC Dev Conf (SC ‘16) 73Network Based Computing Laboratory
• Three Conference Tutorials (IB+HSE, IB+HSE Advanced, Big Data)
• HP-CAST
• Technical Papers (SC main conference; Doctoral Showcase; Poster; PDSW-
DISC, PAW, COMHPC, and ESPM2 Workshops)
• Booth Presentations (Mellanox, NVIDIA, NRL, PGAS)
• HPC Connection Workshop
• Will be stationed at Ohio Supercomputer Center/OH-TECH Booth (#1107)
– Multiple presentations and demos
• More Details from http://mvapich.cse.ohio-state.edu/talks/
OSU Team Will be Participating in Multiple Events at SC ‘16
74. Intel HPC Dev Conf (SC ‘16) 74Network Based Computing Laboratory
{panda, luxi}@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
http://www.cse.ohio-state.edu/~luxi
Thank You!
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/