40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
In this deck from the Swiss HPC Conference, Mark Wilkinson presents: 40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility.
"DiRAC is the integrated supercomputing facility for theoretical modeling and HPC-based research in particle physics, and astrophysics, cosmology, and nuclear physics, all areas in which the UK is world-leading. DiRAC provides a variety of compute resources, matching machine architecture to the algorithm design and requirements of the research problems to be solved. As a single federated Facility, DiRAC allows more effective and efficient use of computing resources, supporting the delivery of the science programs across the STFC research communities. It provides a common training and consultation framework and, crucially, provides critical mass and a coordinating structure for both small- and large-scale cross-discipline science projects, the technical support needed to run and develop a distributed HPC service, and a pool of expertise to support knowledge transfer and industrial partnership projects. The on-going development and sharing of best-practice for the delivery of productive, national HPC services with DiRAC enables STFC researchers to produce world-leading science across the entire STFC science theory program."
Watch the video: https://wp.me/p3RLHQ-k94
Learn more: https://dirac.ac.uk/
and
http://hpcadvisorycouncil.com/events/2019/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
El Barcelona Supercomputing Center (BSC) fue establecido en 2005 y alberga el MareNostrum, uno de los superordenadores más potentes de España. Somos el centro pionero de la supercomputación en España. Nuestra especialidad es la computación de altas prestaciones - también conocida como HPC o High Performance Computing- y nuestra misión es doble: ofrecer infraestructuras y servicio de supercomputación a los científicos españoles y europeos, y generar conocimiento y tecnología para transferirlos a la sociedad. Somos Centro de Excelencia Severo Ochoa, miembros de primer nivel de la infraestructura de investigación europea PRACE (Partnership for Advanced Computing in Europe), y gestionamos la Red Española de Supercomputación (RES). Como centro de investigación, contamos con más de 456 expertos de 45 países, organizados en cuatro grandes áreas de investigación: Ciencias de la computación, Ciencias de la vida, Ciencias de la tierra y aplicaciones computacionales en ciencia e ingeniería.
HPE and NVIDIA are delivering a leading portfolio of optimized AI solutions that transform business and industry to gain deeper insights and facilitate solving the world’s greatest challenges. Join this session to learn about how NVIDIA V100, the world’s most powerful GPU, powering HPE 6500 Systems, the HPE AI Systems, to provide new business insights and outcomes.
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
This document summarizes the results of performance analysis and optimizations done on the STAR-CCM+ application run on different Intel CPU configurations. The analysis showed that the application's performance was highly dependent on CPU frequency (85-88%) and benefited from optimizations like CPU binding, huge pages, and scatter task placement. Comparing CPU types showed the 12-core CPU was 8-9% faster. Hyperthreading had a minimal impact on performance. Turbo Boost was effective but its benefits reduced as fewer cores were utilized.
HPC Infrastructure To Solve The CFD Grand ChallengeAnand Haridass
This document summarizes Anand Haridass' presentation on using HPC infrastructure to solve computational fluid dynamics (CFD) grand challenges. It discusses how CFD utilizes physics, mathematics, computational geometry, and computer science. Solving CFD problems is bound by memory usage, computation needs, and network requirements. The presentation outlines IBM's POWER processor roadmap and how the POWER9 will have stronger cores, enhanced caches, and improved interfaces like NVLink and CAPI to accelerate workloads like CFD. Case studies demonstrate how IBM systems using GPUs and NVLink can provide faster performance for CFD codes and reservoir simulations.
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
Large-scale optimization strategies for typical HPC workloads include:
1) Building a powerful profiling tool to analyze application performance and identify bottlenecks like inefficient instructions, memory bandwidth, and network utilization.
2) Harnessing state-of-the-art hardware like new CPU architectures, instruction sets, and accelerators to maximize application performance.
3) Leveraging the latest algorithms and computational models that are better suited for large-scale parallelization and new hardware.
This document discusses a lecture on hardware acceleration. It begins by providing background on Moore's law and how increasing transistor density led to issues with power consumption and thermal constraints. This motivated the evolution of specialized hardware acceleration to improve performance. The lecture then covers topics like coprocessors vs accelerators, common acceleration techniques, and examples of hardware acceleration. It also discusses challenges like debugging and coherency when designing accelerated systems.
This slide explains about the detailed view hardware architecture which includes CPUs, GPUs, Interconnect networks and applications used by the summit supercomputer
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...RISC-V International
The document summarizes the Klessydra-T architecture for designing vector coprocessors for multi-threaded edge computing cores. It describes the interleaved multi-threading baseline, parameterized vector acceleration schemes using the Klessydra vector intrinsic functions. Performance results show up to 3x speedup over a baseline core for benchmarks like convolution, FFT, and matrix multiplication on FPGA implementations with different configurations of vector lanes, functional units, and scratchpad memories.
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Michelle Holley
Speaker: Daniel Towner, System Architect for Wireless Access, Intel Corporation
5G brings many new capabilities over 4G including higher bandwidths, lower latencies, and more efficient use of radio spectrum. However, these improvements require a large increase in computing power in the base station. Fortunately the Xeon Scalable Processor series (Skylake-SP) recently introduced by Intel has a new high-performance instruction set called Intel® Advanced Vector Extensions 512 (Intel® AVX-512) which is capable of delivering the compute needed to support the exciting new world of 5G.
In his talk Daniel will give an overview of the new capabilities of the Intel AVX-512 instruction set and show why they are so beneficial to supporting 5G efficiently. The most obvious difference is that Intel AVX-512 has double the compute performance of previous generations of instruction sets. Perhaps surprisingly though it is the addition of brand new instructions that can make the biggest improvements. The new instructions mean that software algorithms can become more efficient, thereby enabling even more effective use of the improvements in computing performance and leading to very high performance 5G NR software implementations.
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioOPNFV
This document discusses using Vector Packet Processor (VPP) to provide fast and flexible networking capabilities for NFV solution stacks. It introduces VPP as a high-performance virtual switch that can achieve high throughput even at large scale. VPP offers features like IPv4 and IPv6 routing, Layer 2 switching, and VXLAN tunneling with linear performance scaling across multiple CPU cores. The FastDataStacks project aims to integrate VPP into OpenStack-based NFV solution stacks to provide enhanced networking functions.
Axel Koehler from Nvidia presented this deck at the 2016 HPC Advisory Council Switzerland Conference.
“Accelerated computing is transforming the data center that delivers unprecedented through- put, enabling new discoveries and services for end users. This talk will give an overview about the NVIDIA Tesla accelerated computing platform including the latest developments in hardware and software. In addition it will be shown how deep learning on GPUs is changing how we use computers to understand data.”
In related news, the GPU Technology Conference takes place April 4-7 in Silicon Valley.
Watch the video presentation: http://insidehpc.com/2016/03/tesla-accelerated-computing/
See more talks in the Swiss Conference Video Gallery:
http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter:
http://insidehpc.com/newsletter
Jensen Huang, founder and CEO of NVIDIA, discusses the rise of GPU computing and artificial intelligence. He outlines how GPUs have enabled massive performance increases for deep learning workloads. NVIDIA is introducing new products like the Tesla V100 GPU and DGX-1 server to further accelerate AI research and commercial applications. These announcements position NVIDIA to power continued growth in AI and deep learning.
The document discusses the emergence of computation for interdisciplinary large data analysis. It notes that exponential increases in computational power and data are driving changes in science and engineering. Computational modeling is becoming a third pillar of science alongside theory and experimentation. However, continued increases in clock speeds are no longer feasible due to power constraints, necessitating the use of multi-core processors and parallelism. This is driving changes in software design to expose parallelism.
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
Graphics cards (GPU) open up new ways of processing and analytics over big data, showing millisecond selections over billions of lines, as well as telling stories about data. #QikkDB
How to present data to be understood by everyone? Data analysis is for scientists, but data storytelling is for everyone. For managers, product owners, sales teams, the general public. #TellStory
Learn about high performance computing with GPU and how to present data with a rich Covid-19 data story example on the upcoming webinar.
1) The document discusses implementing and evaluating deep neural networks (DNNs) on mainstream heterogeneous systems like CPUs, GPUs, and APUs.
2) Preliminary results show that an APU achieves the highest performance per watt compared to CPUs and GPUs for DNN models like MLP and autoencoders.
3) Data transfers between the CPU and GPU are identified as a bottleneck, but APUs can help avoid this issue through efficient data sharing and zero-copy techniques between the CPU and GPU.
In this deck from the 2018 Swiss HPC Conference, Axel Koehler from NVIDIA presents: The Convergence of HPC and Deep Learning.
"The intersection of AI and HPC is extending the reach of science and accelerating the pace of scientific innovation like never before. The technology originally developed for HPC has enabled deep learning, and deep learning is enabling many usages in science. Deep learning is also helping deliver real-time results with models that used to take days or months to simulate. The presentation will give an overview about the latest hard- and software developments for HPC and Deep Learning from NVIDIA and will show some examples that Deep Learning can be combined with traditional large scale simulations."
Watch the video: https://wp.me/p3RLHQ-ijM
Learn more: http://nvidia.com
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackOPNFV
Service Provider is evolving and competing with leaner Over the Top Providers (OTP) providers such as Google and Amazon to provide mobile services. Furture SP network has ot be agile, resilient and auto salable. SPs are leaning towards using COTS infra, open networking (OPNFV, ONOS) and VNF to run routers, switches, mobile gateways, firewall, NAT, DPI functions. Session covers design and deployment of virtualizing the mobile infra such as Virtual Evolved Packet Core, GiLAN and VoLTE as well as 5G core. We will also cover performance fine tuning using DPDK, SR-IOV etc. WE will present case study using Cisco (VNF Manager and NFVO), Redhat (NFVI), Openstack and block storage using CEPH technology. Participants will be able to understand complexities of mobile packet core, evolution NFV based solution and architecture framework for 5G mobile packet core.
NERSC is the production high-performance computing (HPC) center for the United States Department of Energy (DOE) Office of Science. The center supports over 6,000 users in 600 projects, using a variety of applications in materials science, chemistry, biology, astrophysics, high energy physics, climate science, fusion science, and more.
NERSC deployed the Cori system on over 9,000 Intel® Xeon Phi™ processors. This session describes the optimization strategy for porting codes that target traditional manycore architectures to the processors. We also discuss highlights and lessons learned from the optimization process on 20 applications associated with the NERSC Exascale Science Application Program (NESAP).
This document proposes CNNECST, an automated framework for hardware acceleration of Convolutional Neural Networks (CNNs) on FPGAs. The framework bridges the gap between high-level ML frameworks and FPGA design. It features a modular dataflow architecture and integration with ML frameworks for specifying, training, and exporting CNN models to the FPGA. Experimental results show that CNNECST achieves significant speedups and energy efficiency gains compared to a CPU for two CNNs and datasets. Challenges include supporting more layer types and reduced precision data formats.
In this deck from the NVIDIA GPU Technology Conference, Axel Koehler presents: Inside the Volta GPU Architecture and CUDA 9.
"The presentation will give an overview about the new NVIDIA Volta GPU architecture and the latest CUDA 9 release. The NVIDIA Volta architecture powers the worlds most advanced data center GPU for AI, HPC, and Graphics. Volta features a new Streaming Multiprocessor (SM) architecture and includes enhanced features like NVLINK2 and the Multi-Process Service (MPS) that delivers major improvements in performance, energy efficiency, and ease of programmability. New features like Independent Thread Scheduling and the Tensor Cores enable Volta to simultaneously deliver the fastest and most accessible performance. CUDA is NVIDIA''s parallel computing platform and programming model. You''ll learn about new programming model enhancements and performance improvements in the latest CUDA9 release."
Watch the video: https://wp.me/p3RLHQ-iB7
Learn more: https://www.nvidia.com/en-us/gtc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontierinside-BigData.com
In this deck from the Stanford Colloquium on Computer Systems Seminar, Brian Boucher from Maxeler Technologies presents: Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier.
"Maxeler Multiscale Dataflow computing is at the leading edge of energy-efficient high performance computing, providing competitive advantage in industries from energy to finance to defense. Maxeler builds the computer around the problem to maximize performance density, eliminating the elaborate caching and decoding machinery occupying most silicon in a standard processor. This talk will explain the motivation behind dataflow computing to escape the end of frequency scaling in the push to exascale machines, introduce the Maxeler dataflow ecosystem including MaxJ code and DFE hardware, and demonstrate the application of dataflow principles to a specific HPC software package (Quantum ESPRESSO)."
Watch the video: https://wp.me/p3RLHQ-hq1
Learn more: http://maxeler.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
High Performance Communication for Oracle using InfiniBandwebhostingguy
The document discusses how InfiniBand provides benefits for Oracle databases by enabling higher performance communication within Oracle Real Application Clusters (RAC). InfiniBand allows for faster block transfers, lower CPU utilization, and higher throughput compared to Gigabit Ethernet. It also supports features like remote direct memory access that improve performance of Oracle RAC operations like locking and parallel queries.
1. Building exascale computers requires moving to sub-nanometer scales and steering individual electrons to solve problems more efficiently.
2. Moving data is a major challenge, as moving data off-chip uses 200x more energy than computing with it on-chip.
3. Future computers should optimize for data movement at all levels, from system design to microarchitecture, to minimize energy usage.
The document discusses parallel computing and multicore processors. It notes that Berkeley researchers believe multicore is the future of computing. It also discusses building an academic "manycore" research system using FPGAs to allow researchers to experiment with parallel algorithms, compilers, and programming models on thousands of processor cores. This would help drive innovation and avoid long waits between hardware and software iterations.
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Databricks
In this talk, we will present how we analyze, predict, and visualize network quality data, as a spark AI use case in a telecommunications company. SK Telecom is the largest wireless telecommunications provider in South Korea with 300,000 cells and 27 million subscribers. These 300,000 cells generate data every 10 seconds, the total size of which is 60TB, 120 billion records per day.
In order to address previous problems of Spark based on HDFS, we have developed a new data store for SparkSQL consisting of Redis and RocksDB that allows us to distribute and store these data in real time and analyze it right away, We were not satisfied with being able to analyze network quality in real-time, we tried to predict network quality in near future in order to quickly detect and recover network device failures, by designing network signal pattern-aware DNN model and a new in-memory data pipeline from spark to tensorflow.
In addition, by integrating Apache Livy and MapboxGL to SparkSQL and our new store, we have built a geospatial visualization system that shows the current population and signal strength of 300,000 cells on the map in real time.
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...RISC-V International
The document summarizes the Klessydra-T architecture for designing vector coprocessors for multi-threaded edge computing cores. It describes the interleaved multi-threading baseline, parameterized vector acceleration schemes using the Klessydra vector intrinsic functions. Performance results show up to 3x speedup over a baseline core for benchmarks like convolution, FFT, and matrix multiplication on FPGA implementations with different configurations of vector lanes, functional units, and scratchpad memories.
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Michelle Holley
Speaker: Daniel Towner, System Architect for Wireless Access, Intel Corporation
5G brings many new capabilities over 4G including higher bandwidths, lower latencies, and more efficient use of radio spectrum. However, these improvements require a large increase in computing power in the base station. Fortunately the Xeon Scalable Processor series (Skylake-SP) recently introduced by Intel has a new high-performance instruction set called Intel® Advanced Vector Extensions 512 (Intel® AVX-512) which is capable of delivering the compute needed to support the exciting new world of 5G.
In his talk Daniel will give an overview of the new capabilities of the Intel AVX-512 instruction set and show why they are so beneficial to supporting 5G efficiently. The most obvious difference is that Intel AVX-512 has double the compute performance of previous generations of instruction sets. Perhaps surprisingly though it is the addition of brand new instructions that can make the biggest improvements. The new instructions mean that software algorithms can become more efficient, thereby enabling even more effective use of the improvements in computing performance and leading to very high performance 5G NR software implementations.
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioOPNFV
This document discusses using Vector Packet Processor (VPP) to provide fast and flexible networking capabilities for NFV solution stacks. It introduces VPP as a high-performance virtual switch that can achieve high throughput even at large scale. VPP offers features like IPv4 and IPv6 routing, Layer 2 switching, and VXLAN tunneling with linear performance scaling across multiple CPU cores. The FastDataStacks project aims to integrate VPP into OpenStack-based NFV solution stacks to provide enhanced networking functions.
Axel Koehler from Nvidia presented this deck at the 2016 HPC Advisory Council Switzerland Conference.
“Accelerated computing is transforming the data center that delivers unprecedented through- put, enabling new discoveries and services for end users. This talk will give an overview about the NVIDIA Tesla accelerated computing platform including the latest developments in hardware and software. In addition it will be shown how deep learning on GPUs is changing how we use computers to understand data.”
In related news, the GPU Technology Conference takes place April 4-7 in Silicon Valley.
Watch the video presentation: http://insidehpc.com/2016/03/tesla-accelerated-computing/
See more talks in the Swiss Conference Video Gallery:
http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter:
http://insidehpc.com/newsletter
Jensen Huang, founder and CEO of NVIDIA, discusses the rise of GPU computing and artificial intelligence. He outlines how GPUs have enabled massive performance increases for deep learning workloads. NVIDIA is introducing new products like the Tesla V100 GPU and DGX-1 server to further accelerate AI research and commercial applications. These announcements position NVIDIA to power continued growth in AI and deep learning.
The document discusses the emergence of computation for interdisciplinary large data analysis. It notes that exponential increases in computational power and data are driving changes in science and engineering. Computational modeling is becoming a third pillar of science alongside theory and experimentation. However, continued increases in clock speeds are no longer feasible due to power constraints, necessitating the use of multi-core processors and parallelism. This is driving changes in software design to expose parallelism.
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
Graphics cards (GPU) open up new ways of processing and analytics over big data, showing millisecond selections over billions of lines, as well as telling stories about data. #QikkDB
How to present data to be understood by everyone? Data analysis is for scientists, but data storytelling is for everyone. For managers, product owners, sales teams, the general public. #TellStory
Learn about high performance computing with GPU and how to present data with a rich Covid-19 data story example on the upcoming webinar.
1) The document discusses implementing and evaluating deep neural networks (DNNs) on mainstream heterogeneous systems like CPUs, GPUs, and APUs.
2) Preliminary results show that an APU achieves the highest performance per watt compared to CPUs and GPUs for DNN models like MLP and autoencoders.
3) Data transfers between the CPU and GPU are identified as a bottleneck, but APUs can help avoid this issue through efficient data sharing and zero-copy techniques between the CPU and GPU.
In this deck from the 2018 Swiss HPC Conference, Axel Koehler from NVIDIA presents: The Convergence of HPC and Deep Learning.
"The intersection of AI and HPC is extending the reach of science and accelerating the pace of scientific innovation like never before. The technology originally developed for HPC has enabled deep learning, and deep learning is enabling many usages in science. Deep learning is also helping deliver real-time results with models that used to take days or months to simulate. The presentation will give an overview about the latest hard- and software developments for HPC and Deep Learning from NVIDIA and will show some examples that Deep Learning can be combined with traditional large scale simulations."
Watch the video: https://wp.me/p3RLHQ-ijM
Learn more: http://nvidia.com
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackOPNFV
Service Provider is evolving and competing with leaner Over the Top Providers (OTP) providers such as Google and Amazon to provide mobile services. Furture SP network has ot be agile, resilient and auto salable. SPs are leaning towards using COTS infra, open networking (OPNFV, ONOS) and VNF to run routers, switches, mobile gateways, firewall, NAT, DPI functions. Session covers design and deployment of virtualizing the mobile infra such as Virtual Evolved Packet Core, GiLAN and VoLTE as well as 5G core. We will also cover performance fine tuning using DPDK, SR-IOV etc. WE will present case study using Cisco (VNF Manager and NFVO), Redhat (NFVI), Openstack and block storage using CEPH technology. Participants will be able to understand complexities of mobile packet core, evolution NFV based solution and architecture framework for 5G mobile packet core.
NERSC is the production high-performance computing (HPC) center for the United States Department of Energy (DOE) Office of Science. The center supports over 6,000 users in 600 projects, using a variety of applications in materials science, chemistry, biology, astrophysics, high energy physics, climate science, fusion science, and more.
NERSC deployed the Cori system on over 9,000 Intel® Xeon Phi™ processors. This session describes the optimization strategy for porting codes that target traditional manycore architectures to the processors. We also discuss highlights and lessons learned from the optimization process on 20 applications associated with the NERSC Exascale Science Application Program (NESAP).
This document proposes CNNECST, an automated framework for hardware acceleration of Convolutional Neural Networks (CNNs) on FPGAs. The framework bridges the gap between high-level ML frameworks and FPGA design. It features a modular dataflow architecture and integration with ML frameworks for specifying, training, and exporting CNN models to the FPGA. Experimental results show that CNNECST achieves significant speedups and energy efficiency gains compared to a CPU for two CNNs and datasets. Challenges include supporting more layer types and reduced precision data formats.
In this deck from the NVIDIA GPU Technology Conference, Axel Koehler presents: Inside the Volta GPU Architecture and CUDA 9.
"The presentation will give an overview about the new NVIDIA Volta GPU architecture and the latest CUDA 9 release. The NVIDIA Volta architecture powers the worlds most advanced data center GPU for AI, HPC, and Graphics. Volta features a new Streaming Multiprocessor (SM) architecture and includes enhanced features like NVLINK2 and the Multi-Process Service (MPS) that delivers major improvements in performance, energy efficiency, and ease of programmability. New features like Independent Thread Scheduling and the Tensor Cores enable Volta to simultaneously deliver the fastest and most accessible performance. CUDA is NVIDIA''s parallel computing platform and programming model. You''ll learn about new programming model enhancements and performance improvements in the latest CUDA9 release."
Watch the video: https://wp.me/p3RLHQ-iB7
Learn more: https://www.nvidia.com/en-us/gtc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontierinside-BigData.com
In this deck from the Stanford Colloquium on Computer Systems Seminar, Brian Boucher from Maxeler Technologies presents: Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier.
"Maxeler Multiscale Dataflow computing is at the leading edge of energy-efficient high performance computing, providing competitive advantage in industries from energy to finance to defense. Maxeler builds the computer around the problem to maximize performance density, eliminating the elaborate caching and decoding machinery occupying most silicon in a standard processor. This talk will explain the motivation behind dataflow computing to escape the end of frequency scaling in the push to exascale machines, introduce the Maxeler dataflow ecosystem including MaxJ code and DFE hardware, and demonstrate the application of dataflow principles to a specific HPC software package (Quantum ESPRESSO)."
Watch the video: https://wp.me/p3RLHQ-hq1
Learn more: http://maxeler.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
High Performance Communication for Oracle using InfiniBandwebhostingguy
The document discusses how InfiniBand provides benefits for Oracle databases by enabling higher performance communication within Oracle Real Application Clusters (RAC). InfiniBand allows for faster block transfers, lower CPU utilization, and higher throughput compared to Gigabit Ethernet. It also supports features like remote direct memory access that improve performance of Oracle RAC operations like locking and parallel queries.
1. Building exascale computers requires moving to sub-nanometer scales and steering individual electrons to solve problems more efficiently.
2. Moving data is a major challenge, as moving data off-chip uses 200x more energy than computing with it on-chip.
3. Future computers should optimize for data movement at all levels, from system design to microarchitecture, to minimize energy usage.
The document discusses parallel computing and multicore processors. It notes that Berkeley researchers believe multicore is the future of computing. It also discusses building an academic "manycore" research system using FPGAs to allow researchers to experiment with parallel algorithms, compilers, and programming models on thousands of processor cores. This would help drive innovation and avoid long waits between hardware and software iterations.
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Databricks
In this talk, we will present how we analyze, predict, and visualize network quality data, as a spark AI use case in a telecommunications company. SK Telecom is the largest wireless telecommunications provider in South Korea with 300,000 cells and 27 million subscribers. These 300,000 cells generate data every 10 seconds, the total size of which is 60TB, 120 billion records per day.
In order to address previous problems of Spark based on HDFS, we have developed a new data store for SparkSQL consisting of Redis and RocksDB that allows us to distribute and store these data in real time and analyze it right away, We were not satisfied with being able to analyze network quality in real-time, we tried to predict network quality in near future in order to quickly detect and recover network device failures, by designing network signal pattern-aware DNN model and a new in-memory data pipeline from spark to tensorflow.
In addition, by integrating Apache Livy and MapboxGL to SparkSQL and our new store, we have built a geospatial visualization system that shows the current population and signal strength of 300,000 cells on the map in real time.
ircuit models of transmission lines are required if they are to be used in a circuit simulator. RF and microwave engineering uses two types of simulators. Spice-like simulators use lumped-element transmission line models in which an RLGC
model of a short segment of line is replicated for the length of the line. If the ground plane is treated as a universal ground, then the model of a segment of length Δz is as shown
The IEEE WIE FUE Student Branch Affinity Group was established in January 2022 with six members and has since grown to over 30 members. The group is led by a board including a chairwoman, vice chairwoman, and heads of media, treasury, organization, and secretary. Some of the group's accomplishments include organizing a virtual event with WIE Egypt Section, an online climate change awareness campaign, two recruitment events, involvement in WIE Africa celebrations and an IEEE YP Egypt event.
Analysis of reinforced concrete deep beam is based on simplified approximate method due to the complexity of the exact analysis. The complexity is due to a number of parameters affecting its response. To evaluate some of this parameters, finite element study of the structural behavior of the reinforced self-compacting concrete deep beam was carried out using Abaqus finite element modeling tool. The model was validated against experimental data from the literature. The parametric effects of varied concrete compressive strength, vertical web reinforcement ratio and horizontal web reinforcement ratio on the beam were tested on eight (8) different specimens under four points loads. The results of the validation work showed good agreement with the experimental studies. The parametric study revealed that the concrete compressive strength most significantly influenced the specimens’ response with the average of 41.1% and 49 % increment in the diagonal cracking and ultimate load respectively due to doubling of concrete compressive strength. Although the increase in horizontal web reinforcement ratio from 0.31 % to 0.63 % lead to average of 6.24 % increment on the diagonal cracking load, it does not influence the ultimate strength and the load-deflection response of the beams. Similar variation in vertical web reinforcement ratio leads to an average of 2.4 % and 15 % increment in cracking and ultimate load respectively with no appreciable effect on the load-deflection response.
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...Infopitaara
A Boiler Feed Pump (BFP) is a critical component in thermal power plants. It supplies high-pressure water (feedwater) to the boiler, ensuring continuous steam generation.
⚙️ How a Boiler Feed Pump Works
Water Collection:
Feedwater is collected from the deaerator or feedwater tank.
Pressurization:
The pump increases water pressure using multiple impellers/stages in centrifugal types.
Discharge to Boiler:
Pressurized water is then supplied to the boiler drum or economizer section, depending on design.
🌀 Types of Boiler Feed Pumps
Centrifugal Pumps (most common):
Multistage for higher pressure.
Used in large thermal power stations.
Positive Displacement Pumps (less common):
For smaller or specific applications.
Precise flow control but less efficient for large volumes.
🛠️ Key Operations and Controls
Recirculation Line: Protects the pump from overheating at low flow.
Throttle Valve: Regulates flow based on boiler demand.
Control System: Often automated via DCS/PLC for variable load conditions.
Sealing & Cooling Systems: Prevent leakage and maintain pump health.
⚠️ Common BFP Issues
Cavitation due to low NPSH (Net Positive Suction Head).
Seal or bearing failure.
Overheating from improper flow or recirculation.
In tube drawing process, a tube is pulled out through a die and a plug to reduce its diameter and thickness as per the requirement. Dimensional accuracy of cold drawn tubes plays a vital role in the further quality of end products and controlling rejection in manufacturing processes of these end products. Springback phenomenon is the elastic strain recovery after removal of forming loads, causes geometrical inaccuracies in drawn tubes. Further, this leads to difficulty in achieving close dimensional tolerances. In the present work springback of EN 8 D tube material is studied for various cold drawing parameters. The process parameters in this work include die semi-angle, land width and drawing speed. The experimentation is done using Taguchi’s L36 orthogonal array, and then optimization is done in data analysis software Minitab 17. The results of ANOVA shows that 15 degrees die semi-angle,5 mm land width and 6 m/min drawing speed yields least springback. Furthermore, optimization algorithms named Particle Swarm Optimization (PSO), Simulated Annealing (SA) and Genetic Algorithm (GA) are applied which shows that 15 degrees die semi-angle, 10 mm land width and 8 m/min drawing speed results in minimal springback with almost 10.5 % improvement. Finally, the results of experimentation are validated with Finite Element Analysis technique using ANSYS.
ELectronics Boards & Product Testing_Shiju.pdfShiju Jacob
This presentation provides a high level insight about DFT analysis and test coverage calculation, finalizing test strategy, and types of tests at different levels of the product.
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYijscai
With the increased use of Artificial Intelligence (AI) in malware analysis there is also an increased need to
understand the decisions models make when identifying malicious artifacts. Explainable AI (XAI) becomes
the answer to interpreting the decision-making process that AI malware analysis models use to determine
malicious benign samples to gain trust that in a production environment, the system is able to catch
malware. With any cyber innovation brings a new set of challenges and literature soon came out about XAI
as a new attack vector. Adversarial XAI (AdvXAI) is a relatively new concept but with AI applications in
many sectors, it is crucial to quickly respond to the attack surface that it creates. This paper seeks to
conceptualize a theoretical framework focused on addressing AdvXAI in malware analysis in an effort to
balance explainability with security. Following this framework, designing a machine with an AI malware
detection and analysis model will ensure that it can effectively analyze malware, explain how it came to its
decision, and be built securely to avoid adversarial attacks and manipulations. The framework focuses on
choosing malware datasets to train the model, choosing the AI model, choosing an XAI technique,
implementing AdvXAI defensive measures, and continually evaluating the model. This framework will
significantly contribute to automated malware detection and XAI efforts allowing for secure systems that
are resilient to adversarial attacks.
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfMohamedAbdelkader115
Glad to be one of only 14 members inside Kuwait to hold this credential.
Please check the members inside kuwait from this link:
https://www.rics.org/networking/find-a-member.html?firstname=&lastname=&town=&country=Kuwait&member_grade=(AssocRICS)&expert_witness=&accrediation=&page=1
Raish Khanji GTU 8th sem Internship Report.pdfRaishKhanji
This report details the practical experiences gained during an internship at Indo German Tool
Room, Ahmedabad. The internship provided hands-on training in various manufacturing technologies, encompassing both conventional and advanced techniques. Significant emphasis was placed on machining processes, including operation and fundamental
understanding of lathe and milling machines. Furthermore, the internship incorporated
modern welding technology, notably through the application of an Augmented Reality (AR)
simulator, offering a safe and effective environment for skill development. Exposure to
industrial automation was achieved through practical exercises in Programmable Logic Controllers (PLCs) using Siemens TIA software and direct operation of industrial robots
utilizing teach pendants. The principles and practical aspects of Computer Numerical Control
(CNC) technology were also explored. Complementing these manufacturing processes, the
internship included extensive application of SolidWorks software for design and modeling tasks. This comprehensive practical training has provided a foundational understanding of
key aspects of modern manufacturing and design, enhancing the technical proficiency and readiness for future engineering endeavors.
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxRishavKumar530754
LiDAR-Based System for Autonomous Cars
Autonomous Driving with LiDAR Tech
LiDAR Integration in Self-Driving Cars
Self-Driving Vehicles Using LiDAR
LiDAR Mapping for Driverless Cars
its all about Artificial Intelligence(Ai) and Machine Learning and not on advanced level you can study before the exam or can check for some information on Ai for project
1. Deep Learning Hardware: Past, Present, and Future
CRA Snowbird Conference
July 23, 2024
Bill Dally
Chief Scientist and SVP of Research, NVIDIA Corporation
Adjunct Professor of CS and EE, Stanford
11. Blackwell B200
The Two Largest Dies Possible—Unified as One GPU
10 PetaFLOPS FP8 | 20 PetaFLOPS FP4
192GB HBM3e | 8 TB/sec HBM Bandwidth | 1.8TB/s NVLink
2 reticle-limited dies operate as One Unified CUDA GPU
NV-HBI 10TB/s High Bandwidth Interface Full
performance. No compromises
4X Training | 30X Inference | 25X Energy Efficiency & TCO
Fast Memory
192GB HBM3e
12. 3D Parallelism
It takes 20 GPUs to
hold one copy of
GPT4 model
parameters
Tensor Parallel
Pipeline
Parallel
13. GB200 NVL72
Delivers New Unit of Compute
36 GRACE CPUs
72 BLACKWELL GPUs
Fully Connected NVLink Switch Rack
GB200 NVL72
Training 720
PFLOPs
Inference 1.4
EFLOPs
NVL Model Size 27T
params
Multi-Node All-to-All 130 TB/s
Multi-Node All-Reduce 260 TB/s
14. Scale-up – NVLink and NVSwitch – to 256 GPUs Scale-
out – IB to 10,000s of GPUs
Collectives Double Effective Network Bandwidth
(AllReduce)
15. 100
10,000
1,000,000
100,000,000
10,000,000,000
2012 2014 2016 2018 2020 2022 2024
Training
Compute
(petaFLOPs)
PASCAL
KEPLER
VOLTA
HOPPER
System Scaling
NVLINK
HGX
HBM
SATURN V
0.6PF
SELENE
2.8 EF
EOS
43 EF
TF32
AMPERE
TENSOR CORE
TRANSFORMER
ENGINE
70,000x in 5 years
23. S M
1 7
S E M
1 5 10
S E
1 7
X
8
int8
fp16
log8
sym8
spike
analog
Weight
Buffer
Activation
Buffer
Storage Transport
Multiply
Accumulate
Operation
•Attributes:
• Cost
• Operation energy
• Movement energy
• Accuracy
• Dynamic range
• Precision (error)
24. S M
1 7
S E M
1 5 10
S
1 4
X
8
int8
fp16
log8
sym8
spike
analog
EI EF
3
Dynamic Range
25. Symbol Representation (Codebook)
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015
clustering
29. •Dynamic Range 105
•WC Accuracy 4%
•Vs Int8 – DR 102
•WC Accuracy 33%
•Vs FP8 (E4M3) – DR 105
•WC Accuracy 6%
Log4.3 S EI
1 4 3
EF
30. 1
3
5
7
9
11
13
15
1 3 5 7 9
Actual Value
11 13 15
Closest
Represenatble
Value
4-bit Integer Representation (Int4)
1
3
5
7
9
11
13
15
1 3 5 7 9
Actual Value
11 13 15
Closest
Represenatble
Value
4-bit Log Representation (L2.2)
Max Error
9%
Max Error
33%
31. 1
3
5
7
9
11
13
15
1 3 5 7 9
Actual Value
11 13 15
Closest
Represenatble
Value
4-bit Log Representation (L2.2)
1
3
5
7
9
11
13
15
1 3 5 7 9
Actual Value
11 13 15
Closest
Representable
value
FP2.2
Max Error
9%
Max Error
13%
32. • Log Numbers
• Multiplies are cheap – just an add
• Adds are hard – convert to integer, add, convert back
• Fractional part of log is a lookup
• Integer part of log is a shift
• Can factor the lookup outside the summation
• Only convert back after summation (and NLF)
S EI
1 4 3
EF
33. S EI
1 4 3
EF
Patent Application US2021/0056446A1
EI
EF
S
39. VS-Quant
Per-vector scaled quantization for low-precision inference
Modified vector MAC unit for VS-Quant
Fine-grained scale factors per vector
[Dai et al., MLSYS 2021]
Works with either post-training quantization or quantization-aware retraining!
40. Traditional Quantization VSQ
One scale factor per
matrix
Two scale factors: one per
vector, one per matrix
High quantization noise Reduced
quantization noise
Noise
Scaling
Max value
in matrix
Min value
in matrix
-8 0 7
Traditional Quantization
-8 7
Min value
in vector
More scaling
Max value
in vector
0
VS
Q
INT4 Quantization
FP32 data
distribution
VSQ Scale Factors
M
K
K
N
M
N
64
64
…
…
…
…
One scale factor for each 64-element input vector
Second scale factor for each input matrix
46. Accelerators Employ:
• Special Data Types and Operations
•Do in 1 cycle what normally takes 10s or 100s – 10-1000x efficiency gain
• Massive Parallelism – >1,000x, not 16x – with Locality
• This gives performance, not efficiency
• Optimized Memory
• High bandwidth (and low energy) for specific data structures and operations
• Reduced or Amortized Overhead
• 10,000x efficiency gain for simple operations
• Algorithm-Architecture Co-Design
47. Fast Accelerators since 1985
• Mossim Simulation Engine: Dally, W.J. and Bryant, R.E., 1985. A hardware architecture for switch-level simulation. IEEE Trans.
CAD, 4(3), pp.239-250.
• MARS Accelerator: Agrawal, P
. and Dally, W.J., 1990. A hardware logic simulation system. IEEE Trans. CAD, 9(1), pp.19-29.
• Reconfigurable Arithmetic Processor: Fiske, S. and Dally, W.J., 1988. The reconfigurable arithmetic processor . ISCA 1988.
• Imagine: Kapasi, U.J., Rixner, S., Dally, W.J., Khailany, B., Ahn, J.H., Mattson, P
. and Owens, J.D., 2003. Programmable stream
processors. Computer, 36(8), pp.54-62.
• ELM: Dally, W.J., Balfour, J., Black-Shaffer, D., Chen, J., Harting, R.C., Parikh, V., Park, J. and Sheffield, D., 2008. Efficient
embedded computing. Computer, 41(7).
• EIE: Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A. and Dally, W.J., 2016, June. EIE: efficient inference engine on
compressed deep neural network, ISCA 2016
• SCNN:Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S.W. and Dally, W.J., 2017,
June. Scnn: An accelerator for compressed-sparse convolutional neural networks, ISCA 2017
• Darwin: Turakhia, Bejerano, and Dally, “Darwin: A Genomics Co-processor provides up to 15,000× acceleration on long read
assembly”, ASPLOS 2018.
• SATiN: Zhuo, Rucker, Wang, and Dally, “Hardware for Boolean Satisfiability Inference,”
48. Eliminating Instruction Overhead
OOO CPU Instruction – 250pJ (99.99% overhead, ARM A-15)
Area is proportional to energy – all 28nm
16b Int Add, 32fJ
Evangelos Vasilakis. 2015. An Instruction Level Energy Characterization of Arm Processors. Foundation of Research and Technology Hellas, Inst. of Computer Science, Tech. Rep. FORTH-ICS/TR-
450 (2015)
49. Operation: Energy (pJ)
8b Add 0.03
16b Add 0.05
32b Add 0.1
16b FP Add 0.4
32b FP Add 0.9
8b Mult 0.2
32b Mult 3.1
16b FP Mult 1.1
32b FP Mult 3.7
32b SRAM Read (8KB) 5
32b DRAM Read 640
Area (m2)
36
67
137
1360
4184
282
3495
1640
7700
N/A
N/A
Cost of Operations
1 10 100 1000
Energy numbers are from Mark Horowitz “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014
Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.
Relative Energy Cost Relative Area Cost
10000 1 10 100 1000
50. The Importance of Staying Local
LPDDR DRAM GB
On-Chip SRAM MB
Local SRAM KB
640pJ/word
50pJ/word
5pJ/word
52. Energy-efficient DL Inference accelerator
Transformers, VS-Quant INT4, TSMC 5nm
• Efficient architecture
• Used MAGNet [Venkatesan et al., ICCAD 2019] to design a low-
precision DL inference accelerator for Transformers
• Multi-level dataflow to improve data reuse and energy efficiency
• Low-precision data format: VS-Quant INT4
• Hardware-software techniques to tolerate quantization error
• Enable low cost multiply-accumulate (MAC) operations
• Reduce storage and data movement
• Special function units
• TSMC 5nm
• 1024 4-bit MACs/cycle (512 8-bit)
• 0.153 mm2 chip
• Voltage range: 0.46V – 1.05V
• Frequency range: 152 MHz – 1760 MHz
• 95.6 TOPS/W with 50%-dense 4-bit input matrices
with VSQ enabled at 0.46V
• 0.8% energy overhead from VSQ support with 50%-
dense inputs at 0.67V
[Keller, Venkatesan, et al., “A 95.6-TOPS/W Deep Learning Inference Accelerator with Per-Vector Scaled 4-bit Quantization in 5nm”, JSSC 2023]
55. Conclusion
• Deep Learning was enabled by hardware and its progress
is limited by hardware
• 1000x in last 10 years
• Number representation, complex ops, sparsity
• Logarithmic numbers
• Lowest worst-case error for a given number of bits
• Can ‘factor out’ hard parts of an add
• Optimum clipping
• Minimize MSE by trading quantization noise for clipping noise
• VS-Quant
• Separate scale factor for each small vector – 16 to 64 scalars
• Accelerators – Testbeds for GPU ‘cores’
• Test chip validates concepts and measures efficiency
• 95.6 TOPS/W on BERT with negligible accuracy loss
3.94 6.84 21.20
125.00
261.00
1248.00
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
1400.00
4/1/12 8/14/13 12/27/14 5/10/16 9/22/17 2/4/19 6/18/20 10/31/21
Int
8
TOPS
Single-Chip Inference Performance - 317X in 8 years
S EI EF
−𝛼 +𝛼
clip clip