We ran a 50k GPU multi-cloud simulation to support the IceCube science. This talk provided an overview of what happened to the associated data.
Presented at the Internet2 booth at SC19.
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...Igor Sfiligoi
NRP Engagement webinar: Description of the 380 PFLOP32S , 51k GPU multi-cloud burst using HTCondor to run IceCube photon propagation simulation.
Presented January 27th, 2020.
This document discusses a large-scale GPU-based cloud burst simulation run by the IceCube collaboration to calibrate simulations of natural ice. The simulation was data-intensive, producing over 130 TB of data and exceeding 10 Gbps of egress bandwidth. Internet2 Cloud Connect service was used to provision over 20 dedicated network links between collaborators' institutions and cloud providers to enable high-throughput data transfer at a lower cost than commercial routes. Careful planning was required to smoothly ramp up the burst and avoid overloading individual network links.
"Building and running the cloud GPU vacuum cleaner"Frank Wuerthwein
This talk, describing the "Largest Cloud Simulation in History" (Jensen Huang at SC19), was given at the MAGIC meeting on Dec. 4th 2019. MAGIC stands for "Middleware and Grid Interagency Cooperation", and is a group within NITRD. Current federal agencies that are members of MAGIC include DOC, DOD, DOE, HHS, NASA, and NSF.
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Igor Sfiligoi
- IceCube is a neutrino observatory that detects high-energy neutrinos from astrophysical sources to study violent cosmic events. It uses over 5000 optical sensors buried in Antarctic ice to detect neutrinos.
- A cloud burst was performed using over 50,000 GPUs across multiple cloud providers worldwide to simulate photon propagation through ice for IceCube data analysis. This was the largest cloud simulation ever and demonstrated the ability to burst at exascale scales.
- The simulation helped improve IceCube's neutrino detection and pointing resolution to identify the first known source of high-energy neutrinos, a blazar, demonstrating IceCube's potential for multi-messenger astrophysics.
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
Presented at PEARC20.
This talk presents expanding the IceCube’s production HTCondor pool using cost-effective GPU instances in preemptible mode gathered from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform. Using this setup, we sustained for a whole workday about 15k GPUs, corresponding to around 170 PFLOP32s, integrating over one EFLOP32 hour worth of science output for a price tag of about $60k. In this paper, we provide the reasoning behind Cloud instance selection, a description of the setup and an analysis of the provisioned resources, as well as a short description of the actual science output of the exercise.
This document introduces SkyhookDM, a system that offloads computation from clients to storage nodes. It does this by embedding Apache Arrow data access libraries inside Ceph object storage devices (OSDs). This allows large Parquet files to be scanned and processed directly on the OSDs without needing to move all the data to clients. Experiments show SkyhookDM reduces latency, CPU usage, and network traffic compared to traditional approaches. It has also been integrated with the Coffea analysis framework. Ongoing work involves optimizing Arrow serialization for network transfers.
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Frank Wuerthwein
- The document describes running a GPU burst simulation for IceCube astrophysics research across 50,000 NVIDIA GPUs in multiple cloud platforms globally, achieving 350 petaflops for 2 hours.
- IceCube detects high-energy neutrinos to study violent astrophysical events by observing the interactions of neutrinos within a cubic kilometer of Antarctic ice instrumented with sensors.
- The GPU burst simulation campaign helped improve IceCube's ability to reconstruct neutrino direction and energy and identify astrophysical sources through multi-messenger astrophysics.
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
The document discusses how Nvidia's A100 GPU with Multi-Instance GPU (MIG) capability can help scale up scientific output for astronomy projects like IceCube and LIGO. The A100 is much faster than previous GPUs, but MIG allows it to be partitioned so multiple jobs or processes can leverage the GPU simultaneously. This results in 200-600% higher throughput compared to using a single GPU, by better utilizing the massive parallelism of the A100. MIG makes the powerful A100 GPU practical for these CPU-bound scientific workloads.
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
Presented at EDUCAUSE CCCG March 2021.
The IceCube Neutrino Observatory is the world’s premier facility to detect neutrinos.
Built at the south pole in natural ice, it requires extensive and expensive calibration to properly track the neutrinos.
Most of the required compute power comes from on-prem resources through the Open Science Grid,
but IceCube can easily harness the Cloud compute at any scale, too, as demonstrated by a series of Cloud bursts.
This talk provides both details of the performed Cloud bursts, as well as some insight in the science itself.
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
Presented at PEARC21.
Many scientific high-throughput applications can benefit from the elastic nature of Cloud resources, especially when there is a need to reduce time to completion. Cost considerations are usually a major issue in such endeavors, with networking often a major component; for data-intensive applications, egress networking costs can exceed the compute costs. Dedicated network links provide a way to lower the networking costs, but they do add complexity. In this paper we provide a description of a 100 fp32 PFLOPS Cloud burst in support of IceCube production compute, that used Internet2 Cloud Connect service to provision several logically-dedicated network links from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and Google Cloud Platform, that in aggregate enabled approximately 100 Gbps egress capability to on-prem storage. It provides technical details about the provisioning process, the benefits and limitations of such a setup and an analysis of the costs incurred.
This is the keynote talk fkw gave at cloudnet 2020. It covers all three cloudbursts we did. As of early 2021, slides 26ff is still the most detailed documentation of the 3rd cloudburst. This material will be covered in a future conference paper.
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Rob Emanuele
This document outlines a presentation on analyzing large raster data in a Jupyter notebook with GeoPySpark on AWS. The presentation covers introductory material, exercises on working with land cover and Landsat imagery data, combining data layers to detect crop cycles, and combining different data types to create maps. It discusses where the notebooks are running, data sources, and GeoPySpark capabilities like working with space-time raster data. Attendees are encouraged to tweet maps created during the exercises.
The OpenStack Cloud at CERN - OpenStack NordicTim Bell
The document discusses the CERN OpenStack cloud, which provides compute resources for the Large Hadron Collider experiment at CERN. It details the scale of the cloud, including over 6,700 hypervisors, 190,000 cores, and 20,000 VMs. It also describes the various use cases served, wide range of hardware, and operations of the cloud, including a retirement campaign and network migration to Neutron.
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...Andrew Howard
This document summarizes an international collaboration between the National Computational Infrastructure (NCI) in Australia and A*Star in Singapore to accelerate DNA analysis. The collaboration utilizes trans-Pacific extended InfiniBand networks and supercomputers to:
1) Transfer large genetic sequence datasets from NCI in Canberra to A*Star in Singapore for analysis on the A*Star Aurora system and return results.
2) Utilize NCI's InfiniCloud HPC system for visualization of genetic data results produced by Aurora.
3) Demonstrate long distance high-speed data transfers between Australia and Singapore leveraging extended InfiniBand networks.
Tim Bell from CERN gave a presentation on "Understanding the Universe through Clouds" at OpenStack UK Days on September 26th, 2017. Some key points:
- CERN operates one of the world's largest private OpenStack clouds to support the Large Hadron Collider, with over 8000 hypervisors and 33,000 VMs.
- The Worldwide LHC Computing Grid distributes and analyzes LHC data across 600 PB of storage and 750k CPU cores at 170 sites in 42 countries.
- CERN has been an early adopter of OpenStack technologies like Nova, Glance, Horizon, and Neutron since 2011 and contributes code back to the community.
- New services like Mag
The document discusses OpenStack at CERN. It provides details on:
- OpenStack has been in production at CERN for 3 years, managing over 190,000 cores and 7,000 hypervisors.
- Major cultural and technology changes were required and have been successfully addressed to transition to OpenStack.
- Contributing back to the upstream OpenStack community has led to sustainable tools and effective technology transfer.
This document summarizes Tim Bell's presentation on OpenStack at CERN. It discusses how CERN adopted OpenStack in 2011 to manage its growing computing infrastructure needs for processing massive data sets from the Large Hadron Collider. OpenStack has since been scaled up to manage over 300,000 CPU cores and 500,000 physics jobs per day across CERN's private cloud. The document also briefly outlines CERN's use of other open source technologies like Ceph and Kubernetes.
This document discusses OpenStack cloud computing at CERN. It notes that CERN has 4 OpenStack clouds with over 120,000 cores total, and is migrating to the Kilo release of OpenStack. It then describes OpenStack components like Keystone for authentication, Glance for images, Nova for compute, and Cinder for block storage. The document outlines how OpenStack supports federated identity through options like Active Directory, OpenID Connect, and SAML. It provides examples of how federation could allow access to external clouds and shares experiences in deploying federated OpenStack.
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBInfluxData
European XFEL are the creators of the strongest x-ray beam in the world. Their 3.4-km long X-ray free-electron laser underground tunnel is used by researchers from around the world. Scientists use their facilities to map atomic details of viruses, film chemical reactions, and study the processes in the interior of planets. Discover how European XFEL uses InfluxDB to monitor their scientific experiments and research.
In this webinar, Alessandro Silenzi will dive into:
European XFEL’s approach to empowering the worldwide community to push the boundaries of science
The evolution of their data management solution — from homegrown to InfluxDB
How a time series platform is used to analyze and validate experiment data
CERN operates the largest particle physics laboratory in the world. It manages over 8,000 servers to support its research. In 2012, CERN recognized limits with its existing infrastructure management tools and formed a team to define a new "Agile Infrastructure Project." The project goals were to improve resource provisioning time, enable cloud interfaces, improve monitoring and accounting, and boost efficiency. The team adopted open source tools like OpenStack, Puppet, and Ceph to create a new cloud service spanning two data centers. This allowed on-demand provisioning in minutes versus months and helped CERN better support its expanding computing needs for research.
CERN operates the largest machine on Earth, the Large Hadron Collider (LHC), which produces over 1 billion collisions per second and records over 0.5 petabytes of data per day. CERN relies heavily on OpenStack, with over 190,000 CPU cores and 5,000 VMs under OpenStack management, accounting for over 90% of CERN's computing resources. CERN plans to add over 100,000 more CPU cores in the next 6 months and explores using public clouds and containers to help process the massive amount of data generated by the LHC.
The document discusses the evolution of Ceilometer, an OpenStack project that collects measurements from deployed clouds and persists the data for later retrieval and analysis. It describes how Ceilometer has scaled out its data collection capabilities over time by adding agents, partitioning workloads, and integrating with Gnocchi to provide more efficient time-series storage. The document also provides best practices for Ceilometer deployment and configuration to optimize data collection, storage and querying.
This document describes XeMPUPiL, a performance-aware power capping orchestrator for the Xen hypervisor. It aims to maximize performance under a power cap using a hybrid approach. The key challenges addressed are instrumentation-free workload monitoring and balancing hardware and software power management techniques. Experimental results show XeMPUPiL outperforms a pure hardware approach for I/O, memory, and mixed workloads by better balancing efficiency and timeliness. Future work includes integrating the orchestrator logic into the scheduler and exploring new resource assignment policies.
CERN OpenStack Cloud Control Plane - From VMs to K8sBelmiro Moreira
CERN is the home of the Large Hadron Collider (LHC), a 27km circular proton accelerator that generates petabytes of physics data every year. To process all this data, CERN runs an OpenStack Cloud (>300K cores) that helps scientists all around the world to unveil the mysteries of the Universe. The Infrastructure is also used to run all the IT services of the Organization.
Delivering these services, with high performance and reliable service levels has been one of the major challenges for the CERN Cloud engineering team. We have been constantly iterating the architecture and deployment model of the Cloud control plane.
In this presentation we will describe the different control plane architecture models that we relied over the years. Finally, we will describe all the work done to move the OpenStack Cloud control plane from VMs into a kubernetes cluster. We will report about our experience running this architecture at scale, its advantages and challenges.
The document summarizes Dr. Larry Smarr's presentation on the Pacific Research Platform (PRP) and its role in working toward a national research platform. It describes how PRP has connected research teams and devices across multiple UC campuses for over 15 years. It also details PRP's innovations like Flash I/O Network Appliances (FIONAs) and use of Kubernetes to manage distributed resources. Finally, it outlines opportunities to further integrate PRP with the Open Science Grid and expand the platform internationally through partnerships.
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...InfluxData
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB Helps Vera C. Rubin Observatory Make the Deepest, Widest Image of the Universe | InfluxDays Virtual Experience NA 2020
Toward a Global Interactive Earth Observing CyberinfrastructureLarry Smarr
The document discusses the need for a new generation of cyberinfrastructure to support interactive global earth observation. It outlines several prototyping projects that are building examples of systems enabling real-time control of remote instruments, remote data access and analysis. These projects are driving the development of an emerging cyber-architecture using web and grid services to link distributed data repositories and simulations.
Detecting solar farms with deep learningJason Brown
Talk delivered at Free and Open Source Software for Geo North America 2019 (FOSS4GNA)
Large scale solar arrays or farms have been installed globally faster than can be reliably tracked by interested stakeholders. We have built a deep learning model with Sentinel 2 satellite imagery that allows us to create accurate, timely global maps of solar farms.
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
Presented at EDUCAUSE CCCG March 2021.
The IceCube Neutrino Observatory is the world’s premier facility to detect neutrinos.
Built at the south pole in natural ice, it requires extensive and expensive calibration to properly track the neutrinos.
Most of the required compute power comes from on-prem resources through the Open Science Grid,
but IceCube can easily harness the Cloud compute at any scale, too, as demonstrated by a series of Cloud bursts.
This talk provides both details of the performed Cloud bursts, as well as some insight in the science itself.
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
Presented at PEARC21.
Many scientific high-throughput applications can benefit from the elastic nature of Cloud resources, especially when there is a need to reduce time to completion. Cost considerations are usually a major issue in such endeavors, with networking often a major component; for data-intensive applications, egress networking costs can exceed the compute costs. Dedicated network links provide a way to lower the networking costs, but they do add complexity. In this paper we provide a description of a 100 fp32 PFLOPS Cloud burst in support of IceCube production compute, that used Internet2 Cloud Connect service to provision several logically-dedicated network links from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and Google Cloud Platform, that in aggregate enabled approximately 100 Gbps egress capability to on-prem storage. It provides technical details about the provisioning process, the benefits and limitations of such a setup and an analysis of the costs incurred.
This is the keynote talk fkw gave at cloudnet 2020. It covers all three cloudbursts we did. As of early 2021, slides 26ff is still the most detailed documentation of the 3rd cloudburst. This material will be covered in a future conference paper.
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Rob Emanuele
This document outlines a presentation on analyzing large raster data in a Jupyter notebook with GeoPySpark on AWS. The presentation covers introductory material, exercises on working with land cover and Landsat imagery data, combining data layers to detect crop cycles, and combining different data types to create maps. It discusses where the notebooks are running, data sources, and GeoPySpark capabilities like working with space-time raster data. Attendees are encouraged to tweet maps created during the exercises.
The OpenStack Cloud at CERN - OpenStack NordicTim Bell
The document discusses the CERN OpenStack cloud, which provides compute resources for the Large Hadron Collider experiment at CERN. It details the scale of the cloud, including over 6,700 hypervisors, 190,000 cores, and 20,000 VMs. It also describes the various use cases served, wide range of hardware, and operations of the cloud, including a retirement campaign and network migration to Neutron.
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...Andrew Howard
This document summarizes an international collaboration between the National Computational Infrastructure (NCI) in Australia and A*Star in Singapore to accelerate DNA analysis. The collaboration utilizes trans-Pacific extended InfiniBand networks and supercomputers to:
1) Transfer large genetic sequence datasets from NCI in Canberra to A*Star in Singapore for analysis on the A*Star Aurora system and return results.
2) Utilize NCI's InfiniCloud HPC system for visualization of genetic data results produced by Aurora.
3) Demonstrate long distance high-speed data transfers between Australia and Singapore leveraging extended InfiniBand networks.
Tim Bell from CERN gave a presentation on "Understanding the Universe through Clouds" at OpenStack UK Days on September 26th, 2017. Some key points:
- CERN operates one of the world's largest private OpenStack clouds to support the Large Hadron Collider, with over 8000 hypervisors and 33,000 VMs.
- The Worldwide LHC Computing Grid distributes and analyzes LHC data across 600 PB of storage and 750k CPU cores at 170 sites in 42 countries.
- CERN has been an early adopter of OpenStack technologies like Nova, Glance, Horizon, and Neutron since 2011 and contributes code back to the community.
- New services like Mag
The document discusses OpenStack at CERN. It provides details on:
- OpenStack has been in production at CERN for 3 years, managing over 190,000 cores and 7,000 hypervisors.
- Major cultural and technology changes were required and have been successfully addressed to transition to OpenStack.
- Contributing back to the upstream OpenStack community has led to sustainable tools and effective technology transfer.
This document summarizes Tim Bell's presentation on OpenStack at CERN. It discusses how CERN adopted OpenStack in 2011 to manage its growing computing infrastructure needs for processing massive data sets from the Large Hadron Collider. OpenStack has since been scaled up to manage over 300,000 CPU cores and 500,000 physics jobs per day across CERN's private cloud. The document also briefly outlines CERN's use of other open source technologies like Ceph and Kubernetes.
This document discusses OpenStack cloud computing at CERN. It notes that CERN has 4 OpenStack clouds with over 120,000 cores total, and is migrating to the Kilo release of OpenStack. It then describes OpenStack components like Keystone for authentication, Glance for images, Nova for compute, and Cinder for block storage. The document outlines how OpenStack supports federated identity through options like Active Directory, OpenID Connect, and SAML. It provides examples of how federation could allow access to external clouds and shares experiences in deploying federated OpenStack.
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBInfluxData
European XFEL are the creators of the strongest x-ray beam in the world. Their 3.4-km long X-ray free-electron laser underground tunnel is used by researchers from around the world. Scientists use their facilities to map atomic details of viruses, film chemical reactions, and study the processes in the interior of planets. Discover how European XFEL uses InfluxDB to monitor their scientific experiments and research.
In this webinar, Alessandro Silenzi will dive into:
European XFEL’s approach to empowering the worldwide community to push the boundaries of science
The evolution of their data management solution — from homegrown to InfluxDB
How a time series platform is used to analyze and validate experiment data
CERN operates the largest particle physics laboratory in the world. It manages over 8,000 servers to support its research. In 2012, CERN recognized limits with its existing infrastructure management tools and formed a team to define a new "Agile Infrastructure Project." The project goals were to improve resource provisioning time, enable cloud interfaces, improve monitoring and accounting, and boost efficiency. The team adopted open source tools like OpenStack, Puppet, and Ceph to create a new cloud service spanning two data centers. This allowed on-demand provisioning in minutes versus months and helped CERN better support its expanding computing needs for research.
CERN operates the largest machine on Earth, the Large Hadron Collider (LHC), which produces over 1 billion collisions per second and records over 0.5 petabytes of data per day. CERN relies heavily on OpenStack, with over 190,000 CPU cores and 5,000 VMs under OpenStack management, accounting for over 90% of CERN's computing resources. CERN plans to add over 100,000 more CPU cores in the next 6 months and explores using public clouds and containers to help process the massive amount of data generated by the LHC.
The document discusses the evolution of Ceilometer, an OpenStack project that collects measurements from deployed clouds and persists the data for later retrieval and analysis. It describes how Ceilometer has scaled out its data collection capabilities over time by adding agents, partitioning workloads, and integrating with Gnocchi to provide more efficient time-series storage. The document also provides best practices for Ceilometer deployment and configuration to optimize data collection, storage and querying.
This document describes XeMPUPiL, a performance-aware power capping orchestrator for the Xen hypervisor. It aims to maximize performance under a power cap using a hybrid approach. The key challenges addressed are instrumentation-free workload monitoring and balancing hardware and software power management techniques. Experimental results show XeMPUPiL outperforms a pure hardware approach for I/O, memory, and mixed workloads by better balancing efficiency and timeliness. Future work includes integrating the orchestrator logic into the scheduler and exploring new resource assignment policies.
CERN OpenStack Cloud Control Plane - From VMs to K8sBelmiro Moreira
CERN is the home of the Large Hadron Collider (LHC), a 27km circular proton accelerator that generates petabytes of physics data every year. To process all this data, CERN runs an OpenStack Cloud (>300K cores) that helps scientists all around the world to unveil the mysteries of the Universe. The Infrastructure is also used to run all the IT services of the Organization.
Delivering these services, with high performance and reliable service levels has been one of the major challenges for the CERN Cloud engineering team. We have been constantly iterating the architecture and deployment model of the Cloud control plane.
In this presentation we will describe the different control plane architecture models that we relied over the years. Finally, we will describe all the work done to move the OpenStack Cloud control plane from VMs into a kubernetes cluster. We will report about our experience running this architecture at scale, its advantages and challenges.
The document summarizes Dr. Larry Smarr's presentation on the Pacific Research Platform (PRP) and its role in working toward a national research platform. It describes how PRP has connected research teams and devices across multiple UC campuses for over 15 years. It also details PRP's innovations like Flash I/O Network Appliances (FIONAs) and use of Kubernetes to manage distributed resources. Finally, it outlines opportunities to further integrate PRP with the Open Science Grid and expand the platform internationally through partnerships.
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...InfluxData
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB Helps Vera C. Rubin Observatory Make the Deepest, Widest Image of the Universe | InfluxDays Virtual Experience NA 2020
Toward a Global Interactive Earth Observing CyberinfrastructureLarry Smarr
The document discusses the need for a new generation of cyberinfrastructure to support interactive global earth observation. It outlines several prototyping projects that are building examples of systems enabling real-time control of remote instruments, remote data access and analysis. These projects are driving the development of an emerging cyber-architecture using web and grid services to link distributed data repositories and simulations.
Detecting solar farms with deep learningJason Brown
Talk delivered at Free and Open Source Software for Geo North America 2019 (FOSS4GNA)
Large scale solar arrays or farms have been installed globally faster than can be reliably tracked by interested stakeholders. We have built a deep learning model with Sentinel 2 satellite imagery that allows us to create accurate, timely global maps of solar farms.
Accelerating Astronomical Discoveries with Apache SparkDatabricks
Our research group is investigating how to leverage Apache Spark (batch, streaming & real-time) to analyse current and future data sets in astronomy. Among the future large experiments, the Large Synoptic Survey Telescope (LSST) will start soon collecting terabytes of data per observation night, and the efficient processing and analysis of both real-time and historical data remains a major challenge. In this talk we will expose the main challenges and explore the latest developments tailored for big data problems in astronomy.
On the one hand we designed a new Data Source API extension to natively manipulate telescope images and astronomical tables within Apache Spark. We then extended the functionalities of the Apache Spark SQL module to ease the manipulation of 3D data sets and perform efficient queries: partitioning, data sets join and cross-match, nearest neighbors search, spatial queries, and more.
On the other hand we are using the new possibilities offered by Structured Streaming APIs in recent Apache Spark versions to enable real-time decisions by rapidly accessing and analysing the alerts sent by telescopes every night. Given the unprecedented precision of next generation of telescopes, the streams of alerts will be made of millions of alerts per night, and relying on Structured Streaming is a guarantee of not missing the latest Black Hole event in a sea of data! We will also share active learning developments used on top to improve real-time event selection and classification for the LSST telescope.
You will walk away with an understanding of modern challenges in astronomy, appreciate some beautiful night skies, and how Apache Spark can help pushing further the frontiers of Science!
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Databricks
The physicists at CERN are increasingly turning to Spark to process large physics datasets in a distributed fashion with the aim of reducing time-to-physics with increased interactivity. The physics data itself is stored in CERN’s mass storage system: EOS and CERN’s IT department runs on-premise private cloud based on OpenStack as a way to provide on-demand compute resources to physicists. This provides both opportunity and challenges to Big Data team at CERN to provide elastic, scalable, reliable spark-as-a-service on OpenStack.
The talk focuses on the design choices made and challenges faced while developing spark-as-a-service over kubernetes on openstack to simplify provisioning, automate management, and minimize the operating burden of managing Spark Clusters. In addition, the service tooling simplifies submitting applications on the behalf of the users, mounting user-specified ConfigMaps, copying application logs to s3 buckets for troubleshooting, performance analysis and accounting of spark applications and support for stateful spark streaming applications. We will also share results from running large scale sustained workloads over terabytes of physics data.
NASA Advanced Computing Environment for Science & Engineeringinside-BigData.com
In this deck from the 2017 Argonne Training Program on Extreme-Scale Computing, Rupak Biswas from NASA presents: NASA Advanced Computing Environment for Science & Engineering.
""High performance computing is now integral to NASA’s portfolio of missions to pioneer the future of space exploration, accelerate scientific discovery, and enable aeronautics research. Anchored by the Pleiades supercomputer at NASA Ames Research Center, the High End Computing Capability (HECC) Project provides a fully integrated environment to satisfy NASA’s diverse modeling, simulation, and analysis needs. In addition, HECC serves as the agency’s expert source for evaluating emerging HPC technologies and maturing the most appropriate ones into the production environment. This includes investigating advanced IT technologies such as accelerators, cloud computing, collaborative environments, big data analytics, and adiabatic quantum computing. The overall goal is to provide a consolidated bleeding-edge environment to support NASA's computational and analysis requirements for science and engineering applications."
Dr. Rupak Biswas is currently the Director of Exploration Technology at NASA Ames Research Center, Moffett Field, Calif., and has held this Senior Executive Service (SES) position since January 2016. In this role, he in charge of planning, directing, and coordinating the technology development and operational activities of the organization that comprises of advanced supercomputing, human systems integration, intelligent systems, and entry systems technology. The directorate consists of approximately 700 employees with an annual budget of $160 million, and includes two of NASA’s critical and consolidated infrastructures: arc jet testing facility and supercomputing facility. He is also the Manager of the NASA-wide High End Computing Capability Project that provides a full range of advanced computational resources and services to numerous programs across the agency. In addition, he leads the emerging quantum computing effort for NASA.
Watch the video: https://wp.me/p3RLHQ-hua
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
In this deck from DataTech19, Debbie Bard from NERSC presents: Supercomputing and the scientist: How HPC and large-scale data analytics are transforming experimental science.
"Debbie Bard leads the Data Science Engagement Group NERSC. NERSC is the mission supercomputing center for the USA Department of Energy, and supports over 7000 scientists and 700 projects with supercomputing needs. A native of the UK, her career spans research in particle physics, cosmology and computing on both sides of the Atlantic. She obtained her PhD at Edinburgh University, and has worked at Imperial College London as well as the Stanford Linear Accelerator Center (SLAC) in the USA, before joining the Data Department at NERSC, where she focuses on data-intensive computing and research, including supercomputing for experimental science and machine learning at scale."
Watch the video: https://wp.me/p3RLHQ-kLV
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Science and Cyberinfrastructure in the Data-Dominated EraLarry Smarr
10.02.22
Invited talk
Symposium #1610, How Computational Science Is Tackling the Grand Challenges Facing Science and Society
Title: Science and Cyberinfrastructure in the Data-Dominated Era
San Diego, CA
The Pacific Research Platform Two Years InLarry Smarr
This document provides an overview of the Pacific Research Platform (PRP) after two years of operation. It describes several science drivers that are using the PRP, including biomedical research on cancer genomics and microbiomes, earth sciences like earthquake modeling, and astronomy. It highlights how the PRP is connecting sites like UC San Diego, UC Santa Cruz, UC Berkeley to share and analyze large datasets using high-speed networks. The PRP is expanding to support new areas like deep learning, cultural heritage projects, and connecting additional UC campuses through network upgrades.
Statistical estimation and inference for large data sets require computationally efficient optimization methods. Remote sensing retrievals are, in fact, estimates of the underlying true state, and their optimization routines must necessarily make compromises in order to keep up with large data volumes. A sub-group of the Remote Sensing Working Group of the SAMSI Program on Mathematical and Statistical Methods for Climate and the Earth System is investigating how optimization in Bayesian-inspired retrievals and o_-line statistical methods could be made more computationally efficient. We will report on discussions held to-date and describe how progress in the theory of data systems research can positively impact optimization methodologies.
The NASA Nebula Project provides a cloud computing platform that addresses NASA's challenge of a fragmented and inefficient IT environment. Nebula offers scalable computing resources that researchers can access easily to perform data processing and analysis. This overcomes limitations of local servers and supercomputers. Early users report being able to accomplish more data-intensive work faster using Nebula. The platform is based on OpenStack, an open source cloud software project.
The document discusses the CERN OpenStack cloud, which provides compute resources for the Large Hadron Collider experiment. Some key points:
- CERN operates a large OpenStack cloud with over 200,000 cores across 4 clouds to provide resources for particle physics experiments like the LHC.
- The LHC is the largest machine on Earth, spanning 27km and containing over 9,600 magnets. It produces enormous amounts of data, with a need for over 400,000 HS06 cores of computing by Run 4.
- CERN's OpenStack cloud has grown significantly over the years to help meet this computing need, now providing over 200,000 cores across more than 5,800 hypervisors. It is a
LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks a...Larry Smarr
05.02.04
Invited Talk to the NASA Jet Propulsion Laboratory
Title: LambdaGrids--Earth and Planetary Sciences Driving High Performance Networks and High Resolution Visualizations
Pasadena, CA
The Academic and R&D Sectors' Current and Future Broadband and Fiber Access N...Larry Smarr
05.02.23
Invited Access Grid Talk
MSCMC FORUM Series
Examining the National Vision for Global Peace and Prosperity
Title: The Academic and R&D Sectors' Current and Future Broadband and Fiber Access Needs for US Global Competitiveness
Arlington, VA
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks
In this session, IBM will present details on advanced Apache Spark analytics currently being performed through a collaborative project with the SETI Institute, NASA, Swinburne University, Stanford University and IBM. The Allen Telescope Array in northern California has been continuously scanning the skies for over two decades, generating data archives with over 200 million signal events.
Come and learn how astronomers and researchers are using Apache Spark, in conjunction with assets such as IBM’s Cognitive Compute Cluster with over 700 GPUs, to train neural net models for signal classification, and to perform computationally intensive Spark workloads on multi-terabyte binary signal files. The speakers will also share details on one of the key components of this implementation: Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark, specifically designed to optimize their performance with object stores. Learn how Stocator works, and see how it was able to greatly improve performance and reduce the quantity of resources used, both for ground-to-cloud uploads of very large signal files, and for subsequent access of radio data for analysis using Spark.
Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * C...Larry Smarr
09.11.03
Report to the
Dept. of Energy Advanced Scientific Computing Advisory Committee
Title: Project StarGate An End-to-End 10Gbps HPC to User Cyberinfrastructure ANL * Calit2 * LBNL * NICS * ORNL * SDSC
Oak Ridge, TN
In this video from ChefConf 2014 in San Francisco, Cycle Computing CEO Jason Stowe outlines the biggest challenge facing us today, Climate Change, and suggests how Cloud HPC can help find a solution, including ideas around Climate Engineering, and Renewable Energy.
"As proof points, Jason uses three use cases from Cycle Computing customers, including from companies like HGST (a Western Digital Company), Aerospace Corporation, Novartis, and the University of Southern California. It’s clear that with these new tools that leverage both Cloud Computing, and HPC – the power of Cloud HPC enables researchers, and designers to ask the right questions, to help them find better answers, faster. This all delivers a more powerful future, and means to solving these really difficult problems."
Watch the video presentation: http://insidehpc.com/2014/09/video-hpc-cluster-computing-64-156000-cores/
ESCAPE Kick-off meeting - KM3Net, Opening a new window on our universe (Feb 2...ESCAPE EU
KM3NeT is a neutrino research infrastructure located in the deep Mediterranean Sea consisting of two detectors, ORCA and ARCA. The document discusses KM3NeT's physics motivations in studying neutrino oscillations, supernovae, dark matter, and cosmic neutrinos. It describes the detector design using optical sensors on vertical strings to detect Cherenkov radiation from neutrinos. A phased construction approach is outlined. The large data volumes require advanced data management, including processing, storage, and open access policies following FAIR principles. Participation in ESCAPE could help address KM3NeT's computing and data challenges for its lifetime scale.
Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi
The document discusses the CGYRO simulation tool, which is used for fusion plasma turbulence simulations. CGYRO is optimized for multi-scale simulations and is both memory and compute intensive. It is inherently parallel and uses OpenMP, OpenACC, and MPI for parallelization across CPU and GPU cores. While initial runs on Perlmutter had communication bottlenecks, improved networking with Slingshot 11 has helped increase performance, though it can interfere with MPS. Overall, CGYRO users are pleased with the transition from Cori to Perlmutter, finding it much faster for equivalent hardware.
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
Fusion simulations have traditionally required the use of leadership scale High Performance Computing (HPC) resources in order to produce advances in physics. The impressive improvements in compute and memory capacity of many-GPU compute nodes are now allowing for some problems that once required a multi-node setup to be also solvable on a single node. When possible, the increased interconnect bandwidth can result in order of magnitude higher science throughput, especially for communication-heavy applications. In this paper we analyze the performance of the fusion simulation tool CGYRO, an Eulerian gyrokinetic turbulence solver designed and optimized for collisional, electromagnetic, multiscale simulation, which is widely used in the fusion research community. Due to the nature of the problem, the application has to work on a large multi-dimensional computational mesh as a whole, requiring frequent exchange of large amounts of data between the compute processes. In particular, we show that the average-scale nl03 benchmark CGYRO simulation can be run at an acceptable speed on a single Google Cloud instance with 16 A100 GPUs, outperforming 8 NERSC Perlmutter Phase1 nodes, 16 ORNL Summit nodes and 256 NERSC Cori nodes. Moving from a multi-node to a single-node GPU setup we get comparable simulation times using less than half the number of GPUs. Larger benchmark problems, however, still require a multi-node HPC setup due to GPU memory capacity needs, since at the time of writing no vendor offers nodes with a sufficient GPU memory setup. The upcoming external NVSWITCH does however promise to deliver an almost equivalent solution for up to 256 NVIDIA GPUs.
Presented at PEARC22.
Paper DOI: https://doi.org/10.1145/3491418.3535130
The anachronism of whole-GPU accountingIgor Sfiligoi
NVIDIA has been making steady progress in increasing the compute performance of its GPUs, resulting in order of magnitude compute throughput improvements over the years. With several models of GPUs coexisting in many deployments, the traditional accounting method of treating all GPUs as being equal is not reflecting compute output anymore. Moreover, for applications that require significant CPU-based compute to complement the GPU-based compute, it is becoming harder and harder to make full use of the newer GPUs, requiring sharing of those GPUs between multiple applications in order to maximize the achievable science output. This further reduces the value of whole-GPU accounting, especially when the sharing is done at the infrastructure level. We thus argue that GPU accounting for throughput-oriented infrastructures should be expressed in GPU core hours, much like it is normally done for the CPUs. While GPU core compute throughput does change between GPU generations, the variability is similar to what we expect to see among CPU cores. To validate our position, we present an extensive set of run time measurements of two IceCube photon propagation workflows on 14 GPU models, using both on-prem and Cloud resources. The measurements also outline the influence of GPU sharing at both HTCondor and Kubernetes infrastructure level.
Presented at PEARC22.
Document DOI: https://doi.org/10.1145/3491418.3535125
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
HTCondor has been very successful in managing globally distributed, pleasantly parallel scientific workloads, especially as part of the Open Science Grid. HTCondor system design makes it ideal for integrating compute resources provisioned from anywhere, but it has very limited native support for autonomously provisioning resources managed by other solutions. This work presents a solution that allows for autonomous, demand-driven provisioning of Kubernetes-managed resources. A high-level overview of the employed architectures is presented, paired with the description of the setups used in both on-prem and Cloud deployments in support of several Open Science Grid communities. The experience suggests that the described solution should be generally suitable for contributing Kubernetes-based resources to existing HTCondor pools.
Presented at PEARC22.
Paper DOI: https://doi.org/10.1145/3491418.3535123
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
Overview of the recent performance optimization of CGYRO, an Eulerian GyroKinetic Fusion Plasma solver, with emphasize on the Multiscale Turbulence Simulations.
Presented at the joint US-Japan Workshop on Exascale Computing Collaboration and6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program (Jan 18th 2022).
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
Poster presented at PEAC21.
The poster contains the complete scaling plots for both unweighted and weighted normalized Unifrac compute for sample sizes ranging from 1k to 307k on both GPUs and CPUs.
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
Presented at PEARC21.
Most experimental sciences now rely on computing, and biolog- ical sciences are no exception. As datasets get bigger, so do the computing costs, making proper optimization of the codes used by scientists increasingly important. Many of the codes developed in recent years are based on the Python-based NumPy, due to its ease of use and good performance characteristics. The composable nature of NumPy, however, does not generally play well with the multi-tier nature of modern CPUs, making any non-trivial multi- step algorithm limited by the external memory access speeds, which are hundreds of times slower than the CPU’s compute capabilities. In order to fully utilize the CPU compute capabilities, one must keep the working memory footprint small enough to fit in the CPU caches, which requires splitting the problem into smaller portions and fusing together as many steps as possible. In this paper, we present changes based on these principles to two important func- tions in the scikit-bio library, principal coordinates analysis and the Mantel test, that resulted in over 100x speed improvement in these widely used, general-purpose tools.
Fusion simulations have traditionally required the use of leadership scale HPC resources in order to produce advances in physics. One such package is CGYRO, a premier tool for multi-scale plasma turbulence simulation. CGYRO is a typical HPC application that will not fit into a single node, as it requires several TeraBytes of memory and O(100) TFLOPS compute capability for cutting-edge simulations. CGYRO also requires high-throughput and low-latency networking, due to its reliance on global FFT computations. While in the past such compute may have required hundreds, or even thousands of nodes, recent advances in hardware capabilities allow for just tens of nodes to deliver the necessary compute power. We explored the feasibility of running CGYRO on Cloud resources provided by Microsoft on their Azure platform, using the infiniband-connected HPC resources in spot mode. We observed both that CPU-only resources were very efficient, and that running in spot mode was doable, with minimal side effects. The GPU-enabled resources were less cost effective but allowed for higher scaling.
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
This document discusses using Admiralty to federate the Pacific Research Platform (PRP) Kubernetes cluster, called Nautilus, with other clusters. The key points are:
1) PRP/Nautilus has been growing and now has nodes in multiple regions, requiring federation to integrate resources.
2) Admiralty provides a native Kubernetes solution for federation without centralized control. It allows clusters to participate in multiple federations.
3) Installing Admiralty on PRP/Nautilus and other clusters being federated was straightforward using Helm. Pods can be scheduled across clusters automatically.
4) Initial federation is working well between PRP/Nautilus and other clusters for expanded resource sharing
Accelerating microbiome research with OpenACCIgor Sfiligoi
Presented at OpenACC Summit 2020.
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another. Computing UniFrac on modest sample sizes used to take a workday on a server class CPU-only node, while modern datasets would require a large compute cluster to be feasible. After porting to GPUs using OpenACC, the compute of the same modest sample size now takes only a few minutes on a single NVIDIA V100 GPU, while modern datasets can be processed on a single GPU in hours. The OpenACC programming model made the porting of the code to GPUs extremely simple; the first prototype was completed in just over a day. Getting full performance did however take much longer, since proper memory access is fundamental for this application.
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
Poster presented at PEARC20.
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another (“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many independent subproblems and exhibits near linear scaling. In this poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We reduced the run time of computing UniFrac on the published Earth Microbiome Project dataset from 13 hours on an Intel Xeon E5-2680 v4 CPU to 12 minutes on an NVIDIA Tesla V100 GPU, and to about one hour on a laptop with NVIDIA GTX 1050 (with minor loss in precision). Computing UniFrac on a larger dataset containing 113k samples reduced the run time from over one month on the CPU to less than 2 hours on the V100 and 9 hours on an NVIDIA RTX 2080TI GPU (with minor loss in precision). This was achieved by using OpenACC for generating the GPU offload code and by improving the memory access patterns. A BSD-licensed implementation is available, which produces a Cshared library linkable by any programming language.
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
Poster presented at PEARC20.
There is increased awareness and recognition that public Cloud providers do provide capabilities not found elsewhere, with elasticity being a major driver. The value of elastic scaling is however tightly coupled to the capabilities of the networks that connect all involved resources, both in the public Clouds and at the various research institutions. This poster presents results of measurements involving file transfers inside public Cloud providers, fetching data from on-prem resources into public Cloud instances and fetching data from public Cloud storage into on-prem nodes. The networking of the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform, has been benchmarked. The on-prem nodes were managed by either the Pacific Research Platform or located at the University of Wisconsin – Madison. The observed sustained throughput was of the order of 100 Gbps in all the tests moving data in and out of the public Clouds and throughput reaching into the Tbps range for data movements inside the public Cloud providers themselves. All the tests used HTTP as the transfer protocol.
TransAtlantic Networking using Cloud linksIgor Sfiligoi
Scientific communities have only limited amount of bandwidth available for transferring data between the US and the EU.
We know Cloud providers have plenty of bandwidth available, but at what cost?
Bursting into the public Cloud - Sharing my experience doing it at large scal...Igor Sfiligoi
When compute workflow needs spike well in excess of the capacity of a local compute resource, capacity should be temporarily provisioned from somewhere else to both meet deadlines and to increase scientific output. Public Clouds have become an attractive option due to their ability to be provisioned with minimal advance notice. I have recently helped IceCube expand their resource pool by a few orders of magnitude, first to 380 PFLOP32s for a few hours and later to 170 PFLOP32s for a whole workday. In the process we moved O(50 TB) of data to and from the clouds, showing that networking is not a limiting factor, either. While there was a non-negligible dollar cost involved with each, the effort involved was quite modest. In this session I will explain what was done and how, alongside an overview of why IceCube needs so much compute.
Demonstrating 100 Gbps in and out of the CloudsIgor Sfiligoi
In this presentation, which was supposed to be presented at the cancelled CENIC 2020 Annual Conference, I present an overview of what is possible to achieve in terms of networking inside the Clouds and when exchanging data between cloud resources and on-prem equipment, with an emphasis on research hosted hardware.
There is increased awareness and recognition that public cloud providers do provide capabilities not found elsewhere, with elasticity being a major driver, and funding agencies are taking an increasingly positive stance toward public clouds.
The value of elastic scaling is, however, tightly coupled to the capabilities of the networks that connect all involved resources, both in the public clouds and at the various research institutions.
This presentation tries to shed some light on what is possible today.
Serving HTC Users in Kubernetes by Leveraging HTCondorIgor Sfiligoi
In this KubeCon 2019 presentation, I show how HTCondor can be used to make use of Kubernetes-managed resources easier for the purpose of Hight Throughput Computing.
Presented in San Diego at KubeCon 2019:
https://sched.co/UadF
Characterizing network paths in and out of the CloudsIgor Sfiligoi
Cloud computing is becoming mainstream, with funding agencies moving beyond prototyping and starting to fund production campaigns, too. An important aspect of any production computing campaign is data movement, both incoming and outgoing. And while the performance and cost of VMs is relatively well understood, the network performance and cost is not.
We thus embarked on a network characterization campaign, documenting traceroutes, latency and throughput in various regions of Amazon AWS, Microsoft Azure and Google GCP Clouds, both between Cloud resources and major DTNs in the Pacific Research Platform, including OSG data federation caches in the network backbone, and inside the clouds themselves. We also documented the incurred cost while doing so.
Presented at CHEP 2019.
An overview how IceCube and LIGO make use of the PRP/TNRP Nautilus distributed Kubernetes cluster.
Presented at GRP'19 http://grp-workshop-2019.ucsd.edu
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
Building 10x Organizations with Modern Productivity Metrics
10x developers may be a myth, but 10x organizations are very real, as proven by the influential study performed in the 1980s, ‘The Coding War Games.’
Right now, here in early 2025, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience. It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them.
But which of these approaches actually work? DORA? SPACE? DevEx? What should we invest in and create urgency behind today, so that we don’t find ourselves having the same discussion again in a decade?
Procurement Insights Cost To Value Guide.pptxJon Hansen
Procurement Insights integrated Historic Procurement Industry Archives, serves as a powerful complement — not a competitor — to other procurement industry firms. It fills critical gaps in depth, agility, and contextual insight that most traditional analyst and association models overlook.
Learn more about this value- driven proprietary service offering here.
TrsLabs - Fintech Product & Business ConsultingTrs Labs
Hybrid Growth Mandate Model with TrsLabs
Strategic Investments, Inorganic Growth, Business Model Pivoting are critical activities that business don't do/change everyday. In cases like this, it may benefit your business to choose a temporary external consultant.
An unbiased plan driven by clearcut deliverables, market dynamics and without the influence of your internal office equations empower business leaders to make right choices.
Getting things done within a budget within a timeframe is key to Growing Business - No matter whether you are a start-up or a big company
Talk to us & Unlock the competitive advantage
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
With expertise in data architecture, performance tracking, and revenue forecasting, Andrew Marnell plays a vital role in aligning business strategies with data insights. Andrew Marnell’s ability to lead cross-functional teams ensures businesses achieve sustainable growth and operational excellence.
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok
In today's fast-paced retail environment, efficiency is key. Every minute counts, and every penny matters. One tool that can significantly boost your store's efficiency is a well-executed planogram. These visual merchandising blueprints not only enhance store layouts but also save time and money in the process.
Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo
From predictive maintenance to robotic automation, AI is driving the future of manufacturing. But without high-quality annotated data, even the smartest models fall short.
Discover how data annotation services are powering accuracy, safety, and efficiency in AI-driven manufacturing systems.
Precision in data labeling = Precision on the production floor.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
Burst data retrieval after 50k GPU Cloud run
1. Burst retrieval of data
from multiple Cloud regions for
Multi-Messenger Astrophysics
with IceCube
Igor Sfiligoi
UCSD/SDSC
2. Jensen Huang keynote
yesterday
2
The Largest Cloud Simulation in History
50k NVIDIA GPUs in the Cloud
350 Petaflops for 2 hours
Distributed across US, Europe & Asia
On Saturday morning we bought all GPU capacity that was for sale in
Amazon Web Services, Microsoft Azure, and Google Cloud Platform worldwide
3. Jensen Huang keynote
yesterday
3
The Largest Cloud Simulation in History
50k NVIDIA GPUs in the Cloud
350 Petaflops for 2 hours
Distributed across US, Europe & Asia
On Saturday morning we bought all GPU capacity that was for sale in
Amazon Web Services, Microsoft Azure, and Google Cloud Platform worldwide
About 20TBytes
of data produced
in the process
5. IceCube
5
A cubic kilometer of ice at the
south pole is instrumented
with 5160 optical sensors.
Astrophysics:
• Discovery of astrophysical neutrinos
• First evidence of neutrino point source (TXS)
• Cosmic rays with surface detector
Particle Physics:
• Atmospheric neutrino oscillation
• Neutrino cross sections at TeV scale
• New physics searches at highest energies
Earth Science:
• Glaciology
• Earth tomography
A facility with very
diverse science goals
Restrict this talk to
high energy Astrophysics
6. High Energy Astrophysics
Science case for IceCube
6
Universe is opaque to light
at highest energies and
distances.
Only gravitational waves
and neutrinos can pinpoint
most violent events in
universe.
Fortunately, highest energy
neutrinos are of cosmic origin.
Effectively “background free” as long
as energy is measured correctly.
7. High energy neutrinos from
outside the solar system
7
First 28 very high energy neutrinos from outside the solar system
Red curve is the photon flux
spectrum measured with the
Fermi satellite.
Black points show the
corresponding high energy
neutrino flux spectrum
measured by IceCube.
This demonstrates both the opaqueness of the universe to high energy
photons, and the ability of IceCube to detect neutrinos above the maximum
energy we can see light due to this opaqueness.
Science 342 (2013). DOI:
10.1126/science.1242856
8. Understanding the Origin
8
We now know high energy events happen in the universe. What are they?
p + g D + p + 0 p + g g
p + g D + n + + n + +
Co
Aya Ishihara
The hypothesis:
The same cosmic events produce
neutrinos and photons
We detect the electrons or muons from neutrino that interact in the ice.
Neutrino interact very weakly => need a very large array of ice instrumented
to maximize chances that a cosmic neutrino interacts inside the detector.
Need pointing accuracy to point back to origin of neutrino.
Telescopes the world over then try to identify the source in the direction
IceCube is pointing to for the neutrino.
Multi-messenger Astrophysics
9. The ν detection challenge
9
Optical Pro
Aya Ishiha
Combining all the possible info
These features are included in
We re al a s be de eloping h
Nature never tell us a perfec
satisfactory agreem
Ice properties change with
depth and wavelength
Observed pointing resolution at high
energies is systematics limited.
Central value moves
for different ice models
Improved e and τ reconstruction
Þ increased neutrino flux
detection
Þ more observations
Photon propagation through
ice runs efficiently on single
precision GPU.
Detailed simulation campaigns
to improve pointing resolution
by improving ice model.
Improvement in reconstruction with
better ice model near the detectors
10. First evidence of an origin
10
First location of a source of very high energy neutrinos.
Neutrino produced high energy muon
near IceCube. Muon produced light as it
traverses IceCube volume. Light is
detected by array of phototubes of
IceCube.
IceCube alerted the astronomy community of the
observation of a single high energy neutrino on
September 22 2017.
A blazar designated by astronomers as TXS
0506+056 was subsequently identified as most likely
source in the direction IceCube was pointing. Multiple
telescopes saw light from TXS at the same time
IceCube saw the neutrino.
Science 361, 147-151
(2018). DOI:10.1126/science.aat2890
11. IceCube’s Future Plans
11
| IceCube Upgrade and Gen2 | Summer Blot | TeVPA 2018
The IceCube-Gen2 Facility
Preliminary timeline
MeV- to EeV-scale physics
Surface array
High Energy
Array
Radio array
PINGU
IC86
2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 … 2032
Today
Surface air shower
ConstructionR&D Design & Approval
IceCube Upgrade
IceCube Upgrade
Deployment
Near term:
add more phototubes to deep core to increase granularity of measurements.
Longer term:
• Extend instrumented
volume at smaller
granularity.
• Extend even smaller
granularity deep core
volume.
• Add surface array.
Improve detector for low & high energy neutrinos
13. The Idea
• Integrate all GPUs available for sale
worldwide into a single HTCondor pool.
- use 28 regions across AWS, Azure, and Google
Cloud for a burst of a couple hours, or so.
• IceCube submits their photon propagation
workflow to this HTCondor pool.
- we handle the input, the jobs on the GPUs, and
the output as a single globally distributed system.
13
Run a GPU burst relevant in scale
for future Exascale HPC systems.
14. A global HTCondor pool
• IceCube, like all OSG user communities, relies on
HTCondor for resource orchestration
- This demo used the standard tools
• Dedicated HW setup
- Avoid disruption of OSG production system
- Optimize HTCondor setup for the spiky nature of the demo
§ multiple schedds for IceCube to submit to
§ collecting resources in each cloud region, then collecting from all
regions into global pool
14
16. Using native Cloud storage
• Input data pre-staged into native Cloud storage
- Each file in one-to-few Cloud regions
§ some replication to deal with limited predictability of resources per region
- Local to Compute for large regions for maximum throughput
- Reading from “close” region for smaller ones to minimize ops
• Output staged back to region-local Cloud storage
- To be transferred back asynchronously after the compute is done
• Deployed simple wrappers around Cloud native file
transfer tools
- IceCube jobs do not need to customize for different Clouds
- They just need to know where input data is available
(pretty standard OSG operation mode)
16
17. Using native Cloud storage
• Input data pre-staged into native Cloud storage
- Each file in one-to-few Cloud regions
§ some replication to deal with limited predictability of resources per region
- Local to Compute for large regions for maximum throughput
- Reading from “close” region for smaller ones to minimize ops
• Output staged back to region-local Cloud storage
- To be transferred back asynchronously after the compute is done
• Deployed simple wrappers around Cloud native file
transfer tools
- IceCube jobs do not need to customize for different Clouds
- They just need to know where input data is available
(pretty standard OSG operation mode)
17
Done at a
leisurely pace
18. Using native Cloud storage
• Input data pre-staged into native Cloud storage
- Each file in one-to-few Cloud regions
§ some replication to deal with limited predictability of resources per region
- Local to Compute for large regions for maximum throughput
- Reading from “close” region for smaller ones to minimize ops
• Output staged back to region-local Cloud storage
- To be transferred back asynchronously after the compute is done
• Deployed simple wrappers around Cloud native file
transfer tools
- IceCube jobs do not need to customize for different Clouds
- They just need to know where input data is available
(pretty standard OSG operation mode)
18
The focus
of this talk
19. Science with 50k GPUs
achieved as peak performance
19
Time in Minutes
Each color is a different
cloud region in US, EU, or Asia.
Total of 28 Regions in use.
Peaked at about 50k GPUs
~350 Petaflops of fp32
8 generations of NVIDIA GPUs used.
20. A Heterogenous Resource Pool
20
28 cloud Regions across 4 world regions
providing us with 8 GPU generations.
No one region or GPU type dominates!
21. Science Produced
21
Distributed High-Throughput
Computing (dHTC) paradigm
implemented via HTCondor provides
global resource aggregation.
Largest cloud region provided 10.8% of the total
dHTC paradigm can aggregate
on-prem anywhere
HPC at any scale
and multiple clouds
22. Data Produced
22
Size of the data created
was proportional
to the events processed
Largest cloud region provided 10.8% of the total
Just as distributed as
the compute has been
About 20 TB total
24. Timeline
• IceCube is actually in no hurry in getting the
data out of the Clouds
- Sooner is of course better
- But not time critical
• But Cloud great for urgent computing
- And there getting the data promptly out
would be as important as getting
the compute done in the first place
24
25. LIGO example
• The LIGO is the other MMA experiment that
can be used to detect large Cosmic events
and point other Astronomy observations
• They are currently limited by compute on
how accurate their pointing is
- More compute would mean better pointing
- Must must be prompt
25
26. LIGO example
• The LIGO is the other MMA experiment that
can be used to detect large Cosmic events
and point other Astronomy observations
• They are currently limited by compute on
how accurate their pointing is
- More compute would mean better pointing
- Must must be prompt
26
20k GPUs for 30 mins with a 30min ramp-
up gets us into the regime where we can
reasonably run a multi-approximant/multi-
EOS analysis to dramatically improve
confidence in probability of an EM counter
part in ~1 hour, so that classifications are
as accurate as they're going to get before
an optical counterpart fades
James Clark, LIGO
27. Demonstrating a Burt Transfer
• We thus decided to move
~10 TB of the data
back from the Clouds
in a short burst
- 10 TB dictated by the available storage options
• Trying two options
- Directly to UW using many commodity nodes
- Stage to a Internet2 DTN
27
28. UW commodity setup
• We fully expected to be disk I/O bound
- Single spinning disk per node
• We managed to secure 30 nodes
for the purpose
28
30. UW commodity setup
• About 16 Gbps aggregate bandwidth
- But huge variations between Cloud regions
- 3.5Gbps from best, <0.5 Gbps from worst
30
31. Internet2 DTN
• Wanted to see how a single high-end node
with flash-based storage would fare
• We also had previous network
measurements that suggested that we may
be able to beat the 30-node US setup
- See my CHEP19 talk, if interested
http://chep2019.org
31
36. Internet2 DTN
• Peaked at slightly less than 10 Gbps
- Likely limited by the storage
• Again, huge differences in performance
between Cloud regions
36
37. Summary
• Large scale cloud computing is feasible
- We almost matched Summit in FLOP32s
- And can be ramped up very fast
• Getting data between on-prem and Cloud
not a big deal either
- We exceeded 10 Gbps while going
to virtually all Cloud regions
- But needs adequate on-prem capabilities
37
38. Acknowledgements
• Internet2 was the main network provider for
this activity.
• This work was partially sponsored by
NSF grants OAC-1941481,
MPS-1148698, OAC-1841530 and
OAC-1826967.
38