AI/ML Infra Meetup Aug. 29, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Bin Fan (VP of Technology, Founding Engineer @OpenAI) In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scale data for AI. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving. In this talk, Bin Fan will discuss the challenges of I/O stalls that lead to suboptimal GPU utilization during model training. He will present a reference architecture for running PyTorch jobs with Alluxio in cloud environments, demonstrating how this approach can significantly enhance GPU efficiency. What you will learn: - How to identify GPU utilization and I/O-related performance bottlenecks in model training - Leverage GPU anywhere to maximize resource utilization - Best practices for monitoring and optimizing GPU usage across training and serving pipelines - Strategies for reducing cloud costs and simplifying management of AI infrastructure at scale