Alluxio Bay Area Meetup @ Galvanize | SF
Aug 20, 2019
Interactive Analytics in the Cloud with Presto and Alluxio
Speaker:
Bin Fan, Founding Engineer, Alluxio
Building Cloud Native Analytical Pipelines on AWS Alluxio, Inc.
Alluxio Bay Area Meetup @ Galvanize | SF
Aug 20, 2019
Interactive Analytics in the Cloud with Presto and Alluxio
Speaker:
Irene Cai, Software Engineer, Google
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Presto on Alluxio Hands-On Lab
Speakers:
Bin Fan, Alluxio
Zac Blanco, Alluxio
Kamil Bajda-Pawlikowski, Starburst, Presto Company
Martin Traverso, Presto Software Foundation
For more Alluxio events: https://www.alluxio.io/events/
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017 Alluxio, Inc.
Adit Madan from Alluxio presented on using Alluxio to accelerate analytics on data stored in Ceph object storage. Alluxio acts as a virtual distributed file system that caches data in memory to provide faster access to data across different storage systems. It was shown to provide up to 20x faster performance for repeated Spark jobs on a 60GB dataset in Ceph compared to without Alluxio. Details are provided in Alluxio's whitepaper on accelerating analytics on Ceph with Alluxio.
Securely Enhancing Data Access in Hybrid Cloud with AlluxioAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Michael Fagan & Prashant Khanolkar, Comcast
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Alluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Enterprise Distributed Query Service powered by Presto & Alluxio across clouds at WalmartLabs
Speaker:
Ashish Tadose, WalmartLabs
For more Alluxio events: https://www.alluxio.io/events/
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
This document discusses optimizations made to Alibaba Cloud's Data Lake Analytics (DLA) engine, which uses Presto, to improve performance when querying data stored in Object Storage Service (OSS). The optimizations included decreasing OSS API request counts, implementing an Alluxio data cache using local disks on Presto workers, and improving disk throughput by utilizing multiple ultra disks. These changes increased cache hit ratios and query performance for workloads involving large scans of data stored in OSS. Future plans include supporting an Alluxio cluster shared by multiple users and additional caching techniques.
Best Practices for Using Alluxio with SparkAlluxio, Inc.
Gene Pang presented on best practices for using Alluxio with Spark. Alluxio is a memory-centric distributed storage system that can improve Spark performance by enabling data to be accessed at memory speed. Using Alluxio between Spark and storage systems allows data to be shared between Spark's storage and execution engines at memory speed without requiring multiple copies. Alluxio also provides data resilience during crashes since data is not lost from memory. Experiments showed Alluxio providing a 6-8x speedup over reading cached Parquet dataframes from S3.
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.
Alluxio is a virtual distributed file system that serves as a data access layer between applications and storage systems. It provides a unified interface, improved performance through caching, and enables transparent migration between storage systems. Alluxio deployed with Presto on cloud storage like S3 can provide 5x faster query performance through caching query data in Alluxio workers located with compute. Case studies show how Alluxio improved response times for analytics workloads at large companies by eliminating remote data access and enabling data locality.
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio, Inc.
- Alluxio (formerly Tachyon) provides a unified memory-speed data access across compute frameworks like Spark and Presto, and storage systems like S3, HDFS, and NFS.
- It started as an open source project at UC Berkeley in 2012 and is now rapidly growing with over 500 contributors from 100+ organizations.
- By keeping frequently used data in memory, Alluxio can accelerate data access by 30x or more for companies like Baidu, Barclays, and Qunar by enabling workflows that were previously impossible.
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Alluxio, Inc.
This document discusses using Alluxio with Spark to improve performance. Alluxio consolidates data in memory across distributed systems to enable faster data sharing between Spark jobs and frameworks. Tests show Alluxio can accelerate Spark workloads by up to 30x when reading from remote storage like S3 by serving data at memory speed. Alluxio also provides data resilience during failures and allows sharing data across jobs more easily.
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Alluxio, Inc.
This document discusses the rise of intermediary APIs like Apache Beam and Alluxio that allow users to write data processing jobs and express storage lifecycles independently of physical constraints. Intermediary APIs provide portability across frameworks and unified access to multiple storage systems. Alluxio in particular provides an in-memory filesystem that can cache data from various storage sources, while Beam allows processing jobs to run on different execution engines. These intermediary APIs create a path for easy technology adoption and focus on features over connectivity.
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)Alluxio, Inc.
Presto Meetup @ Uber
Nov 21, 2019
Speakers:
Haoyuan (H.Y.) Li, Founder and CTO | Alluxio
Bin Fan, Founding engineer and VP of Open Source | Alluxio
For more Alluxio events: https://www.alluxio.io/events/
Alluxio-FUSE as a data access layer for DaskAlluxio, Inc.
This document discusses integrating Alluxio with Dask for processing large mass spectrometry imaging data. Alluxio is used as a distributed caching layer via its FUSE POSIX API to provide standardized access to datasets from Dask. This allows Dask to process data in parallel across compute nodes without needing to load full datasets into memory. Initial results found a 10x speedup when reading cached data from Alluxio versus directly from S3 storage each time.
Presto: Query Anything - Data Engineer’s perspectiveAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Presto: Query Anything - Data Engineer’s perspective
Speakers:
Kamil Bajda-Pawlikowski, Starburst, Presto Company
Martin Traverso, Presto Software Foundation
For more Alluxio events: https://www.alluxio.io/events/
Alluxio Use Cases and Future DirectionsAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Data Orchestration for Analytics and AI in the Cloud Era
Calvin Jia, Founding Engineer (Alluxio)
Bin Fan, Founding Engineer, VP of Open Source (Alluxio)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformAlluxio, Inc.
This document discusses JD.com's use of Presto and Alluxio in their big data platform (BDP) architecture. It provides an introduction to Presto and how JD.com uses it in their BDP, including scaling Presto on YARN and using PowerServer for operations and maintenance. It also discusses how Presto and Alluxio are used together to improve query performance through caching and eliminating network traffic. Finally, it outlines ongoing explorations around improving Presto and Alluxio, such as load balancing, resource isolation, supporting larger clusters, and porting HDFS authentication to Alluxio.
How to Build a new under filesystem in Alluxio: Apache Ozone as an exampleAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
How to Build a new under filesystem in Alluxio: Apache Ozone as an example
Baolong Mao, Sr. System Engineer (Tencent)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Hybrid data lake on google cloud with alluxio and dataprocAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Hybrid Data Lake on Google Cloud with Alluxio and Dataproc
Roderick Yao, Strategic Cloud Engineer (Google Cloud)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
ApacheCon 2021
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Lu Qiu
Bin Fan
Alluxio’s capabilities as a Data Orchestration framework have encouraged users to onboard more of their data-driven applications to an Alluxio powered data access layer. Driven by strong interests from our open-source community, the core team of Alluxio started to re-design an efficient and transparent way for users to leverage data orchestration through the POSIX interface. This effort has a lot of progress with the collaboration with engineers from Microsoft, Alibaba and Tencent. Particularly, we have introduced a new JNI-based FUSE implementation to support POSIX data access, created a more efficient way to integrate Alluxio with FUSE service, as well as many improvements in relevant data operations like more efficient distributedLoad, optimizations on listing or calculating directories with a massive amount of files, which are common in model training. We will also share our engineering lessons and roadmap in future releases to support Machine Learning applications.
This document discusses deploying the Alluxio distributed file system on Mesosphere DC/OS. It begins with an overview of the SMACK and SMAACK data stacks that include Apache Spark, Kafka, Cassandra and Akka. It then summarizes the benefits of Alluxio in providing unified access to data across storage systems at memory speed. The document demonstrates deploying Alluxio on DC/OS, noting how this provides on-demand provisioning, simplified operations and an elastic data infrastructure. It concludes by recommending users get started with Alluxio on DC/OS to process data from multiple storage systems faster.
Deep Learning and Gene Computing Acceleration with Alluxio in KubernetesAlluxio, Inc.
Eric Li, Senior Architect of Alibaba Cloud, presented on using Alluxio on Kubernetes. He discussed:
1. The challenges of deploying Alluxio on Kubernetes, including how to deploy it in a Kubernetes-native way, how applications can access data without changes, and how to achieve best Alluxio performance.
2. Optimizations made to Alluxio including a Helm chart for one-click installation, optimizations to the OSS SDK for data loading speed, and using fuse and short-circuiting for performance.
3. Best practices for using Alluxio on Kubernetes for different workloads like deep learning and genomic computing.
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
Alluxio Community Office Hour
February 23, 2021
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
The Missing Piece of On-Demand ClustersAlluxio, Inc.
The Missing Piece of On-Demand Clusters
Presented by Calvin Jia, Alluxio
Introduction to Alluxio Meetup at Princeton
http://www.meetup.com/futureofdata-princeton/events/232927731/
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration between Presto & Alluxio
Ke Wang, Software Engineer (Facebook)
Bin Fan, Founding Engineer, VP Of Open Source (Alluxio)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsAlluxio, Inc.
Alluxio Austin Meetup
Aug 15, 2019
Speaker: Bin Fan
Apache Spark and Alluxio are cousin open source projects that originated from UC Berkeley’s AMPLab. Running Spark with Alluxio is a popular stack particularly for hybrid environments. In this session, I will briefly introduce Apache Spark and Alluxio, share the top ten tips for performance tuning for real-world workloads, and demo Alluxio with Spark.
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreAlluxio, Inc.
Alluxio - Data Orchestration for Analytics and AI in the Cloud
Oct 8, 2019
Speakers:
Haoyuan Li & Bin Fan, Alluxio
Visit https://www.alluxio.io/events/ for more Alluxio events.
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.
Alluxio is a virtual distributed file system that serves as a data access layer between applications and storage systems. It provides a unified interface, improved performance through caching, and enables transparent migration between storage systems. Alluxio deployed with Presto on cloud storage like S3 can provide 5x faster query performance through caching query data in Alluxio workers located with compute. Case studies show how Alluxio improved response times for analytics workloads at large companies by eliminating remote data access and enabling data locality.
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio, Inc.
- Alluxio (formerly Tachyon) provides a unified memory-speed data access across compute frameworks like Spark and Presto, and storage systems like S3, HDFS, and NFS.
- It started as an open source project at UC Berkeley in 2012 and is now rapidly growing with over 500 contributors from 100+ organizations.
- By keeping frequently used data in memory, Alluxio can accelerate data access by 30x or more for companies like Baidu, Barclays, and Qunar by enabling workflows that were previously impossible.
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Alluxio, Inc.
This document discusses using Alluxio with Spark to improve performance. Alluxio consolidates data in memory across distributed systems to enable faster data sharing between Spark jobs and frameworks. Tests show Alluxio can accelerate Spark workloads by up to 30x when reading from remote storage like S3 by serving data at memory speed. Alluxio also provides data resilience during failures and allows sharing data across jobs more easily.
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Alluxio, Inc.
This document discusses the rise of intermediary APIs like Apache Beam and Alluxio that allow users to write data processing jobs and express storage lifecycles independently of physical constraints. Intermediary APIs provide portability across frameworks and unified access to multiple storage systems. Alluxio in particular provides an in-memory filesystem that can cache data from various storage sources, while Beam allows processing jobs to run on different execution engines. These intermediary APIs create a path for easy technology adoption and focus on features over connectivity.
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)Alluxio, Inc.
Presto Meetup @ Uber
Nov 21, 2019
Speakers:
Haoyuan (H.Y.) Li, Founder and CTO | Alluxio
Bin Fan, Founding engineer and VP of Open Source | Alluxio
For more Alluxio events: https://www.alluxio.io/events/
Alluxio-FUSE as a data access layer for DaskAlluxio, Inc.
This document discusses integrating Alluxio with Dask for processing large mass spectrometry imaging data. Alluxio is used as a distributed caching layer via its FUSE POSIX API to provide standardized access to datasets from Dask. This allows Dask to process data in parallel across compute nodes without needing to load full datasets into memory. Initial results found a 10x speedup when reading cached data from Alluxio versus directly from S3 storage each time.
Presto: Query Anything - Data Engineer’s perspectiveAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Presto: Query Anything - Data Engineer’s perspective
Speakers:
Kamil Bajda-Pawlikowski, Starburst, Presto Company
Martin Traverso, Presto Software Foundation
For more Alluxio events: https://www.alluxio.io/events/
Alluxio Use Cases and Future DirectionsAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Data Orchestration for Analytics and AI in the Cloud Era
Calvin Jia, Founding Engineer (Alluxio)
Bin Fan, Founding Engineer, VP of Open Source (Alluxio)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformAlluxio, Inc.
This document discusses JD.com's use of Presto and Alluxio in their big data platform (BDP) architecture. It provides an introduction to Presto and how JD.com uses it in their BDP, including scaling Presto on YARN and using PowerServer for operations and maintenance. It also discusses how Presto and Alluxio are used together to improve query performance through caching and eliminating network traffic. Finally, it outlines ongoing explorations around improving Presto and Alluxio, such as load balancing, resource isolation, supporting larger clusters, and porting HDFS authentication to Alluxio.
How to Build a new under filesystem in Alluxio: Apache Ozone as an exampleAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
How to Build a new under filesystem in Alluxio: Apache Ozone as an example
Baolong Mao, Sr. System Engineer (Tencent)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Hybrid data lake on google cloud with alluxio and dataprocAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Hybrid Data Lake on Google Cloud with Alluxio and Dataproc
Roderick Yao, Strategic Cloud Engineer (Google Cloud)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
ApacheCon 2021
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Lu Qiu
Bin Fan
Alluxio’s capabilities as a Data Orchestration framework have encouraged users to onboard more of their data-driven applications to an Alluxio powered data access layer. Driven by strong interests from our open-source community, the core team of Alluxio started to re-design an efficient and transparent way for users to leverage data orchestration through the POSIX interface. This effort has a lot of progress with the collaboration with engineers from Microsoft, Alibaba and Tencent. Particularly, we have introduced a new JNI-based FUSE implementation to support POSIX data access, created a more efficient way to integrate Alluxio with FUSE service, as well as many improvements in relevant data operations like more efficient distributedLoad, optimizations on listing or calculating directories with a massive amount of files, which are common in model training. We will also share our engineering lessons and roadmap in future releases to support Machine Learning applications.
This document discusses deploying the Alluxio distributed file system on Mesosphere DC/OS. It begins with an overview of the SMACK and SMAACK data stacks that include Apache Spark, Kafka, Cassandra and Akka. It then summarizes the benefits of Alluxio in providing unified access to data across storage systems at memory speed. The document demonstrates deploying Alluxio on DC/OS, noting how this provides on-demand provisioning, simplified operations and an elastic data infrastructure. It concludes by recommending users get started with Alluxio on DC/OS to process data from multiple storage systems faster.
Deep Learning and Gene Computing Acceleration with Alluxio in KubernetesAlluxio, Inc.
Eric Li, Senior Architect of Alibaba Cloud, presented on using Alluxio on Kubernetes. He discussed:
1. The challenges of deploying Alluxio on Kubernetes, including how to deploy it in a Kubernetes-native way, how applications can access data without changes, and how to achieve best Alluxio performance.
2. Optimizations made to Alluxio including a Helm chart for one-click installation, optimizations to the OSS SDK for data loading speed, and using fuse and short-circuiting for performance.
3. Best practices for using Alluxio on Kubernetes for different workloads like deep learning and genomic computing.
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
Alluxio Community Office Hour
February 23, 2021
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
The Missing Piece of On-Demand ClustersAlluxio, Inc.
The Missing Piece of On-Demand Clusters
Presented by Calvin Jia, Alluxio
Introduction to Alluxio Meetup at Princeton
http://www.meetup.com/futureofdata-princeton/events/232927731/
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration between Presto & Alluxio
Ke Wang, Software Engineer (Facebook)
Bin Fan, Founding Engineer, VP Of Open Source (Alluxio)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsAlluxio, Inc.
Alluxio Austin Meetup
Aug 15, 2019
Speaker: Bin Fan
Apache Spark and Alluxio are cousin open source projects that originated from UC Berkeley’s AMPLab. Running Spark with Alluxio is a popular stack particularly for hybrid environments. In this session, I will briefly introduce Apache Spark and Alluxio, share the top ten tips for performance tuning for real-world workloads, and demo Alluxio with Spark.
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreAlluxio, Inc.
Alluxio - Data Orchestration for Analytics and AI in the Cloud
Oct 8, 2019
Speakers:
Haoyuan Li & Bin Fan, Alluxio
Visit https://www.alluxio.io/events/ for more Alluxio events.
Achieving compute and storage independence for data-driven workloadsAlluxio, Inc.
Alluxio provides a unified interface to access data across multiple storage systems, allowing compute and storage to scale independently for data-driven applications. It uses a virtual unified file system with a global namespace and server-side API translation to abstract data location and access. Alluxio intelligently manages data placement across memory, SSDs and HDDs using multi-tier caching for local performance on remote data. This allows flexible deployment of compute like Spark on any cloud while keeping data fully controlled on-premises. Alluxio is seeing wide adoption with many large production deployments handling thousands of nodes. Upcoming features include POSIX API support and preview of version 2.0.
Open Source Data Orchestration for AI, Big Data, and CloudAlluxio, Inc.
- Alluxio is an open source data orchestration platform that allows data to be accessed closer to compute across cloud, on-premise, and hybrid environments.
- It provides a unified namespace and API to access data located in various storage systems like HDFS, S3, and more.
- Alluxio intelligently manages data placement across memory, SSDs, and HDDs for fast data access and supports popular frameworks like Spark, Presto, and Hive.
Alluxio can be deployed on Kubernetes to provide data orchestration for analytics frameworks like Spark. Alluxio abstracts data sources and provides a unified namespace, enabling elastic scaling of compute and independent data. It can be deployed with the Alluxio master and workers in separate pods or together with compute frameworks like Spark. A demo was shown of running Spark jobs on Alluxio to get data locality benefits within Kubernetes.
Over the past two decades, the Big Data stack has reshaped and evolved quickly with numerous innovations driven by the rise of many different open source projects and communities. In this meetup, speakers from Uber, Alibaba, and Alluxio will share best practices for addressing the challenges and opportunities in the developing data architectures using new and emerging open source building blocks. Topics include data format (ORC) optimization, storage security (HDFS), data format (Parquet) layers, and unified data access (Alluxio) layers.
Building a Cloud Native Stack with EMR Spark, Alluxio, and S3Alluxio, Inc.
This document summarizes a presentation about building a cloud native stack with EMR Spark, Alluxio, and S3. It discusses using Alluxio to provide better performance than S3 by adding a caching tier and keeping data local to applications like Spark. Alluxio provides familiar file system semantics and can mount multiple data sources. The document demonstrates Alluxio's architecture and how it provides memory speed access to data. It also covers integrating Alluxio with EMR using bootstrap actions and upcoming features in Alluxio 2.0 and 2.1.
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio, Inc.
Alluxio provides a data orchestration platform that allows applications to access data closer to compute across different storage systems through a unified namespace. Key features include intelligent multi-tier caching that provides local performance for remote data, API translation that enables popular frameworks to access different storages without changes, and data elasticity through a global namespace. Alluxio powers analytics and AI workloads in hybrid cloud environments.
Speeding up I/O for Machine Learning ft Apple Case Study using TensorFlow, N...Alluxio, Inc.
Alluxio Online Meetup
January 15, 2019
Speakers:
Bill Zhao, Apple
Bin Fan, Alluxio
For more Alluxio events: https://www.alluxio.io/events/
Enabling Ultra-fast Presto in the Cloud with AlluxioAlluxio, Inc.
Alluxio is an open source data orchestration system that enables ultra-fast Presto in the cloud. It provides a Presto Alluxio Stack that caches data in Alluxio for faster Presto queries, with benefits like lower latency, more consistent performance, and reduced data transfer. Alluxio's new structured data service provides deeper integration with SQL engines like Presto through features like a catalog service and transformation service. This enables schema-aware optimizations and compute-optimized data formats for further accelerating Presto performance.
Accelerating Analytics with EMR on your S3 Data LakeAlluxio, Inc.
- Alluxio provides a data caching layer for analytics frameworks like Spark running on AWS EMR, addressing challenges of using S3 directly like inconsistent performance and expensive metadata operations.
- It mounts S3 as a unified filesystem and caches frequently used data in memory across workers for faster queries while continuously syncing data to S3.
- Alluxio's multi-tier storage enables data to be accessed locally from remote locations like S3 using intelligent policies to promote and demote data between memory, SSDs and disks.
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudAlluxio, Inc.
Alluxio Tech Talk
Mar 12, 2019
Speaker:
Bin Fan, Alluxio
Matt Fuller, Starburst
As data analytic needs have increased with the explosion of data, the importance of the speed of analytics and the interactivity of queries has increased dramatically
In this tech talk, we will introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments.
You’ll learn about:
- The architecture of Presto, an open source distributed SQL engine, as well as innovations by Starburst like as it’s cost-based optimizer
- How Presto can query data from cloud object storage like S3 at high performance and cost-effectively with Alluxio
- How to achieve data locality and cross-job caching with Alluxio no matter where the data is persisted and reduce egress costs
In addition, we’ll present some real world architectures & use cases from internet companies like JD.com and NetEase.com running the Presto and Alluxio stack at the scale of hundreds of nodes.
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
Alluxio Tech Talk
January 21, 2020
Speakers:
Matt Fuller, Starburst
Dipti Borkar, Alluxio
With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data.
Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about:
- The architecture of Presto, an open source distributed SQL engine
- How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics
- Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted
Alluxio Data Orchestration Platform for the CloudShubham Tagra
Alluxio originated as an open source project at UC Berkeley to orchestrate data for cloud applications by providing a unified namespace and intelligent data caching across multiple data sources. It provides consistent high performance for analytics and AI workloads running on object stores by caching frequently accessed data in memory and tiering data to flash/disk based on policies. Alluxio can also enable hybrid cloud environments by allowing on-premises workloads to burst to public clouds without data movement through "zero-copy" access to remote data.
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
Alluxio Tech Talk
Feb 12, 2019
Speaker:
Dipti Borkar, Alluxio
The rise of compute intensive workloads and the adoption of the cloud has driven organizations to adopt a decoupled architecture for modern workloads – one in which compute scales independently from storage. While this enables scaling elasticity, it introduces new problems – how do you co-locate data with compute, how do you unify data across multiple remote clouds, how do you keep storage and I/O service costs down and many more.
Enter Alluxio, a virtual unified file system, which sits between compute and storage that allows you to realize the benefits of a hybrid cloud architecture with the same performance and lower costs.
In this webinar, we will discuss:
- Why leading enterprises are adopting hybrid cloud architectures with compute and storage disaggregated
- The new challenges that this new paradigm introduces
- An introduction to Alluxio and the unified data solution it provides for hybrid environments
This document discusses accelerating Spark workloads on Amazon S3 using Alluxio. It describes the challenges of running Spark interactively on S3 due to its eventual consistency and expensive metadata operations. Alluxio provides a data caching layer that offers strong consistency, faster performance, and API compatibility with HDFS and S3. It also allows data outside of S3 to be analyzed. The document demonstrates how to bootstrap Alluxio on an AWS EMR cluster to accelerate Spark workloads running on S3.
Unified Data API for Distributed Cloud Analytics and AIAlluxio, Inc.
Alluxio Day x APAC Modern Data Stack
September 22, 2022
For more on Alluxio Day: https://www.alluxio.io/alluxio-day/
For more Alluxio events: https://alluxio.io/events/
Speaker: Bin Fan (Founding Member & VP of Open Source, Alluxio)
Alluxio (www.alluxio.io) is an open-source virtual distributed file system that provides a unified data access layer for hybrid and multi-cloud deployments. It enables distributed compute engines like Spark, Presto or Machine Learning frameworks like TensorFlow to transparently access different persistent storage systems (including HDFS, S3, Azure and etc) while actively leveraging in-memory cache to accelerate data access. Developed originally from UC Berkeley AMPLab as research project “Tachyon”, Alluxio has more than 1200 contributors and is used by over 100 companies worldwide with the largest production deployment over 1000 nodes.
This presentation focuses on how Alluxio helps the big data analytics stack to be cloud-native. The trending Cloud object storage systems provide more cost-effective and scalable storage solutions but also different semantics and performance implications compared to HDFS. Applications like Spark or Presto will not benefit from the node-level locality or cross-job caching when retrieving data from the cloud object storage. Deploying Alluxio to access cloud solves these problems because data will be retrieved and cached in Alluxio instead of the underlying cloud or object storage repeatedly.
This document provides an overview of Alluxio, a unified data solution that allows applications to access data closer to the computation. It summarizes Alluxio's key innovations including providing a unified namespace, translating between different storage APIs, and using an intelligent caching system. The document also outlines several use cases where Alluxio has helped customers including accelerating machine learning and analytics workloads.
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...Alluxio, Inc.
Alluxio Webinar
Feb. 25, 2025
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
Bill Hodak (VP of Marketing and Product Marketing, Alluxio)
Tom Luckenbach (Solutions Engineering Manager, Alluxio)
Join us to learn about the latest release of Alluxio Enterprise AI. In this webinar, we’ll provide an overviewof the new features and capabilities of Alluxio Enterprise AI, built to accelerate AI workloads and maximize GPU utilization.
Key highlights include:
- New caching mode accelerates AI checkpoints
- Advanced cache eviction policies provide fine-grained control
- Python SDK integrations enhance AI framework compatibility
- A demo of Alluxio accelerating AI training workloads in AWS
How Coupang Leverages Distributed Cache to Accelerate ML Model TrainingAlluxio, Inc.
Alluxio Tech Talk Webinar
Apr. 22, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Hyun Jung Baek (Staff Backend Engineer @ Coupang)
Description
Coupang is a leading e-commerce company in South Korea, with over 50,000 employees and $20+ billion in annual revenue. Coupang's AI platform team builds and manages a large-scale AI platform in AWS for machine learning engineers to train models that enhance and customize product search results and product recommendations for its 100+ million customers.
As the search and recommendation models evolve, optimizing the underlying infrastructure for AI/ML workloads is essential for the e-commerce business. Coupang's platform team actively sought to improve their model training pipeline to boost machine learning engineers' productivity, publish models to production faster, and reduce operational costs.
Coupang focused on addressing several key areas:
- Shortening data preparation and model training time
- Improving GPU utilization in training clusters in different regions
- Reducing S3 API and egress costs incurred from copying large training datasets across regions
- Simplifying the operational complexity of storage system management
In this tech talk, Hyun Jung Baek, Staff Backend Engineer at Coupang, will share best practices for leveraging Alluxio to power search and recommendation model training infrastructure.
Hyun will discuss:
- How Coupang builds a world-class large-scale AI platform for machine learning engineers to deliver better search and recommendation models
- How adding distributed caching to their multi-region AI infrastructure improves GPU utilization, accelerates end-to-end training time, and significantly reduces cross-region data transfer costs.
- How to simplify platform operations and to easily deploy the same architecture to new GPU clusters.
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...Alluxio, Inc.
Alluxio Webinar
Apr 1, 2025
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
Stephen Pu (Staff Software Engineer @ Alluxio)
Deepseek’s recent announcement of the Fire-flyer File System (3FS) has sparked excitement across the AI infra community, promising a breakthrough in how machine learning models access and process data.
In this webinar, an expert in distributed systems and AI infrastructure will take you inside Deepseek 3FS, the purpose-built file system for handling large files and high-bandwidth workloads. We’ll break down how 3FS optimizes data access and speeds up AI workloads as well as the design tradeoffs made to maximize throughput for AI workloads.
This webinar you’ll learn about how 3FS works under the hood, including:
✅ The system architecture
✅ Core software components
✅ Read/write flows
✅ Data distribution/placement algorithms
✅ Cluster/node management and disaster recovery
Whether you’re an AI researcher, ML engineer, or infrastructure architect, this deep dive will give you the technical insights you need to determine if 3FS is the right solution for you.
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...Alluxio, Inc.
AI/ML Infra Meetup
Mar. 06, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Xu Ning (Director of Engineering, AI Platform @ Snap)
In this talk, Xu Ning from Snap provides a comprehensive overview of the unique challenges in building and scaling recommendation systems compared to LLM applications.
AI/ML Infra Meetup | How Uber Optimizes LLM Training and FinetuneAlluxio, Inc.
AI/ML Infra Meetup
Mar. 06, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Chongxiao Cao (Senior SWE @ Uber)
Chongxiao Cao from Uber's Michelangelo training team shared valuable insights into Uber's approach to optimizing LLM training and fine-tuning workflows.
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...Alluxio, Inc.
AI/ML Infra Meetup
Mar. 06, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Bin Fan (VP of Technology @ Alluxio)
In this talk, Bin Fan shares his insights on data access challenges in ML applications, with particular emphasis on how Alluxio's distributed caching helps bridge the gap between storage and compute in preprocessing, pretraining and inference.
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber ScaleAlluxio, Inc.
AI/ML Infra Meetup
Mar. 06, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Sean Po (Staff SWE @ Uber)
- Tse-Chi Wang (Senior SWE @ Uber)
This talk provided a deep dive into how Uber manages its Generative AI Gateway, which powers all generative AI applications across the company.
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference StackAlluxio, Inc.
AI/ML Infra Meetup
Jan. 23, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Junchen Jiang (Assistant Professor @ University of Chicago)
LLM inference can be huge, particularly, with long contexts. In this on-demand video, Junchen Jiang, Assistant Professor at University of Chicago, presents a 10x solution for long contexts inference: an easy-to-deploy stack over multiple vLLM engines with tailored KV-cache backend.
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...Alluxio, Inc.
AI/ML Infra Meetup
Jan. 23, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Bin Fan (VP of Technology @ Alluxio)
Ready to optimize your AI infra strategy? Watch this on-demand video, where Bin Fan, VP of Technology at Alluxio, will guide you through how to balance cost & performance for GPU/CPU workloads.
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...Alluxio, Inc.
AI/ML Infra Meetup
Jan. 23, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Robert Nishihara (Co-Founder @ Anyscale)
You won't want to miss this talk presented by Robert Nishihara, Co-Founder of Anyscale, which is packed with insights on using Ray to conquer the last-mile challenges in AI deployment.
Alluxio Webinar | Accelerate AI: Alluxio 101Alluxio, Inc.
Alluxio Webinar
Dec. 3, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
Bill Hodak (VP of Marketing and Product Marketing, Alluxio)
In the rapidly evolving landscape of AI and machine learning, Platform and Data Infrastructure Teams face critical challenges in building and managing large-scale AI platforms. Performance bottlenecks, scalability of the platform, and scarcity of GPUs pose significant challenges in supporting large-scale model training and serving.
In this talk, we will introduce how Alluxio helps Platform and Data Infrastructure teams deliver faster, more scalable platforms to ML Engineering teams developing and training AI models. Alluxio’s highly-distributed cache accelerates AI workloads by eliminating data loading bottlenecks and maximizing GPU utilization. Customers report up to 4x faster training performance with high-speed access to petabytes of data spread across billions of files regardless of persistent storage type or proximity to GPU clusters. Alluxio’s architecture lowers data infrastructure costs, increases GPU utilization, and enables workload portability for navigating GPU scarcity challenges.
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AIAlluxio, Inc.
AI/ML Infra Meetup
Nov. 7, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Zhe Zhang (Distinguished Engineer @ NVIDIA)
In this talk, Zhe Zhang (NVIDIA, ex-Anyscale) introduced Ray and its applications in the LLM and multi-modal AI era. He shared his perspective on ML infrastructure, noting that it presents more unstructured challenges, and recommended using Ray and Alluxio as solutions for increasingly data-intensive multi-modal AI workloads.
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...Alluxio, Inc.
AI/ML Infra Meetup
Nov. 7, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Bin Fan (Founding Engineer, VP of Technology @ Alluxio)
As large-scale machine learning becomes increasingly GPU-centric, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization. In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.
AI/ML Infra Meetup | Big Data and AI, Zoom DevelopersAlluxio, Inc.
AI/ML Infra Meetup
Nov. 7, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Sandeep Manchem (ML Platform Engineering Manager @ Zoom)
In this talk, Sandeep Manchem (Zoom) discussed big data and AI, covering typical platform architecture and data challenges. We had engaging discussions about ensuring data safety and compliance in Big Data and AI applications.
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...Alluxio, Inc.
AI/ML Infra Meetup
Nov. 7, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Tianyu Liu (Research Scientist @ Meta)
TorchTitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is a repo that showcases PyTorch's latest distributed training features in a clean, minimal codebase.
In this talk, Tianyu will share TorchTitan’s design and optimizations for the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its performance, composability, and scalability.
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...Alluxio, Inc.
Alluxio Webinar
October.15, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Tom Luckenbach (Solutions Engineering Manager, Alluxio)
AI training workloads running on compute engines like PyTorch, TensorFlow, and Ray require consistent, high-throughput access to training data to maintain high GPU utilization. However, with the decoupling of compute and storage and with today’s hybrid and multi-cloud landscape, AI Platform and Data Infrastructure teams are struggling to cost-effectively deliver the high-performance data access needed for AI workloads at scale.
Join Tom Luckenbach, Alluxio Solutions Engineering Manager, to learn how Alluxio enables high-speed, cost-effective data access for AI training workloads in hybrid and multi-cloud architectures, while eliminating the need to manage data copies across regions and clouds.
What Tom will share:
- AI data access challenges in cross-region, cross-cloud architectures.
- The architecture and integration of Alluxio with frameworks like PyTorch, TensorFlow, and Ray using POSIX, REST, or Python APIs across AWS, GCP and Azure.
- A live demo of an AI training workload accessing cross-cloud datasets leveraging Alluxio's distributed cache, unified namespace, and policy-driven data management.
- MLPerf and FIO benchmark results and cost-savings analysis.
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...Alluxio, Inc.
AI/ML Infra Meetup
Aug. 29, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Koundinya Pidaparthi (VP of Analytics @ Poshmark)
Scaling experimentation in digital marketplaces is crucial for driving growth and enhancing user experiences. However, varied methodologies and a lack of experiment governance can hinder the impact of experimentation leading to inconsistent decision-making, inefficiencies, and missed opportunities for innovation.
At Poshmark, we developed a homegrown experimentation platform, Lightspeed, that allowed us to make reliable and confident reads on product changes, which led to a 10x growth in experiment velocity and positive business outcomes along the way.
This session will provide a deep dive into the best practices and lessons learned from successful implementations of large-scale experiments. We will explore the importance of experimentation, overcome scalability challenges, and gain insights into the frameworks and technologies that enable effective testing.
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...Alluxio, Inc.
AI/ML Infra Meetup
Aug. 29, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Mahesh Pasupuleti (VP of DS, ML & Data Infra @ Poshmark)
In the rapidly evolving world of e-commerce, visual search has become a game-changing technology. Poshmark, a leading fashion resale marketplace, has developed Posh Lens – an advanced visual search engine that revolutionizes how shoppers discover and purchase items.
Under the hood of Posh Lens lies Milvus, a vector database enabling efficient product search and recommendation across our vast catalog of over 150 million items. However, with such an extensive and growing dataset, maintaining high-performance search capabilities while scaling AI infrastructure presents significant challenges.
In this talk, Mahesh Pasupuleti shares:
- The architecture and strategies to scale Milvus effectively within the Posh Lens infrastructure
- Key considerations include optimizing vector indexing, managing data partitioning, and ensuring query efficiency amidst large-scale data growth
- Distributed computing principles and advanced indexing techniques to handle the complexity of Poshmark's diverse product catalog
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...Alluxio, Inc.
Alluxio Webinar
Sept. 10, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jingwen Ouyang (Senior Program Manager, Alluxio)
As machine learning and deep learning models grow in complexity, AI platform engineers and ML engineers face significant challenges with slow data loading and GPU utilization, often leading to costly investments in high-performance computing (HPC) storage. However, this approach can result in overspending without addressing the core issues of data bottlenecks and infrastructure complexity.
A better approach is adding a data caching layer between compute and storage, like Alluxio, which offers a cost-effective alternative through its innovative data caching strategy. In this webinar, Jingwen will explore how Alluxio's caching solutions optimize AI workloads for performance, user experience and cost-effectiveness.
What you will learn:
- The I/O bottlenecks that slow down data loading in model training
- How Alluxio's data caching strategy optimizes I/O performance for training and GPU utilization, and significantly reduces cloud API costs
- The architecture and key capabilities of Alluxio
- Using Rapid Alluxio Deployer to install Alluxio and run benchmarks in AWS in just 30 minutes
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...Alluxio, Inc.
AI/ML Infra Meetup
Aug. 29, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Bin Fan (VP of Technology, Founding Engineer @OpenAI)
In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scale data for AI. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving.
In this talk, Bin Fan will discuss the challenges of I/O stalls that lead to suboptimal GPU utilization during model training. He will present a reference architecture for running PyTorch jobs with Alluxio in cloud environments, demonstrating how this approach can significantly enhance GPU efficiency.
What you will learn:
- How to identify GPU utilization and I/O-related performance bottlenecks in model training
- Leverage GPU anywhere to maximize resource utilization
- Best practices for monitoring and optimizing GPU usage across training and serving pipelines
- Strategies for reducing cloud costs and simplifying management of AI infrastructure at scale
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMsAlluxio, Inc.
AI/ML Infra Meetup
Aug. 29, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Ankit Khare (Developer Relations, @OpenAI)
This session aims to provide practical insights for AI enthusiasts on effectively customizing and leveraging LLMs in various applications through preference tuning and fine-tuning.
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfTechSoup
In this webinar we will dive into the essentials of generative AI, address key AI concerns, and demonstrate how nonprofits can benefit from using Microsoft’s AI assistant, Copilot, to achieve their goals.
This event series to help nonprofits obtain Copilot skills is made possible by generous support from Microsoft.
What You’ll Learn in Part 2:
Explore real-world nonprofit use cases and success stories.
Participate in live demonstrations and a hands-on activity to see how you can use Microsoft 365 Copilot in your own work!
Who Watches the Watchmen (SciFiDevCon 2025)Allon Mureinik
Tests, especially unit tests, are the developers’ superheroes. They allow us to mess around with our code and keep us safe.
We often trust them with the safety of our codebase, but how do we know that we should? How do we know that this trust is well-deserved?
Enter mutation testing – by intentionally injecting harmful mutations into our code and seeing if they are caught by the tests, we can evaluate the quality of the safety net they provide. By watching the watchmen, we can make sure our tests really protect us, and we aren’t just green-washing our IDEs to a false sense of security.
Talk from SciFiDevCon 2025
https://www.scifidevcon.com/courses/2025-scifidevcon/contents/680efa43ae4f5
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...Andre Hora
Unittest and pytest are the most popular testing frameworks in Python. Overall, pytest provides some advantages, including simpler assertion, reuse of fixtures, and interoperability. Due to such benefits, multiple projects in the Python ecosystem have migrated from unittest to pytest. To facilitate the migration, pytest can also run unittest tests, thus, the migration can happen gradually over time. However, the migration can be timeconsuming and take a long time to conclude. In this context, projects would benefit from automated solutions to support the migration process. In this paper, we propose TestMigrationsInPy, a dataset of test migrations from unittest to pytest. TestMigrationsInPy contains 923 real-world migrations performed by developers. Future research proposing novel solutions to migrate frameworks in Python can rely on TestMigrationsInPy as a ground truth. Moreover, as TestMigrationsInPy includes information about the migration type (e.g., changes in assertions or fixtures), our dataset enables novel solutions to be verified effectively, for instance, from simpler assertion migrations to more complex fixture migrations. TestMigrationsInPy is publicly available at: https://github.com/altinoalvesjunior/TestMigrationsInPy.
Societal challenges of AI: biases, multilinguism and sustainabilityJordi Cabot
Towards a fairer, inclusive and sustainable AI that works for everybody.
Reviewing the state of the art on these challenges and what we're doing at LIST to test current LLMs and help you select the one that works best for you
Discover why Wi-Fi 7 is set to transform wireless networking and how Router Architects is leading the way with next-gen router designs built for speed, reliability, and innovation.
Interactive Odoo Dashboard for various business needs can provide users with dynamic, visually appealing dashboards tailored to their specific requirements. such a module that could support multiple dashboards for different aspects of a business
✅Visit And Buy Now : https://bit.ly/3VojWza
✅This Interactive Odoo dashboard module allow user to create their own odoo interactive dashboards for various purpose.
App download now :
Odoo 18 : https://bit.ly/3VojWza
Odoo 17 : https://bit.ly/4h9Z47G
Odoo 16 : https://bit.ly/3FJTEA4
Odoo 15 : https://bit.ly/3W7tsEB
Odoo 14 : https://bit.ly/3BqZDHg
Odoo 13 : https://bit.ly/3uNMF2t
Try Our website appointment booking odoo app : https://bit.ly/3SvNvgU
👉Want a Demo ?📧 [email protected]
➡️Contact us for Odoo ERP Set up : 091066 49361
👉Explore more apps: https://bit.ly/3oFIOCF
👉Want to know more : 🌐 https://www.axistechnolabs.com/
#odoo #odoo18 #odoo17 #odoo16 #odoo15 #odooapps #dashboards #dashboardsoftware #odooerp #odooimplementation #odoodashboardapp #bestodoodashboard #dashboardapp #odoodashboard #dashboardmodule #interactivedashboard #bestdashboard #dashboard #odootag #odooservices #odoonewfeatures #newappfeatures #odoodashboardapp #dynamicdashboard #odooapp #odooappstore #TopOdooApps #odooapp #odooexperience #odoodevelopment #businessdashboard #allinonedashboard #odooproducts
Join Ajay Sarpal and Miray Vu to learn about key Marketo Engage enhancements. Discover improved in-app Salesforce CRM connector statistics for easy monitoring of sync health and throughput. Explore new Salesforce CRM Synch Dashboards providing up-to-date insights into weekly activity usage, thresholds, and limits with drill-down capabilities. Learn about proactive notifications for both Salesforce CRM sync and product usage overages. Get an update on improved Salesforce CRM synch scale and reliability coming in Q2 2025.
Key Takeaways:
Improved Salesforce CRM User Experience: Learn how self-service visibility enhances satisfaction.
Utilize Salesforce CRM Synch Dashboards: Explore real-time weekly activity data.
Monitor Performance Against Limits: See threshold limits for each product level.
Get Usage Over-Limit Alerts: Receive notifications for exceeding thresholds.
Learn About Improved Salesforce CRM Scale: Understand upcoming cloud-based incremental sync.
F-Secure Freedome VPN 2025 Crack Plus Activation New Versionsaimabibi60507
Copy & Past Link 👉👉
https://dr-up-community.info/
F-Secure Freedome VPN is a virtual private network service developed by F-Secure, a Finnish cybersecurity company. It offers features such as Wi-Fi protection, IP address masking, browsing protection, and a kill switch to enhance online privacy and security .
Avast Premium Security Crack FREE Latest Version 2025mu394968
🌍📱👉COPY LINK & PASTE ON GOOGLE https://dr-kain-geera.info/👈🌍
Avast Premium Security is a paid subscription service that provides comprehensive online security and privacy protection for multiple devices. It includes features like antivirus, firewall, ransomware protection, and website scanning, all designed to safeguard against a wide range of online threats, according to Avast.
Key features of Avast Premium Security:
Antivirus: Protects against viruses, malware, and other malicious software, according to Avast.
Firewall: Controls network traffic and blocks unauthorized access to your devices, as noted by All About Cookies.
Ransomware protection: Helps prevent ransomware attacks, which can encrypt your files and hold them hostage.
Website scanning: Checks websites for malicious content before you visit them, according to Avast.
Email Guardian: Scans your emails for suspicious attachments and phishing attempts.
Multi-device protection: Covers up to 10 devices, including Windows, Mac, Android, and iOS, as stated by 2GO Software.
Privacy features: Helps protect your personal data and online privacy.
In essence, Avast Premium Security provides a robust suite of tools to keep your devices and online activity safe and secure, according to Avast.
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMaxim Salnikov
Imagine if apps could think, plan, and team up like humans. Welcome to the world of AI agents and agentic user interfaces (UI)! In this session, we'll explore how AI agents make decisions, collaborate with each other, and create more natural and powerful experiences for users.
Why Orangescrum Is a Game Changer for Construction Companies in 2025Orangescrum
Orangescrum revolutionizes construction project management in 2025 with real-time collaboration, resource planning, task tracking, and workflow automation, boosting efficiency, transparency, and on-time project delivery.
Download YouTube By Click 2025 Free Full Activatedsaniamalik72555
Copy & Past Link 👉👉
https://dr-up-community.info/
"YouTube by Click" likely refers to the ByClick Downloader software, a video downloading and conversion tool, specifically designed to download content from YouTube and other video platforms. It allows users to download YouTube videos for offline viewing and to convert them to different formats.
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...Egor Kaleynik
This case study explores how we partnered with a mid-sized U.S. healthcare SaaS provider to help them scale from a successful pilot phase to supporting over 10,000 users—while meeting strict HIPAA compliance requirements.
Faced with slow, manual testing cycles, frequent regression bugs, and looming audit risks, their growth was at risk. Their existing QA processes couldn’t keep up with the complexity of real-time biometric data handling, and earlier automation attempts had failed due to unreliable tools and fragmented workflows.
We stepped in to deliver a full QA and DevOps transformation. Our team replaced their fragile legacy tests with Testim’s self-healing automation, integrated Postman and OWASP ZAP into Jenkins pipelines for continuous API and security validation, and leveraged AWS Device Farm for real-device, region-specific compliance testing. Custom deployment scripts gave them control over rollouts without relying on heavy CI/CD infrastructure.
The result? Test cycle times were reduced from 3 days to just 8 hours, regression bugs dropped by 40%, and they passed their first HIPAA audit without issue—unlocking faster contract signings and enabling them to expand confidently. More than just a technical upgrade, this project embedded compliance into every phase of development, proving that SaaS providers in regulated industries can scale fast and stay secure.
PDF Reader Pro Crack Latest Version FREE Download 2025mu394968
🌍📱👉COPY LINK & PASTE ON GOOGLE https://dr-kain-geera.info/👈🌍
PDF Reader Pro is a software application, often referred to as an AI-powered PDF editor and converter, designed for viewing, editing, annotating, and managing PDF files. It supports various PDF functionalities like merging, splitting, converting, and protecting PDFs. Additionally, it can handle tasks such as creating fillable forms, adding digital signatures, and performing optical character recognition (OCR).
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?steaveroggers
Migrating from Lotus Notes to Outlook can be a complex and time-consuming task, especially when dealing with large volumes of NSF emails. This presentation provides a complete guide on how to batch export Lotus Notes NSF emails to Outlook PST format quickly and securely. It highlights the challenges of manual methods, the benefits of using an automated tool, and introduces eSoftTools NSF to PST Converter Software — a reliable solution designed to handle bulk email migrations efficiently. Learn about the software’s key features, step-by-step export process, system requirements, and how it ensures 100% data accuracy and folder structure preservation during migration. Make your email transition smoother, safer, and faster with the right approach.
Read More:- https://www.esofttools.com/nsf-to-pst-converter.html
Landscape of Requirements Engineering for/by AI through Literature ReviewHironori Washizaki
Hironori Washizaki, "Landscape of Requirements Engineering for/by AI through Literature Review," RAISE 2025: Workshop on Requirements engineering for AI-powered SoftwarE, 2025.
3. Alluxio is Open-Source Data Orchestration
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver GCS Driver S3 Driver Azure Driver
4. The Alluxio Story
Originated as Tachyon project, at UC Berkley AMPLab by
Ph.D. student Haoyuan (H.Y.) Li - now Alluxio CTO2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
20192018
2019
Top 10 Big Data
2019
Top 10 Cloud Software
5. Fast-growing Open Source Community
4000+ Github Stars1000+ Contributors
Join the community on Slack
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio
6. Consumer Travel &
Transportation
Telco & Media Healthcare
Community Across Industries
Learn more
TechnologyFinancial Services Retail & Entertainment Data & Analytics
Services
7. Data Locality via Intelligent Multi-tiering
§ Local performance from remote data using multi-tier storage
RAM SSD HDD
Hot Warm Cold
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion, TTL
9/13/19 7
8. Spark
Presto
Bash
Tensorflow
Java
~$ cat /mnt/alluxio/myInput
Data Accessibility via popular APIs
> rdd = sc.textFile(“alluxio://master:19998/myInput”)
> CREATE SCHEMA hive.web
> WITH (location = 'alluxio://master:19998/my-table/')
~$ python classify_image.py --model_dir /mnt/fuse/imagenet/
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
9. Data Abstraction via Unified Namespace
Enables effective data management across different Under Store
$ ./bin/alluxio fs mount /Data s3://bucket/directory
11. Typical Use Cases
Cloud Analytics Caching
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage
Hybrid Cloud Analytics
Get in-memory data access for Spark, Presto,
or any analytics framework on Cloud storage
12. Deployment Approaches
Spark
Alluxio
Storage
Co-locate Alluxio Workers with Spark for
optimal I/O performance
Any Cloud
Same instance
/ container
Spark
Alluxio
Storage
Deploy Alluxio as standalone cluster
between Spark and Storage
Any Cloud
Same data
center / region
Presto
13. Use Case | On-premise Caching for Presto
HDFS
§ Large query variance during peak hours before
§ Alluxio brings data local to Presto to reduce
the latency during peak hours
NetEase Games
Leading Online Game Company in China
https://www.alluxio.io/blog/presto-on-alluxio-how-netease-
games-leveraged-alluxio-to-boost-ad-hoc-sql-on-hdfs/
Presto
HDFS
Presto
Alluxio
14. Architecture: Colocate Alluxio with Presto
• Black/Red line – Large Query variance without Alluxio
• Green line - Stable query time with Alluxio
15. Project:
• Offload HDFS with separate clusters
of Presto and Spark
Problem:
• HDFS cluster is compute and
network bound
• Performance is inconsistent
JD.com |
$70B e-commerce retailer
Hadoop Offload Use Case
Alluxio solution:
• Alluxio offloads the network I/O as
well as the compute
Result:
• Teams can run additional workloads
without taxing the existing HDFS
cluster
3000 Node HDFS
PRESTO
Separate Compute
ALLUXIO
Datacenter
SPARK
3000 Node HDFS
PRESTO
Separate Compute
Datacenter
SPARK
https://www.slideshare.net/Alluxio/alluxio-in-jd
16. Performance Evaluation
• Yellow line - Stable query time with Alluxio
• < 1sec after first query (cold read)
• Green line – JD Presto without Alluxio : > 10sec
18. Read data in Alluxio, on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Read of Data
Application
Alluxio
Client
Alluxio
Master
19. Read data not in Alluxio
RAM / SSD / HDD
Network / Disk Speed Read of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
WorkerUnder Store
20. Write data only to Alluxio on same node as client
Alluxio
Worker
RAM / SSD / HDD
Memory Speed Write of Data
Application
Alluxio
Client
Alluxio
Master
21. Write data to Alluxio and Under Store synchronously
RAM / SSD / HDD
Network / Disk Speed Write of
Data
Application
Alluxio
Client
Alluxio
Master
Alluxio
Worker
Under Store
22. Alluxio 2.0 & Coming in 2.1 Release
§ Alluxio 2.0: Released in July
§ Metadata scales to 1 bln file or more (based on rocksdb)
§ Self-managed Metadata service based on Quorum
§ Async writes, distributed load
§ Many more: https://www.alluxio.io/download/releases/alluxio-2-0-0-release/
§ Alluxio 2.1: Scheduled in Sept
§ A Presto-Alluxio Connector with Iceberg Integration
§ Use Alluxio as a caching layer without modifying HMS
23. Next steps - Try it out!
• Getting Started
• Try 10 Minutes Alluxio & Presto Tutorial on Laptop
• Try 10 Minutes Alluxio & Presto Tutorial on AWS
• Tops 5 Performance tips running Presto on Alluxio
Questions or Suggestions? Engage with us at alluxio.io/slack!
24. Questions
Slides will be available at slack channel (https://alluxio.io/slack)