Hands-on with Alluxio Structured Data Management

Jan 14, 20200 likes1,646 views

Alluxio Online Office Hours January 14, 2020 We introduce the concepts and components of Alluxio Structured Data Management, and go through a demo with Presto.

Alluxio Office Hour:
Alluxio Structured Data Management
Gene Pang | Alluxio

Motivation
Alluxio Structured Data Management
Developer Preview in Alluxio 2.1
Demo
Conclusion
Outline
2

Common Alluxio Use Cases
4
…
…
Unified Interface
Unified Namespace
Caching and Locality
SQL Engines are popular

5
Storage Systems SQL Frameworks
Files/Objects
Directories
Raw Bytes
Storage
Optimized
Tables
Schemas
Rows/Columns
Compute
Optimized
Impedance Mismatch
Further Expand Benefits!

Benefits of Alluxio Data Orchestration
6
Storage
Systems
SQL
Frameworks
Caching
Unified Interface/Namespace
Schema-Aware Optimizations
Compute-Optimized Formats
Physical Data Independence

Alluxio
Structured Data Management
Integrating SQL Engines with Alluxio

Provide Structured Data APIs
Focus on how frameworks interact with data
High-Level Philosophy
8
Cache Logical Data Access
Focus on caching what frameworks want

Alluxio Structured Data Management
Alluxio Structured Data Management
9
Storage
System
Transformation
Service
Structured Data
and Metadata
Logical Data
Access Layer
Structured
Data Client
SQL
Engine
Engine

Developer Preview in Alluxio 2.1
Try out initial components!

Target Environment
11
Presto
Hive
Connector
Hive
Metastore
Storage

Alluxio Structured Data Management
12
Presto
Alluxio Caching
Service
Alluxio Catalog
Service
AlluxioTransformation
Service
Hive
Connector
Alluxio
Connector
Hive
Metastore
Storage

Alluxio Catalog Service
13
Alluxio Catalog Service
Hive Metastore
Hive Under Database
Functionality
Manages metadata for structured data
Abstracts other database catalogs as
Under Database (UDB)
Benefits
Schema-aware optimizations
Simple deployment

Tighter integration with Presto
New plugin based on the Presto Hive connector
Available in Alluxio 2.1 distribution
In Progress: Merging connector into Presto codebase
Alluxio Presto Connector
14

Transformation Service
15
Transform data to be compute-optimized
independent from storage-optimized format
Coalesce Format Conversion
parquetcsv

2 isolated AWS 10-node clusters
Presto + Hive Metastore + S3 Data
Presto + Alluxio + Hive Metastore + S3 Data
TPCDS sample dataset on S3
~10,000 CSV files
Demo
17

Attached existing Hive database into Alluxio Catalog
Alluxio Catalog served table metadata for Presto
Transformed store_sales by coalescing and converting CSV to Parquet
Demo Summary
18
Presto Without
Alluxio
20s
Alluxio
Transformations
7s
AlluxioTransformations
With Caching
3s

User community feedback/collaboration is important!
Future projects
New UDB implementations (AWS Glue)
More conversion formats (json)
DDL/DML workloads (CREATETABLE, INSERT, etc.)
New Client APIs for structured data (Arrow)
Future Work
20

Try it out!
Documentation
Provide feedback
Feature requests and issues in Github Alluxio/alluxio
Developer Preview Available in Alluxio 2.1
21
ThankYou!

Gene Pang presented on Alluxio architecture and scaling performance for large deployments. He discussed Alluxio's high-level components including the master, workers, jobs masters and workers, and proxies. He then covered techniques for improving Alluxio scaling including parallelizing metadata sync and catalog sync, handling slow external storage reads asynchronously, rearranging blocks asynchronously, and adding timeouts for disk operations to avoid unexpected hangs. The goal is to make Alluxio faster, more predictable, and support higher concurrency even with interactions with slow external storage systems.

Accelerate Cloud Training with AlluxioAlluxio, Inc.

ApacheCon 2021 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Lu Qiu Bin Fan Alluxio’s capabilities as a Data Orchestration framework have encouraged users to onboard more of their data-driven applications to an Alluxio powered data access layer. Driven by strong interests from our open-source community, the core team of Alluxio started to re-design an efficient and transparent way for users to leverage data orchestration through the POSIX interface. This effort has a lot of progress with the collaboration with engineers from Microsoft, Alibaba and Tencent. Particularly, we have introduced a new JNI-based FUSE implementation to support POSIX data access, created a more efficient way to integrate Alluxio with FUSE service, as well as many improvements in relevant data operations like more efficient distributedLoad, optimizations on listing or calculating directories with a massive amount of files, which are common in model training. We will also share our engineering lessons and roadmap in future releases to support Machine Learning applications.

Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Alluxio, Inc.

StorageQuery: federated querying on object stores, powered by Alluxio and PrestoAlluxio, Inc.

Alluxio Global Online Meetup August 25, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Abner Ferreira, Simbiose Ventures Caio Pavanelli, Simbiose Ventures Bin Fan, Alluxio Over the last few years, organizations have worked towards the separation of storage and compute for a number of benefits in the areas of cost, data duplication and data latency. Cloud resolves most of these issues but comes to the expense of needing a way to query data on remote storages. Alluxio and Presto are a powerful combination to address the compute problem, which is part of the strategy used by Simbiose Ventures to create a product called StorageQuery - A platform to query files in cloud storages with SQL. This talk will focus on: - How Alluxio fits StorageQuery's tech stack; - Advantages of using Alluxio as a cache layer and its unified filesystem; - Development of new under file system for Backblaze B2 and fine-grained code documentation; - ShannonDB remote storage mode.

Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.

This document discusses optimizations made to Alibaba Cloud's Data Lake Analytics (DLA) engine, which uses Presto, to improve performance when querying data stored in Object Storage Service (OSS). The optimizations included decreasing OSS API request counts, implementing an Alluxio data cache using local disks on Presto workers, and improving disk throughput by utilizing multiple ultra disks. These changes increased cache hit ratios and query performance for workloads involving large scans of data stored in OSS. Future plans include supporting an Alluxio cluster shared by multiple users and additional caching techniques.

Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsAlluxio, Inc.

Alluxio Product School Webinar January 27, 2022 For more Alluxio events: https://www.alluxio.io/events/ Speaker: Adit Madan Data platform teams are increasingly challenged with accessing multiple data stores that are separated from compute engines, such as Spark, Presto, TensorFlow or PyTorch. Whether your data is distributed across multiple datacenters and/or clouds, a successful heterogeneous data platform requires efficient data access. Alluxio enables you to embrace the separation of storage from compute and use Alluxio data orchestration to simplify adoption of the data lake and data mesh paradigms for analytics and AI/ML workloads. Join Alluxio’s Sr. Product Mgr., Adit Madan, to learn: - Key challenges with architecting a successful heterogeneous data platform - How data orchestration can overcome data access challenges in a distributed, heterogeneous environment - How to identify ways to use Alluxio to meet the needs of your own data environment and workload requirements

Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.

Alluxio Webinar September 22, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Alex Ma, Alluxio Peter Behrakis, Alluxio Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows. In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see. In this tech talk, we'll go over: - What is Alluxio Data Orchestration? - How does it work? - Alluxio customer results

Securely Enhancing Data Access in Hybrid Cloud with AlluxioAlluxio, Inc.

Data Orchestration for the Hybrid Cloud EraAlluxio, Inc.

Alluxio Community Office Hour October 20, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speaker(s): Alex Ma, Alluxio Peter Behrakis, Alluxio Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows. In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see. In this tech talk, we'll go over: - What is Alluxio Data Orchestration? - How does it work? - Alluxio customer results

Best Practices for Using Alluxio with SparkAlluxio, Inc.

Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud EraAlluxio, Inc.

This document discusses modernizing a data platform for analytics and AI across single, hybrid, or multi-cloud environments using Alluxio. It describes Alluxio's key features like data locality, metadata locality, asynchronous data operations, and policy-driven data management that enable consistent performance, portability, and cost savings. Examples are provided of how Alluxio can be used to transition from on-premises HDFS to object storage to hybrid cloud and multi-cloud configurations.

Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsAlluxio, Inc.

Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.

Alluxio Community Office Hour February 23, 2021 For more Alluxio events: https://www.alluxio.io/events/ Speaker(s): Alex Ma, Alluxio Peter Behrakis, Alluxio Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows. In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see. In this tech talk, we'll go over: - What is Alluxio Data Orchestration? - How does it work? - Alluxio customer results

Best Practices for Using Alluxio with SparkAlluxio, Inc.

Gene Pang presented on best practices for using Alluxio with Spark. Alluxio is a memory-centric distributed storage system that can improve Spark performance by enabling data to be accessed at memory speed. Using Alluxio between Spark and storage systems allows data to be shared between Spark's storage and execution engines at memory speed without requiring multiple copies. Alluxio also provides data resilience during crashes since data is not lost from memory. Experiments showed Alluxio providing a 6-8x speedup over reading cached Parquet dataframes from S3.

Reducing large S3 API costs using Alluxio at Datasapiens Alluxio, Inc.

Alluxio Global Online Meetup August 4, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Koen Michiels, Datasapiens Juraj Pohanka, Datasapiens Bin Fan, Alluxio Datasapiens is an international data-analytics startup based in Prague. We help our clients to uncover the value of their data and open up new revenue streams for them. We provide an end-to-end service that manages the data pipeline and automates the process of generating data insights. In this talk, we will describe how we have solved an issue with large S3 API costs incurred by Presto under several usage concurrency levels by implementing Alluxio as a data orchestration layer between S3 and Presto. Also, we will show the results of an experiment with estimating the per-query S3 API costs using the TPC-DS dataset. This talk will focus on: - The Hadoop ecosystem at Datasapiens - Drastic increase of S3 API costs during performance tests with Presto - S3 API costs tests with TPC-DS - Implications to the cloud data lake architecture

Introducing the Hub for Data OrchestrationAlluxio, Inc.

Alluxio on AWS EMR Fast Storage Access & Sharing for SparkAlluxio, Inc.

What's New in Alluxio 2.3Alluxio, Inc.

Alluxio Community Office Hour July 14, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Calvin Jia, Alluxio Bin Fan, Alluxio Alluxio 2.3 was just released at the end of June 2020. Calvin and Bin will go over the new features and integrations available and share learnings from the community. Any questions about the release and on-going community feature development are welcome. In this Office Hour, we will go over: - Glue Under Database integration - Under Filesystem mount wizard - Tiered Storage Enhancements - Concurrent Metadata Sync - Delegated Journal Backups

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio, Inc.

- Alluxio (formerly Tachyon) provides a unified memory-speed data access across compute frameworks like Spark and Presto, and storage systems like S3, HDFS, and NFS. - It started as an open source project at UC Berkeley in 2012 and is now rapidly growing with over 500 contributors from 100+ organizations. - By keeping frequently used data in memory, Alluxio can accelerate data access by 30x or more for companies like Baidu, Barclays, and Qunar by enabling workflows that were previously impossible.

Spark Summit EU talk by Jiri SimsaSpark Summit

This document discusses using Alluxio with Spark to improve performance when working with big data. It provides an overview of Alluxio and how it can be used to accelerate Spark jobs by consolidating memory, providing data resilience, and enabling data access from different storage systems at memory speed. Performance tests show that Alluxio provides 2-17x speedups over Spark alone for reading RDDs and DataFrames from remote storage like S3, by caching the data in memory.

Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017 Alluxio, Inc.

Adit Madan from Alluxio presented on using Alluxio to accelerate analytics on data stored in Ceph object storage. Alluxio acts as a virtual distributed file system that caches data in memory to provide faster access to data across different storage systems. It was shown to provide up to 20x faster performance for repeated Spark jobs on a 60GB dataset in Ceph compared to without Alluxio. Details are provided in Alluxio's whitepaper on accelerating analytics on Ceph with Alluxio.

Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Alluxio, Inc.

This document discusses using Alluxio with Spark to improve performance. Alluxio consolidates data in memory across distributed systems to enable faster data sharing between Spark jobs and frameworks. Tests show Alluxio can accelerate Spark workloads by up to 30x when reading from remote storage like S3 by serving data at memory speed. Alluxio also provides data resilience during failures and allows sharing data across jobs more easily.

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.

Alluxio Mesos Meetup - SMACK to SMAACKAlluxio, Inc.

This document discusses deploying the Alluxio distributed file system on Mesosphere DC/OS. It begins with an overview of the SMACK and SMAACK data stacks that include Apache Spark, Kafka, Cassandra and Akka. It then summarizes the benefits of Alluxio in providing unified access to data across storage systems at memory speed. The document demonstrates deploying Alluxio on DC/OS, noting how this provides on-demand provisioning, simplified operations and an elastic data infrastructure. It concludes by recommending users get started with Alluxio on DC/OS to process data from multiple storage systems faster.

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...Alluxio, Inc.

Embracing hybrid cloud for data-intensive analytic workloadsAlluxio, Inc.

Deep Learning and Gene Computing Acceleration with Alluxio in KubernetesAlluxio, Inc.

Eric Li, Senior Architect of Alibaba Cloud, presented on using Alluxio on Kubernetes. He discussed: 1. The challenges of deploying Alluxio on Kubernetes, including how to deploy it in a Kubernetes-native way, how applications can access data without changes, and how to achieve best Alluxio performance. 2. Optimizations made to Alluxio including a Helm chart for one-click installation, optimizations to the OSS SDK for data loading speed, and using fuse and short-circuiting for performance. 3. Best practices for using Alluxio on Kubernetes for different workloads like deep learning and genomic computing.

Alluxio Innovations for Structured DataAlluxio, Inc.

Gene Pang from Alluxio presented on their new structured data management capabilities in Alluxio. Alluxio 2.1.0 includes preview components to integrate SQL engines like Presto with Alluxio's unified metadata catalog and caching. A demo showed Presto queries against a TPCDS dataset on S3 running over 3x faster when using Alluxio's transformations to coalesce and optimize the data format from CSV to Parquet and leverage Alluxio's caching. Future work may include additional connectors, formats, DDL/DML support and client APIs. Feedback from the user community is important to help guide the project.

Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...Alluxio, Inc.

Alluxio Webinar Feb. 25, 2025 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: Bill Hodak (VP of Marketing and Product Marketing, Alluxio) Tom Luckenbach (Solutions Engineering Manager, Alluxio) Join us to learn about the latest release of Alluxio Enterprise AI. In this webinar, we’ll provide an overviewof the new features and capabilities of Alluxio Enterprise AI, built to accelerate AI workloads and maximize GPU utilization. Key highlights include: - New caching mode accelerates AI checkpoints - Advanced cache eviction policies provide fine-grained control - Python SDK integrations enhance AI framework compatibility - A demo of Alluxio accelerating AI training workloads in AWS

More Related Content

What's hot (20)

Data Orchestration for the Hybrid Cloud EraAlluxio, Inc.

Best Practices for Using Alluxio with SparkAlluxio, Inc.

Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud EraAlluxio, Inc.

Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsAlluxio, Inc.

Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.

Best Practices for Using Alluxio with SparkAlluxio, Inc.

Reducing large S3 API costs using Alluxio at Datasapiens Alluxio, Inc.

Introducing the Hub for Data OrchestrationAlluxio, Inc.

Alluxio on AWS EMR Fast Storage Access & Sharing for SparkAlluxio, Inc.

What's New in Alluxio 2.3Alluxio, Inc.

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio, Inc.

Spark Summit EU talk by Jiri SimsaSpark Summit

Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017 Alluxio, Inc.

Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Alluxio, Inc.

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.

Alluxio Mesos Meetup - SMACK to SMAACKAlluxio, Inc.

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...Alluxio, Inc.

Embracing hybrid cloud for data-intensive analytic workloadsAlluxio, Inc.

Deep Learning and Gene Computing Acceleration with Alluxio in KubernetesAlluxio, Inc.

Data Orchestration for the Hybrid Cloud EraAlluxio, Inc.

Best Practices for Using Alluxio with SparkAlluxio, Inc.

Modernizing Your Data Platform for Analytics and AI in the Hybrid Cloud EraAlluxio, Inc.

Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsAlluxio, Inc.

Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.

Best Practices for Using Alluxio with SparkAlluxio, Inc.

Reducing large S3 API costs using Alluxio at Datasapiens Alluxio, Inc.

Introducing the Hub for Data OrchestrationAlluxio, Inc.

Alluxio on AWS EMR Fast Storage Access & Sharing for SparkAlluxio, Inc.

What's New in Alluxio 2.3Alluxio, Inc.

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio, Inc.

Spark Summit EU talk by Jiri SimsaSpark Summit

Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017 Alluxio, Inc.

Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Alluxio, Inc.

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.

Alluxio Mesos Meetup - SMACK to SMAACKAlluxio, Inc.

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...Alluxio, Inc.

Embracing hybrid cloud for data-intensive analytic workloadsAlluxio, Inc.

Deep Learning and Gene Computing Acceleration with Alluxio in KubernetesAlluxio, Inc.

Similar to Hands-on with Alluxio Structured Data Management (20)

Alluxio Innovations for Structured DataAlluxio, Inc.

Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...Alluxio, Inc.

Building the Perfect SharePoint 2010 FarmMichael Noel

FHIR Server internals - sqlonfhirBrian Postlethwaite

SharePoint 2010 High Availability - TechEd Brasil 2010Michael Noel

This document summarizes solutions for high availability and disaster recovery in SharePoint 2010. It discusses making SharePoint components like web servers, search service applications, and database servers redundant. It also covers options for database mirroring using SQL Server, including synchronous mirroring within and across sites. Sample farm architectures are presented, from small to large farms, and virtualized environments. Backup strategies using SQL maintenance plans and Data Protection Manager 2010 are also outlined.

Flask SQLite .pdfSudhanshiBakre1

Flask is a popular Python web framework that allows developers to build web applications with minimal code. It supports integrating databases like SQLite, a lightweight and self-contained database. The article will explore how to use Flask with SQLite to build powerful web applications. SQLite does not require a separate server and is well-suited for small to medium applications. When combined, Flask and SQLite provide a flexible solution for building database-backed web applications without database server overhead.

INFOGOV14 - Trusting Your KM & ECM Strategy to SharePointJonathan Ralton

The document discusses trusting a knowledge management (KM) and enterprise content management (ECM) strategy to Microsoft SharePoint. It outlines SharePoint's capabilities that enable it to effectively support large-scale content management activities, including rich metadata structures, taxonomy, security features, workflows, search, integration with line of business systems, and support for multiple languages. Governance is required to balance control with flexibility when adopting SharePoint as an information management platform.

More Best Practices With Share Point SolutionsAlexander Meijers

This document provides best practices for SharePoint solutions. It discusses installation best practices such as avoiding basic or standalone installations and separating database and front-end servers. It also covers farm architecture such as example small, medium, and large farm configurations with separate web front-end, application, and database servers. Additional topics include the SharePoint 12 folder structure, organizing information through web applications and site collections, caching techniques, and maintaining a DTAP environment.

MOSS 2007 Deployment Fundamentals -Part1Information Technology

Different Storage Models in Big Data Analyticsdarklegendharsha1

Tech Ed Africa Demystifying Backup Restore In Share Point 2007Joel Oleson

This document discusses challenges with backup and recovery for SharePoint environments. It notes that SharePoint protection is difficult due to its complex architecture with multiple servers and databases. The document outlines various SharePoint components that need protection and different protection requirements. It also discusses factors to consider when creating a backup and recovery plan, such as recovery time objectives and policies. Finally, it provides tips for addressing limitations with native SharePoint backup and using third-party solutions to improve protection.

Techedafricademystifyingbackuprestoreinsharepoint2007 090805103250 Phpapp02malonzo

This document discusses challenges with backup and recovery for SharePoint environments. It notes that SharePoint protection is difficult due to its complex architecture with multiple servers and databases. The document outlines various components that need protection, including databases, configurations, services, and custom code. It emphasizes the importance of defining recovery time objectives and recovery point objectives to determine the appropriate backup and recovery solution. The document also provides tips for improving performance of native SharePoint backups and summarizes available backup and recovery options.

Alluxio: Unify Data at Memory SpeedAlluxio, Inc.

Alluxio is a data orchestration platform that unifies data access at memory speed across multiple storage systems. It provides a unified namespace and intelligent caching to enable fast access to remote data. Alluxio's architecture includes a master that manages metadata, workers that manage block data on local storage, and clients that access data. New features in version 1.7.0 include asynchronous caching, Kubernetes integration, tiered locality, under store synchronization, and FUSE improvements.

Large Scale SQL Considerations for SharePoint DeploymentsJoel Oleson

Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit

Apache Ignite vs Alluxio: Memory Speed Big Data Analytics - Apache Spark’s in memory capabilities catapulted it as the premier processing framework for Hadoop. Apache Ignite and Alluxio, both high-performance, integrated and distributed in-memory platform, takes Apache Spark to the next level by providing an even more powerful, faster and scalable platform to the most demanding data processing and analytic environments. Speaker Irfan Elahi, Consultant, Deloitte

Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsAlluxio, Inc.

Alluxio Austin Meetup Aug 15, 2019 Speaker: Bin Fan Apache Spark and Alluxio are cousin open source projects that originated from UC Berkeley’s AMPLab. Running Spark with Alluxio is a popular stack particularly for hybrid environments. In this session, I will briefly introduce Apache Spark and Alluxio, share the top ten tips for performance tuning for real-world workloads, and demo Alluxio with Spark.

Jonathan Ralton - Trusting Your KM & ECM Strategy To SharePointARMA International

AUDWC 2016 - Using SQL Server 20146 AlwaysOn Availability Groups for SharePoi...Michael Noel

SQL Server 2016 provides for unprecedented high availability and disaster recovery options for SharePoint farms in the form of AlwaysOn Availability Groups. Using this new technology, SharePoint architects can provide for near-instant failover at the data tier, without the risk of any data loss. In addition, the latest version of this technology, available with SQL Server 2016, allows for replicas of SharePoint databases to be stored in the cloud in Microsoft’s Azure cloud offering. This technology, which will be demonstrated live, completely changes the data tier design options for SharePoint and revolutionises high availability options for a farm. This session covers in step-by-step detail the exact configuration required to enable this functionality for a SharePoint 2013 farm, based on the best practices, tips and tricks, and real-world experience of the presenter in deploying this technology in production. Understand the differences between SQL AlwaysOn options, and determine the requirements to deploy the technologies Examine how SQL Server 2016 AlwaysOn Availability Groups can provide aggressive Service Level Agreements (SLAs) with a Recovery Point Objective (RPO) of zero and a Recovery Time Objective (RTO) of a few seconds. See the exact steps required to enable SQL Server 2016 AlwaysOn Availability Groups for a SharePoint 2013 On-Premises environment, including options for storing replicas in Microsoft’s Azure cloud service.

SPSPTCDC - SharePoint Admin 101 - SpeedMetal - PowerUser to Admin in 75 MinutesKnowledge Management Associates, LLC

I/O & virtualization performance with a search engine based on an xml databa...lucenerevolution

The document discusses performance testing of the Documentum xPlore search engine when deployed in a virtualized environment. It provides tips on ensuring sufficient hardware resources are allocated to virtual machines to avoid resource contention. It also describes pre-caching portions of the Lucene index in memory to improve response times when the index data is paged out of the operating system buffer cache. Testing showed pre-caching the stored fields, term dictionary, or positions data reduced average response times by up to 40% and lowered disk I/O per search result.

Alluxio Innovations for Structured DataAlluxio, Inc.

Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...Alluxio, Inc.

Building the Perfect SharePoint 2010 FarmMichael Noel

FHIR Server internals - sqlonfhirBrian Postlethwaite

SharePoint 2010 High Availability - TechEd Brasil 2010Michael Noel

Flask SQLite .pdfSudhanshiBakre1

INFOGOV14 - Trusting Your KM & ECM Strategy to SharePointJonathan Ralton

More Best Practices With Share Point SolutionsAlexander Meijers

MOSS 2007 Deployment Fundamentals -Part1Information Technology

Different Storage Models in Big Data Analyticsdarklegendharsha1

Tech Ed Africa Demystifying Backup Restore In Share Point 2007Joel Oleson

Techedafricademystifyingbackuprestoreinsharepoint2007 090805103250 Phpapp02malonzo

Alluxio: Unify Data at Memory SpeedAlluxio, Inc.

Large Scale SQL Considerations for SharePoint DeploymentsJoel Oleson

Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit

Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsAlluxio, Inc.

Jonathan Ralton - Trusting Your KM & ECM Strategy To SharePointARMA International

AUDWC 2016 - Using SQL Server 20146 AlwaysOn Availability Groups for SharePoi...Michael Noel

SPSPTCDC - SharePoint Admin 101 - SpeedMetal - PowerUser to Admin in 75 MinutesKnowledge Management Associates, LLC

I/O & virtualization performance with a search engine based on an xml databa...lucenerevolution

More from Alluxio, Inc. (20)

How Coupang Leverages Distributed Cache to Accelerate ML Model TrainingAlluxio, Inc.

Alluxio Tech Talk Webinar Apr. 22, 2025 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Hyun Jung Baek (Staff Backend Engineer @ Coupang) Description Coupang is a leading e-commerce company in South Korea, with over 50,000 employees and $20+ billion in annual revenue. Coupang's AI platform team builds and manages a large-scale AI platform in AWS for machine learning engineers to train models that enhance and customize product search results and product recommendations for its 100+ million customers. As the search and recommendation models evolve, optimizing the underlying infrastructure for AI/ML workloads is essential for the e-commerce business. Coupang's platform team actively sought to improve their model training pipeline to boost machine learning engineers' productivity, publish models to production faster, and reduce operational costs. Coupang focused on addressing several key areas: - Shortening data preparation and model training time - Improving GPU utilization in training clusters in different regions - Reducing S3 API and egress costs incurred from copying large training datasets across regions - Simplifying the operational complexity of storage system management In this tech talk, Hyun Jung Baek, Staff Backend Engineer at Coupang, will share best practices for leveraging Alluxio to power search and recommendation model training infrastructure. Hyun will discuss: - How Coupang builds a world-class large-scale AI platform for machine learning engineers to deliver better search and recommendation models - How adding distributed caching to their multi-region AI infrastructure improves GPU utilization, accelerates end-to-end training time, and significantly reduces cross-region data transfer costs. - How to simplify platform operations and to easily deploy the same architecture to new GPU clusters.

Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...Alluxio, Inc.

Alluxio Webinar Apr 1, 2025 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: Stephen Pu (Staff Software Engineer @ Alluxio) Deepseek’s recent announcement of the Fire-flyer File System (3FS) has sparked excitement across the AI infra community, promising a breakthrough in how machine learning models access and process data. In this webinar, an expert in distributed systems and AI infrastructure will take you inside Deepseek 3FS, the purpose-built file system for handling large files and high-bandwidth workloads. We’ll break down how 3FS optimizes data access and speeds up AI workloads as well as the design tradeoffs made to maximize throughput for AI workloads. This webinar you’ll learn about how 3FS works under the hood, including: ✅ The system architecture ✅ Core software components ✅ Read/write flows ✅ Data distribution/placement algorithms ✅ Cluster/node management and disaster recovery Whether you’re an AI researcher, ML engineer, or infrastructure architect, this deep dive will give you the technical insights you need to determine if 3FS is the right solution for you.

AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...Alluxio, Inc.

AI/ML Infra Meetup | How Uber Optimizes LLM Training and FinetuneAlluxio, Inc.

AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...Alluxio, Inc.

AI/ML Infra Meetup Mar. 06, 2025 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Bin Fan (VP of Technology @ Alluxio) In this talk, Bin Fan shares his insights on data access challenges in ML applications, with particular emphasis on how Alluxio's distributed caching helps bridge the gap between storage and compute in preprocessing, pretraining and inference.

AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber ScaleAlluxio, Inc.

AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference StackAlluxio, Inc.

AI/ML Infra Meetup Jan. 23, 2025 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Junchen Jiang (Assistant Professor @ University of Chicago) LLM inference can be huge, particularly, with long contexts. In this on-demand video, Junchen Jiang, Assistant Professor at University of Chicago, presents a 10x solution for long contexts inference: an easy-to-deploy stack over multiple vLLM engines with tailored KV-cache backend.

AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...Alluxio, Inc.

AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...Alluxio, Inc.

Alluxio Webinar | Accelerate AI: Alluxio 101Alluxio, Inc.

Alluxio Webinar Dec. 3, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: Bill Hodak (VP of Marketing and Product Marketing, Alluxio) In the rapidly evolving landscape of AI and machine learning, Platform and Data Infrastructure Teams face critical challenges in building and managing large-scale AI platforms. Performance bottlenecks, scalability of the platform, and scarcity of GPUs pose significant challenges in supporting large-scale model training and serving. In this talk, we will introduce how Alluxio helps Platform and Data Infrastructure teams deliver faster, more scalable platforms to ML Engineering teams developing and training AI models. Alluxio’s highly-distributed cache accelerates AI workloads by eliminating data loading bottlenecks and maximizing GPU utilization. Customers report up to 4x faster training performance with high-speed access to petabytes of data spread across billions of files regardless of persistent storage type or proximity to GPU clusters. Alluxio’s architecture lowers data infrastructure costs, increases GPU utilization, and enables workload portability for navigating GPU scarcity challenges.

AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AIAlluxio, Inc.

AI/ML Infra Meetup Nov. 7, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Zhe Zhang (Distinguished Engineer @ NVIDIA) In this talk, Zhe Zhang (NVIDIA, ex-Anyscale) introduced Ray and its applications in the LLM and multi-modal AI era. He shared his perspective on ML infrastructure, noting that it presents more unstructured challenges, and recommended using Ray and Alluxio as solutions for increasingly data-intensive multi-modal AI workloads.

AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...Alluxio, Inc.

AI/ML Infra Meetup Nov. 7, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Bin Fan (Founding Engineer, VP of Technology @ Alluxio) As large-scale machine learning becomes increasingly GPU-centric, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization. In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.

AI/ML Infra Meetup | Big Data and AI, Zoom DevelopersAlluxio, Inc.

AI/ML Infra Meetup Nov. 7, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Sandeep Manchem (ML Platform Engineering Manager @ Zoom) In this talk, Sandeep Manchem (Zoom) discussed big data and AI, covering typical platform architecture and data challenges. We had engaging discussions about ensuring data safety and compliance in Big Data and AI applications.

AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...Alluxio, Inc.

AI/ML Infra Meetup Nov. 7, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Tianyu Liu (Research Scientist @ Meta) TorchTitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is a repo that showcases PyTorch's latest distributed training features in a clean, minimal codebase. In this talk, Tianyu will share TorchTitan’s design and optimizations for the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its performance, composability, and scalability.

Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...Alluxio, Inc.

Alluxio Webinar October.15, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Tom Luckenbach (Solutions Engineering Manager, Alluxio) AI training workloads running on compute engines like PyTorch, TensorFlow, and Ray require consistent, high-throughput access to training data to maintain high GPU utilization. However, with the decoupling of compute and storage and with today’s hybrid and multi-cloud landscape, AI Platform and Data Infrastructure teams are struggling to cost-effectively deliver the high-performance data access needed for AI workloads at scale. Join Tom Luckenbach, Alluxio Solutions Engineering Manager, to learn how Alluxio enables high-speed, cost-effective data access for AI training workloads in hybrid and multi-cloud architectures, while eliminating the need to manage data copies across regions and clouds. What Tom will share: - AI data access challenges in cross-region, cross-cloud architectures. - The architecture and integration of Alluxio with frameworks like PyTorch, TensorFlow, and Ray using POSIX, REST, or Python APIs across AWS, GCP and Azure. - A live demo of an AI training workload accessing cross-cloud datasets leveraging Alluxio's distributed cache, unified namespace, and policy-driven data management. - MLPerf and FIO benchmark results and cost-savings analysis.

AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...Alluxio, Inc.

AI/ML Infra Meetup Aug. 29, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Koundinya Pidaparthi (VP of Analytics @ Poshmark) Scaling experimentation in digital marketplaces is crucial for driving growth and enhancing user experiences. However, varied methodologies and a lack of experiment governance can hinder the impact of experimentation leading to inconsistent decision-making, inefficiencies, and missed opportunities for innovation. At Poshmark, we developed a homegrown experimentation platform, Lightspeed, that allowed us to make reliable and confident reads on product changes, which led to a 10x growth in experiment velocity and positive business outcomes along the way. This session will provide a deep dive into the best practices and lessons learned from successful implementations of large-scale experiments. We will explore the importance of experimentation, overcome scalability challenges, and gain insights into the frameworks and technologies that enable effective testing.

AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...Alluxio, Inc.

AI/ML Infra Meetup Aug. 29, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Mahesh Pasupuleti (VP of DS, ML & Data Infra @ Poshmark) In the rapidly evolving world of e-commerce, visual search has become a game-changing technology. Poshmark, a leading fashion resale marketplace, has developed Posh Lens – an advanced visual search engine that revolutionizes how shoppers discover and purchase items. Under the hood of Posh Lens lies Milvus, a vector database enabling efficient product search and recommendation across our vast catalog of over 150 million items. However, with such an extensive and growing dataset, maintaining high-performance search capabilities while scaling AI infrastructure presents significant challenges. In this talk, Mahesh Pasupuleti shares: - The architecture and strategies to scale Milvus effectively within the Posh Lens infrastructure - Key considerations include optimizing vector indexing, managing data partitioning, and ensuring query efficiency amidst large-scale data growth - Distributed computing principles and advanced indexing techniques to handle the complexity of Poshmark's diverse product catalog

Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...Alluxio, Inc.

Alluxio Webinar Sept. 10, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Jingwen Ouyang (Senior Program Manager, Alluxio) As machine learning and deep learning models grow in complexity, AI platform engineers and ML engineers face significant challenges with slow data loading and GPU utilization, often leading to costly investments in high-performance computing (HPC) storage. However, this approach can result in overspending without addressing the core issues of data bottlenecks and infrastructure complexity. A better approach is adding a data caching layer between compute and storage, like Alluxio, which offers a cost-effective alternative through its innovative data caching strategy. In this webinar, Jingwen will explore how Alluxio's caching solutions optimize AI workloads for performance, user experience and cost-effectiveness. What you will learn: - The I/O bottlenecks that slow down data loading in model training - How Alluxio's data caching strategy optimizes I/O performance for training and GPU utilization, and significantly reduces cloud API costs - The architecture and key capabilities of Alluxio - Using Rapid Alluxio Deployer to install Alluxio and run benchmarks in AWS in just 30 minutes

AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...Alluxio, Inc.

AI/ML Infra Meetup Aug. 29, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Bin Fan (VP of Technology, Founding Engineer @OpenAI) In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scale data for AI. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving. In this talk, Bin Fan will discuss the challenges of I/O stalls that lead to suboptimal GPU utilization during model training. He will present a reference architecture for running PyTorch jobs with Alluxio in cloud environments, demonstrating how this approach can significantly enhance GPU efficiency. What you will learn: - How to identify GPU utilization and I/O-related performance bottlenecks in model training - Leverage GPU anywhere to maximize resource utilization - Best practices for monitoring and optimizing GPU usage across training and serving pipelines - Strategies for reducing cloud costs and simplifying management of AI infrastructure at scale

AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMsAlluxio, Inc.