Scalable Filesystem Metadata Services with RocksDB

Jul 22, 20191 like2,737 views

Alluxio maintainer and founding engineer Calvin Jia presents on Scalable Filesystem Metadata Services with RocksDB at the RocksDB meetup at Twitter.

Scalable Filesystem Metadata Services
Calvin Jia - 07/11 RocksDB Meetup
Featuring RocksDB

● Release Manager for Alluxio 2.0.0
● Contributor since Tachyon 0.4 (2012)
● Founding Engineer @ Alluxio
About Me
Calvin Jia

Alluxio Overview
• Open source data orchestration
• Commonly used for data analytics such as OLAP on Hadoop
• Deployed at Huya, Two Sigma, Tencent, and many others
• Largest deployments of over 1000 nodes

Agenda
Architecture1
Challenges2
Solutions3

Alluxio Master
• Responsible for storing and serving metadata in Alluxio
• Alluxio Metadata consists of files and blocks
• Main data structure is the Filesystem Tree
• The namespace for files in Alluxio
• Can include mounts of other file system namespaces
• The size of the tree can be very large!

Metadata Storage Challenges
• Storing the raw metadata becomes a problem with a large number of
files
• On average, each file takes 1KB of on-heap storage
• 1 billion files would take 1 TB of heap space!
• A typical JVM runs with < 64GB of heap space
• GC becomes a big problem when using larger heaps

Metadata Serving Challenges
• File operations (ie. getStatus, create) need to be fast
• On heap data structures excel in this case
• Operations need to be optimized for high concurrency
• Generally many readers and few writers

Store 1B+ files while serving at high performance

RocksDB
• Embeddable
• Key-Value interface
• LSMT based storage (sorted)
• Has Java API
• Vibrant community

Tiered Metadata Storage = 1 Billion Files
14
Alluxio Master
Local Disk
RocksDB (Embedded)
● Inode Tree
● Block Map
● Worker Block Locations
On Heap
● Inode Cache
● Mount Table
● Locks

Working with RocksDB
• Abstract the metadata storage layer
• Redesign the data structure representation of the Filesystem Tree
• Each inode is represented by a numerical ID
• Edge table maps <ID,childname> to <ID of child> Ex: <1foo, 2>
• Inode table maps <ID> to <Metadata blob of inode> Ex: <2, proto>
• Two table solution provides good performance for common
operations
• One lookup for listing by using prefix scan
• Path depth lookups for tree traversal
• Constant number of inserts for updates/deletes/creates

Example RocksDB Operations
• To create a file, /s3/data/june.txt:
• Look up <rootID, s3> in the edge table to get <s3ID>
• Look up <s3ID, data> in the edge table to get <dataID>
• Look up <dataID> in the inode table to get <dataID metadata>
• Update <dataID, dataID metadata> in the inode table
• Put <june.txtID, june.txt metadata> in the inode table
• Put <dataId, june.txt> in the edge table
• To list children of /:
• Prefix lookup of <rootId> in the edge table to get all <childID>s
• Look up each <childID> in the inode table to get <child metadata>

Eﬀects of the Inode Cache
• Generally can store up to 10M inodes
• Caching top levels of the Filesystem Tree greatly speeds up read
performance
• 20-50% performance loss when addressing a filesystem tree that does not
mostly fit into memory
• Writes can be buﬀered in the cache and are asynchronously flushed
to RocksDB
• No requirement for durability - that is handled by the journal

Additional & Future Work
• Fast startup time through using RocksDB checkpoints
• More sophisticated cache management policies

Conclusion
• RocksDB enables us to leverage oﬀheap storage
• Scales our raw metadata storage by an order of magnitude, allowing
us to address over 1 billion files
• Available in Alluxio 2.0 - Released June 27th 2019!

Questions?
Alluxio Website - https://www.alluxio.io
Alluxio Community Slack Channel - https://www.alluxio.io/slack
Alluxio Oﬀice Hours & Webinars - https://www.alluxio.io/events

What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible. A talk by Julian Hyde at JOIN 2019 in San Francisco.

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

This document summarizes a benchmark study of file formats for Hadoop, including Avro, JSON, ORC, and Parquet. It found that ORC with zlib compression generally performed best for full table scans. However, Avro with Snappy compression worked better for datasets with many shared strings. The document recommends experimenting with the benchmarks, as performance can vary based on data characteristics and use cases like column projections.

ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley

Eric Hanson and I gave this presentation at Hadoop Summit 2013: Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. Hive 0.11 added a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.

Facebook Messages & HBase强王

The document discusses Facebook's use of HBase to store messaging data. It provides an overview of HBase, including its data model, performance characteristics, and how it was a good fit for Facebook's needs due to its ability to handle large volumes of data, high write throughput, and efficient random access. It also describes some enhancements Facebook made to HBase to improve availability, stability, and performance. Finally, it briefly mentions Facebook's migration of messaging data from MySQL to their HBase implementation.

Introduction to Apache CalciteJordan Halterman

The Apache Spark File Format EcosystemDatabricks

LineairDB: Fast and Embedded Transactional Key-Value StorageSho Nakazono

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Histograms at scale - Monitorama 2019Evan Chan

Slides for my talk at Monitorama PDX 2019. Histograms have the potential to give us tools to meet SLO/SLAs, quantile measurements, and very rich heatmap displays for debugging. Their promise has not been fulfilled by TSDB backends however. This talk talks about the concept of histograms as first class citizens in storage. What does accuracy mean for histograms? How can we store and compress rich histograms for evaluation and querying at massive scale? How can we fix some of the issues with histograms in Prometheus, such as proper aggregation, bucketing, avoiding clipping, etc.?

Kafka replication apachecon_2013Jun Rao

The document discusses intra-cluster replication in Apache Kafka, including its architecture where partitions are replicated across brokers for high availability. Kafka uses a leader and in-sync replicas approach to strongly consistent replication while tolerating failures. Performance considerations in Kafka replication include latency and durability tradeoffs for producers and optimizing throughput for consumers.

Kafka at scale facebook israelGwen (Chen) Shapira

This document provides guidance on scaling Apache Kafka clusters and tuning performance. It discusses expanding Kafka clusters horizontally across inexpensive servers for increased throughput and CPU utilization. Key aspects that impact performance like disk layout, OS tuning, Java settings, broker and topic monitoring, client tuning, and anticipating problems are covered. Application performance can be improved through configuration of batch size, compression, and request handling, while consumer performance relies on partitioning, fetch settings, and avoiding perpetual rebalances.

Unified Stream and Batch Processing with Apache FlinkDataWorks Summit/Hadoop Summit

The document discusses Apache Flink, an open source stream processing framework. It provides high throughput and low latency processing of both streaming and batch data. Flink allows for explicit handling of event time, stateful stream processing with exactly-once semantics, and high performance. It also supports features like windowing, sessionization, and complex event processing that are useful for building streaming applications.

Hive tuningMichael Zhang

This document provides an overview of Hive and its performance capabilities. It discusses Hive's SQL interface for querying large datasets stored in Hadoop, its architecture which compiles SQL queries into MapReduce jobs, and its support for SQL semantics and datatypes. The document also covers techniques for optimizing Hive performance, including data abstractions like partitions, buckets and skews. It describes different join strategies in Hive like shuffle joins, broadcast joins and sort-merge bucket joins and how they are implemented in MapReduce. The overall presentation aims to explain how Hive provides scalable SQL processing for big data.

Log Structured Merge TreeUniversity of California, Santa Cruz

1. Log structured merge trees store data in multiple levels with different storage speeds and costs, requiring data to periodically merge across levels. 2. This structure allows fast writes by storing new data in faster levels before merging to slower levels, and efficient reads by querying multiple levels and merging results. 3. The merging process involves loading, sorting, and rewriting levels to consolidate and propagate deletions and updates between levels.

Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc

Migrating large databases at Facebook from InnoDB to MyRocks and HBase to MyRocks resulted in significant space savings of 2-4x and improved write performance by up to 10x. Various techniques were used for the migrations such as creating new MyRocks instances without downtime, loading data efficiently, testing on shadow instances, and promoting MyRocks instances as masters. Ongoing work involves optimizations like direct I/O, dictionary compression, parallel compaction, and dynamic configuration changes to further improve performance and efficiency.

Getting Started with Databricks SQL AnalyticsDatabricks

It has long been said that business intelligence needs a relational warehouse, but that view is changing. With the Lakehouse architecture being shouted from the rooftops, Databricks have released SQL Analytics, an alternative workspace for SQL-savvy users to interact with an analytics-tuned cluster. But how does it work? Where do you start? What does a typical Data Analyst’s user journey look like with the tool? This session will introduce the new workspace and walk through the various key features – how you set up a SQL Endpoint, the query workspace, creating rich dashboards and connecting up BI tools such as Microsoft Power BI. If you’re truly trying to create a Lakehouse experience that satisfies your SQL-loving Data Analysts, this is a tool you’ll need to be familiar with and include in your design patterns, and this session will set you on the right path.

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

HBase in Practicelarsgeorge

From: DataWorks Summit 2017 - Munich - 20170406 HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.

HBase Low LatencyDataWorks Summit

This document discusses techniques for improving latency in HBase. It analyzes the write and read paths, identifying sources of latency such as networking, HDFS flushes, garbage collection, and machine failures. For writes, it finds that single puts can achieve millisecond latency while streaming puts can hide latency spikes. For reads, it notes cache hits are sub-millisecond while cache misses and seeks add latency. GC pauses of 25-100ms are common, and failures hurt locality and require cache rebuilding. The document outlines ongoing work to reduce GC, use off-heap memory, improve compactions and caching to further optimize for low latency.

Can Apache Kafka Replace a Database?Kai Wähner

Can and should Apache Kafka replace a database? How long can and should I store data in Kafka? How can I query and process data in Kafka? These are common questions that come up more and more. This session explains the idea behind databases and different features like storage, queries, transactions, and processing to evaluate when Kafka is a good fit and when it is not. The discussion includes different Kafka-native add-ons like Tiered Storage for long-term, cost-efficient storage and ksqlDB as event streaming database. The relation and trade-offs between Kafka and other databases are explored to complement each other instead of thinking about a replacement. This includes different options for pull and push-based bi-directional integration. Key takeaways: - Kafka can store data forever in a durable and high available manner - Kafka has different options to query historical data - Kafka-native add-ons like ksqlDB or Tiered Storage make Kafka more powerful than ever before to store and process data - Kafka does not provide transactions, but exactly-once semantics - Kafka is not a replacement for existing databases like MySQL, MongoDB or Elasticsearch - Kafka and other databases complement each other; the right solution has to be selected for a problem - Different options are available for bi-directional pull and push-based integration between Kafka and databases to complement each other Video Recording: https://youtu.be/7KEkWbwefqQ Blog post: https://www.kai-waehner.de/blog/2020/03/12/can-apache-kafka-replace-database-acid-storage-transactions-sql-nosql-data-lake/

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang

Big Data IntegrationHadi Fadlallah

The document discusses big data integration techniques. It defines big data integration as combining heterogeneous data sources into a unified form. The key techniques discussed are schema mapping to match data schemas, record linkage to identify matching records across sources, and data fusion to resolve conflicts by techniques like voting and source quality assessment. The document also briefly mentions research areas in big data integration and some tools for performing integration.

Apache Flink and what it is used forAljoscha Krettek

Aljoscha Krettek is the PMC chair of Apache Flink and Apache Beam, and co-founder of data Artisans. Apache Flink is an open-source platform for distributed stream and batch data processing. It allows for stateful computations over data streams in real-time and historically. Flink supports batch and stream processing using APIs like DataSet and DataStream. Data Artisans originated Flink and provides an application platform powered by Flink and Kubernetes for building stateful stream processing applications.

Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks

HBase: How to get MTTR below 1 minuteHortonworks

This document discusses strategies for reducing the mean time to recovery (MTTR) in HBase to below 1 minute. It outlines how HBase recovery works and key components involved. Some techniques discussed to reduce MTTR include faster failure detection by lowering Zookeeper timeouts, improving parallelism in region reassignment, and rewriting the data recovery process in HBase 0.96. However, the document notes that high MTTR is often due to downtime from HDFS data replication when a datanode fails along with a regionserver.

Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive

This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.

Achieving 100k Queries per Hour on Hive on TezDataWorks Summit/Hadoop Summit

Hive on Tez with LLAP (Late Loading Application) can achieve query processing speeds of over 100,000 queries per hour. Tuning various Hive and YARN parameters such as increasing the number of executor and I/O threads, memory allocation, and disabling consistent splits between LLAP daemons and data nodes was needed to reach this performance level on a test cluster of 45 nodes. Future work includes adding a web UI for monitoring LLAP clusters and implementing column-level access controls while allowing other frameworks like Spark to still access data through HiveServer2 and prevent direct access to HDFS for security reasons.

Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.

Alluxio Community Office Hour Apr 7, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speaker: Bin Fan Alluxio (alluxio.io) is an open-source data orchestration system that provides a single namespace federating multiple external distributed storage systems. It is critical for Alluxio to be able to store and serve the metadata of all files and directories from all mounted external storage both at scale and at speed. This talk shares our design, implementation, and optimization of Alluxio metadata service (master node) to address the scalability challenges. Particularly, we will focus on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc. As a result of the combined above techniques, Alluxio 2.0 is able to store at least 1 billion files with a significantly reduced memory requirement, serving 3000 workers and 30000 clients concurrently. In this Office Hour, we will go over how to: - Metadata storage challenges - How to combine different open source technologies as building blocks - The design, implementation, and optimization of Alluxio metadata service

Alluxio - Scalable Filesystem Metadata ServicesAlluxio, Inc.

This talk was presented by Alluxio's top contributor and PMC Maintainer Calvin Jia at the Alluxio bay area Meetup. This talk shares our design, implementation and optimization of Alluxio metadata service to address the scalability challenges, focusing on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc.

More Related Content

What's hot (20)

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Histograms at scale - Monitorama 2019Evan Chan

Kafka replication apachecon_2013Jun Rao

Kafka at scale facebook israelGwen (Chen) Shapira

Unified Stream and Batch Processing with Apache FlinkDataWorks Summit/Hadoop Summit

Hive tuningMichael Zhang

Log Structured Merge TreeUniversity of California, Santa Cruz

Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc

Getting Started with Databricks SQL AnalyticsDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

HBase in Practicelarsgeorge

HBase Low LatencyDataWorks Summit

Can Apache Kafka Replace a Database?Kai Wähner

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang

Big Data IntegrationHadi Fadlallah

Apache Flink and what it is used forAljoscha Krettek

Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks

HBase: How to get MTTR below 1 minuteHortonworks

Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive

Achieving 100k Queries per Hour on Hive on TezDataWorks Summit/Hadoop Summit

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Histograms at scale - Monitorama 2019Evan Chan

Kafka replication apachecon_2013Jun Rao

Kafka at scale facebook israelGwen (Chen) Shapira

Unified Stream and Batch Processing with Apache FlinkDataWorks Summit/Hadoop Summit

Hive tuningMichael Zhang

Log Structured Merge TreeUniversity of California, Santa Cruz

Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc

Getting Started with Databricks SQL AnalyticsDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

HBase in Practicelarsgeorge

HBase Low LatencyDataWorks Summit

Can Apache Kafka Replace a Database?Kai Wähner

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang

Big Data IntegrationHadi Fadlallah

Apache Flink and what it is used forAljoscha Krettek

Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks

HBase: How to get MTTR below 1 minuteHortonworks

Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive

Achieving 100k Queries per Hour on Hive on TezDataWorks Summit/Hadoop Summit

Similar to Scalable Filesystem Metadata Services with RocksDB (20)

Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.

Alluxio - Scalable Filesystem Metadata ServicesAlluxio, Inc.

Building a Distributed File System for the Cloud-Native EraAlluxio, Inc.

Big Data Bellevue Meetup May 19, 2022 For more Alluxio events: https://alluxio.io/events/ Speaker: Bin Fan (Founding Engineer & VP of Open Source, Alluxio) Today, data engineering in modern enterprises has become increasingly more complex and resource-consuming, particularly because (1) the rich amount of organizational data is often distributed across data centers, cloud regions, or even cloud providers, and (2) the complexity of the big data stack has been quickly increasing over the past few years with an explosion in big-data analytics and machine-learning engines (like MapReduce, Hive, Spark, Presto, Tensorflow, PyTorch to name a few). To address these challenges, it is critical to provide a single and logical namespace to federate different storage services, on-prem or cloud-native, to abstract away the data heterogeneity, while providing data locality to improve the computation performance. [Bin Fan] will share his observation and lessons learned in designing, architecting, and implementing such a system – Alluxio open-source project — since 2015. Alluxio originated from UC Berkeley AMPLab (used to be called Tachyon) and was initially proposed as a daemon service to enable Spark to share RDDs across jobs for performance and fault tolerance. Today, it has become a general-purpose, high-performance, and highly available distributed file system to provide generic data service to abstract away complexity in data and I/O. Many companies and organizations today like Uber, Meta, Tencent, Tiktok, Shopee are using Alluxio in production, as a building block in their data platform to create a data abstraction and access layer. We will talk about the journey of this open source project, especially in its design challenges in tiered metadata storage (based on RocksDB), embedded state-replicate machine (based on RAFT) for HA, and evolution in RPC framework (based on gRPC) and etc.

Accesso ai dati con Azure Data PlatformLuca Di Fino

The document discusses various data storage options available on the Microsoft Azure platform. It provides information on relational databases like Azure SQL, non-relational databases like Azure Table Storage and DocumentDB, file storage with Azure Blobs, queue-based messaging with Azure Queues, and data analytics services like HDInsight. Live demos are shown of common tasks like inserting, querying and retrieving data from Table Storage, Blob Storage, and Queues. Key differences between relational and non-relational storage are also explained.

Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.

Data Orchestration Summit 2020 organized by Alluxio https://www.alluxio.io/data-orchestration-summit-2020/ Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration between Presto & Alluxio Ke Wang, Software Engineer (Facebook) Bin Fan, Founding Engineer, VP Of Open Source (Alluxio) About Alluxio: alluxio.io Engage with the open source community on slack: alluxio.io/slack

Evolving HDFS to Generalized Storage SubsystemDataWorks Summit/Hadoop Summit

The document discusses evolving HDFS to support generalized storage containers in order to better scale the number of files and blocks. It proposes using block containers and a partial namespace approach to initially scale to billions of files and blocks, and eventually much higher numbers. The storage layer is being restructured to support various container types for use cases beyond HDFS like object storage and HBase.

Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.

Alluxio Global Online Meetup May 7, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Rohit Jain, Facebook Yutian "James" Sun, Facebook Bin Fan, Alluxio For many latency-sensitive SQL workloads, Presto is often bound by retrieving distant data. In this talk, Rohit Jain, James Sun from Facebook and Bin Fan from Alluxio will introduce their teams’ collaboration on adding a local on-SSD Alluxio cache inside Presto workers to improve unsatisfied Presto latency. This talk will focus on: - Insights of the Presto workloads at Facebook w.r.t. cache effectiveness - API and internals of the Alluxio local cache, from design trade-offs (e.g. caching granularity, concurrency level and etc) to performance optimizations. - Initial performance analysis and timeline to deliver this feature for general Presto users. - Discussion on our future work to optimize cache performance with deeper integration with Presto

Intro to Big DataZohar Elkayam

Sizing your Content Databases: Understanding the LimitsRandy Williams

Did you know that Microsoft now supports content databases up 4TB and beyond? Hang on though—before you design or adjust your information and service architectures, there are a number of assumptions, caveats and trade-off choices you must understand. We'll discuss these and how database size affects performance, content recovery, and day-to-day administration tasks. We'll then look at various techniques to help you scale out your storage tier. We close the session by sharing the very latest guidance on employing using RBS (Remote BLOB Storage) in your environments.

Introduction to Google BigQueryCsaba Toth

This document provides an introduction to Google BigQuery, a cloud-based data warehouse that allows users to interactively query and analyze massive datasets. It begins with background on big data and technologies like Hadoop, Hive, and Spark. It then explains the differences between row-based and column-based data stores, with BigQuery using a columnar approach. The rest of the document demonstrates BigQuery through an example query on public datasets and provides pricing and resource information.

Big data and hadoop anupamaAnupama Prabhudesai

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was designed to support large datasets and scale efficiently using low-cost hardware. Hadoop's core components include HDFS for distributed storage and MapReduce for distributed processing. Hadoop saw early adoption by companies like Yahoo and Facebook to support applications like advertisement targeting, searches, and security using large datasets.

Hadoop ppt1chariorienit

Ozone and HDFS's EvolutionDataWorks Summit

HDFS has several strengths: horizontally scale its IO bandwidth and scale its storage to petabytes of storage. Further, it provides very low latency metadata operations and scales to over 60K concurrent clients. Hadoop 3.0 recently added Erasure Coding. One of HDFS’s limitations is scaling a number of files and blocks in the system. We describe a radical change to Hadoop’s storage infrastructure with the upcoming Ozone technology. It allows Hadoop to scale to tens of billions of files and blocks and, in the future, to every larger number of smaller objects. Ozone fundamentally separates the namespace layer and the block layer allowing new namespace layers to be added in the future. Further, the use of RAFT protocol has allowed the storage layer to be self-consistent. We show how this technology helps a Hadoop user and also what it means for evolving HDFS in the future. We will also cover the technical details of Ozone. Speaker: Sanjay Radia, Chief Architect, Founder, Hortonworks

Ozone and HDFS’s evolutionDataWorks Summit

Big Data Architecture Workshop - Vahid Amiridatastack

Data Modeling in Hadoop - Essentials for building data driven applicationsMaloy Manna, PMP®

A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?DATAVERSITY

Wordnik migrated from a MySQL relational database to the non-relational MongoDB database for 5 key reasons: speed, stability, scaling, simplicity, and fitting their object model better. They tested MongoDB extensively, iteratively improving their data mapping and access patterns. The migration was done without downtime by switching between the databases. While inserts were much faster in MongoDB, updates could be slow due to disk I/O. Wordnik addressed this through optimizations like pre-fetching on updates and moving to local storage. Overall, MongoDB was a better fit for Wordnik's large and evolving datasets.

002-Storage Basics and Application Environments V1.0.pptxDrewMe1

Storage Basics and Application Environments is a document that discusses storage concepts, hardware, protocols, and data protection basics. It begins by defining storage and describing different types including block storage, file storage, and object storage. It then covers basic concepts of storage hardware such as disks, disk arrays, controllers, enclosures, and I/O modules. Storage protocols like SCSI, NVMe, iSCSI, and Fibre Channel are also introduced. Additional concepts like RAID, LUNs, multipathing, and file systems are explained. The document provides a high-level overview of fundamental storage topics.

Distributed Data processing in a Cloudelliando dias

This document discusses distributed data processing using MapReduce and Hadoop in a cloud computing environment. It describes the need for scalable, economical, and reliable distributed systems to process petabytes of data across thousands of nodes. It introduces Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers using MapReduce. Key aspects of Hadoop discussed include its core components HDFS for distributed file storage and MapReduce for distributed computation.

Cloud computing UNIT 2.1 presentation inRahulBhole12

Cloud storage allows users to store files online through cloud storage providers like Apple iCloud, Dropbox, Google Drive, Amazon Cloud Drive, and Microsoft SkyDrive. These providers offer various amounts of free storage and options to purchase additional storage. They allow files to be securely uploaded, accessed, and synced across devices. The best cloud storage provider depends on individual needs and preferences regarding storage space requirements and features offered.

Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.

Alluxio - Scalable Filesystem Metadata ServicesAlluxio, Inc.

Building a Distributed File System for the Cloud-Native EraAlluxio, Inc.

Accesso ai dati con Azure Data PlatformLuca Di Fino

Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.

Evolving HDFS to Generalized Storage SubsystemDataWorks Summit/Hadoop Summit

Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.

Intro to Big DataZohar Elkayam

Sizing your Content Databases: Understanding the LimitsRandy Williams

Introduction to Google BigQueryCsaba Toth

Big data and hadoop anupamaAnupama Prabhudesai

Hadoop ppt1chariorienit

Ozone and HDFS's EvolutionDataWorks Summit

Ozone and HDFS’s evolutionDataWorks Summit

Big Data Architecture Workshop - Vahid Amiridatastack

Data Modeling in Hadoop - Essentials for building data driven applicationsMaloy Manna, PMP®

A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?DATAVERSITY

002-Storage Basics and Application Environments V1.0.pptxDrewMe1

Distributed Data processing in a Cloudelliando dias

Cloud computing UNIT 2.1 presentation inRahulBhole12

More from Alluxio, Inc. (20)

How Coupang Leverages Distributed Cache to Accelerate ML Model TrainingAlluxio, Inc.

Alluxio Tech Talk Webinar Apr. 22, 2025 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Hyun Jung Baek (Staff Backend Engineer @ Coupang) Description Coupang is a leading e-commerce company in South Korea, with over 50,000 employees and $20+ billion in annual revenue. Coupang's AI platform team builds and manages a large-scale AI platform in AWS for machine learning engineers to train models that enhance and customize product search results and product recommendations for its 100+ million customers. As the search and recommendation models evolve, optimizing the underlying infrastructure for AI/ML workloads is essential for the e-commerce business. Coupang's platform team actively sought to improve their model training pipeline to boost machine learning engineers' productivity, publish models to production faster, and reduce operational costs. Coupang focused on addressing several key areas: - Shortening data preparation and model training time - Improving GPU utilization in training clusters in different regions - Reducing S3 API and egress costs incurred from copying large training datasets across regions - Simplifying the operational complexity of storage system management In this tech talk, Hyun Jung Baek, Staff Backend Engineer at Coupang, will share best practices for leveraging Alluxio to power search and recommendation model training infrastructure. Hyun will discuss: - How Coupang builds a world-class large-scale AI platform for machine learning engineers to deliver better search and recommendation models - How adding distributed caching to their multi-region AI infrastructure improves GPU utilization, accelerates end-to-end training time, and significantly reduces cross-region data transfer costs. - How to simplify platform operations and to easily deploy the same architecture to new GPU clusters.

Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...Alluxio, Inc.

Alluxio Webinar Apr 1, 2025 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: Stephen Pu (Staff Software Engineer @ Alluxio) Deepseek’s recent announcement of the Fire-flyer File System (3FS) has sparked excitement across the AI infra community, promising a breakthrough in how machine learning models access and process data. In this webinar, an expert in distributed systems and AI infrastructure will take you inside Deepseek 3FS, the purpose-built file system for handling large files and high-bandwidth workloads. We’ll break down how 3FS optimizes data access and speeds up AI workloads as well as the design tradeoffs made to maximize throughput for AI workloads. This webinar you’ll learn about how 3FS works under the hood, including: ✅ The system architecture ✅ Core software components ✅ Read/write flows ✅ Data distribution/placement algorithms ✅ Cluster/node management and disaster recovery Whether you’re an AI researcher, ML engineer, or infrastructure architect, this deep dive will give you the technical insights you need to determine if 3FS is the right solution for you.

AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...Alluxio, Inc.

AI/ML Infra Meetup | How Uber Optimizes LLM Training and FinetuneAlluxio, Inc.

AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...Alluxio, Inc.

AI/ML Infra Meetup Mar. 06, 2025 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Bin Fan (VP of Technology @ Alluxio) In this talk, Bin Fan shares his insights on data access challenges in ML applications, with particular emphasis on how Alluxio's distributed caching helps bridge the gap between storage and compute in preprocessing, pretraining and inference.

AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber ScaleAlluxio, Inc.

Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...Alluxio, Inc.

Alluxio Webinar Feb. 25, 2025 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: Bill Hodak (VP of Marketing and Product Marketing, Alluxio) Tom Luckenbach (Solutions Engineering Manager, Alluxio) Join us to learn about the latest release of Alluxio Enterprise AI. In this webinar, we’ll provide an overviewof the new features and capabilities of Alluxio Enterprise AI, built to accelerate AI workloads and maximize GPU utilization. Key highlights include: - New caching mode accelerates AI checkpoints - Advanced cache eviction policies provide fine-grained control - Python SDK integrations enhance AI framework compatibility - A demo of Alluxio accelerating AI training workloads in AWS

AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference StackAlluxio, Inc.

AI/ML Infra Meetup Jan. 23, 2025 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Junchen Jiang (Assistant Professor @ University of Chicago) LLM inference can be huge, particularly, with long contexts. In this on-demand video, Junchen Jiang, Assistant Professor at University of Chicago, presents a 10x solution for long contexts inference: an easy-to-deploy stack over multiple vLLM engines with tailored KV-cache backend.

AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...Alluxio, Inc.

AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...Alluxio, Inc.

Alluxio Webinar | Accelerate AI: Alluxio 101Alluxio, Inc.

Alluxio Webinar Dec. 3, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: Bill Hodak (VP of Marketing and Product Marketing, Alluxio) In the rapidly evolving landscape of AI and machine learning, Platform and Data Infrastructure Teams face critical challenges in building and managing large-scale AI platforms. Performance bottlenecks, scalability of the platform, and scarcity of GPUs pose significant challenges in supporting large-scale model training and serving. In this talk, we will introduce how Alluxio helps Platform and Data Infrastructure teams deliver faster, more scalable platforms to ML Engineering teams developing and training AI models. Alluxio’s highly-distributed cache accelerates AI workloads by eliminating data loading bottlenecks and maximizing GPU utilization. Customers report up to 4x faster training performance with high-speed access to petabytes of data spread across billions of files regardless of persistent storage type or proximity to GPU clusters. Alluxio’s architecture lowers data infrastructure costs, increases GPU utilization, and enables workload portability for navigating GPU scarcity challenges.

AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AIAlluxio, Inc.

AI/ML Infra Meetup Nov. 7, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Zhe Zhang (Distinguished Engineer @ NVIDIA) In this talk, Zhe Zhang (NVIDIA, ex-Anyscale) introduced Ray and its applications in the LLM and multi-modal AI era. He shared his perspective on ML infrastructure, noting that it presents more unstructured challenges, and recommended using Ray and Alluxio as solutions for increasingly data-intensive multi-modal AI workloads.

AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...Alluxio, Inc.

AI/ML Infra Meetup Nov. 7, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Bin Fan (Founding Engineer, VP of Technology @ Alluxio) As large-scale machine learning becomes increasingly GPU-centric, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization. In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.

AI/ML Infra Meetup | Big Data and AI, Zoom DevelopersAlluxio, Inc.

AI/ML Infra Meetup Nov. 7, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Sandeep Manchem (ML Platform Engineering Manager @ Zoom) In this talk, Sandeep Manchem (Zoom) discussed big data and AI, covering typical platform architecture and data challenges. We had engaging discussions about ensuring data safety and compliance in Big Data and AI applications.

AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...Alluxio, Inc.

AI/ML Infra Meetup Nov. 7, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Tianyu Liu (Research Scientist @ Meta) TorchTitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is a repo that showcases PyTorch's latest distributed training features in a clean, minimal codebase. In this talk, Tianyu will share TorchTitan’s design and optimizations for the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its performance, composability, and scalability.

Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...Alluxio, Inc.

Alluxio Webinar October.15, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Tom Luckenbach (Solutions Engineering Manager, Alluxio) AI training workloads running on compute engines like PyTorch, TensorFlow, and Ray require consistent, high-throughput access to training data to maintain high GPU utilization. However, with the decoupling of compute and storage and with today’s hybrid and multi-cloud landscape, AI Platform and Data Infrastructure teams are struggling to cost-effectively deliver the high-performance data access needed for AI workloads at scale. Join Tom Luckenbach, Alluxio Solutions Engineering Manager, to learn how Alluxio enables high-speed, cost-effective data access for AI training workloads in hybrid and multi-cloud architectures, while eliminating the need to manage data copies across regions and clouds. What Tom will share: - AI data access challenges in cross-region, cross-cloud architectures. - The architecture and integration of Alluxio with frameworks like PyTorch, TensorFlow, and Ray using POSIX, REST, or Python APIs across AWS, GCP and Azure. - A live demo of an AI training workload accessing cross-cloud datasets leveraging Alluxio's distributed cache, unified namespace, and policy-driven data management. - MLPerf and FIO benchmark results and cost-savings analysis.

AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...Alluxio, Inc.

AI/ML Infra Meetup Aug. 29, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Koundinya Pidaparthi (VP of Analytics @ Poshmark) Scaling experimentation in digital marketplaces is crucial for driving growth and enhancing user experiences. However, varied methodologies and a lack of experiment governance can hinder the impact of experimentation leading to inconsistent decision-making, inefficiencies, and missed opportunities for innovation. At Poshmark, we developed a homegrown experimentation platform, Lightspeed, that allowed us to make reliable and confident reads on product changes, which led to a 10x growth in experiment velocity and positive business outcomes along the way. This session will provide a deep dive into the best practices and lessons learned from successful implementations of large-scale experiments. We will explore the importance of experimentation, overcome scalability challenges, and gain insights into the frameworks and technologies that enable effective testing.

AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...Alluxio, Inc.

AI/ML Infra Meetup Aug. 29, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Mahesh Pasupuleti (VP of DS, ML & Data Infra @ Poshmark) In the rapidly evolving world of e-commerce, visual search has become a game-changing technology. Poshmark, a leading fashion resale marketplace, has developed Posh Lens – an advanced visual search engine that revolutionizes how shoppers discover and purchase items. Under the hood of Posh Lens lies Milvus, a vector database enabling efficient product search and recommendation across our vast catalog of over 150 million items. However, with such an extensive and growing dataset, maintaining high-performance search capabilities while scaling AI infrastructure presents significant challenges. In this talk, Mahesh Pasupuleti shares: - The architecture and strategies to scale Milvus effectively within the Posh Lens infrastructure - Key considerations include optimizing vector indexing, managing data partitioning, and ensuring query efficiency amidst large-scale data growth - Distributed computing principles and advanced indexing techniques to handle the complexity of Poshmark's diverse product catalog

Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...Alluxio, Inc.

Alluxio Webinar Sept. 10, 2024 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Jingwen Ouyang (Senior Program Manager, Alluxio) As machine learning and deep learning models grow in complexity, AI platform engineers and ML engineers face significant challenges with slow data loading and GPU utilization, often leading to costly investments in high-performance computing (HPC) storage. However, this approach can result in overspending without addressing the core issues of data bottlenecks and infrastructure complexity. A better approach is adding a data caching layer between compute and storage, like Alluxio, which offers a cost-effective alternative through its innovative data caching strategy. In this webinar, Jingwen will explore how Alluxio's caching solutions optimize AI workloads for performance, user experience and cost-effectiveness. What you will learn: - The I/O bottlenecks that slow down data loading in model training - How Alluxio's data caching strategy optimizes I/O performance for training and GPU utilization, and significantly reduces cloud API costs - The architecture and key capabilities of Alluxio - Using Rapid Alluxio Deployer to install Alluxio and run benchmarks in AWS in just 30 minutes

AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...Alluxio, Inc.

AI/ML Infra Meetup Aug. 29, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Bin Fan (VP of Technology, Founding Engineer @OpenAI) In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scale data for AI. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving. In this talk, Bin Fan will discuss the challenges of I/O stalls that lead to suboptimal GPU utilization during model training. He will present a reference architecture for running PyTorch jobs with Alluxio in cloud environments, demonstrating how this approach can significantly enhance GPU efficiency. What you will learn: - How to identify GPU utilization and I/O-related performance bottlenecks in model training - Leverage GPU anywhere to maximize resource utilization - Best practices for monitoring and optimizing GPU usage across training and serving pipelines - Strategies for reducing cloud costs and simplifying management of AI infrastructure at scale