Alluxio Day VI
October 12, 2021
https://www.alluxio.io/alluxio-day/
Speaker:
Vinoth Chandar, Apache Software Foundation
Raymond Xu, Zendesk
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
Developing a Data Strategy for your organization can seem like a daunting task – but it’s worth the effort. Getting your Data Strategy right can provide significant value, as data drives many of the key initiatives in today’s marketplace – from digital transformation, to marketing, to customer centricity, to population health, and more. This webinar will help demystify Data Strategy and its relationship to Data Architecture and will provide concrete, practical ways to get started.
The document discusses modern data architectures. It presents conceptual models for data ingestion, storage, processing, and insights/actions. It compares traditional vs modern architectures. The modern architecture uses a data lake for storage and allows for on-demand analysis. It provides an example of how this could be implemented on Microsoft Azure using services like Azure Data Lake Storage, Azure Data Bricks, and Azure Data Warehouse. It also outlines common data management functions such as data governance, architecture, development, operations, and security.
Introducing Change Data Capture with DebeziumChengKuan Gan
This document discusses change data capture (CDC) and how it can be used to stream change events from databases. It introduces Debezium, an open source CDC platform that captures change events from transaction logs. Debezium supports capturing changes from multiple databases and transmitting them as a stream of events. The summary discusses how CDC can be used for data replication between databases, auditing, and in microservices architectures. It also covers deployment of CDC on Kubernetes using OpenShift.
IBM Cloud Pak for Data is a unified platform that simplifies data collection, organization, and analysis through an integrated cloud-native architecture. It allows enterprises to turn data into insights by unifying various data sources and providing a catalog of microservices for additional functionality. The platform addresses challenges organizations face in leveraging data due to legacy systems, regulatory constraints, and time spent preparing data. It provides a single interface for data teams to collaborate and access over 45 integrated services to more efficiently gain insights from data.
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
Apache Hudi is a data lake platform, that provides streaming primitives (upserts/deletes/change streams) on top of data lake storage. Hudi powers very large data lakes at Uber, Robinhood and other companies, while being pre-installed on four major cloud platforms.
Hudi supports exactly-once, near real-time data ingestion from Apache Kafka to cloud storage, which is typically used in-place of a S3/HDFS sink connector to gain transactions and mutability. While this approach is scalable and battle-tested, it can only ingest data in mini batches, leading to lower data freshness. In this talk, we introduce a Kafka Connect Sink Connector for Apache Hudi, which writes data straight into Hudi's log format, making the data immediately queryable, while Hudi's table services like indexing, compaction, clustering work behind the scenes, to further re-organize for better query performance.
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
Building highly efficient data lakes using Apache Hudi (Incubating)
Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake.
Speaker: Vinoth Chandar (Uber)
Vinoth is Technical Lead at Uber Data Infrastructure Team
3 Things to Learn About:
-How Kudu is able to fill the analytic gap between HDFS and Apache HBase
-The trade-offs between real-time transactional access and fast analytic performance
-How Kudu provides an option to achieve fast scans and random access from a single API
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
Transactional operations in Apache Hive: present and futureDataWorks Summit
Apache Hive is an enterprise data warehouse build on top of Hadoop. Hive supports insert, update, delete, and merge SQL operations with transactional semantics and read operations that run at snapshot isolation. The well defined semantics of these operations in the face of failure and concurrency are critical to building robust application on top of Apache Hive. In the past there were many preconditions to enabling these features which meant giving up other functionality. The need to make these tradeoffs is rapidly being eliminated.
This talk will describe the intended use cases, architecture of the implementation, recent improvements and new features build for Hive 3.0. For example, bucketing transactional tables, while supported, is no longer required. Performance overhead of using transactional tables is nearly eliminated relative to identical non-transactional tables. We’ll also cover Streaming Ingest API, which allows writing batches of events into a Hive table without using SQL.
Speaker
Eugene Koifman, Hortonworks, Principal Software Engineer
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Trevor Zhang & Vino Yang (T3Go)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks
The reality of most large scale data deployments includes storage decoupled from computation, pipelines operating directly on files and metadata services with no locking mechanisms or transaction tracking. For this reason attempts at achieving transactional behavior, snapshot isolation, safe schema evolution or performant support for CRUD operations has always been marred with tradeoffs.
This talk will focus on technical aspects, practical capabilities and the potential future of three table formats that have emerged in recent years as solutions to the issues mentioned above – ACID ORC (in Hive 3.x), Iceberg and Delta Lake. To provide a richer context, a comparison between traditional databases and big data tools as well as an overview of the reasons for the current state of affairs will be included.
After the talk, the audience is expected to have a clear understanding of the current development trends in large scale table formats, on the conceptual and practical level. This should allow the attendees to make better informed assessments about which approaches to data warehousing, metadata management and data pipelining they should adapt in their organizations.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
This document discusses supporting Apache HBase and improving troubleshooting and supportability. It introduces two Cloudera employees who work on HBase support and provides an overview of typical troubleshooting scenarios for HBase like performance degradation, process crashes, and inconsistencies. The agenda covers using existing tools like logs and metrics to troubleshoot HBase performance issues with a general approach, and introduces htop as a real-time monitoring tool for HBase.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time.
In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.
At Salesforce, we have deployed many thousands of HBase/HDFS servers, and learned a lot about tuning during this process. This talk will walk you through the many relevant HBase, HDFS, Apache ZooKeeper, Java/GC, and Operating System configuration options and provides guidelines about which options to use in what situation, and how they relate to each other.
An Open Source Incremental Processing Framework called Hoodie is summarized. Key points:
- Hoodie provides upsert and incremental processing capabilities on top of a Hadoop data lake to enable near real-time queries while avoiding costly full scans.
- It introduces primitives like upsert and incremental pull to apply mutations and consume only changed data.
- Hoodie stores data on HDFS and provides different views like read optimized, real-time, and log views to balance query performance and data latency for analytical workloads.
- The framework is open source and built on Spark, providing horizontal scalability and leveraging existing Hadoop SQL query engines like Hive and Presto.
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
Apache Hudi is a data lake platform, that provides streaming primitives (upserts/deletes/change streams) on top of data lake storage. Hudi powers very large data lakes at Uber, Robinhood and other companies, while being pre-installed on four major cloud platforms.
Hudi supports exactly-once, near real-time data ingestion from Apache Kafka to cloud storage, which is typically used in-place of a S3/HDFS sink connector to gain transactions and mutability. While this approach is scalable and battle-tested, it can only ingest data in mini batches, leading to lower data freshness. In this talk, we introduce a Kafka Connect Sink Connector for Apache Hudi, which writes data straight into Hudi's log format, making the data immediately queryable, while Hudi's table services like indexing, compaction, clustering work behind the scenes, to further re-organize for better query performance.
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
Building highly efficient data lakes using Apache Hudi (Incubating)
Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake.
Speaker: Vinoth Chandar (Uber)
Vinoth is Technical Lead at Uber Data Infrastructure Team
3 Things to Learn About:
-How Kudu is able to fill the analytic gap between HDFS and Apache HBase
-The trade-offs between real-time transactional access and fast analytic performance
-How Kudu provides an option to achieve fast scans and random access from a single API
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
Transactional operations in Apache Hive: present and futureDataWorks Summit
Apache Hive is an enterprise data warehouse build on top of Hadoop. Hive supports insert, update, delete, and merge SQL operations with transactional semantics and read operations that run at snapshot isolation. The well defined semantics of these operations in the face of failure and concurrency are critical to building robust application on top of Apache Hive. In the past there were many preconditions to enabling these features which meant giving up other functionality. The need to make these tradeoffs is rapidly being eliminated.
This talk will describe the intended use cases, architecture of the implementation, recent improvements and new features build for Hive 3.0. For example, bucketing transactional tables, while supported, is no longer required. Performance overhead of using transactional tables is nearly eliminated relative to identical non-transactional tables. We’ll also cover Streaming Ingest API, which allows writing batches of events into a Hive table without using SQL.
Speaker
Eugene Koifman, Hortonworks, Principal Software Engineer
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Trevor Zhang & Vino Yang (T3Go)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks
The reality of most large scale data deployments includes storage decoupled from computation, pipelines operating directly on files and metadata services with no locking mechanisms or transaction tracking. For this reason attempts at achieving transactional behavior, snapshot isolation, safe schema evolution or performant support for CRUD operations has always been marred with tradeoffs.
This talk will focus on technical aspects, practical capabilities and the potential future of three table formats that have emerged in recent years as solutions to the issues mentioned above – ACID ORC (in Hive 3.x), Iceberg and Delta Lake. To provide a richer context, a comparison between traditional databases and big data tools as well as an overview of the reasons for the current state of affairs will be included.
After the talk, the audience is expected to have a clear understanding of the current development trends in large scale table formats, on the conceptual and practical level. This should allow the attendees to make better informed assessments about which approaches to data warehousing, metadata management and data pipelining they should adapt in their organizations.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
This document discusses supporting Apache HBase and improving troubleshooting and supportability. It introduces two Cloudera employees who work on HBase support and provides an overview of typical troubleshooting scenarios for HBase like performance degradation, process crashes, and inconsistencies. The agenda covers using existing tools like logs and metrics to troubleshoot HBase performance issues with a general approach, and introduces htop as a real-time monitoring tool for HBase.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time.
In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.
At Salesforce, we have deployed many thousands of HBase/HDFS servers, and learned a lot about tuning during this process. This talk will walk you through the many relevant HBase, HDFS, Apache ZooKeeper, Java/GC, and Operating System configuration options and provides guidelines about which options to use in what situation, and how they relate to each other.
An Open Source Incremental Processing Framework called Hoodie is summarized. Key points:
- Hoodie provides upsert and incremental processing capabilities on top of a Hadoop data lake to enable near real-time queries while avoiding costly full scans.
- It introduces primitives like upsert and incremental pull to apply mutations and consume only changed data.
- Hoodie stores data on HDFS and provides different views like read optimized, real-time, and log views to balance query performance and data latency for analytical workloads.
- The framework is open source and built on Spark, providing horizontal scalability and leveraging existing Hadoop SQL query engines like Hive and Presto.
The document discusses LinkedIn's data ecosystem and the challenge of bridging operational transactional data (OLTP) with analytical processing (OLAP) at scale. It describes LinkedIn's solution called Lumos, which is a scalable ETL framework that uses change data capture, delta processing, and virtual snapshots to frequently refresh petabyte-scale data from OLTP databases into Hadoop for OLAP. Lumos supports requirements like handling multiple data centers, schema evolution, and efficient change capture while ensuring data consistency and low latency refresh times.
Hadoop and object stores can we do it bettergvernik
This document discusses using object stores for unstructured data storage instead of HDFS. It summarizes that existing Hadoop components are inefficient for working with object stores due to algorithms that create temporary files and supporting full HDFS shell functionality. An alternative approach called Stocator is presented that is optimized for object stores by directly interacting with them without Hadoop modules. Stocator provides significant performance gains over HDFS connectors by generating fewer requests and being tailored for analytic workflows rather than full HDFS compatibility. Case studies with SETI show Stocator processing large datasets from object stores much faster than HDFS connectors.
Hadoop and object stores: Can we do it better?gvernik
Strata Data Conference, London, May 2017
Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark specifically designed to optimize their performance with object stores. Trent and Gil describe how Stocator works and share real-life examples and benchmarks that demonstrate how it can greatly improve performance and reduce the quantity of resources used.
Speaker: Varun Sharma (Pinterest)
Over the past year, HBase has become an integral component of Pinterest's storage stack. HBase has enabled us to quickly launch and iterate on new products and create amazing pinner experiences. This talk briefly describes some of these applications, the underlying schema, and how our HBase setup stays highly available and performant despite billions of requests every week. It will also include some performance tips for running on SSDs. Finally, we will talk about a homegrown serving technology we built from a mashup of HBase components that has gained wide adoption across Pinterest.
In this session, we'll discuss architectural, design and tuning best practices for building rock solid and scalable Alfresco Solutions. We'll cover the typical use cases for highly scalable Alfresco solutions, like massive injection and high concurrency, also introducing 3.3 and 3.4 Transfer / Replication services for building complex high availability enterprise architectures.
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...nadine39280
Discover the evolution of Apache Hudi within the open-source realm - a community and project pushing the boundaries of data lake possibilities. This presentation delves into Apache Hudi 1.0, a pivotal release reimagining its transactional database layer while honoring its foundational principles. Join us in this transformative journey!
Join the Apache Hudi Community
https://join.slack.com/t/apache-hudi/shared_invite/zt-20r833rxh-627NWYDUyR8jRtMa2mZ~gg.
Follow us on LinkedIn and Twitter
https://www.linkedin.com/company/apache-hudi/
https://twitter.com/apachehudi
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015alanfgates
The document discusses using Hive, HBase, Phoenix, and Calcite to build a single data store for both analytics and transaction processing. It describes some recent improvements to Hive like LLAP (Live Long and Process) that aim to achieve sub-second query response times, as well as using HBase as the Hive metastore to improve performance.
The document discusses using Hive, HBase, Phoenix, and Calcite to build a single data store for both analytics and transaction processing. It describes some recent improvements to Hive like LLAP (Live Long and Process) that aim to achieve sub-second query response times, as well as using HBase as the Hive metastore to improve performance.
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
Jan 22nd, 2010 Hadoop meetup presentation on project voldemort and how it plays well with Hadoop at linkedin. The talk focus on Linkedin Hadoop ecosystem. How linkedin manage complex workflows, data ETL , data storage and online serving of 100GB to TB of data.
The document discusses Project Voldemort, a distributed key-value storage system developed at LinkedIn. It provides an overview of Voldemort's motivation and features, including high availability, horizontal scalability, and consistency guarantees. It also describes LinkedIn's use of Voldemort and Hadoop for applications like event logging, online lookups, and batch processing of large datasets.
Optimizing Big Data to run in the Public CloudQubole
Qubole is a cloud-based platform that allows customers to easily run Hadoop and Spark clusters on AWS for big data analytics. It optimizes performance and reduces costs through techniques like caching data in S3 for faster access, using spot instances, and directly writing query outputs to S3. The document discusses Qubole's features, capabilities, and how it provides an easier way for more users like data scientists and analysts to access and query big data compared to building and managing Hadoop clusters themselves.
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingDibyendu Bhattacharya
My presentation at recently concluded Apache Big Data Conference Europe about the Reliable Low Level Kafka Spark Consumer I developed and an use case of real time indexing to Apache Blur using this consumer
This document summarizes a presentation on building a streaming Lakehouse with Apache Flink and Apache Hudi. The presentation introduces Hudi as a way to unify batch and streaming workloads in a centralized data lake platform. It discusses how Hudi enables features like efficient upserts/deletes, incremental processing for change streams, and automatic catalog synchronization. The presentation demonstrates using Flink and Hudi on Amazon EMR and outlines several ongoing Hudi projects, including a new metaserver and lake cache, to further optimize query performance and metadata handling for streaming data lakes.
Apache Marmotta is a linked data platform that provides a linked data server, SPARQL server, and development environment for building linked data applications. It uses modular components including a triplestore backend, SPARQL endpoint, LDCache for remote data access, and an optional reasoner. Marmotta is implemented as a Java web application and uses services, dependency injection, and REST APIs.
Overview of MongoDB and Other Non-Relational DatabasesAndrew Kandels
My Minnesota PHP Usergroup (mnphp.org) presentation where I give an overview on MongoDB and other non-relational databases and their ability to solve unique, complex problems.
How Coupang Leverages Distributed Cache to Accelerate ML Model TrainingAlluxio, Inc.
Alluxio Tech Talk Webinar
Apr. 22, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Hyun Jung Baek (Staff Backend Engineer @ Coupang)
Description
Coupang is a leading e-commerce company in South Korea, with over 50,000 employees and $20+ billion in annual revenue. Coupang's AI platform team builds and manages a large-scale AI platform in AWS for machine learning engineers to train models that enhance and customize product search results and product recommendations for its 100+ million customers.
As the search and recommendation models evolve, optimizing the underlying infrastructure for AI/ML workloads is essential for the e-commerce business. Coupang's platform team actively sought to improve their model training pipeline to boost machine learning engineers' productivity, publish models to production faster, and reduce operational costs.
Coupang focused on addressing several key areas:
- Shortening data preparation and model training time
- Improving GPU utilization in training clusters in different regions
- Reducing S3 API and egress costs incurred from copying large training datasets across regions
- Simplifying the operational complexity of storage system management
In this tech talk, Hyun Jung Baek, Staff Backend Engineer at Coupang, will share best practices for leveraging Alluxio to power search and recommendation model training infrastructure.
Hyun will discuss:
- How Coupang builds a world-class large-scale AI platform for machine learning engineers to deliver better search and recommendation models
- How adding distributed caching to their multi-region AI infrastructure improves GPU utilization, accelerates end-to-end training time, and significantly reduces cross-region data transfer costs.
- How to simplify platform operations and to easily deploy the same architecture to new GPU clusters.
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...Alluxio, Inc.
Alluxio Webinar
Apr 1, 2025
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
Stephen Pu (Staff Software Engineer @ Alluxio)
Deepseek’s recent announcement of the Fire-flyer File System (3FS) has sparked excitement across the AI infra community, promising a breakthrough in how machine learning models access and process data.
In this webinar, an expert in distributed systems and AI infrastructure will take you inside Deepseek 3FS, the purpose-built file system for handling large files and high-bandwidth workloads. We’ll break down how 3FS optimizes data access and speeds up AI workloads as well as the design tradeoffs made to maximize throughput for AI workloads.
This webinar you’ll learn about how 3FS works under the hood, including:
✅ The system architecture
✅ Core software components
✅ Read/write flows
✅ Data distribution/placement algorithms
✅ Cluster/node management and disaster recovery
Whether you’re an AI researcher, ML engineer, or infrastructure architect, this deep dive will give you the technical insights you need to determine if 3FS is the right solution for you.
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...Alluxio, Inc.
AI/ML Infra Meetup
Mar. 06, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Xu Ning (Director of Engineering, AI Platform @ Snap)
In this talk, Xu Ning from Snap provides a comprehensive overview of the unique challenges in building and scaling recommendation systems compared to LLM applications.
AI/ML Infra Meetup | How Uber Optimizes LLM Training and FinetuneAlluxio, Inc.
AI/ML Infra Meetup
Mar. 06, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Chongxiao Cao (Senior SWE @ Uber)
Chongxiao Cao from Uber's Michelangelo training team shared valuable insights into Uber's approach to optimizing LLM training and fine-tuning workflows.
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...Alluxio, Inc.
AI/ML Infra Meetup
Mar. 06, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Bin Fan (VP of Technology @ Alluxio)
In this talk, Bin Fan shares his insights on data access challenges in ML applications, with particular emphasis on how Alluxio's distributed caching helps bridge the gap between storage and compute in preprocessing, pretraining and inference.
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber ScaleAlluxio, Inc.
AI/ML Infra Meetup
Mar. 06, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Sean Po (Staff SWE @ Uber)
- Tse-Chi Wang (Senior SWE @ Uber)
This talk provided a deep dive into how Uber manages its Generative AI Gateway, which powers all generative AI applications across the company.
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...Alluxio, Inc.
Alluxio Webinar
Feb. 25, 2025
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
Bill Hodak (VP of Marketing and Product Marketing, Alluxio)
Tom Luckenbach (Solutions Engineering Manager, Alluxio)
Join us to learn about the latest release of Alluxio Enterprise AI. In this webinar, we’ll provide an overviewof the new features and capabilities of Alluxio Enterprise AI, built to accelerate AI workloads and maximize GPU utilization.
Key highlights include:
- New caching mode accelerates AI checkpoints
- Advanced cache eviction policies provide fine-grained control
- Python SDK integrations enhance AI framework compatibility
- A demo of Alluxio accelerating AI training workloads in AWS
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference StackAlluxio, Inc.
AI/ML Infra Meetup
Jan. 23, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Junchen Jiang (Assistant Professor @ University of Chicago)
LLM inference can be huge, particularly, with long contexts. In this on-demand video, Junchen Jiang, Assistant Professor at University of Chicago, presents a 10x solution for long contexts inference: an easy-to-deploy stack over multiple vLLM engines with tailored KV-cache backend.
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...Alluxio, Inc.
AI/ML Infra Meetup
Jan. 23, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Bin Fan (VP of Technology @ Alluxio)
Ready to optimize your AI infra strategy? Watch this on-demand video, where Bin Fan, VP of Technology at Alluxio, will guide you through how to balance cost & performance for GPU/CPU workloads.
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...Alluxio, Inc.
AI/ML Infra Meetup
Jan. 23, 2025
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Robert Nishihara (Co-Founder @ Anyscale)
You won't want to miss this talk presented by Robert Nishihara, Co-Founder of Anyscale, which is packed with insights on using Ray to conquer the last-mile challenges in AI deployment.
Alluxio Webinar | Accelerate AI: Alluxio 101Alluxio, Inc.
Alluxio Webinar
Dec. 3, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
Bill Hodak (VP of Marketing and Product Marketing, Alluxio)
In the rapidly evolving landscape of AI and machine learning, Platform and Data Infrastructure Teams face critical challenges in building and managing large-scale AI platforms. Performance bottlenecks, scalability of the platform, and scarcity of GPUs pose significant challenges in supporting large-scale model training and serving.
In this talk, we will introduce how Alluxio helps Platform and Data Infrastructure teams deliver faster, more scalable platforms to ML Engineering teams developing and training AI models. Alluxio’s highly-distributed cache accelerates AI workloads by eliminating data loading bottlenecks and maximizing GPU utilization. Customers report up to 4x faster training performance with high-speed access to petabytes of data spread across billions of files regardless of persistent storage type or proximity to GPU clusters. Alluxio’s architecture lowers data infrastructure costs, increases GPU utilization, and enables workload portability for navigating GPU scarcity challenges.
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AIAlluxio, Inc.
AI/ML Infra Meetup
Nov. 7, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Zhe Zhang (Distinguished Engineer @ NVIDIA)
In this talk, Zhe Zhang (NVIDIA, ex-Anyscale) introduced Ray and its applications in the LLM and multi-modal AI era. He shared his perspective on ML infrastructure, noting that it presents more unstructured challenges, and recommended using Ray and Alluxio as solutions for increasingly data-intensive multi-modal AI workloads.
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...Alluxio, Inc.
AI/ML Infra Meetup
Nov. 7, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Bin Fan (Founding Engineer, VP of Technology @ Alluxio)
As large-scale machine learning becomes increasingly GPU-centric, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization. In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.
AI/ML Infra Meetup | Big Data and AI, Zoom DevelopersAlluxio, Inc.
AI/ML Infra Meetup
Nov. 7, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Sandeep Manchem (ML Platform Engineering Manager @ Zoom)
In this talk, Sandeep Manchem (Zoom) discussed big data and AI, covering typical platform architecture and data challenges. We had engaging discussions about ensuring data safety and compliance in Big Data and AI applications.
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...Alluxio, Inc.
AI/ML Infra Meetup
Nov. 7, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Tianyu Liu (Research Scientist @ Meta)
TorchTitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is a repo that showcases PyTorch's latest distributed training features in a clean, minimal codebase.
In this talk, Tianyu will share TorchTitan’s design and optimizations for the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its performance, composability, and scalability.
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...Alluxio, Inc.
Alluxio Webinar
October.15, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Tom Luckenbach (Solutions Engineering Manager, Alluxio)
AI training workloads running on compute engines like PyTorch, TensorFlow, and Ray require consistent, high-throughput access to training data to maintain high GPU utilization. However, with the decoupling of compute and storage and with today’s hybrid and multi-cloud landscape, AI Platform and Data Infrastructure teams are struggling to cost-effectively deliver the high-performance data access needed for AI workloads at scale.
Join Tom Luckenbach, Alluxio Solutions Engineering Manager, to learn how Alluxio enables high-speed, cost-effective data access for AI training workloads in hybrid and multi-cloud architectures, while eliminating the need to manage data copies across regions and clouds.
What Tom will share:
- AI data access challenges in cross-region, cross-cloud architectures.
- The architecture and integration of Alluxio with frameworks like PyTorch, TensorFlow, and Ray using POSIX, REST, or Python APIs across AWS, GCP and Azure.
- A live demo of an AI training workload accessing cross-cloud datasets leveraging Alluxio's distributed cache, unified namespace, and policy-driven data management.
- MLPerf and FIO benchmark results and cost-savings analysis.
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...Alluxio, Inc.
AI/ML Infra Meetup
Aug. 29, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Koundinya Pidaparthi (VP of Analytics @ Poshmark)
Scaling experimentation in digital marketplaces is crucial for driving growth and enhancing user experiences. However, varied methodologies and a lack of experiment governance can hinder the impact of experimentation leading to inconsistent decision-making, inefficiencies, and missed opportunities for innovation.
At Poshmark, we developed a homegrown experimentation platform, Lightspeed, that allowed us to make reliable and confident reads on product changes, which led to a 10x growth in experiment velocity and positive business outcomes along the way.
This session will provide a deep dive into the best practices and lessons learned from successful implementations of large-scale experiments. We will explore the importance of experimentation, overcome scalability challenges, and gain insights into the frameworks and technologies that enable effective testing.
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...Alluxio, Inc.
AI/ML Infra Meetup
Aug. 29, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Mahesh Pasupuleti (VP of DS, ML & Data Infra @ Poshmark)
In the rapidly evolving world of e-commerce, visual search has become a game-changing technology. Poshmark, a leading fashion resale marketplace, has developed Posh Lens – an advanced visual search engine that revolutionizes how shoppers discover and purchase items.
Under the hood of Posh Lens lies Milvus, a vector database enabling efficient product search and recommendation across our vast catalog of over 150 million items. However, with such an extensive and growing dataset, maintaining high-performance search capabilities while scaling AI infrastructure presents significant challenges.
In this talk, Mahesh Pasupuleti shares:
- The architecture and strategies to scale Milvus effectively within the Posh Lens infrastructure
- Key considerations include optimizing vector indexing, managing data partitioning, and ensuring query efficiency amidst large-scale data growth
- Distributed computing principles and advanced indexing techniques to handle the complexity of Poshmark's diverse product catalog
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...Alluxio, Inc.
Alluxio Webinar
Sept. 10, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jingwen Ouyang (Senior Program Manager, Alluxio)
As machine learning and deep learning models grow in complexity, AI platform engineers and ML engineers face significant challenges with slow data loading and GPU utilization, often leading to costly investments in high-performance computing (HPC) storage. However, this approach can result in overspending without addressing the core issues of data bottlenecks and infrastructure complexity.
A better approach is adding a data caching layer between compute and storage, like Alluxio, which offers a cost-effective alternative through its innovative data caching strategy. In this webinar, Jingwen will explore how Alluxio's caching solutions optimize AI workloads for performance, user experience and cost-effectiveness.
What you will learn:
- The I/O bottlenecks that slow down data loading in model training
- How Alluxio's data caching strategy optimizes I/O performance for training and GPU utilization, and significantly reduces cloud API costs
- The architecture and key capabilities of Alluxio
- Using Rapid Alluxio Deployer to install Alluxio and run benchmarks in AWS in just 30 minutes
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...Alluxio, Inc.
AI/ML Infra Meetup
Aug. 29, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Bin Fan (VP of Technology, Founding Engineer @OpenAI)
In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scale data for AI. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving.
In this talk, Bin Fan will discuss the challenges of I/O stalls that lead to suboptimal GPU utilization during model training. He will present a reference architecture for running PyTorch jobs with Alluxio in cloud environments, demonstrating how this approach can significantly enhance GPU efficiency.
What you will learn:
- How to identify GPU utilization and I/O-related performance bottlenecks in model training
- Leverage GPU anywhere to maximize resource utilization
- Best practices for monitoring and optimizing GPU usage across training and serving pipelines
- Strategies for reducing cloud costs and simplifying management of AI infrastructure at scale
Agentic AI Use Cases using GenAI LLM modelsManish Chopra
This document presents specific use cases for Agentic AI (Artificial Intelligence), featuring Large Language Models (LLMs), Generative AI, and snippets of Python code alongside each use case.
Solidworks Crack 2025 latest new + license codeaneelaramzan63
Copy & Paste On Google >>> https://dr-up-community.info/
The two main methods for installing standalone licenses of SOLIDWORKS are clean installation and parallel installation (the process is different ...
Disable your internet connection to prevent the software from performing online checks during installation
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentShubham Joshi
A secure test infrastructure ensures that the testing process doesn’t become a gateway for vulnerabilities. By protecting test environments, data, and access points, organizations can confidently develop and deploy software without compromising user privacy or system integrity.
Adobe Lightroom Classic Crack FREE Latest link 2025kashifyounis067
🌍📱👉COPY LINK & PASTE ON GOOGLE http://drfiles.net/ 👈🌍
Adobe Lightroom Classic is a desktop-based software application for editing and managing digital photos. It focuses on providing users with a powerful and comprehensive set of tools for organizing, editing, and processing their images on their computer. Unlike the newer Lightroom, which is cloud-based, Lightroom Classic stores photos locally on your computer and offers a more traditional workflow for professional photographers.
Here's a more detailed breakdown:
Key Features and Functions:
Organization:
Lightroom Classic provides robust tools for organizing your photos, including creating collections, using keywords, flags, and color labels.
Editing:
It offers a wide range of editing tools for making adjustments to color, tone, and more.
Processing:
Lightroom Classic can process RAW files, allowing for significant adjustments and fine-tuning of images.
Desktop-Focused:
The application is designed to be used on a computer, with the original photos stored locally on the hard drive.
Non-Destructive Editing:
Edits are applied to the original photos in a non-destructive way, meaning the original files remain untouched.
Key Differences from Lightroom (Cloud-Based):
Storage Location:
Lightroom Classic stores photos locally on your computer, while Lightroom stores them in the cloud.
Workflow:
Lightroom Classic is designed for a desktop workflow, while Lightroom is designed for a cloud-based workflow.
Connectivity:
Lightroom Classic can be used offline, while Lightroom requires an internet connection to sync and access photos.
Organization:
Lightroom Classic offers more advanced organization features like Collections and Keywords.
Who is it for?
Professional Photographers:
PCMag notes that Lightroom Classic is a popular choice among professional photographers who need the flexibility and control of a desktop-based application.
Users with Large Collections:
Those with extensive photo collections may prefer Lightroom Classic's local storage and robust organization features.
Users who prefer a traditional workflow:
Users who prefer a more traditional desktop workflow, with their original photos stored on their computer, will find Lightroom Classic a good fit.
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfTechSoup
In this webinar we will dive into the essentials of generative AI, address key AI concerns, and demonstrate how nonprofits can benefit from using Microsoft’s AI assistant, Copilot, to achieve their goals.
This event series to help nonprofits obtain Copilot skills is made possible by generous support from Microsoft.
What You’ll Learn in Part 2:
Explore real-world nonprofit use cases and success stories.
Participate in live demonstrations and a hands-on activity to see how you can use Microsoft 365 Copilot in your own work!
Get & Download Wondershare Filmora Crack Latest [2025]saniaaftab72555
Copy & Past Link 👉👉
https://dr-up-community.info/
Wondershare Filmora is a video editing software and app designed for both beginners and experienced users. It's known for its user-friendly interface, drag-and-drop functionality, and a wide range of tools and features for creating and editing videos. Filmora is available on Windows, macOS, iOS (iPhone/iPad), and Android platforms.
This presentation explores code comprehension challenges in scientific programming based on a survey of 57 research scientists. It reveals that 57.9% of scientists have no formal training in writing readable code. Key findings highlight a "documentation paradox" where documentation is both the most common readability practice and the biggest challenge scientists face. The study identifies critical issues with naming conventions and code organization, noting that 100% of scientists agree readable code is essential for reproducible research. The research concludes with four key recommendations: expanding programming education for scientists, conducting targeted research on scientific code quality, developing specialized tools, and establishing clearer documentation guidelines for scientific software.
Presented at: The 33rd International Conference on Program Comprehension (ICPC '25)
Date of Conference: April 2025
Conference Location: Ottawa, Ontario, Canada
Preprint: https://arxiv.org/abs/2501.10037
Download YouTube By Click 2025 Free Full Activatedsaniamalik72555
Copy & Past Link 👉👉
https://dr-up-community.info/
"YouTube by Click" likely refers to the ByClick Downloader software, a video downloading and conversion tool, specifically designed to download content from YouTube and other video platforms. It allows users to download YouTube videos for offline viewing and to convert them to different formats.
Explaining GitHub Actions Failures with Large Language Models Challenges, In...ssuserb14185
GitHub Actions (GA) has become the de facto tool that developers use to automate software workflows, seamlessly building, testing, and deploying code. Yet when GA fails, it disrupts development, causing delays and driving up costs. Diagnosing failures becomes especially challenging because error logs are often long, complex and unstructured. Given these difficulties, this study explores the potential of large language models (LLMs) to generate correct, clear, concise, and actionable contextual descriptions (or summaries) for GA failures, focusing on developers’ perceptions of their feasibility and usefulness. Our results show that over 80% of developers rated LLM explanations positively in terms of correctness for simpler/small logs. Overall, our findings suggest that LLMs can feasibly assist developers in understanding common GA errors, thus, potentially reducing manual analysis. However, we also found that improved reasoning abilities are needed to support more complex CI/CD scenarios. For instance, less experienced developers tend to be more positive on the described context, while seasoned developers prefer concise summaries. Overall, our work offers key insights for researchers enhancing LLM reasoning, particularly in adapting explanations to user expertise.
https://arxiv.org/abs/2501.16495
Douwan Crack 2025 new verson+ License codeaneelaramzan63
Copy & Paste On Google >>> https://dr-up-community.info/
Douwan Preactivated Crack Douwan Crack Free Download. Douwan is a comprehensive software solution designed for data management and analysis.
Copy & Past Link 👉👉
http://drfiles.net/
When you say Xforce with GTA 5, it sounds like you might be talking about Xforce Keygen — a tool that's often mentioned in connection with cracking software like Autodesk programs.
BUT, when it comes to GTA 5, Xforce isn't officially part of the game or anything Rockstar made.
If you're seeing "Xforce" related to GTA 5 downloads or cracks, it's usually some unofficial (and risky) tool for pirating the game — which can be super dangerous because:
Mastering OOP: Understanding the Four Core PillarsMarcel David
Visit for updated note:
https://www.notion.so/Four-Pillars-of-Object-Oriented-Programming-OOP-1e2d7d9612808079b7c5f938afd62a7b?pvs=4
Dive into the essential concepts of Object-Oriented Programming (OOP) with a detailed explanation of its four key pillars: Encapsulation, Inheritance, Polymorphism, and Abstraction. Understand how these principles contribute to robust, maintainable, and scalable software development.
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...Andre Hora
Unittest and pytest are the most popular testing frameworks in Python. Overall, pytest provides some advantages, including simpler assertion, reuse of fixtures, and interoperability. Due to such benefits, multiple projects in the Python ecosystem have migrated from unittest to pytest. To facilitate the migration, pytest can also run unittest tests, thus, the migration can happen gradually over time. However, the migration can be timeconsuming and take a long time to conclude. In this context, projects would benefit from automated solutions to support the migration process. In this paper, we propose TestMigrationsInPy, a dataset of test migrations from unittest to pytest. TestMigrationsInPy contains 923 real-world migrations performed by developers. Future research proposing novel solutions to migrate frameworks in Python can rely on TestMigrationsInPy as a ground truth. Moreover, as TestMigrationsInPy includes information about the migration type (e.g., changes in assertions or fixtures), our dataset enables novel solutions to be verified effectively, for instance, from simpler assertion migrations to more complex fixture migrations. TestMigrationsInPy is publicly available at: https://github.com/altinoalvesjunior/TestMigrationsInPy.
Avast Premium Security Crack FREE Latest Version 2025mu394968
🌍📱👉COPY LINK & PASTE ON GOOGLE https://dr-kain-geera.info/👈🌍
Avast Premium Security is a paid subscription service that provides comprehensive online security and privacy protection for multiple devices. It includes features like antivirus, firewall, ransomware protection, and website scanning, all designed to safeguard against a wide range of online threats, according to Avast.
Key features of Avast Premium Security:
Antivirus: Protects against viruses, malware, and other malicious software, according to Avast.
Firewall: Controls network traffic and blocks unauthorized access to your devices, as noted by All About Cookies.
Ransomware protection: Helps prevent ransomware attacks, which can encrypt your files and hold them hostage.
Website scanning: Checks websites for malicious content before you visit them, according to Avast.
Email Guardian: Scans your emails for suspicious attachments and phishing attempts.
Multi-device protection: Covers up to 10 devices, including Windows, Mac, Android, and iOS, as stated by 2GO Software.
Privacy features: Helps protect your personal data and online privacy.
In essence, Avast Premium Security provides a robust suite of tools to keep your devices and online activity safe and secure, according to Avast.
Discover why Wi-Fi 7 is set to transform wireless networking and how Router Architects is leading the way with next-gen router designs built for speed, reliability, and innovation.
Adobe Photoshop CC 2025 Crack Full Serial Key With Latestusmanhidray
Copy & Past Link👉👉💖
💖http://drfiles.net/
Adobe Photoshop is a widely-used, professional-grade software for digital image editing and graphic design. It allows users to create, manipulate, and edit raster images, which are pixel-based, and is known for its extensive tools and capabilities for photo retouching, compositing, and creating intricate visual effects.
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?steaveroggers
Migrating from Lotus Notes to Outlook can be a complex and time-consuming task, especially when dealing with large volumes of NSF emails. This presentation provides a complete guide on how to batch export Lotus Notes NSF emails to Outlook PST format quickly and securely. It highlights the challenges of manual methods, the benefits of using an automated tool, and introduces eSoftTools NSF to PST Converter Software — a reliable solution designed to handle bulk email migrations efficiently. Learn about the software’s key features, step-by-step export process, system requirements, and how it ensures 100% data accuracy and folder structure preservation during migration. Make your email transition smoother, safer, and faster with the right approach.
Read More:- https://www.esofttools.com/nsf-to-pst-converter.html
PDF Reader Pro Crack Latest Version FREE Download 2025mu394968
🌍📱👉COPY LINK & PASTE ON GOOGLE https://dr-kain-geera.info/👈🌍
PDF Reader Pro is a software application, often referred to as an AI-powered PDF editor and converter, designed for viewing, editing, annotating, and managing PDF files. It supports various PDF functionalities like merging, splitting, converting, and protecting PDFs. Additionally, it can handle tasks such as creating fillable forms, adding digital signatures, and performing optical character recognition (OCR).
5. Hudi - the Pioneer
Serverless, transactional layer over
lakes.
Multi-engine, Decoupled storage
from engine/compute
Introduced notions of
Copy-On-Write and
Merge-on-Read
Change capture on lakes
Ideas now heavily borrowed
outside.
6. The Hudi Stack
Lakes on cheap, scalable Hadoop compatible storage
Built on open file and data formats
Transactional Database Kernel
- Table Format for file layouts, schema, …
- Indexing for faster updates/deletes
- Built-in “daemons” aka table services
- MVCC, OCC Concurrency Control
SQL and Programming APIs
Platform services and operational tools
Universally queryable from popular engines
7. It’s a platform!
Both streaming + batch style pipelines
- State store for incremental merging intermediate results
- Change events like Apache Kafka topics
For data lake workloads
- Optimized, self-managing data plane
- Large scale data processing
- Lakehouse?
With tightly-integrated components
- Loose coupling => too many to integrate
- Reduce build out time for data lakes
http://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform
9. File Formats
Base and Delta Log Files
- Parquet, Orc, HFile Base files
- Avro log files
- Encode changes as blocks
Ongoing
- Parquet log blocks for large
batch writes
- CSV, unstructured formats
- pre-materialization for
masking/data privacy
10. Indexes
Pluggable, Consistent with txns
For upserts, deletes
- HBase, External index ->
pluggable
- Simple, Bloom/Local vs Global
Ongoing
- RFC-27 Range indexes
- Bucketed Index
- DynamoDB index
- Metadata index
- Record level indexing
11. Concurrency Control
Hudi did not need multi-writer support
- Treat writers and services differently
- MVCC, non-blocking
- Table services satisfy most needs
Hudi now does Optimistic Concurrency Control
- File level, timeline consistent
- Still MVCC for table services
Future/Ongoing
- Multi-table transactions
- MVCC, fully lock free transactions
12. Writers
Incremental & Batch write operations
- File sizing, Layout control upon write
- Sorting, compression, Index maintenance
- Spill handling, Multi-threaded write pipeline
Record level merges APIs
- Unique keys, composite,
- key generators, virtual or physical
- partial merges, event-time processing
Record level metadata
- Arrival and event time, watermarks
- Encode source CDC operation
13. Readers
Hive, Impala, Presto, Spark, Trino, Redshift
Use engine’s native readers
First class support for incremental queries
Flexibility - snapshot vs read-optimized
Future
- Flexible change stream data models.
- Snowflake/BigQuery external tables
14. Table Services
Self managing database runtime
Table services know each other
- E.g avoid duplicate schedules
- E.g skip compacting files being clustered
Cleaning (committed/uncommitted), archival,
clustering, compaction, ..
Services can be run continuously or scheduled
17. What qualifies as table metadata?
Schema - Columns names/types, keys, partitioning, evolution/versions
- Typically small, < 1MB per version.
Files/Objects - Length, paths, URIs
- 2M objects => 10s of MBs
Stats - Min, Max, Nulls etc, Per col Per file
- 2M objects => 100+ of MBs
Redo Logs - Changes to metadata => writes, rollbacks, table optimizations.
- Committing (200kb) every minute for a year => ~100 GB
Indexes? - Remember Stats != Index, They can be much bigger.
18. How’s this stored in Hudi, today?
Schema - Stored within the redo log, consistent with table changes.
- Synced out to different meta-stores, post commit
Files/Objects - Obtained from an internal metadata table partition `files`
- Or just by listing storage - sometimes it’s faster!
Redo Logs - As an event log in the timeline folder “.hoodie”
- Archived out, once transactions/table operations complete/expire.
Stats - We don’t. Yet. Fetch from file footers.
- Again sometimes faster if parallelized, even on cloud storage.
19. RFC-27 (Ongoing): Flat Files are not cool
Scaling file stats for high scale writing
- 65536 files (1TB data, stored as 16MB
small files)
- 100 columns, 6.5M stat entries
- O(total_cols_tracked_in_table)
- Slow, 10s of seconds.
Range reads to the rescue!
- O(num_cols_in_query) performace
- Interval trees with smart skipping
20. The Hudi Timeline server
Metadata need efficient serving, caching
- Not just efficient storage
Responsibilities
- Cache file listings across executors
- Amortize access to metadata table
- Performant uncommitted file
cleanup
Incremental sync
- Streaming/continuous writes
- Lazy refreshing of timeline
S3 Baseline: listing p90
- 1sec (10k files),
- 10 sec (100K files)
Timeline Server: 1-10 ms!
File-backed metadata: ~1 second!
21. Extending the Timeline Server
New APIs
- Serve also stats, redo log information.
- Locking APIs
Let’s make a cluster!
- Shard servers by table/db
- Pluggable backing storage
- Local DB w/ recovery/checkpointing
- Remote DB with
newSQL/transactional storage
23. Basic Idea
Problems
- Frequent commits => small objects /
blocks => I/O costly
- File System / Block level caching not
very effective
base file b @ t1
base file b’ @ t2
log file 1 for b
log file 2 for b
log file 1 for b’
log file 2 for b’
Time
Hudi FileGroup
log file 3 for b’
Hudi FileGroup fits caching
- Smallest unit to compact
- Size properly to fit cache store
- Cache compacted data for
real-time views => save
computation
24. Design Considerations
Refresh-Ahead
- Works with Change-Data-Capture
scenario
- Micro-compact FileGroup and save in
cache
Cache
base file b
log file 1 for b
log file 2 for b
compacted
Change-Data-Capture
Refresh-Ahead
Read-Through
- Driven by usage, on-demand
computation
- LRU or LFU
Query I/O
Read-Through
25. Design Considerations
FileGroup consistent hashing
- Each FileGroup has a unique ID
- Work with distributed cache servers
Cache
Node A
FileGroup
Query I/O
Cache
Node B
FileGroup FileGroup
Coordinator
(Timeline server?)
Query I/O
Lake Storage
Cache (e.g. Alluxio)
Transactionality
- Only committed files can be
cached
- Rollback include cache
invalidation
Pluggable Caching Layer
- Define APIs for pluggable
caching implementations
27. How we roll?
Friendly and diverse community
- Open and Collaborative
- 20+ PMCs/Committers from 10+
organizations
Developers
- Propose new RFCs (design docs)
- Dev list discussions, JIRA for issue tracking.
Users
- Weekly community on-call rotations
- Issue triage, bug filing process on Github
1200+
Slack
200+
Contributors
1000+
GH Engagers
~10-20
PRs/week
20+
Committers
10+
PMCs
28. Major Ongoing Works
RFC-26: Z-order indexing, Hilbert curves (PR #3330)
RFC-27: Data skipping/Range indexing (PR #3475)
RFC-29: Hashed Indexing (PR #3173)
RFC-32: Kafka Connect Sink for Hudi (Pre-release; available in 0.10.0)
RFC-33: Full-schema evolution support (PR #3668)
RFC-35: BigQuery integration
29. Major Ongoing Works
RFC-20: Error tables (PR #3312)
RFC-08: Record level indexing (PR #3508)
RFC-15: Synchronous, Multi table Metadata writes (PR #3590)
Hudi + Dbt (dbt-labs/dbt-spark/pull/210)
PrestoDB/Trino Connectors (Early design)
30. Hudi is broadly adopted outside
More at : http://hudi.apache.org/powered-by
31. Engage With Our Community
User Docs : https://hudi.apache.org
Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI
Github : https://github.com/apache/hudi/
Twitter : https://twitter.com/apachehudi
Mailing list(s) : [email protected] (send an empty email to subscribe)
[email protected] (actual mailing list)
Slack : https://join.slack.com/t/apache-hudi/signup