Cisco connect toronto 2015 big data sean mc keownCisco Canada
The document provides an overview of big data concepts and architectures. It discusses key topics like Hadoop, HDFS, MapReduce, NoSQL databases, and MPP relational databases. It also covers network design considerations for big data, common traffic patterns in Hadoop, and how to optimize performance through techniques like data locality and quality of service policies.
An Open Source Incremental Processing Framework called Hoodie is summarized. Key points:
- Hoodie provides upsert and incremental processing capabilities on top of a Hadoop data lake to enable near real-time queries while avoiding costly full scans.
- It introduces primitives like upsert and incremental pull to apply mutations and consume only changed data.
- Hoodie stores data on HDFS and provides different views like read optimized, real-time, and log views to balance query performance and data latency for analytical workloads.
- The framework is open source and built on Spark, providing horizontal scalability and leveraging existing Hadoop SQL query engines like Hive and Presto.
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time.
In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.
Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.
In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.
HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.
The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.
To view the accompanying slide deck: http://www.slideshare.net/ChicagoHUG/
Splice Machine is a SQL relational database management system built on Hadoop. It aims to provide the scalability, flexibility and cost-effectiveness of Hadoop with the transactional consistency, SQL support and real-time capabilities of a traditional RDBMS. Key features include ANSI SQL support, horizontal scaling on commodity hardware, distributed transactions using multi-version concurrency control, and massively parallel query processing by pushing computations down to individual HBase regions. It combines Apache Derby for SQL parsing and processing with HBase/HDFS for storage and distribution. This allows it to elastically scale out while supporting rich SQL, transactions, analytics and real-time updates on large datasets.
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources.
Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines.
Speaker
Navina Ramesh, Sr. Software Engineer, LinkedIn
This document provides an overview and deep dive into Robinhood's RDS Data Lake architecture for ingesting data from their RDS databases into an S3 data lake. It discusses their prior daily snapshotting approach, and how they implemented a faster change data capture pipeline using Debezium to capture database changes and ingest them incrementally into a Hudi data lake. It also covers lessons learned around change data capture setup and configuration, initial table bootstrapping, data serialization formats, and scaling the ingestion process. Future work areas discussed include orchestrating thousands of pipelines and improving downstream query performance.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
This document discusses SQL engines for Hadoop, including Hive, Presto, and Impala. Hive is best for batch jobs due to its stability. Presto provides interactive queries across data sources and is easier to manage than Hive with Tez. Presto's distributed architecture allows queries to run in parallel across nodes. It supports pluggable connectors to access different data stores and has language bindings for multiple clients.
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Kim Hammar
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself. Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model. We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
Building highly efficient data lakes using Apache Hudi (Incubating)
Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake.
Speaker: Vinoth Chandar (Uber)
Vinoth is Technical Lead at Uber Data Infrastructure Team
This document discusses the new features of Apache Hive 2.0, including:
1) The addition of procedural SQL capabilities through HPLSQL to add features like cursors and loops.
2) Performance improvements for interactive queries through LLAP which uses in-memory caching and persistent daemons.
3) Using HBase as the metastore to speed up query planning by reducing metadata access times.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized joins.
5) Improvements to the cost-based optimizer including better statistics collection.
GNW03: Stream Processing with Apache Kafka by Gwen Shapiragluent.
Gwen Shapira of Confluent presented episode #03 of Gluent New World series and talked about stream processing in modern enterprises using Apache Kafka.
The video recording for this presentation is at: http://vimeo.com/gluent/
Hadoop 2 introduces the YARN framework to provide a common platform for multiple data processing paradigms beyond just MapReduce. YARN splits cluster resource management from application execution, allowing different applications like MapReduce, Spark, Storm and others to run on the same Hadoop cluster. HDFS 2 improves HDFS with features like high availability, federation and snapshots. Apache Tez provides a new data processing engine that enables pipelining of jobs to improve performance over traditional MapReduce.
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
This document discusses the evolution of data pipelines at Databricks over time from 2014 to present day. Early pipelines involved copying data from S3 hourly, which did not scale. Later pipelines used Amazon Kinesis but led to performance issues with many small files. The document then introduces structured streaming and Delta Lake as better solutions. Structured streaming provides correctness while Delta Lake improves performance, scalability, and makes data management and GDPR compliance easier through features like ACID transactions, automatic schema management, and built-in deletion/update support.
Ingesting data at scale into elasticsearch with apache pulsarTimothy Spann
Ingesting data at scale into elasticsearch with apache pulsar
FLiP
Flink, Pulsar, Spark, NiFi, ElasticSearch, MQTT, JSON
data ingest
etl
elt
sql
timothy spann
elasticsearch community conference
2/11/2022
developer advocate at streamnative
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
You want to ingest event, time-series, streaming data easily, yet have flexible, fast ad-hoc queries. Is this even possible? Yes! Find out how in this talk of combining Apache Cassandra and Apache Spark, using a new open-source database, FiloDB.
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
Exploring a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Then, discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS.
https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
The document discusses Rocana Search, a system built by Rocana to enable large scale real-time collection, processing, and analysis of event data. It aims to provide higher indexing throughput and better horizontal scaling than general purpose search systems like Solr. Key features include fully parallelized ingest and query, dynamic partitioning of data, and assigning partitions to nodes to maximize parallelism and locality. Initial benchmarks show Rocana Search can index over 3 times as many events per second as Solr.
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
Building applications using Apache Hadoop with a use-case of clickstream analysis. Presented by Mark Grover and Jonathan Seidman at Big Data TechCon, Boston in April 2014
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
Last year at Yahoo, we spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez.
After successful migration and the improved performance we shifted our focus to addressing some of the bottlenecks we identified and new optimization ideas that we came up with to make it go even faster. We will go over the new features and work done in Tez to make that happen like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc.
We will also cover exciting new features that were added to Pig for performance such as bloom join and byte code generation. A distributed bloom join that can create multiple bloom filters in parallel was straightforward to implement with the flexibility of Tez DAGs. It vastly improved performance and reduced disk and network utilization for our large joins. Byte code generation for projection and filtering of records is another big feature that we are targeting for Pig 0.17 which will speed up processing by reducing the virtual function calls.
This document summarizes a presentation on using Apache Calcite for cost-based query optimization in Apache Phoenix. Key points include:
- Phoenix is adding Calcite's query planning capabilities to improve performance and SQL compliance over its existing query optimizer.
- Calcite models queries as relational algebra expressions and uses rules, statistics, and a cost model to choose the most efficient execution plan.
- Examples show how Calcite rules like filter pushdown and exploiting sortedness can generate better plans than Phoenix's existing optimizer.
- Materialized views and interoperability with other Calcite data sources like Apache Drill are areas for future improvement beyond the initial Phoenix integration.
I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information
Why is My Stream Processing Job Slow? with Xavier LeauteDatabricks
The goal of most streams processing jobs is to process data and deliver insights to the business – fast. Unfortunately, sometimes our streams processing jobs fall short of this goal. Or perhaps the job used to run fine, but one day it just isn’t fast enough? In this talk, we’ll dive into the challenges of analyzing performance of real-time stream processing applications. We’ll share troubleshooting suggestions and some of our favorite tools. So next time someone asks “why is this taking so long?”, you’ll know what to do.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
This document discusses lessons learned from building a scalable, self-serve, real-time, multi-tenant monitoring service at Yahoo. It describes transitioning from a classical architecture to one based on real-time big data technologies like Storm and Kafka. Key lessons include properly handling producer-consumer problems at scale, challenges of debugging skewed data, strategically managing multi-tenancy and resources, issues optimizing asynchronous systems, and not neglecting assumptions outside the application.
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
This document provides an overview of stream processing with Apache Flink. It discusses the rise of stream processing and how it enables low-latency applications and real-time analysis. It then describes Flink's stream processing capabilities, including pipelining of data, fault tolerance through checkpointing and recovery, and integration with batch processing. The document also summarizes Flink's programming model, state management, and roadmap for further development.
Alluxio Day VI
October 12, 2021
https://www.alluxio.io/alluxio-day/
Speaker:
Vinoth Chandar, Apache Software Foundation
Raymond Xu, Zendesk
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
This document discusses SQL engines for Hadoop, including Hive, Presto, and Impala. Hive is best for batch jobs due to its stability. Presto provides interactive queries across data sources and is easier to manage than Hive with Tez. Presto's distributed architecture allows queries to run in parallel across nodes. It supports pluggable connectors to access different data stores and has language bindings for multiple clients.
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Kim Hammar
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself. Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model. We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
Building highly efficient data lakes using Apache Hudi (Incubating)
Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake.
Speaker: Vinoth Chandar (Uber)
Vinoth is Technical Lead at Uber Data Infrastructure Team
This document discusses the new features of Apache Hive 2.0, including:
1) The addition of procedural SQL capabilities through HPLSQL to add features like cursors and loops.
2) Performance improvements for interactive queries through LLAP which uses in-memory caching and persistent daemons.
3) Using HBase as the metastore to speed up query planning by reducing metadata access times.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized joins.
5) Improvements to the cost-based optimizer including better statistics collection.
GNW03: Stream Processing with Apache Kafka by Gwen Shapiragluent.
Gwen Shapira of Confluent presented episode #03 of Gluent New World series and talked about stream processing in modern enterprises using Apache Kafka.
The video recording for this presentation is at: http://vimeo.com/gluent/
Hadoop 2 introduces the YARN framework to provide a common platform for multiple data processing paradigms beyond just MapReduce. YARN splits cluster resource management from application execution, allowing different applications like MapReduce, Spark, Storm and others to run on the same Hadoop cluster. HDFS 2 improves HDFS with features like high availability, federation and snapshots. Apache Tez provides a new data processing engine that enables pipelining of jobs to improve performance over traditional MapReduce.
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
This document discusses the evolution of data pipelines at Databricks over time from 2014 to present day. Early pipelines involved copying data from S3 hourly, which did not scale. Later pipelines used Amazon Kinesis but led to performance issues with many small files. The document then introduces structured streaming and Delta Lake as better solutions. Structured streaming provides correctness while Delta Lake improves performance, scalability, and makes data management and GDPR compliance easier through features like ACID transactions, automatic schema management, and built-in deletion/update support.
Ingesting data at scale into elasticsearch with apache pulsarTimothy Spann
Ingesting data at scale into elasticsearch with apache pulsar
FLiP
Flink, Pulsar, Spark, NiFi, ElasticSearch, MQTT, JSON
data ingest
etl
elt
sql
timothy spann
elasticsearch community conference
2/11/2022
developer advocate at streamnative
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
You want to ingest event, time-series, streaming data easily, yet have flexible, fast ad-hoc queries. Is this even possible? Yes! Find out how in this talk of combining Apache Cassandra and Apache Spark, using a new open-source database, FiloDB.
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
Exploring a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Then, discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS.
https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
The document discusses Rocana Search, a system built by Rocana to enable large scale real-time collection, processing, and analysis of event data. It aims to provide higher indexing throughput and better horizontal scaling than general purpose search systems like Solr. Key features include fully parallelized ingest and query, dynamic partitioning of data, and assigning partitions to nodes to maximize parallelism and locality. Initial benchmarks show Rocana Search can index over 3 times as many events per second as Solr.
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
Building applications using Apache Hadoop with a use-case of clickstream analysis. Presented by Mark Grover and Jonathan Seidman at Big Data TechCon, Boston in April 2014
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
Last year at Yahoo, we spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez.
After successful migration and the improved performance we shifted our focus to addressing some of the bottlenecks we identified and new optimization ideas that we came up with to make it go even faster. We will go over the new features and work done in Tez to make that happen like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc.
We will also cover exciting new features that were added to Pig for performance such as bloom join and byte code generation. A distributed bloom join that can create multiple bloom filters in parallel was straightforward to implement with the flexibility of Tez DAGs. It vastly improved performance and reduced disk and network utilization for our large joins. Byte code generation for projection and filtering of records is another big feature that we are targeting for Pig 0.17 which will speed up processing by reducing the virtual function calls.
This document summarizes a presentation on using Apache Calcite for cost-based query optimization in Apache Phoenix. Key points include:
- Phoenix is adding Calcite's query planning capabilities to improve performance and SQL compliance over its existing query optimizer.
- Calcite models queries as relational algebra expressions and uses rules, statistics, and a cost model to choose the most efficient execution plan.
- Examples show how Calcite rules like filter pushdown and exploiting sortedness can generate better plans than Phoenix's existing optimizer.
- Materialized views and interoperability with other Calcite data sources like Apache Drill are areas for future improvement beyond the initial Phoenix integration.
I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information
Why is My Stream Processing Job Slow? with Xavier LeauteDatabricks
The goal of most streams processing jobs is to process data and deliver insights to the business – fast. Unfortunately, sometimes our streams processing jobs fall short of this goal. Or perhaps the job used to run fine, but one day it just isn’t fast enough? In this talk, we’ll dive into the challenges of analyzing performance of real-time stream processing applications. We’ll share troubleshooting suggestions and some of our favorite tools. So next time someone asks “why is this taking so long?”, you’ll know what to do.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
This document discusses lessons learned from building a scalable, self-serve, real-time, multi-tenant monitoring service at Yahoo. It describes transitioning from a classical architecture to one based on real-time big data technologies like Storm and Kafka. Key lessons include properly handling producer-consumer problems at scale, challenges of debugging skewed data, strategically managing multi-tenancy and resources, issues optimizing asynchronous systems, and not neglecting assumptions outside the application.
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
This document provides an overview of stream processing with Apache Flink. It discusses the rise of stream processing and how it enables low-latency applications and real-time analysis. It then describes Flink's stream processing capabilities, including pipelining of data, fault tolerance through checkpointing and recovery, and integration with batch processing. The document also summarizes Flink's programming model, state management, and roadmap for further development.
Alluxio Day VI
October 12, 2021
https://www.alluxio.io/alluxio-day/
Speaker:
Vinoth Chandar, Apache Software Foundation
Raymond Xu, Zendesk
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
This document discusses optimizations made to Alibaba Cloud's Data Lake Analytics (DLA) engine, which uses Presto, to improve performance when querying data stored in Object Storage Service (OSS). The optimizations included decreasing OSS API request counts, implementing an Alluxio data cache using local disks on Presto workers, and improving disk throughput by utilizing multiple ultra disks. These changes increased cache hit ratios and query performance for workloads involving large scans of data stored in OSS. Future plans include supporting an Alluxio cluster shared by multiple users and additional caching techniques.
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
Michael Sun presented on CBS Interactive's use of Hadoop for web analytics processing. Some key points:
- CBS Interactive processes over 1 billion web logs daily from hundreds of websites on a Hadoop cluster with over 1PB of storage.
- They developed an ETL framework called Lumberjack in Python for extracting, transforming, and loading data from web logs into Hadoop and databases.
- Lumberjack uses streaming, filters, and schemas to parse, clean, lookup dimensions, and sessionize web logs before loading into a data warehouse for reporting and analytics.
- Migrating to Hadoop provided significant benefits including reduced processing time, fault tolerance, scalability, and cost effectiveness compared to their
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
Apache Hudi is a data lake platform, that provides streaming primitives (upserts/deletes/change streams) on top of data lake storage. Hudi powers very large data lakes at Uber, Robinhood and other companies, while being pre-installed on four major cloud platforms.
Hudi supports exactly-once, near real-time data ingestion from Apache Kafka to cloud storage, which is typically used in-place of a S3/HDFS sink connector to gain transactions and mutability. While this approach is scalable and battle-tested, it can only ingest data in mini batches, leading to lower data freshness. In this talk, we introduce a Kafka Connect Sink Connector for Apache Hudi, which writes data straight into Hudi's log format, making the data immediately queryable, while Hudi's table services like indexing, compaction, clustering work behind the scenes, to further re-organize for better query performance.
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
- Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable storage of petabytes of data and large-scale computations across commodity hardware.
- Apache Hadoop is used widely by internet companies to analyze web server logs, power search engines, and gain insights from large amounts of social and user data. It is also used for machine learning, data mining, and processing audio, video, and text data.
- The future of Apache Hadoop includes making it more accessible and easy to use for enterprises, addressing gaps like high availability and management, and enabling partners and the community to build on it through open APIs and a modular architecture.
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
Lakehouses are quickly growing in popularity as a new approach to Data Platform Architecture bringing some of the long-established benefits from OLTP world to OLAP, including transactions, record-level updates/deletes, and changes streaming. In this talk, we will discuss Apache Hudi and how it unlocks possibilities of building your own fully open-source Lakehouse featuring a rich set of integrations with existing technologies, including Apache Pulsar. In this session, we will present: - What Lakehouses are, and why they are needed. - What Apache Hudi is and how it works. - Provide a use-case and demo that applies Apache Hudi’s DeltaStreamer tool to ingest data from Apache Pulsar.
Hive is used at Facebook for data warehousing and analytics tasks on a large Hadoop cluster. It allows SQL-like queries on structured data stored in HDFS files. Key features include schema definitions, data summarization and filtering, extensibility through custom scripts and functions. Hive provides scalability for Facebook's rapidly growing data needs through its ability to distribute queries across thousands of nodes.
Big data architectures and the data lakeJames Serra
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include:
- Defining top-down and bottom-up approaches to data management
- Explaining what a data lake is and how Hadoop can function as the data lake
- Describing how a modern data warehouse combines features of a traditional data warehouse and data lake
- Discussing how federated querying allows data to be accessed across multiple sources
- Highlighting benefits of implementing big data solutions in the cloud
- Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
The other Apache Technologies your Big Data solution needsgagravarr
The document discusses many Apache projects relevant to big data solutions, including projects for loading and querying data like Pig and Gora, building MapReduce jobs like Avro and Thrift, cloud computing with LibCloud and DeltaCloud, and extracting information from unstructured data with Tika, UIMA, OpenNLP, and cTakes. It also mentions utility projects like Chemistry, JMeter, Commons, and ManifoldCF.
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Optimizing Big Data to run in the Public CloudQubole
Qubole is a cloud-based platform that allows customers to easily run Hadoop and Spark clusters on AWS for big data analytics. It optimizes performance and reduces costs through techniques like caching data in S3 for faster access, using spot instances, and directly writing query outputs to S3. The document discusses Qubole's features, capabilities, and how it provides an easier way for more users like data scientists and analysts to access and query big data compared to building and managing Hadoop clusters themselves.
SF Big Analytics meetup : Hoodie From UberChester Chen
Even after a decade, the name “Hadoop" remains synonymous with "big data”, even as new options for processing/querying (stream processing, in-memory analytics, interactive sql) and storage services (S3/Google Cloud/Azure) have emerged & unlocked new possibilities. However, the overall data architecture has become more complex with more moving parts and specialized systems, leading to duplication of data and strain on usability . In this talk, we argue that by adding some missing blocks to existing Hadoop stack, we are able to a provide similar capabilities right on top of Hadoop, at reduced cost and increased efficiency, greatly simplifying the overall architecture as well in the process. We will discuss the need for incremental processing primitives on Hadoop, motivating them with some real world problems from Uber. We will then introduce “Hoodie”, an open source spark library built at Uber, to enable faster data for petabyte scale data analytics and solve these problems. We will deep dive into the design & implementation of the system and discuss the core concepts around timeline consistency, tradeoffs between ingest speed & query performance. We contrast Hoodie with similar systems in the space, discuss how its deployed across Hadoop ecosystem at Uber and finally also share the technical direction ahead for the project.
Speaker: VINOTH CHANDAR, Staff Software Engineer at Uber
Vinoth is the founding engineer/architect of the data team at Uber, as well as author of many data processing & querying systems at Uber, including "Hoodie". He has keen interest in unified architectures for data analytics and processing.
Previously, Vinoth was the lead on Linkedin’s Voldemort key value store and has also worked on Oracle Database replication engine, HPC, and stream processing.
The document discusses big data challenges and solutions. It describes how specialized systems like Hadoop are more efficient than relational databases for large-scale data. It provides examples of open source projects that can be used for tasks like storage, search, streaming data, and batch processing. The document also summarizes the design of the Voldemort distributed key-value store and how it was inspired by Dynamo and Memcached.
This document discusses connecting Oracle Analytics Cloud (OAC) Essbase data to Microsoft Power BI. It provides an overview of Power BI and OAC, describes various methods for connecting the two including using a REST API and exporting data to Excel or CSV files, and demonstrates some visualization capabilities in Power BI including trends over time. Key lessons learned are that data can be accessed across tools through various connections, analytics concepts are often similar between tools, and while partnerships exist between Microsoft and Oracle, integration between specific products like Power BI and OAC is still limited.
This document proposes a scalable application insights framework using Hadoop, HBase, Hive and other technologies to aggregate and visualize data from large production application data sets. It describes using Hadoop for raw data storage and processing, Oozie for workflow control, HBase for storing processed data, and Hive or Splunk for data visualization. The framework aims to provide metrics, track mail delivery, consolidate scattered data and provide informative visualizations while being scalable, reliable and ensuring data quality.
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
Microsoft Azure Data Lake Storage is designed to enable operational and exploratory analytics through a hyper-scale repository. Journey through Azure Data Lake Storage Gen1 with Microsoft Data Platform Specialist, Audrey Hammonds. In this video she explains the fundamentals to Gen 1 and Gen 2, walks us through how to provision a Data Lake, and gives tips to avoid turning your Data Lake into a swamp.
Learn more about Data Lakes with our blog - Data Lakes: Data Agility is Here Now https://bit.ly/2NUX1H6
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
The document discusses Uber's use of Hadoop to store and analyze large amounts of data. Some key points:
1) Uber was facing challenges with data reliability, system scalability, fragile data ingestion, and lack of multi-DC support with its previous data systems.
2) Uber implemented a Hadoop data lake to address these issues. The Hadoop ecosystem at Uber includes tools for data ingestion (Streamific, Komondor), storage (HDFS, Hive), processing (Spark, Presto) and serving data to applications and data marts.
3) Uber continues to work on challenges like enabling low-latency interactive SQL, implementing an all-active architecture for high availability, and reducing
Voldemort is a distributed key-value store developed at LinkedIn that provides high availability and partition tolerance. The document discusses improvements made to Voldemort's storage layer and cluster expansion capabilities. The storage layer was rewritten to improve performance and flexibility by moving data off the Java heap, enabling partition-aware storage, and allowing dynamic cache partitioning. The cluster expansion process was redesigned using a "zone n-ary" model to eliminate costly cross-datacenter data moves and enable safe aborts of rebalancing operations. The changes improved Voldemort's scalability, flexibility, and operational characteristics in LinkedIn's production environment.
Voldemort is a distributed key-value storage system used at LinkedIn that was migrated to use solid state drives (SSDs) to improve performance. This led to issues with Java garbage collection causing increased latency. The authors address these issues by modifying the caching strategy to cache only index nodes and evict data objects quickly to reduce fragmentation. They also reduce the cost of garbage collection by controlling promotion rates and the scheduling of garbage collection cycles. Migrating to SSDs requires rethinking caching strategies and garbage collection to fully realize the performance benefits.
Composing and Executing Parallel Data Flow Graphs wth Shell PipesVinoth Chandar
The document discusses extending shell languages to enable parallel computing using shell pipes. It proposes adding features like pipeline forking, joining, and cycling to express parallel dataflow graphs. It describes implementing these features in BASH by using a startup overlay to launch parallel shell workers and having them execute command files to perform parallel tasks. Experimental results showed this approach can effectively express and execute parallel programs using shell pipes.
Triple-Triple RDF Store with Greedy Graph based GroupingVinoth Chandar
This document proposes improvements to the query performance of triple stores that store RDF data in a relational database. It explores storing RDF triples in different orders in three tables and developing a query rewriting scheme. It also looks at optimizing the physical schema through graph clustering techniques that aim to reduce the number of disk accesses by grouping related triples closer together. The paper presents an implementation of these techniques over a million triples and shows they can yield significant performance benefits on complex queries.
The document proposes a distributed database system for delay tolerant networks (DTNs). It discusses two key contributions: 1) practical heuristics for query scheduling that optimize for lower delays by moving relevant data to sites with the largest amounts of data or highest likelihood of contact, and 2) a prereplication scheme that reduces latency for on-demand retrieval by actively pre-caching popular data between nodes based on bandwidth availability. The paper evaluates the proposed system through simulations on artificial and real-world DTN traces.
This document outlines a proposal for video streaming over Bluetooth using a peer-to-peer approach. It discusses the goals, challenges, proposed solution, and implementation of Bluetube. The key aspects are using a content distribution tree to distribute video chunks amongst clients, with early clients serving late clients. An evaluation of the theoretical service acceptance ratio, jitter, and workload distribution is provided. The document concludes with details about the Bluetube proof-of-concept implementation developed in Java for emulation.
This paper proposes a shoulder inverse kinematics (IK) technique. Shoulder complex is comprised of the sternum, clavicle, ribs, scapula, humerus, and four joints.
Raish Khanji GTU 8th sem Internship Report.pdfRaishKhanji
This report details the practical experiences gained during an internship at Indo German Tool
Room, Ahmedabad. The internship provided hands-on training in various manufacturing technologies, encompassing both conventional and advanced techniques. Significant emphasis was placed on machining processes, including operation and fundamental
understanding of lathe and milling machines. Furthermore, the internship incorporated
modern welding technology, notably through the application of an Augmented Reality (AR)
simulator, offering a safe and effective environment for skill development. Exposure to
industrial automation was achieved through practical exercises in Programmable Logic Controllers (PLCs) using Siemens TIA software and direct operation of industrial robots
utilizing teach pendants. The principles and practical aspects of Computer Numerical Control
(CNC) technology were also explored. Complementing these manufacturing processes, the
internship included extensive application of SolidWorks software for design and modeling tasks. This comprehensive practical training has provided a foundational understanding of
key aspects of modern manufacturing and design, enhancing the technical proficiency and readiness for future engineering endeavors.
Dust Suppressants: A Sustainable Approach to Dust Pollution ControlJanapriya Roy
This journal explores the growing field of bio-organic dust suppressants as a sustainable solution to dust pollution. It reviews the working principles of dust suppression, key performance indicators, and the latest research on using natural materials like polysaccharides, lignin, proteins, and agricultural wastes. It also highlights current challenges and future directions to enhance the durability, cost-effectiveness, and environmental safety of bio-based dust control technologies. A valuable reference for researchers, environmental engineers, and industry professionals seeking eco-friendly dust management solutions.
Dust pollution, whether natural or anthropogenic, poses significant threats to both the environment and public health. Effective dust suppression technologies are essential in mitigating airborne particulate matter, especially in industrial, mining, and construction settings. Recently, bio-organic materials have emerged as promising raw materials for eco-friendly dust suppressants. This paper explores the working principles of dust suppressants, key performance evaluation indices, and the current progress in developing bio-based dust control agents using polysaccharides, lignin, proteins, microorganisms, and agricultural or forestry waste. It also discusses existing challenges and future research directions for enhancing the efficiency and applicability of bio-organic dust suppressants.
Working Principles of Dust Suppressants
Dust suppressants operate primarily through three interrelated mechanisms: wetting, coagulation, and consolidation.
Wetting: This mechanism involves the application of dust suppressants to reduce the surface tension of water, allowing it to penetrate and coat dust particles. This increases the weight and cohesion of the particles, causing them to settle quickly. Surfactants and hygroscopic agents are commonly used in this approach.
Coagulation: Dust particles are brought together to form larger aggregates through electrostatic interactions or binding agents, which helps in accelerating their settling.
Consolidation: A more long-term effect where the suppressant forms a crust or mesh-like structure over the dust-prone surface, physically stabilizing it and preventing re-entrainment of particles by wind or vehicle movement.
Bio-Organic Materials in Dust Suppressants
The shift toward natural, renewable, and biodegradable components has led to extensive research on the use of various bio-organic substances, including:
Polysaccharides: Starch, cellulose derivatives, and chitosan can form gels or films that stabilize surfaces.
Lignin: A byproduct of the paper industry, lignin offers strong binding capacity and is naturally water-resistant.
Proteins: Derived from agricultural waste, proteins like casein and soy protein can enhance binding and wetting.
Microorganisms: Certain bacteria and fungi can produce biofilms or exopolysaccharides that trap dust particles.
Agricultural and Forestry Wastes: Residues su
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.Kamal Acharya
The project developers created a system entitled Resort Management and Reservation System; it will provide better management and monitoring of the services in every resort business, especially D’ Rock Resort. To accommodate those out-of-town guests who want to remain and utilize the resort's services, the proponents planned to automate the business procedures of the resort and implement the system. As a result, it aims to improve business profitability, lower expenses, and speed up the resort's transaction processing. The resort will now be able to serve those potential guests, especially during the high season. Using websites for faster transactions to reserve on your desired time and date is another step toward technological advancement. Customers don’t need to walk in and hold in line for several hours. There is no problem in converting a paper-based transaction online; it's just the system that will be used that will help the resort expand. Moreover, Gerard (2012) stated that “The flexible online information structure was developed as a tool for the reservation theory's two primary applications. Computer use is more efficient, accurate, and faster than a manual or present lifestyle of operation. Using a computer has a vital role in our daily life and the advantages of the devices we use.
The Fluke 925 is a vane anemometer, a handheld device designed to measure wind speed, air flow (volume), and temperature. It features a separate sensor and display unit, allowing greater flexibility and ease of use in tight or hard-to-reach spaces. The Fluke 925 is particularly suitable for HVAC (heating, ventilation, and air conditioning) maintenance in both residential and commercial buildings, offering a durable and cost-effective solution for routine airflow diagnostics.
ELectronics Boards & Product Testing_Shiju.pdfShiju Jacob
This presentation provides a high level insight about DFT analysis and test coverage calculation, finalizing test strategy, and types of tests at different levels of the product.
2. Speaker Bio
PMC Chair/Creator of Apache Hudi
Sr.Staff Eng @ Uber (Data Infra/Platforms, Networking)
Principal Eng @ Confluent (ksqlDB, Kafka/Streams)
Staff Eng @ Linkedin (Voldemort, DDS)
Sr Eng @ Oracle (CDC/Goldengate/XStream)
5. Change Data Capture
Design Pattern for Data Integration
- Not tied to any particular technology
- Deliver low-latency
System for tracking, fetching new data
- Not concerned with how to use such data
- Ideally, incremental update downstream
- Minimizing number of bits read-written/change
Change is the ONLY Constant
- Even in Computer science
- Data is immutable = Myth (well, kinda)
6. Polling an external API
- Timestamps, status indicators, versions
- Simple, works for small-scale data changes
- E.g: Polling github events API
7. Emit Event From App
- Data model to encode deltas
- Scales for high-volume data changes
- E.g: Emitting sensor state changes to Pulsar
8. Scan Database Redo Log
- SCN and other watermarks to extract data/metadata changes
- Operationally heavy, very high fidelity
- E.g: Using Debezium to obtain changelogs from MySQL
9. CDC vs Extract-Transform-
Load?
CDC is merely Incremental Extraction
- Not really competing concepts
- ETL needs one-time full bootstrap
CDC changes T and L significantly
- T on change streams, not just table state
- L incrementally, not just bulk reloads
Incremental L = Apply
10. CDC vs Stream Processing
CDC enables Streaming ETL
- Why bulk T & L anymore?
- Process change streams
- Mutable Sinks
Reliable Stream Processing needs distributed logs
- Rewind/Replay CDC logs
- Absorb spikes/batch writes to sinks
11. Ideal CDC Source
Support reliable incremental consumption
- Offsets are stable, deterministic
- Efficient fetching of new changes
Support rewinding/replay
- Databases redo logs are typically purged
frequently
- Event Logs offer tiering/large retention
Support ordering of changes
- Out-of-order apply => incorrect results
12. Ideal CDC Sink
Mutable, Transactional
- Reliably apply changes
Quickly absorb changes
- Sinks are often bottlenecks
- Random I/O
Bonus: Also act as CDC Source
- Keep the stream flowing
14. What is a Data Lake?
Architectural Pattern for Analytical Data
- Data Lake != Spark, Flink
- Data Lake != Files on S3
- Raw data (OLTP schema)
- Derived Data (OLAP/BI, ML schema)
Vast Storage
- Object storage vs dedicated storage nodes
- Open formats (data + metadata)
Scalable Compute
- Many query engine choices
Source:
https://martinfowler.com/bliki/images/dataLake/context.png
16. Challenges
Data Lakes are often file dumps
- Reliably change subset of files
- Transactional, Concurrency Control
Getting “ALL” data quickly
- Apply Updates quicky
- Scalable Deletes, to ensure compliance
Lakes think in big batches
- Difficult to align batch intervals, to join
- Large skews for “event_time” streaming joins
17. Apache Hudi
Transactional Writes, MVCC/OCC
- Fully managed file/object storage
- Automatic compaction, clustering, sizing
First class support for Updates, Deletes
- Record level Update/Deletes inspired by stream
processors
CDC Streams From Lake Storage
- Storage Layout optimized for incremental fetches
- Hudi’s unique contribution in the space
23. Hudi Data Lake
Original pioneer of the transactional
data lake movement
Embeddable, Serverless, Distributed
Database abstraction layer over DFS
- We invented this!
Hadoop Upserts, Deletes &
Incrementals
Provide transactional updates/deletes
First class support for record level CDC
streams
25. What If: Streaming Model on Batch Data?
The Incremental Stack
+ Intelligent, Incremental
+ Fast, Efficient
+ Scans, Columnar
formats
+ Scalable Compute
https://www.oreilly.com/content/ubers-case-for-
incremental-processing-on-hadoop/; 2016
26. Hudi : Open Sourcing & Evolution..
2015 : Published core ideas/principles for incremental processing (O’reilly article)
2016 : Project created at Uber & powers all database/business critical feeds @ Uber
2017 : Project open sourced by Uber & work begun on Merge-On-Read, Cloud support
2018 : Picked up adopters, hardening, async compaction..
2019 : Incubated into ASF, community growth, added more platform components.
2020 : Top level Apache project, Over 10x growth in community, downloads, adoption
2021 : SQL DMLs, Flink Continuous Queries, More indexing schemes, Metaserver, Caching
30. Delta Logs at File Level over Global
Each file group is it’s own self
contained log
- Constant metadata size,
controlled by “retention”
parameters
- Leverage append() when
available; lower metadata
overhead
Merges are local to each file group
- UUID keys throw off any
range pruning
31. Record Indexes over Just File/Column Stats
Index maps key to a file group
- During upsert/deletes
- Much like streaming state
store
Workloads have different shapes
- Late arriving updates; Totally
random
- Trickle down to derived tables
Many pluggable options
- Bloom Filters + Key ranges
- HBase, Join based
- Global vs Local
32. MVCC Concurrency Control over Only OCC
Frequent commits => More frequent
clustering/compaction => More contention
Differentiate writers vs table services
- Much like what databases do
- Table services don’t contend with
writers
- Async compaction/clustering
Don’t be so “Optimistic”
- OCC b/w writers; works, until it does
n’t
- Retries, split txns, wastes resources
- MVCC/Log based between
writers/table services
33. Record Level Merge API over Only Overwrites
More generalized approach
- Default: overwrite w/ latest writer wins
- Support business-specific resolution
Log partial updates
- Log just changed column;
- Drastic reduction in write amplification
Log based reconciliation
- Delete, Undelete based on business
logic
- CRDT, Operational Transform like
delayed conflict resolution
34. Specialized Database over Generalized Format
Approach it more like a shared-nothing
database
- Daemons aware of each other
- E.g: Compaction, Cleaning in rocksDB
E.g: Clustering & Compaction know each
other
- Reconcile metadata based on time order
- Compactions avoid redundant
scheduling
Self Managing
- Sorting, Time-order preservation, File-
sizing
35. Record level CDC over File/Snapshot Diffing
Per record metadata
- _hoodie_commit_time : Kafka style
compacted change streams in commit
order
- _hoodie_commit_seqno: Consume
large commits in chunks, ala Kafka
offsets
File group design => CDC friendly
- Efficient retrieval of old, new values
- Efficient retrieval of all values for key
Infinite Retention/Lookback coming later in
2021
37. Pulsar Source In Hudi
PR#3096 is up and WIP! (Contributions welcome)
- Supports Avro/KEY_VALUE, partitioned/non-partitioned topics, checkpointing
- All Hudi operations, bells-and-whistles
- Consolidate with Kafka-Debezium Source work
Hudi facing work
- Adding transformers, record payload for parsing debezium logs
- Hardening, functional/scale testing
Pulsar facing work
- Better Spark batch query datasource support in Apache Pulsar
- Streamnative/pulsar-spark : upgrade to Spark 3 (we know its painful)/Scala
2.12, support for KV records.
38. Hudi Sink from Pulsar
Push to the lake in real-time
- Today’s model is “pull based”, micro-batch
- Transactional, Concurrency Control
Hudi facing work
- Harden hudi-java-client
- ~1 min commit frequency, while retaining a month of history
Pulsar facing work
- How to once across tasks?
- How to perform indexing etc efficiently without shuffling data
around
39. Pulsar Tiered Storage
Columnar reads off Hudi
- Leverage Hudi’s metadata to track ordering/changes
- Push projections/filters to Hudi
- Faster backfills!
Work/Challenges
- Most open-ended of the lot
- Pluggable Tiered storage API in Pulsar (exists?)
- Mapping offsets to _hoodie_commit_seqno
- Leverage Hudi’s compaction and other machinery
40. Engage With Our Community
User Docs : https://hudi.apache.org
Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI
Github : https://github.com/apache/hudi/
Twitter : https://twitter.com/apachehudi
Mailing list(s) : [email protected] (send an empty email to subscribe)
[email protected] (actual mailing list)
Slack : https://join.slack.com/t/apache-hudi/signup