Impala Architecture Presentation at Toronto Hadoop User Group, in January 2014 by Mark Grover.
Event details:
http://www.meetup.com/TorontoHUG/events/150328602/
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
This is a presentation deck for Data+AI Summit 2021 at
https://databricks.com/session_na21/enabling-vectorized-engine-in-apache-spark
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
This document discusses Apache Tez, a framework for accelerating Hadoop query processing. Some key points:
- Tez is a dataflow framework that expresses computations as directed acyclic graphs (DAGs) of tasks, allowing for optimizations like container reuse and locality-aware scheduling.
- It is built on YARN and provides a customizable execution engine as well as APIs for applications like Hive and Pig.
- By expressing jobs as DAGs, Tez can reduce overheads, queueing delays, and better utilize cluster resources compared to the traditional MapReduce framework.
- The document provides examples of how Tez can improve performance for operations like joins, aggregations, and handling of multiple outputs
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems.
This presentation explains where it fits into the data eco system and how it helps implement your system in Rust
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Iceberg provides capabilities beyond traditional partitioning of data in Spark/Hive. It allows updating or deleting individual rows without rewriting partitions through mutable row operations (MOR). It also supports ACID transactions through versions, faster queries through statistics and sorting, and flexible schema changes. Iceberg manages metadata that traditional formats like Parquet do not, enabling these new capabilities. It is useful for workloads that require updating or filtering data at a granular record level, managing data history through versions, or frequent schema changes.
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
In this presentation, we will introduce Hotspot's Garbage First collector (G1GC) as the most suitable collector for latency-sensitive applications running with large memory environments. We will first discuss G1GC internal operations and tuning opportunities, and also cover tuning flags that set desired GC pause targets, change adaptive GC thresholds, and adjust GC activities at runtime. We will provide several HBase case studies using Java heaps as large as 100GB that show how to best tune applications to remove unpredicted, protracted GC pauses.
Apache Hive 3 introduces new capabilities for data analytics including materialized views, default columns, constraints, and improved JDBC and Kafka connectors to enable real-time streaming and integration with external systems like Druid; Hive 3 also improves performance and query optimization through a new query result cache, workload management, and cloud storage optimizations. Data Analytics Studio provides self-service analytics on top of Hive 3 through a visual interface to optimize queries, monitor performance, and manage data lifecycles.
This document discusses optimizing Spark write-heavy workloads to S3 object storage. It describes problems with eventual consistency, renames, and failures when writing to S3. It then presents several solutions implemented at Qubole to improve the performance of Spark writes to Hive tables and directly writing to the Hive warehouse location. These optimizations include parallelizing renames, writing directly to the warehouse, and making recover partitions faster by using more efficient S3 listing. Performance improvements of up to 7x were achieved.
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
There is growing interest in running Apache Spark natively on Kubernetes (see https://github.com/apache-spark-on-k8s/spark). Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. When running Spark on Kubernetes, if the HDFS daemons run outside Kubernetes, applications will slow down while accessing the data remotely.
This session will demonstrate how to run HDFS inside Kubernetes to speed up Spark. In particular, it will show how Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons. You’ll also learn how you can provide Spark with the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure.
Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
The document provides an agenda and slides for a presentation on architectural considerations for data warehousing with Hadoop. The presentation discusses typical data warehouse architectures and challenges, how Hadoop can complement existing architectures, and provides an example use case of implementing a data warehouse with Hadoop using the Movielens dataset. Key aspects covered include ingestion of data from various sources using tools like Flume and Sqoop, data modeling and storage formats in Hadoop, processing the data using tools like Hive and Spark, and exporting results to a data warehouse.
Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
Apache Spark is a fast and general engine for large-scale data processing. It provides a unified API for batch, interactive, and streaming data processing using in-memory primitives. A benchmark showed Spark was able to sort 100TB of data 3 times faster than Hadoop using 10 times fewer machines by keeping data in memory between jobs.
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible.
A talk by Julian Hyde at JOIN 2019 in San Francisco.
The document compares the query execution plans produced by Apache Hive and PostgreSQL. It shows that Hive's old-style execution plans are overly verbose and difficult to understand, providing many low-level details across multiple stages. In contrast, PostgreSQL's plans are more concise and readable, showing the logical query plan in a top-down manner with actual table names and fewer lines of text. The document advocates for Hive to adopt a simpler execution plan format similar to PostgreSQL's.
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Impala is an open source SQL query engine for Apache Hadoop that allows real-time queries on large datasets stored in HDFS and other data stores. It uses a distributed architecture where an Impala daemon runs on each node and coordinates query planning and execution across nodes. Impala allows SQL queries to be run directly against files stored in HDFS and other formats like Avro and Parquet. It aims to provide high performance for both analytical and transactional workloads through its C++ implementation and avoidance of MapReduce.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Iceberg provides capabilities beyond traditional partitioning of data in Spark/Hive. It allows updating or deleting individual rows without rewriting partitions through mutable row operations (MOR). It also supports ACID transactions through versions, faster queries through statistics and sorting, and flexible schema changes. Iceberg manages metadata that traditional formats like Parquet do not, enabling these new capabilities. It is useful for workloads that require updating or filtering data at a granular record level, managing data history through versions, or frequent schema changes.
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
In this presentation, we will introduce Hotspot's Garbage First collector (G1GC) as the most suitable collector for latency-sensitive applications running with large memory environments. We will first discuss G1GC internal operations and tuning opportunities, and also cover tuning flags that set desired GC pause targets, change adaptive GC thresholds, and adjust GC activities at runtime. We will provide several HBase case studies using Java heaps as large as 100GB that show how to best tune applications to remove unpredicted, protracted GC pauses.
Apache Hive 3 introduces new capabilities for data analytics including materialized views, default columns, constraints, and improved JDBC and Kafka connectors to enable real-time streaming and integration with external systems like Druid; Hive 3 also improves performance and query optimization through a new query result cache, workload management, and cloud storage optimizations. Data Analytics Studio provides self-service analytics on top of Hive 3 through a visual interface to optimize queries, monitor performance, and manage data lifecycles.
This document discusses optimizing Spark write-heavy workloads to S3 object storage. It describes problems with eventual consistency, renames, and failures when writing to S3. It then presents several solutions implemented at Qubole to improve the performance of Spark writes to Hive tables and directly writing to the Hive warehouse location. These optimizations include parallelizing renames, writing directly to the warehouse, and making recover partitions faster by using more efficient S3 listing. Performance improvements of up to 7x were achieved.
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
There is growing interest in running Apache Spark natively on Kubernetes (see https://github.com/apache-spark-on-k8s/spark). Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. When running Spark on Kubernetes, if the HDFS daemons run outside Kubernetes, applications will slow down while accessing the data remotely.
This session will demonstrate how to run HDFS inside Kubernetes to speed up Spark. In particular, it will show how Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons. You’ll also learn how you can provide Spark with the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure.
Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
The document provides an agenda and slides for a presentation on architectural considerations for data warehousing with Hadoop. The presentation discusses typical data warehouse architectures and challenges, how Hadoop can complement existing architectures, and provides an example use case of implementing a data warehouse with Hadoop using the Movielens dataset. Key aspects covered include ingestion of data from various sources using tools like Flume and Sqoop, data modeling and storage formats in Hadoop, processing the data using tools like Hive and Spark, and exporting results to a data warehouse.
Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
Apache Spark is a fast and general engine for large-scale data processing. It provides a unified API for batch, interactive, and streaming data processing using in-memory primitives. A benchmark showed Spark was able to sort 100TB of data 3 times faster than Hadoop using 10 times fewer machines by keeping data in memory between jobs.
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible.
A talk by Julian Hyde at JOIN 2019 in San Francisco.
The document compares the query execution plans produced by Apache Hive and PostgreSQL. It shows that Hive's old-style execution plans are overly verbose and difficult to understand, providing many low-level details across multiple stages. In contrast, PostgreSQL's plans are more concise and readable, showing the logical query plan in a top-down manner with actual table names and fewer lines of text. The document advocates for Hive to adopt a simpler execution plan format similar to PostgreSQL's.
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Impala is an open source SQL query engine for Apache Hadoop that allows real-time queries on large datasets stored in HDFS and other data stores. It uses a distributed architecture where an Impala daemon runs on each node and coordinates query planning and execution across nodes. Impala allows SQL queries to be run directly against files stored in HDFS and other formats like Avro and Parquet. It aims to provide high performance for both analytical and transactional workloads through its C++ implementation and avoidance of MapReduce.
Impala is a SQL query engine for Apache Hadoop that allows real-time queries on large datasets. It is designed to provide high performance for both analytical and transactional workloads by running directly on Hadoop clusters and utilizing C++ code generation and in-memory processing. Impala uses the existing Hadoop ecosystem including metadata storage in Hive and data formats like Avro, but provides faster performance through its new query execution engine compared to traditional MapReduce-based systems like Hive. Future development of Impala will focus on improved support for features like HBase, additional SQL functionality, and query optimization.
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
Cloudera Impala is a modern SQL query engine for Apache Hadoop that provides high performance for both analytical and transactional workloads. It runs directly within Hadoop clusters, reading common Hadoop file formats and communicating with Hadoop storage systems. Impala uses a C++ implementation and runtime code generation for high performance compared to other Hadoop SQL query engines like Hive that use Java and MapReduce.
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
Impala is a massively parallel processing SQL query engine for Apache Hadoop. It allows real-time queries on large datasets using existing SQL skills. Impala's architecture includes impalad daemons that process queries in parallel across nodes, a statestore for metadata coordination, and a new execution engine written in C++. It aims to provide faster performance than Hive for interactive queries while leveraging Hadoop's existing ecosystem. The first general availability release is planned for April 2013.
Presentation on Cloudera Impala at PDX Data Science Group at Portland. Delievered on February 27, 2013. Presentation slides borrowed from Cloudera Impala's architect, Marcel Kornacker.
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
Maxime Dumas gives a presentation on Cloudera Impala, which provides fast SQL query capability for Apache Hadoop. Impala allows for interactive queries on Hadoop data in seconds rather than minutes by using a native MPP query engine instead of MapReduce. It offers benefits like SQL support, improved performance of 3-4x up to 90x faster than MapReduce, and flexibility to query existing Hadoop data without needing to migrate or duplicate it. The latest release of Impala 2.0 includes new features like window functions, subqueries, and spilling joins and aggregations to disk when memory is exhausted.
Impala is a SQL query engine for Apache Hadoop that allows for interactive queries on large datasets. It uses a distributed architecture where each node runs an Impala daemon and queries are distributed across nodes. Impala aims to provide general-purpose SQL with high performance by using C++ instead of Java and avoiding MapReduce execution. It runs directly on Hadoop storage systems and supports common file formats like Parquet and Avro.
Impala is a massively parallel processing SQL query engine for Hadoop. It allows users to issue SQL queries directly to their data in Apache Hadoop. Impala uses a distributed architecture where queries are executed in parallel across nodes by Impala daemons. It uses a new execution engine written in C++ with runtime code generation for high performance. Impala also supports commonly used Hadoop file formats and can query data stored in HDFS and HBase.
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
The document discusses Impala releases and roadmaps. It outlines key features released in different Impala versions, including SQL capabilities, performance improvements, and support for additional file formats and data types. It also describes Impala's performance advantages compared to other SQL-on-Hadoop systems and how its approach is expected to increasingly favor performance gains. Lastly, it encourages trying out Impala and engaging with their community.
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.
Impala is a massively parallel processing SQL query engine for Apache Hadoop. It allows real-time queries on large datasets by using a new execution engine written in C++ instead of Java and MapReduce. Impala can process queries in milliseconds to hours by distributing query execution across Hadoop clusters. It uses existing Hadoop file formats and metadata but is optimized for performance through techniques like runtime code generation and in-memory processing.
Impala is an open-source SQL query engine for Apache Hadoop that allows for fast, interactive queries directly against data stored in HDFS and other data storage systems. It provides low-latency queries in seconds by using a custom query engine instead of MapReduce. Impala allows users to interact with data using standard SQL and business intelligence tools while leveraging existing metadata in Hadoop. It is designed to be integrated with the Hadoop ecosystem for distributed, fault-tolerant and scalable data processing and analytics.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Introduction to Apache Apex and writing a big data streaming application Apache Apex
Introduction to Apache Apex - The next generation native Hadoop platform, and writing a native Hadoop big data Apache Apex streaming application.
This talk will cover details about how Apex can be used as a powerful and versatile platform for big data. Apache apex is being used in production by customers for both streaming and batch use cases. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch. alerts, real-time actions, threat detection, etc.
Presenter : <b>Pramod Immaneni</b> Apache Apex PPMC member and senior architect at DataTorrent Inc, where he works on Apex and specializes in big data applications. Prior to DataTorrent he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs. Before that he was a technical co-founder of a mobile startup where he was an architect of a dynamic content rendering engine for mobile devices.
This is a video of the webcast of an Apache Apex meetup event organized by Guru Virtues at 267 Boston Rd no. 9, North Billerica, MA, on <b>May 7th 2016</b> and broadcasted from San Jose, CA. If you are interested in helping organize i.e., hosting, presenting, community leadership Apache Apex community, please email [email protected]
Customer Education Webcast: New Features in Data Integration and Streaming CDCPrecisely
View our quarterly customer education webcast to learn about the new advancements in Syncsort DMX and DMX-h data integration software and DataFunnel - our new easy-to-use browser-based database onboarding application. Learn about DMX Change Data Capture and the advantages of true streaming over micro-batch.
View this webcast on-demand where you'll hear the latest news on:
• Improvements in Syncsort DMX and DMX-h
• What’s next in the new DataFunnel interface
• Streaming data in DMX Change Data Capture
• Hadoop 3 support in Syncsort Integrate products
Architecting a next-generation data platformhadooparchbook
This document discusses a high-level architecture for analyzing taxi trip data in real-time and batch using Apache Hadoop and streaming technologies. The architecture includes ingesting data from multiple sources using Kafka, processing streaming data using stream processing engines, storing data in data stores like HDFS, and enabling real-time and batch querying and analytics. Key considerations discussed are choosing data transport and stream processing technologies, scaling and reliability, and processing both streaming and batch data.
Top 5 mistakes when writing Streaming applicationshadooparchbook
This document discusses 5 common mistakes when writing streaming applications and provides solutions. It covers: 1) Not shutting down apps gracefully by using thread hooks or external markers to stop processing after batches finish. 2) Assuming exactly-once semantics when things can fail at multiple points requiring offsets and idempotent operations. 3) Using streaming for everything when batch processing is better for some goals. 4) Not preventing data loss by enabling checkpointing and write-ahead logs. 5) Not monitoring jobs by using tools like Spark Streaming UI, Graphite and YARN cluster mode for automatic restarts.
Architecting a Next Generation Data Platformhadooparchbook
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
What no one tells you about writing a streaming apphadooparchbook
This document discusses 5 things that are often not addressed when writing streaming applications:
1. Managing and monitoring long-running streaming jobs can be challenging as frameworks were not originally designed for streaming workloads. Options include using cluster mode to ensure jobs continue if clients disconnect and leveraging monitoring tools to track metrics.
2. Preventing data loss requires different approaches depending on the data source. File and receiver-based sources benefit from checkpointing while Kafka's commit log ensures data is not lost.
3. Spark Streaming is well-suited for tasks involving windowing, aggregations, and machine learning but may not be needed for all streaming use cases.
4. Achieving exactly-once semantics requires techniques
Architecting next generation big data platformhadooparchbook
A tutorial on architecting next generation big data platform by the authors of O'Reilly's Hadoop Application Architectures book. This tutorial discusses how to build a customer 360 (or entity 360) big data application.
Audience: Technical.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes made when writing Spark applications and provides recommendations to address them. It covers issues like having executors that are too small or large, shuffle blocks exceeding size limits, data skew slowing jobs, and excessive stages. The key recommendations are to optimize executor and partition sizes, increase partitions to reduce skew, use techniques like salting to address skew, and favor transformations like ReduceByKey over GroupByKey to minimize shuffles and memory usage.
The document discusses best practices for streaming applications. It covers common streaming use cases like ingestion, transformations, and counting. It also discusses advanced streaming use cases that involve machine learning. The document provides an overview of streaming architectures and compares different streaming engines like Spark Streaming, Flink, Storm, and Kafka Streams. It discusses when to use different storage systems and message brokers like Kafka for ingestion pipelines. The goal is to understand common streaming use cases and their architectures.
This document discusses a presentation on fraud detection application architectures using Hadoop. It provides an overview of different fraud use cases and challenges in implementing Hadoop-based solutions. Requirements for the applications include handling high volumes, velocities and varieties of data, generating real-time alerts with low latency, and performing both stream and batch processing. A high-level architecture is proposed using Hadoop, HBase, HDFS, Kafka and Spark to meet the requirements. Storage layer choices and considerations are also discussed.
Building a fraud detection application using the tools in the Hadoop ecosystem. Presentation given by authors of O'Reilly's Hadoop Application Architectures book at Strata + Hadoop World in San Jose, CA 2016.
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes people make when writing Spark applications and provides recommendations to address them. It covers issues related to executor configuration, application failures due to shuffle block sizes exceeding limits, slow jobs caused by data skew, and managing the DAG to avoid excessive shuffles and stages. Recommendations include using smaller executors, increasing the number of partitions, addressing skew through techniques like salting, and preferring ReduceByKey over GroupByKey and TreeReduce over Reduce to improve performance and resource usage.
This document discusses a case study on fraud detection using Hadoop. It begins with an overview of fraud detection requirements, including the need for real-time and near real-time processing of large volumes and varieties of data. It then covers considerations for the system architecture, including using HDFS and HBase for storage, Kafka for ingestion, and Spark and Storm for stream and batch processing. Data modeling with HBase and caching options are also discussed.
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
Delivered by Mark Grover at Northern CO Hadoop User Group:
http://www.meetup.com/Northern-Colorado-Big-Data-Meetup/events/224717963/
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
This document discusses architectures for fraud detection applications using Hadoop. It provides an overview of requirements for such an application, including the need for real-time alerts and batch processing. It proposes using Kafka for ingestion due to its high throughput and partitioning. HBase and HDFS would be used for storage, with HBase better supporting random access for profiles. The document outlines using Flume, Spark Streaming, and HBase for near real-time processing and alerting on incoming events. Batch processing would use HDFS, Impala, and Spark. Caching profiles in memory is also suggested to improve performance.
The document discusses real-time fraud detection patterns and architectures. It provides an overview of key technologies like Kafka, Flume, and Spark Streaming used for real-time event processing. It then describes a high-level architecture involving ingesting events through Flume and Kafka into Spark Streaming for real-time processing, with results stored in HBase, HDFS, and Solr. The document also covers partitioning strategies, micro-batching, complex topologies, and ingestion of real-time and batch data.
The document discusses architectural considerations for Hadoop applications based on a case study of clickstream analysis. It covers requirements for data ingestion, storage, processing, and orchestration. For data storage, it recommends storing raw clickstream data in HDFS using the Avro file format with Snappy compression. For processed data, it recommends using the Parquet columnar storage format to enable efficient analytical queries. The document also discusses partitioning strategies and HDFS directory layout design.
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
This document outlines a presentation on architectural considerations for Hadoop applications. It introduces the presenters who are experts from Cloudera and contributors to Apache Hadoop projects. It then discusses a case study on clickstream analysis, how this was challenging before Hadoop due to data storage limitations, and how Hadoop provides a better solution by enabling active archiving of large volumes and varieties of data at scale. Finally, it covers some of the challenges in implementing Hadoop, such as choices around storage managers, data modeling and file formats, data movement workflows, metadata management, and data access and processing frameworks.
Architectural considerations for Hadoop Applicationshadooparchbook
The document discusses architectural considerations for Hadoop applications using a case study on clickstream analysis. It covers requirements for data ingestion, storage, processing, and orchestration. For data storage, it considers HDFS vs HBase, file formats, and compression formats. SequenceFiles are identified as a good choice for raw data storage as they allow for splittable compression.
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentShubham Joshi
A secure test infrastructure ensures that the testing process doesn’t become a gateway for vulnerabilities. By protecting test environments, data, and access points, organizations can confidently develop and deploy software without compromising user privacy or system integrity.
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)Andre Hora
Software testing plays a crucial role in the contribution process of open-source projects. For example, contributions introducing new features are expected to include tests, and contributions with tests are more likely to be accepted. Although most real-world projects require contributors to write tests, the specific testing practices communicated to contributors remain unclear. In this paper, we present an empirical study to understand better how software testing is approached in contribution guidelines. We analyze the guidelines of 200 Python and JavaScript open-source software projects. We find that 78% of the projects include some form of test documentation for contributors. Test documentation is located in multiple sources, including CONTRIBUTING files (58%), external documentation (24%), and README files (8%). Furthermore, test documentation commonly explains how to run tests (83.5%), but less often provides guidance on how to write tests (37%). It frequently covers unit tests (71%), but rarely addresses integration (20.5%) and end-to-end tests (15.5%). Other key testing aspects are also less frequently discussed: test coverage (25.5%) and mocking (9.5%). We conclude by discussing implications and future research.
Exploring Wayland: A Modern Display Server for the FutureICS
Wayland is revolutionizing the way we interact with graphical interfaces, offering a modern alternative to the X Window System. In this webinar, we’ll delve into the architecture and benefits of Wayland, including its streamlined design, enhanced performance, and improved security features.
Discover why Wi-Fi 7 is set to transform wireless networking and how Router Architects is leading the way with next-gen router designs built for speed, reliability, and innovation.
This presentation explores code comprehension challenges in scientific programming based on a survey of 57 research scientists. It reveals that 57.9% of scientists have no formal training in writing readable code. Key findings highlight a "documentation paradox" where documentation is both the most common readability practice and the biggest challenge scientists face. The study identifies critical issues with naming conventions and code organization, noting that 100% of scientists agree readable code is essential for reproducible research. The research concludes with four key recommendations: expanding programming education for scientists, conducting targeted research on scientific code quality, developing specialized tools, and establishing clearer documentation guidelines for scientific software.
Presented at: The 33rd International Conference on Program Comprehension (ICPC '25)
Date of Conference: April 2025
Conference Location: Ottawa, Ontario, Canada
Preprint: https://arxiv.org/abs/2501.10037
Download YouTube By Click 2025 Free Full Activatedsaniamalik72555
Copy & Past Link 👉👉
https://dr-up-community.info/
"YouTube by Click" likely refers to the ByClick Downloader software, a video downloading and conversion tool, specifically designed to download content from YouTube and other video platforms. It allows users to download YouTube videos for offline viewing and to convert them to different formats.
Who Watches the Watchmen (SciFiDevCon 2025)Allon Mureinik
Tests, especially unit tests, are the developers’ superheroes. They allow us to mess around with our code and keep us safe.
We often trust them with the safety of our codebase, but how do we know that we should? How do we know that this trust is well-deserved?
Enter mutation testing – by intentionally injecting harmful mutations into our code and seeing if they are caught by the tests, we can evaluate the quality of the safety net they provide. By watching the watchmen, we can make sure our tests really protect us, and we aren’t just green-washing our IDEs to a false sense of security.
Talk from SciFiDevCon 2025
https://www.scifidevcon.com/courses/2025-scifidevcon/contents/680efa43ae4f5
Not So Common Memory Leaks in Java WebinarTier1 app
This SlideShare presentation is from our May webinar, “Not So Common Memory Leaks & How to Fix Them?”, where we explored lesser-known memory leak patterns in Java applications. Unlike typical leaks, subtle issues such as thread local misuse, inner class references, uncached collections, and misbehaving frameworks often go undetected and gradually degrade performance. This deck provides in-depth insights into identifying these hidden leaks using advanced heap analysis and profiling techniques, along with real-world case studies and practical solutions. Ideal for developers and performance engineers aiming to deepen their understanding of Java memory management and improve application stability.
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...Egor Kaleynik
This case study explores how we partnered with a mid-sized U.S. healthcare SaaS provider to help them scale from a successful pilot phase to supporting over 10,000 users—while meeting strict HIPAA compliance requirements.
Faced with slow, manual testing cycles, frequent regression bugs, and looming audit risks, their growth was at risk. Their existing QA processes couldn’t keep up with the complexity of real-time biometric data handling, and earlier automation attempts had failed due to unreliable tools and fragmented workflows.
We stepped in to deliver a full QA and DevOps transformation. Our team replaced their fragile legacy tests with Testim’s self-healing automation, integrated Postman and OWASP ZAP into Jenkins pipelines for continuous API and security validation, and leveraged AWS Device Farm for real-device, region-specific compliance testing. Custom deployment scripts gave them control over rollouts without relying on heavy CI/CD infrastructure.
The result? Test cycle times were reduced from 3 days to just 8 hours, regression bugs dropped by 40%, and they passed their first HIPAA audit without issue—unlocking faster contract signings and enabling them to expand confidently. More than just a technical upgrade, this project embedded compliance into every phase of development, proving that SaaS providers in regulated industries can scale fast and stay secure.
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Ranjan Baisak
As software complexity grows, traditional static analysis tools struggle to detect vulnerabilities with both precision and context—often triggering high false positive rates and developer fatigue. This article explores how Graph Neural Networks (GNNs), when applied to source code representations like Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs), and Data Flow Graphs (DFGs), can revolutionize vulnerability detection. We break down how GNNs model code semantics more effectively than flat token sequences, and how techniques like attention mechanisms, hybrid graph construction, and feedback loops significantly reduce false positives. With insights from real-world datasets and recent research, this guide shows how to build more reliable, proactive, and interpretable vulnerability detection systems using GNNs.
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Eric D. Schabell
It's time you stopped letting your telemetry data pressure your budgets and get in the way of solving issues with agility! No more I say! Take back control of your telemetry data as we guide you through the open source project Fluent Bit. Learn how to manage your telemetry data from source to destination using the pipeline phases covering collection, parsing, aggregation, transformation, and forwarding from any source to any destination. Buckle up for a fun ride as you learn by exploring how telemetry pipelines work, how to set up your first pipeline, and exploring several common use cases that Fluent Bit helps solve. All this backed by a self-paced, hands-on workshop that attendees can pursue at home after this session (https://o11y-workshops.gitlab.io/workshop-fluentbit).
Landscape of Requirements Engineering for/by AI through Literature ReviewHironori Washizaki
Hironori Washizaki, "Landscape of Requirements Engineering for/by AI through Literature Review," RAISE 2025: Workshop on Requirements engineering for AI-powered SoftwarE, 2025.
AgentExchange is Salesforce’s latest innovation, expanding upon the foundation of AppExchange by offering a centralized marketplace for AI-powered digital labor. Designed for Agentblazers, developers, and Salesforce admins, this platform enables the rapid development and deployment of AI agents across industries.
Email: [email protected]
Phone: +1(630) 349 2411
Website: https://www.fexle.com/blogs/agentexchange-an-ultimate-guide-for-salesforce-consultants-businesses/?utm_source=slideshare&utm_medium=pptNg
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AIdanshalev
If we were building a GenAI stack today, we'd start with one question: Can your retrieval system handle multi-hop logic?
Trick question, b/c most can’t. They treat retrieval as nearest-neighbor search.
Today, we discussed scaling #GraphRAG at AWS DevOps Day, and the takeaway is clear: VectorRAG is naive, lacks domain awareness, and can’t handle full dataset retrieval.
GraphRAG builds a knowledge graph from source documents, allowing for a deeper understanding of the data + higher accuracy.
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Versionsaimabibi60507
Copy & Past Link👉👉
https://dr-up-community.info/
Pixologic ZBrush, now developed by Maxon, is a premier digital sculpting and painting software renowned for its ability to create highly detailed 3D models. Utilizing a unique "pixol" technology, ZBrush stores depth, lighting, and material information for each point on the screen, allowing artists to sculpt and paint with remarkable precision .
WinRAR Crack for Windows (100% Working 2025)sh607827
copy and past on google ➤ ➤➤ https://hdlicense.org/ddl/
WinRAR Crack Free Download is a powerful archive manager that provides full support for RAR and ZIP archives and decompresses CAB, ARJ, LZH, TAR, GZ, ACE, UUE, .
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Lionel Briand
Impala Architecture presentation
1.
Impala: A Modern,
Open-Source SQL
Engine for Hadoop
Mark
Grover
So+ware
Engineer,
Cloudera
January
7th,
2014
Twi@er:
mark_grover
github.com/markgrover/impala-‐thug/
2. • What
is
Impala
and
what
is
Hadoop?
• ExecuPon
frameworks
on
Hadoop
–
MR,
Hive,
etc.
• Goals
and
user
view
of
Impala
• Architecture
of
Impala
• Comparing
Impala
to
other
systems
• Impala
Roadmap
Agenda
4. What
is
Apache
Hadoop?
4
Has the Flexibility to Store
and Mine Any Type of Data
§ Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
§ Not bound by a single schema
Excels at
Processing Complex Data
§ Scale-out architecture divides
workloads across multiple nodes
§ Flexible file system eliminates ETL
bottlenecks
Scales
Economically
§ Can be deployed on commodity
hardware
§ Open source platform guards
against vendor lock
Hadoop
Distributed File
System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
MapReduce
Distributed Computing
Framework
Apache Hadoop is an open
source platform for data storage and
processing that is…
ü Distributed
ü Fault tolerant
ü Scalable
CORE HADOOP SYSTEM COMPONENTS
5. • Batch
oriented
• High
latency
• Not
all
paradigms
fit
very
well
• Only
for
developers
So,
what’s
wrong
with
MapReduce?
6. • MR
is
hard
and
only
for
developers
• Higher
level
pla]orms
for
converPng
declaraPve
syntax
to
MapReduce
• SQL
–
Hive
• workflow
language
–
Pig
• Build
on
top
of
MapReduce
What
are
Hive
and
Pig?
7. • General-‐purpose
SQL
engine
• Real-‐Pme
queries
in
Apache
Hadoop
• Beta
version
released
since
October
2012
• General
availability
(v1.0)
release
out
since
April
2013
• Open
source
under
Apache
license
• Latest
release
(v1.2.3)
released
on
December
23rd
What
is
Impala?
8. Impala
Overview:
Goals
• General-‐purpose
SQL
query
engine:
• Works
for
both
for
analyPcal
and
transacPonal/single-‐row
workloads
• Supports
queries
that
take
from
milliseconds
to
hours
• Runs
directly
within
Hadoop:
• reads
widely
used
Hadoop
file
formats
• talks
to
widely
used
Hadoop
storage
managers
• runs
on
same
nodes
that
run
Hadoop
processes
• High
performance:
• C++
instead
of
Java
• runPme
code
generaPon
• completely
new
execuPon
engine
–
No
MapReduce
9. User
View
of
Impala:
Overview
• Runs
as
a
distributed
service
in
cluster:
one
Impala
daemon
on
each
node
with
data
• Highly
available:
no
single
point
of
failure
• User
submits
query
via
ODBC/JDBC,
Impala
CLI
or
Hue
to
any
of
the
daemons
• Query
is
distributed
to
all
nodes
with
relevant
data
• Impala
uses
Hive's
metadata
interface,
connects
to
Hive
metastore
10. User
View
of
Impala:
Overview
• There
is
no
‘Impala
format’!
• There
is
no
‘Impala
format’!!
• Supported
file
formats:
• uncompressed/lzo-‐compressed
text
files
• sequence
files
and
RCFile
with
snappy/gzip
compression
• Avro
data
files
• Parquet
columnar
format
(more
on
that
later)
11. User
View
of
Impala:
SQL
• SQL
support:
• essenPally
SQL-‐92,
minus
correlated
subqueries
• INSERT
INTO
…
SELECT
…
• only
equi-‐joins;
no
non-‐equi
joins,
no
cross
products
• Order
By
requires
Limit
• (Limited)
DDL
support
• SQL-‐style
authorizaPon
via
Apache
Sentry
(incubaPng)
• UDFs
and
UDAFs
are
supported
12. User
View
of
Impala:
SQL
• FuncPonal
limitaPons:
• No
file
formats,
SerDes
• no
beyond
SQL
(buckets,
samples,
transforms,
arrays,
structs,
maps,
xpath,
json)
• Broadcast
joins
and
parPPoned
hash
joins
supported
• Smaller
table
has
to
fit
in
aggregate
memory
of
all
execuPng
nodes
13. User
View
of
Impala:
HBase
• FuncPonality
highlights:
• Support
for
SELECT,
INSERT
INTO
…
SELECT
…,
and
INSERT
INTO
…
VALUES(…)
• Predicates
on
rowkey
columns
are
mapped
into
start/stop
rows
• Predicates
on
other
columns
are
mapped
into
SingleColumnValueFilters
• But:
mapping
of
HBase
tables
metastore
table
pa@erned
a+er
Hive
• All
data
stored
as
scalars
and
in
ascii
• The
rowkey
needs
to
be
mapped
into
a
single
string
column
14. User
View
of
Impala:
HBase
• Roadmap
• Full
support
for
UPDATE
and
DELETE
• Storage
of
structured
data
to
minimize
storage
and
access
overhead
• Composite
row
key
encoding,
mapped
into
an
arbitrary
number
of
table
columns
15. Impala
Architecture
• Three
binaries:
impalad,
statestored,
catalogd
• Impala
daemon
(impalad)
–
N
instances
• handles
client
requests
and
all
internal
requests
related
to
query
execuPon
• State
store
daemon
(statestored)
–
1
instance
• Provides
name
service
and
metadata
distribuPon
• Catalog
daemon
(catalogd)
–
1
instance
• Relays
metadata
changes
to
all
impalad’s
16. Impala
Architecture
• Query
execuPon
phases
• request
arrives
via
odbc/jdbc
• planner
turns
request
into
collecPons
of
plan
fragments
• coordinator
iniPates
execuPon
on
remote
impalad’s
17. Impala
Architecture
• During
execuPon
• intermediate
results
are
streamed
between
executors
• query
results
are
streamed
back
to
client
• subject
to
limitaPons
imposed
to
blocking
operators
(top-‐n,
aggregaPon)
19. Impala
Architecture:
Query
ExecuPon
Planner
turns
request
into
collecPons
of
plan
fragments
Coordinator
iniPates
execuPon
on
remote
impalad's
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
SQL
App
ODBC
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
HiveMeta
store
HDFS
NN
Statestore
+
Catalogd
20. Impala
Architecture:
Query
ExecuPon
Intermediate
results
are
streamed
between
impalad's
Query
results
are
streamed
back
to
client
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
SQL
App
ODBC
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
query
results
HiveMeta
store
HDFS
NN
Statestore
+
Catalogd
21. Query
Planning:
Overview
• 2-‐phase
planning
process:
• single-‐node
plan:
le+-‐deep
tree
of
plan
operators
• plan
parPPoning:
parPPon
single-‐node
plan
to
maximize
scan
locality,
minimize
data
movement
• ParallelizaPon
of
operators:
• All
query
operators
are
fully
distributed
23. Single-‐Node
Plan:
Example
Query
SELECT
t1.cusPd,
SUM(t2.revenue)
AS
revenue
FROM
LargeHdfsTable
t1
JOIN
LargeHdfsTable
t2
ON
(t1.id1
=
t2.id)
JOIN
SmallHbaseTable
t3
ON
(t1.id2
=
t3.id)
WHERE
t3.category
=
'Online'
GROUP
BY
t1.cusPd
ORDER
BY
revenue
DESC
LIMIT
10;
24. Query Planning:
Single-‐Node
Plan
HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Agg
• Single-‐node
plan
for
example:
25. Query Planning:
Distributed
Plans
• Goals:
o maximize
scan
locality,
minimize
data
movement
o full
distribuPon
of
all
query
operators
(where
semanPcally
correct)
• Parallel
joins:
o broadcast
join:
join
is
collocated
with
le+
input;
right-‐
hand
side
table
is
broadcast
to
each
node
execuPng
join
-‐>
preferred
for
small
right-‐hand
side
input
o parPPoned
join:
both
tables
are
hash-‐parPPoned
on
join
columns
-‐>
preferred
for
large
joins
o cost-‐based
decision
based
on
column
stats/esPmated
cost
of
data
transfers
26. Query Planning:
Distributed
Plans
• Parallel
aggregaPon:
o pre-‐aggregaPon
where
data
is
first
materialized
o merge
aggregaPon
parPPoned
by
grouping
columns
• Parallel
top-‐N:
o iniPal
top-‐N
operaPon
where
data
is
first
materialized
o final
top-‐N
in
single-‐node
plan
fragment
27. Query Planning:
Distributed
Plans
• In
the
example:
o scans
are
local:
each
scan
receives
its
own
fragment
o 1st
join:
large
x
large
-‐>
parPPoned
join
o 2nd
scan:
large
x
small
-‐>
broadcast
join
o pre-‐aggregaPon
in
fragment
that
materializes
join
result
o merge
aggregaPon
a+er
reparPPoning
on
grouping
column
o iniPal
top-‐N
in
fragment
that
does
merge
aggregaPon
o final
top-‐N
in
coordinator
fragment
28. Query
Planning:
Distributed
Plans
HashJoinScan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Pre-Agg
MergeAgg
TopN
Broadcast
Broadcast
hash t2.idhash t1.id1
hash
t1.custid
at HDFS DN
at HBase RS
at coordinator
29. Metadata
Handling
• Impala
metadata:
o Hive's
metastore:
logical
metadata
(table
definiPons,
columns,
CREATE
TABLE
parameters)
o HDFS
NameNode:
directory
contents
and
block
replica
locaPons
o HDFS
DataNode:
block
replicas'
volume
ids
30. Metadata
Handling
• Caches
metadata:
no
synchronous
metastore
API
calls
during
query
execuPon
• impalad
instances
read
metadata
from
metastore
at
startup
• Catalog
Service
relays
metadata
when
you
run
DDL
or
update
metadata
on
one
of
Impalad’s
• REFRESH [<tbl>]:
reloads
metadata
on
all
impalad’s
(if
you
added
new
files
via
Hive)
• INVALIDATE METADATA:
reloads
metadata
for
all
tables
• Roadmap:
HCatalog
31. Impala
ExecuPon
Engine
• Wri@en
in
C++
for
minimal
execuPon
overhead
• Internal
in-‐memory
tuple
format
puts
fixed-‐width
data
at
fixed
offsets
• Uses
intrinsics/special
cpu
instrucPons
for
text
parsing,
crc32
computaPon,
etc.
• RunPme
code
generaPon
for
“big
loops”
32. Impala
ExecuPon
Engine
• More
on
runPme
code
generaPon
• example
of
"big
loop":
insert
batch
of
rows
into
hash
table
• known
at
query
compile
Pme:
#
of
tuples
in
a
batch,
tuple
layout,
column
types,
etc.
• generate
at
compile
Pme:
unrolled
loop
that
inlines
all
funcPon
calls,
contains
no
dead
code,
minimizes
branches
• code
generated
using
llvm
33. Impala's
Statestore
• Central
system
state
repository
• name
service
(membership)
• Metadata
• Roadmap:
other
scheduling-‐relevant
or
diagnosPc
state
• So+-‐state
• all
data
can
be
reconstructed
from
the
rest
of
the
system
• cluster
conPnues
to
funcPon
when
statestore
fails,
but
per-‐node
state
becomes
increasingly
stale
• Sends
periodic
heartbeats
• pushes
new
data
• checks
for
liveness
34. Statestore:
Why
not
ZooKeeper?
• ZK
is
not
a
good
pub-‐sub
system
• Watch
API
is
awkward
and
requires
a
lot
of
client
logic
• mulPple
round-‐trips
required
to
get
data
for
changes
to
node's
children
• push
model
is
more
natural
for
our
use
case
• Don't
need
all
the
guarantees
ZK
provides:
• serializability
• persistence
• prefer
to
avoid
complexity
where
possible
• ZK
is
bad
at
the
things
we
care
about
and
good
at
the
things
we
don't
35. Comparing
Impala
to
Dremel
• What
is
Dremel?
• columnar
storage
for
data
with
nested
structures
• distributed
scalable
aggregaPon
on
top
of
that
• Columnar
storage
in
Hadoop:
Parquet
• stores
data
in
appropriate
naPve/binary
types
• can
also
store
nested
structures
similar
to
Dremel's
ColumnIO
• Distributed
aggregaPon:
Impala
• Impala
plus
Parquet:
a
superset
of
the
published
version
of
Dremel
(which
didn't
support
joins)
36. • What
is
it:
• container
format
for
all
popular
serializaPon
formats:
Avro,
Thri+,
Protocol
Buffers
• successor
to
Trevni
• jointly
developed
between
Cloudera
and
Twi@er
• open
source;
hosted
on
github
• Features
• rowgroup
format:
file
contains
mulPple
horiz.
slices
• supports
storing
each
column
in
separate
file
• supports
fully
shredded
nested
data;
repePPon
and
definiPon
levels
similar
to
Dremel's
ColumnIO
• column
values
stored
in
naPve
types
(bool,
int<x>,
float,
double,
byte
array)
• support
for
index
pages
for
fast
lookup
• extensible
value
encodings
More
about
Parquet
37. Comparing
Impala
to
Hive
• Hive:
MapReduce
as
an
execuPon
engine
• High
latency,
low
throughput
queries
• Fault-‐tolerance
model
based
on
MapReduce's
on-‐disk
checkpoinPng;
materializes
all
intermediate
results
• Java
runPme
allows
for
easy
late-‐binding
of
funcPonality:
file
formats
and
UDFs.
• Extensive
layering
imposes
high
runPme
overhead
• Impala:
• direct,
process-‐to-‐process
data
exchange
• no
fault
tolerance
• an
execuPon
engine
designed
for
low
runPme
overhead
38. Impala
Roadmap:
2013
• AddiPonal
SQL:
• ORDER
BY
without
LIMIT
• AnalyPc
window
funcPons
• support
for
structured
data
types
• Improved
HBase
support:
• composite
keys,
complex
types
in
columns,
index
nested-‐loop
joins,
INSERT/UPDATE/DELETE
39. Impala
Roadmap:
2013
• RunPme
opPmizaPons:
• straggler
handling
• improved
cache
management
• data
collocaPon
for
improved
join
performance
• Resource
management:
• goal:
run
exploratory
and
producPon
workloads
in
same
cluster,
against
same
data,
w/o
impacPng
producPon
jobs
40. Demo
• Uses
Cloudera’s
Quickstart
VM
h@p://Pny.cloudera.com/quick-‐start
41. Try
it
out!
• Open
source!
Available
at
cloudera.com,
AWS
EMR!
• We
have
packages
for:
• RHEL
5,6,
SLES11,
Ubuntu
Lucid,
Maverick,
Precise,
Debain,
etc.
• QuesPons/comments?
community.cloudera.com
• My
twi@er
handle:
mark_grover
• Slides
at:
github.com/markgrover/impala-‐thug