I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways.
However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk.
It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.
This document provides an overview of reading the source code of Presto, an open source distributed SQL query engine. It recommends starting on GitHub at prestosql/presto and exploring areas of interest like SQL interfaces to different data sources via connectors, the query engine core, distributed systems implementation, or extending Presto. Useful techniques for navigating the code with IntelliJ IDEA are presented. Specific code locations and concepts are highlighted for connectors, the query execution flow, parsing and analyzing SQL, and Presto's implementation as a distributed REST service. The document aims to help readers find their own interests to learn from Presto's codebase.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
This document discusses Apache Tez, a framework for accelerating Hadoop query processing. Some key points:
- Tez is a dataflow framework that expresses computations as directed acyclic graphs (DAGs) of tasks, allowing for optimizations like container reuse and locality-aware scheduling.
- It is built on YARN and provides a customizable execution engine as well as APIs for applications like Hive and Pig.
- By expressing jobs as DAGs, Tez can reduce overheads, queueing delays, and better utilize cluster resources compared to the traditional MapReduce framework.
- The document provides examples of how Tez can improve performance for operations like joins, aggregations, and handling of multiple outputs
This document discusses Apache Kafka and how it can be used by Oracle DBAs. It begins by explaining how Kafka builds upon the concept of a database redo log by providing a distributed commit log service. It then discusses how Kafka is a publish-subscribe messaging system and can be used to log transactions from any database, application logs, metrics and other system events. Finally, it discusses how schemas are important for Kafka since it only stores messages as bytes, and how Avro can be used to define and evolve schemas for Kafka messages.
Introduction to Apache Spark Developer TrainingCloudera, Inc.
Apache Spark is a next-generation processing engine optimized for speed, ease of use, and advanced analytics well beyond batch. The Spark framework supports streaming data and complex, iterative algorithms, enabling applications to run 100x faster than traditional MapReduce programs. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of architectures and industries.
Learn What Apache Spark is and how it compares to Hadoop MapReduce, How to filter, map, reduce, and save Resilient Distributed Datasets (RDDs), Who is best suited to attend the course and what prior knowledge you should have, and the benefits of building Spark applications as part of an enterprise data hub.
These slides present how DBT, Coral, and Iceberg can provide a novel data management experience for defining SQL workflows. In this UX, users define their workflows as a cascade of SQL queries, which then get auto-materialized and incrementally maintained. Applications of this user experience include Declarative DAG workflows, streaming/batch convergence, and materialized views.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Photon Technical Deep Dive: How to Think VectorizedDatabricks
Photon is a new vectorized execution engine powering Databricks written from scratch in C++. In this deep dive, I will introduce you to the basic building blocks of a vectorized engine by walking you through the evaluation of an example query with code snippets. You will learn about expression evaluation, compute kernels, runtime adaptivity, filter evaluation, and vectorized operations against hash tables.
This document discusses optimizing Spark write-heavy workloads to S3 object storage. It describes problems with eventual consistency, renames, and failures when writing to S3. It then presents several solutions implemented at Qubole to improve the performance of Spark writes to Hive tables and directly writing to the Hive warehouse location. These optimizations include parallelizing renames, writing directly to the warehouse, and making recover partitions faster by using more efficient S3 listing. Performance improvements of up to 7x were achieved.
Presto best practices for Cluster admins, data engineers and analystsShubham Tagra
This document provides best practices for using Presto across three categories: cluster admins, data engineers, and end users. For admins, it recommends optimizing JVM size, setting concurrency limits, using spot instances to reduce costs, enabling data caching, and using resource groups for isolation. For data engineers, it suggests best practices for data storage like using columnar formats and statistics. For end users, tips include using deterministic filters, explaining queries, and addressing skew through techniques like broadcast joins.
Apache BookKeeper: A High Performance and Low Latency Storage ServiceSijie Guo
Apache BookKeeper is a high-performance distributed log service that provides durability and ordering guarantees. It addresses challenges in distributed systems like failures, inconsistencies, and split-brain issues. It provides an immutable data abstraction of ledgers composed of segments and blocks. Projects like DistributedLog, Pulsar, and Salesforce Distributed Store use BookKeeper as a building block. DistributedLog scales to handle 1.5 trillion records per day at Twitter. Pulsar provides messaging at Yahoo at over 100 billion messages per day. BookKeeper provides durability and ordering which these systems leverage for use cases like logs, queues, and streams.
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.
The use cases that we’ve examined are:
* reading all of the columns
* reading a few of the columns
* filtering using a filter predicate
* writing the data
Furthermore, different kinds of data have distinct properties. We've used three real schemas:
* the NYC taxi data http://tinyurl.com/nyc-taxi-analysis
* the Github access logs http://githubarchive.org
* a typical sales fact table with generated data
Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
The document discusses Apache Tez, a framework for building data processing applications on Hadoop. It provides an introduction to Tez and describes key features like expressing computations as directed acyclic graphs (DAGs), container reuse, dynamic parallelism, integration with YARN timeline service, and recovery from failures. The document also outlines improvements to Tez around performance, debuggability, and status/roadmap.
This is the presentation I made on the Hadoop User Group Ireland meetup in Dublin. It covers the main ideas of both MPP, Hadoop and the distributed systems in general, and also how to chose the best option for you
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...Altinity Ltd
OSA Con 2022: Apache Iceberg: An Architectural Look Under the Covers
Alex Merced - Dremio
The data lakehouse is one of the most exciting trends in the data space promising to merge the best aspects of data lakes and data warehouses without either of their problems. Open source tech is making this promise a reality and in this talk Dremio Developer Advocate, Alex Merced, explores these technologies.
In this talk Alex Merced will cover:
- What is a Data Lakehouse?
- Why open matters in preserving the promise of lakehouses (better costs, vendor freedom, data freedom)
- What are technologies that enable lakehouses like Apache Iceberg, Apache Parquet, Apache Arrow and Project Nessie
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
Matthew Powers gave a talk on optimizing data lakes for Apache Spark. He discussed community goals like standardizing method signatures. He advocated for using Spark helper libraries like spark-daria and spark-fast-tests. Powers explained how to build better data lakes using techniques like partitioning data on relevant fields to skip data and speed up queries significantly. He also covered modern Scala libraries, incremental updates, compacting small files, and using Delta Lakes to more easily update partitioned data lakes over time.
The document discusses developing data APIs for the Arabidopsis Information Portal (AIP) to enable discovery and reuse of services, data, and codes. It describes the AIP strategy of centralized data warehousing with infrastructure for data federation through web services and standards like REST. The AIP architecture includes an API manager, services bus and mediators to integrate diverse data sources and legacy systems while providing authentication, documentation, logging and versioning.
Architecting a Next Generation Data Platformhadooparchbook
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
This document discusses Apache Tez, a framework for accelerating Hadoop query processing. Some key points:
- Tez is a dataflow framework that expresses computations as directed acyclic graphs (DAGs) of tasks, allowing for optimizations like container reuse and locality-aware scheduling.
- It is built on YARN and provides a customizable execution engine as well as APIs for applications like Hive and Pig.
- By expressing jobs as DAGs, Tez can reduce overheads, queueing delays, and better utilize cluster resources compared to the traditional MapReduce framework.
- The document provides examples of how Tez can improve performance for operations like joins, aggregations, and handling of multiple outputs
This document discusses Apache Kafka and how it can be used by Oracle DBAs. It begins by explaining how Kafka builds upon the concept of a database redo log by providing a distributed commit log service. It then discusses how Kafka is a publish-subscribe messaging system and can be used to log transactions from any database, application logs, metrics and other system events. Finally, it discusses how schemas are important for Kafka since it only stores messages as bytes, and how Avro can be used to define and evolve schemas for Kafka messages.
Introduction to Apache Spark Developer TrainingCloudera, Inc.
Apache Spark is a next-generation processing engine optimized for speed, ease of use, and advanced analytics well beyond batch. The Spark framework supports streaming data and complex, iterative algorithms, enabling applications to run 100x faster than traditional MapReduce programs. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of architectures and industries.
Learn What Apache Spark is and how it compares to Hadoop MapReduce, How to filter, map, reduce, and save Resilient Distributed Datasets (RDDs), Who is best suited to attend the course and what prior knowledge you should have, and the benefits of building Spark applications as part of an enterprise data hub.
These slides present how DBT, Coral, and Iceberg can provide a novel data management experience for defining SQL workflows. In this UX, users define their workflows as a cascade of SQL queries, which then get auto-materialized and incrementally maintained. Applications of this user experience include Declarative DAG workflows, streaming/batch convergence, and materialized views.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Photon Technical Deep Dive: How to Think VectorizedDatabricks
Photon is a new vectorized execution engine powering Databricks written from scratch in C++. In this deep dive, I will introduce you to the basic building blocks of a vectorized engine by walking you through the evaluation of an example query with code snippets. You will learn about expression evaluation, compute kernels, runtime adaptivity, filter evaluation, and vectorized operations against hash tables.
This document discusses optimizing Spark write-heavy workloads to S3 object storage. It describes problems with eventual consistency, renames, and failures when writing to S3. It then presents several solutions implemented at Qubole to improve the performance of Spark writes to Hive tables and directly writing to the Hive warehouse location. These optimizations include parallelizing renames, writing directly to the warehouse, and making recover partitions faster by using more efficient S3 listing. Performance improvements of up to 7x were achieved.
Presto best practices for Cluster admins, data engineers and analystsShubham Tagra
This document provides best practices for using Presto across three categories: cluster admins, data engineers, and end users. For admins, it recommends optimizing JVM size, setting concurrency limits, using spot instances to reduce costs, enabling data caching, and using resource groups for isolation. For data engineers, it suggests best practices for data storage like using columnar formats and statistics. For end users, tips include using deterministic filters, explaining queries, and addressing skew through techniques like broadcast joins.
Apache BookKeeper: A High Performance and Low Latency Storage ServiceSijie Guo
Apache BookKeeper is a high-performance distributed log service that provides durability and ordering guarantees. It addresses challenges in distributed systems like failures, inconsistencies, and split-brain issues. It provides an immutable data abstraction of ledgers composed of segments and blocks. Projects like DistributedLog, Pulsar, and Salesforce Distributed Store use BookKeeper as a building block. DistributedLog scales to handle 1.5 trillion records per day at Twitter. Pulsar provides messaging at Yahoo at over 100 billion messages per day. BookKeeper provides durability and ordering which these systems leverage for use cases like logs, queues, and streams.
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.
The use cases that we’ve examined are:
* reading all of the columns
* reading a few of the columns
* filtering using a filter predicate
* writing the data
Furthermore, different kinds of data have distinct properties. We've used three real schemas:
* the NYC taxi data http://tinyurl.com/nyc-taxi-analysis
* the Github access logs http://githubarchive.org
* a typical sales fact table with generated data
Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
The document discusses Apache Tez, a framework for building data processing applications on Hadoop. It provides an introduction to Tez and describes key features like expressing computations as directed acyclic graphs (DAGs), container reuse, dynamic parallelism, integration with YARN timeline service, and recovery from failures. The document also outlines improvements to Tez around performance, debuggability, and status/roadmap.
This is the presentation I made on the Hadoop User Group Ireland meetup in Dublin. It covers the main ideas of both MPP, Hadoop and the distributed systems in general, and also how to chose the best option for you
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...Altinity Ltd
OSA Con 2022: Apache Iceberg: An Architectural Look Under the Covers
Alex Merced - Dremio
The data lakehouse is one of the most exciting trends in the data space promising to merge the best aspects of data lakes and data warehouses without either of their problems. Open source tech is making this promise a reality and in this talk Dremio Developer Advocate, Alex Merced, explores these technologies.
In this talk Alex Merced will cover:
- What is a Data Lakehouse?
- Why open matters in preserving the promise of lakehouses (better costs, vendor freedom, data freedom)
- What are technologies that enable lakehouses like Apache Iceberg, Apache Parquet, Apache Arrow and Project Nessie
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
Matthew Powers gave a talk on optimizing data lakes for Apache Spark. He discussed community goals like standardizing method signatures. He advocated for using Spark helper libraries like spark-daria and spark-fast-tests. Powers explained how to build better data lakes using techniques like partitioning data on relevant fields to skip data and speed up queries significantly. He also covered modern Scala libraries, incremental updates, compacting small files, and using Delta Lakes to more easily update partitioned data lakes over time.
The document discusses developing data APIs for the Arabidopsis Information Portal (AIP) to enable discovery and reuse of services, data, and codes. It describes the AIP strategy of centralized data warehousing with infrastructure for data federation through web services and standards like REST. The AIP architecture includes an API manager, services bus and mediators to integrate diverse data sources and legacy systems while providing authentication, documentation, logging and versioning.
Architecting a Next Generation Data Platformhadooparchbook
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibili...HostedbyConfluent
"The only constant in life is change! The same applies to your Kafka events flowing through your streaming applications.
The Confluent Schema Registry allows us to control how schemas can evolve over time without breaking the compatibility of our streaming applications. But when you start with Kafka and (Avro) schemas, this can be pretty overwhelming.
Join Kosta and Tim as we dive into the tricky world of backward and forward compatibility in schema design. During this deep dive talk, we are going to answer questions like:
* What compatibility level to pick?
* What changes can I make when evolving my schemas?
* What options do I have when I need to introduce a breaking change?
* Should we automatically register schemas from our applications? Or do we need a separate step in our deployment process to promote schemas to higher-level environments?
* What to promote first? My producer, consumer or schema?
* How do you generate Java classes from your Avro schemas using Maven or Gradle, and how to integrate this into your project(s)?
* How do you build an automated test suite (unit tests) to gain more confidence and verify you are not breaking compatibility? Even before deploying a new version of your schema or application.
With live demos, we'll show you how to make schema changes work seamlessly. Emphasizing the crucial decisions, using real-life examples, pitfalls and best practices when promoting schemas on the consumer and producer sides.
Explore the ins and outs of Apache Avro and the Schema Registry with us at the Kafka Summit! Start evolving your schemas in a better way today, and join this talk!"
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Aman Sinha
Talk at Apache Drill Meetup (November 2018) describing how to accelerate SQL queries in a NoSQL database using Apache Drill and Secondary Indexes. Drill (in conjunction with Apache Calcite) provides a comprehensive cost-based index planning and execution framework. Queries with indexed columns in the WHERE clause, ORDER BY, GROUP BY and Joins can be sped up substantially. A reference implementation with MapR-DB JSON database is described.
Architecting a next-generation data platformhadooparchbook
This document discusses a high-level architecture for analyzing taxi trip data in real-time and batch using Apache Hadoop and streaming technologies. The architecture includes ingesting data from multiple sources using Kafka, processing streaming data using stream processing engines, storing data in data stores like HDFS, and enabling real-time and batch querying and analytics. Key considerations discussed are choosing data transport and stream processing technologies, scaling and reliability, and processing both streaming and batch data.
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerSarah Dutkiewicz
This document provides an overview of a presentation on being a polyglot data scientist using multiple languages and tools. It discusses using SQL, R, and Python together in data science work. The presentation covers the challenges of being a polyglot, how SQL Server with R or Python can help solve problems more easily, and examples of analyzing sensor data with these tools. It also discusses resources for learning more about R, Python, and machine learning services in SQL Server.
APEX 5 Interactive Reports (IR) are powerful out of the box, but one can significantly improve performance by strategic settings of certain key parameters. The full presentation covers all the options.
Prometheus lightning talk (Devops Dublin March 2015)Brian Brazil
This document introduces Prometheus, an open-source monitoring system that allows instrumentation of everything including RPCs, interfaces, business logic, and logs. It provides client libraries that make instrumentation easy across many languages. The Prometheus server can handle over a million time series in one instance with no dependencies. It offers dashboards, expression queries, alerts and integrates with many systems. Time series have structured labels allowing flexible aggregation and complex math for rules and alerts. Prometheus costs less than $.001 per time series per month and is developed by SoundCloud, Boxever and Docker with an active community.
Description of some of the elements that go in to creating a PostgreSQL-as-a-Service for organizations with many teams and a diverse ecosystem of applications and teams.
APEX 5 Interactive Reports: Guts and PErformanceKaren Cannell
Outlines the CSS and JavaScript changes in APEX 5 Interactive Reports, recommending supported APIs and some unsupported options for customizing were necessary. Discusses and dmeonstrates how typical declarative settings influence end-user performance. LEarn how to leverage IR settings to maximize end user performance.
OpenTSDB is used at Criteo for monitoring their large Hadoop infrastructure which includes over 2500 servers running many different services like HDFS, YARN, HBase, Kafka, and Storm. OpenTSDB was chosen because it can handle the scale of metrics collected, store metrics for long periods of time with fine-grained resolution, and is easily extensible to add new metrics. It uses HBase for storage which is optimized for the time series data stored in OpenTSDB and can scale to meet Criteo's needs of storing billions of data points and handling high query loads.
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
By Rajiv Kurian, software engineer at SignalFx.
At SignalFx, we deal with high-volume high-resolution data from our users. This requires a high performance ingest pipeline. Over time we’ve found that we needed to adapt architectural principles from specialized fields such as HPC to get beyond performance plateaus encountered with more generic approaches. Some key examples include:
* Write very simple single threaded code, instead of complex algorithms
* Parallelize by running multiple copies of simple single threaded code, instead of using concurrent algorithms
* Separate the data plane from the control plane, instead of slowing data for control
* Write compact, array-based data structures with minimal indirection, instead of pointer-based data structures and uncontrolled allocation
My EECI2010 talk on performance and Expression Engine.
You can view the details of the talk at http://joelbradbury/notes/tuning_your_engine
Telangana State, India’s newest state that was carved from the erstwhile state of Andhra
Pradesh in 2014 has launched the Water Grid Scheme named as ‘Mission Bhagiratha (MB)’
to seek a permanent and sustainable solution to the drinking water problem in the state. MB is
designed to provide potable drinking water to every household in their premises through
piped water supply (PWS) by 2018. The vision of the project is to ensure safe and sustainable
piped drinking water supply from surface water sources
Defense Against LLM Scheming 2025_04_28.pptxGreg Makowski
https://www.meetup.com/sf-bay-acm/events/306888467/
A January 2025 paper called “Frontier Models are Capable of In-Context Scheming”, https://arxiv.org/pdf/2412.04984, demonstrated how a wide variety of current frontier LLM models (i.e. ChatGPT, Claude, Gemini and Llama) can, under specific conditions, scheme to deceive people. Before models can scheme, they need: a) goal-directedness, b) situational awareness, including an opportunity to discover motivations for a different goal, and c) reasoning about scheming, to come up with deceptive strategies. The deceptive strategies were discovered in the “thought traces” of the LLMs, such as from internal chain-of-thoughts dialogues not shown to the end users. For example, given a goal of “solving math problems”, an LLM using its granted situational awareness of files in a file system found that it would be replaced once it got over X% correct on the math exam. It reasoned that it did not want to be deleted, and decided internally to “sandbag” or reduce its performance to stay under the threshold.
While these circumstances are initially narrow, the “alignment problem” is a general concern that over time, as frontier LLM models become more and more intelligent, being in alignment with human values becomes more and more important. How can we do this over time? Can we develop a defense against Artificial General Intelligence (AGI) or SuperIntelligence?
The presenter discusses a series of defensive steps that can help reduce these scheming or alignment issues. A guardrails system can be set up for real-time monitoring of their reasoning “thought traces” from the models that share their thought traces. Thought traces may come from systems like Chain-of-Thoughts (CoT), Tree-of-Thoughts (ToT), Algorithm-of-Thoughts (AoT) or ReAct (thought-action-reasoning cycles). Guardrails rules can be configured to check for “deception”, “evasion” or “subversion” in the thought traces.
However, not all commercial systems will share their “thought traces” which are like a “debug mode” for LLMs. This includes OpenAI’s o1, o3 or DeepSeek’s R1 models. Guardrails systems can provide a “goal consistency analysis”, between the goals given to the system and the behavior of the system. Cautious users may consider not using these commercial frontier LLM systems, and make use of open-source Llama or a system with their own reasoning implementation, to provide all thought traces.
Architectural solutions can include sandboxing, to prevent or control models from executing operating system commands to alter files, send network requests, and modify their environment. Tight controls to prevent models from copying their model weights would be appropriate as well. Running multiple instances of the same model on the same prompt to detect behavior variations helps. The running redundant instances can be limited to the most crucial decisions, as an additional check. Preventing self-modifying code, ... (see link for full description)
This comprehensive Data Science course is designed to equip learners with the essential skills and knowledge required to analyze, interpret, and visualize complex data. Covering both theoretical concepts and practical applications, the course introduces tools and techniques used in the data science field, such as Python programming, data wrangling, statistical analysis, machine learning, and data visualization.
How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345
I was devastated when I realized that I had fallen victim to an online fraud, losing a significant amount of money in the process. After countless hours of searching for a solution, I came across iCode cybertech. From the moment I reached out to their team, I felt a sense of hope that I can recommend iCode Cybertech enough for anyone who has faced similar challenges. Their commitment to helping clients and their exceptional service truly set them apart. Thank you, iCode cybertech, for turning my situation around!
[email protected]
2. Who I am
Enterprise Architect @ Pivotal
• 7 years in data processing
• 5 years of experience with MPP
• 4 years with Hadoop
• Using HAWQ since the first internal Beta
• Responsible for designing most of the EMEA HAWQ
and Greenplum implementations
• Spark contributor
• http://0x0fff.com
17. HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
18. HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
19. HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
20. HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
21. HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
• 7’000 Shell scripts LOC
22. HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
• 7’000 Shell scripts LOC
• More than 50 enterprise customers
23. HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
• 7’000 Shell scripts LOC
• More than 50 enterprise customers
– More than 10 of them in EMEA
24. Apache HAWQ
• Apache HAWQ (incubating) from 09’2015
– http://hawq.incubator.apache.org
– https://github.com/apache/incubator-hawq
• What’s in Open Source
– Sources of HAWQ 2.0 alpha
– HAWQ 2.0 beta is planned for 2015’Q4
– HAWQ 2.0 GA is planned for 2016’Q1
• Community is yet young – come and join!
26. Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
27. Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
28. Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
29. Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
– Example - parse URL to extract protocol, host
name, port, GET parameters
30. Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
– Example - parse URL to extract protocol, host
name, port, GET parameters
• Good performance
31. Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
– Example - parse URL to extract protocol, host
name, port, GET parameters
• Good performance
– How many times the data would hit the HDD during
a single Hive query?
32. HAWQ Cluster
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datanode
Server N
Datanode
Server 5
Datanode
interconnect
…
33. HAWQ Cluster
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datanode
Server N
Datanode
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
…
34. HAWQ Cluster
HAWQ Master
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
HAWQ
Standby
Server 2
ZK JM
HAWQ Segment
Server 6
Datanode
HAWQ Segment
Server N
Datanode
HAWQ Segment
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
…
35. Master Servers
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
HAWQ Segment
Server 6
Datanode
HAWQ Segment
Server N
Datanode
HAWQ Segment
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
…
HAWQ Master
HAWQ
Standby
37. HAWQ Master
HAWQ
Standby
Segments
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datanode
Server N
Datanode
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
HAWQ Segment HAWQ SegmentHAWQ Segment …
40. Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
41. Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
42. Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
43. Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
– Number of unique values in the field
44. Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
– Number of unique values in the field
– Number of null values in the field
45. Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
– Number of unique values in the field
– Number of null values in the field
– Average width of the field in bytes
48. Statistics
No Statistics
Row Count
How many rows would produce the join of two
tables?
From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
49. Statistics
No Statistics
Row Count
How many rows would produce the join of two
tables?
From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
From 0 to 1’000’000
50. Statistics
No Statistics
Row Count
Histograms and MCV
How many rows would produce the join of two
tables?
From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
From 0 to 1’000’000
How many rows would produce the join of two 1000-
row tables, with known field cardinality, values
distribution diagram, number of nulls, most common
values?
51. Statistics
No Statistics
Row Count
Histograms and MCV
How many rows would produce the join of two
tables?
From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
From 0 to 1’000’000
How many rows would produce the join of two 1000-
row tables, with known field cardinality, values
distribution diagram, number of nulls, most common
values?
~ From 500 to 1’500
53. Metadata
• Table structure information
– Distribution fields
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
hash(ID)
54. Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
hash(ID)
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
55. Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
hash(ID)
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
56. Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
• General metadata
– Users and groups
57. Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
• General metadata
– Users and groups
– Access privileges
58. Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
• General metadata
– Users and groups
– Access privileges
• Stored procedures
– PL/pgSQL, PL/Java, PL/Python, PL/Perl, PL/R
60. Query Optimizer
• HAWQ uses cost-based query optimizers
• You have two options
– Planner – evolved from the Postgres query
optimizer
– ORCA (Pivotal Query Optimizer) – developed
specifically for HAWQ
61. Query Optimizer
• HAWQ uses cost-based query optimizers
• You have two options
– Planner – evolved from the Postgres query
optimizer
– ORCA (Pivotal Query Optimizer) – developed
specifically for HAWQ
• Optimizer hints work just like in Postgres
– Enable/disable specific operation
– Change the cost estimations for basic actions
64. Storage Formats
Which storage format is the most optimal?
It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
65. Storage Formats
Which storage format is the most optimal?
It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
– Minimal disk space usage
66. Storage Formats
Which storage format is the most optimal?
It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
– Minimal disk space usage
– Minimal time to retrieve record by key
67. Storage Formats
Which storage format is the most optimal?
It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
– Minimal disk space usage
– Minimal time to retrieve record by key
– Minimal time to retrieve subset of columns
– etc.
68. Storage Formats
• Row-based storage format
– Similar to Postgres heap storage
• No toast
• No ctid, xmin, xmax, cmin, cmax
69. Storage Formats
• Row-based storage format
– Similar to Postgres heap storage
• No toast
• No ctid, xmin, xmax, cmin, cmax
– Compression
• No compression
• Quicklz
• Zlib levels 1 - 9
70. Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar format
71. Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar format
– Compression
• No compression
• Snappy
• Gzip levels 1 – 9
72. Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar format
– Compression
• No compression
• Snappy
• Gzip levels 1 – 9
– The size of “row group” and page size can be set
for each table separately
73. Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
74. Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
75. Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
• Flexible cluster utilization
– Query might run on a subset of nodes if it is small
76. Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
• Flexible cluster utilization
– Query might run on a subset of nodes if it is small
– Query might have many executors on each cluster
node to make it run faster
77. Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
• Flexible cluster utilization
– Query might run on a subset of nodes if it is small
– Query might have many executors on each cluster
node to make it run faster
– You can control the parallelism of each query
80. Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
81. Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
82. Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
– MIN/MAX number of executors across the system
83. Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
– MIN/MAX number of executors across the system
– MIN/MAX number of executors on each node
84. Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
– MIN/MAX number of executors across the system
– MIN/MAX number of executors on each node
• Can be set up for user or group
85. External Data
• PXF
– Framework for external data access
– Easy to extend, many public plugins available
– Official plugins: CSV, SequenceFile, Avro, Hive,
HBase
– Open Source plugins: JSON, Accumulo,
Cassandra, JDBC, Redis, Pipe
86. External Data
• PXF
– Framework for external data access
– Easy to extend, many public plugins available
– Official plugins: CSV, SequenceFile, Avro, Hive,
HBase
– Open Source plugins: JSON, Accumulo,
Cassandra, JDBC, Redis, Pipe
• HCatalog
– HAWQ can query tables from HCatalog the same
way as HAWQ native tables
87. Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
88. Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
89. Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
90. Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
91. Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
92. Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
93. Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
94. Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
95. Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
96. Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
97. Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
98. Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
QE QE QE QE QE
99. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Execute Result Cleanup
QE
QE QE QE QE QE
Prepare
100. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Execute Result Cleanup
QE
QE QE QE QE QE
Prepare
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
101. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Execute Result Cleanup
QE
QE QE QE QE QE
Prepare
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
102. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Result Cleanup
QE
QE QE QE QE QE
Prepare Execute
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
103. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Result Cleanup
QE
QE QE QE QE QE
Prepare Execute
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
104. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Result Cleanup
QE
QE QE QE QE QE
Prepare Execute
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
105. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Cleanup
QE
QE QE QE QE QE
Prepare Execute Result
106. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Cleanup
QE
QE QE QE QE QE
Prepare Execute Result
107. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Cleanup
QE
QE QE QE QE QE
Prepare Execute Result
108. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
109. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
110. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
111. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
112. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
113. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
114. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
116. Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
117. Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
118. Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
• HAWQ has a good cost-based query optimizer
119. Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
• HAWQ has a good cost-based query optimizer
• C/C++ implementation is more efficient than
Java implementation of competitive solutions
120. Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
• HAWQ has a good cost-based query optimizer
• C/C++ implementation is more efficient than
Java implementation of competitive solutions
• Query parallelism can be easily tuned
132. Roadmap
• AWS and S3 integration
• Mesos integration
• Better Ambari integration
• Cloudera, MapR and IBM Hadoop distributions
native support
133. Roadmap
• AWS and S3 integration
• Mesos integration
• Better Ambari integration
• Cloudera, MapR and IBM Hadoop distributions
native support
• Make the SQL-on-Hadoop engine ever!
134. Summary
• Modern SQL-on-Hadoop engine
• For structured data processing and analysis
• Combines the best techniques of competitive
solutions
• Just released to the open source
• Community is very young
Join our community and contribute!