Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Should I use more, smaller instances, or fewer, bigger instances? Is 1Gbps enough for my network cards? Should I use batches? Can I have a collection with 3GB in size? Those are just some of the many questions we see users asking themselves on a daily basis over our mailing list, slack, and corporate ticket requests. In this talk, I will explore the answers to these common questions and help you make sure that your deployment is up to the highest standards.
This document discusses database security and best practices for securing MySQL databases. It covers common database vulnerabilities like poor configurations, weak authentication, lack of encryption, and improper credential management. It also discusses database attacks like SQL injection and brute force attacks. The document provides recommendations for database administrators to properly configure access controls, encryption, auditing, backups and monitoring to harden MySQL databases.
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
This document summarizes a presentation on building a streaming Lakehouse with Apache Flink and Apache Hudi. The presentation introduces Hudi as a way to unify batch and streaming workloads in a centralized data lake platform. It discusses how Hudi enables features like efficient upserts/deletes, incremental processing for change streams, and automatic catalog synchronization. The presentation demonstrates using Flink and Hudi on Amazon EMR and outlines several ongoing Hudi projects, including a new metaserver and lake cache, to further optimize query performance and metadata handling for streaming data lakes.
RedisConf17- Using Redis at scale @ TwitterRedis Labs
The document discusses Nighthawk, Twitter's distributed caching system which uses Redis. It provides caching services at a massive scale of over 10 million queries per second and 10 terabytes of data across 3000 Redis nodes. The key aspects of Nighthawk's architecture that allow it to scale are its use of a client-oblivious proxy layer and cluster manager that can independently scale and rebalance partitions across Redis nodes. It also employs replication between data centers to provide high availability even in the event of node failures. Some challenges discussed are handling "hot keys" that get an unusually high volume of requests and more efficiently warming up replicas when nodes fail.
Scylla Summit 2022: Scylla 5.0 New Features, Part 2ScyllaDB
Scylla 5.0 introduces several new features to improve node operations and compaction:
1. Repair-based node operations (RBNO) provide more efficient, consistent, and simplified bootstrap, replace, rebuild, and other node operations by using row-level repair as the underlying mechanism instead of streaming.
2. Off-strategy compaction keeps sstables generated during node operations in a separate data set and compacts them together after the operation finishes for less compaction work and faster completion.
3. Space amplification goal (SAG) for compaction optimizes space efficiency for overwrite workloads by dynamically adapting compaction to meet latency and space goals, improving storage density.
1. Log structured merge trees store data in multiple levels with different storage speeds and costs, requiring data to periodically merge across levels.
2. This structure allows fast writes by storing new data in faster levels before merging to slower levels, and efficient reads by querying multiple levels and merging results.
3. The merging process involves loading, sorting, and rewriting levels to consolidate and propagate deletions and updates between levels.
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
In the last few years, Netflix’s S3 data warehouse has grown to more than 100 PB. In that time, the company has shared several techniques and released open source tools for working around S3’s quirks, including s3mper to work around eventual consistency, S3 multipart committers to commit data without renames, and the batchid pattern for cross-partition atomic commits.
Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3 that is replacing many of the company’s current tools. Iceberg enables a new generation of improvements, including:
* Snapshot isolation with no directory listing or file renames
* Distributed planning to relieve metastore bottlenecks
* Improved data layout for S3 performance
* Immediately available writes from streaming applications
* Opportunistic compaction and data optimization
PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas
PostgreSQL is an open source relational database. Kafka is an open source log-based messaging system. Because both systems are powerful and flexible, they’re devouring whole categories of infrastructure. And they’re even better together.
In this talk, you’ll learn about commit logs and how that fundamental data structure underlies both PostgreSQL and Kafka. We’ll use that basis to understand what Kafka is, what advantages it has over traditional messaging systems, and why it’s perfect for modeling database tables as streams. From there, we’ll introduce the concept of change data capture (CDC) and run a live demo of Bottled Water, an open source CDC pipeline, watching INSERT, UPDATE, and DELETE operations in PostgreSQL stream into Kafka. We’ll wrap up with a discussion of use cases for this pipeline: messaging between systems with transactional guarantees, transmitting database changes to a data warehouse, and stream processing.
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
The document discusses updates to InfluxDB IOx, a new columnar time series database. It covers changes and improvements to the API, CLI, query capabilities, and path to open sourcing builds. Key points include moving to gRPC for management, adding PostgreSQL string functions to queries, optimizing functions for scalar values and columns, and monitoring internal systems as the first step to releasing open source builds.
MySQL and MariaDB though they share the same roots for replication .They support parallel replication , but they diverge the way the parallel replication is implemented.
RocksDB is an embedded key-value store written in C++ and optimized for fast storage environments like flash or RAM. It uses a log-structured merge tree to store data by writing new data sequentially to an in-memory log and memtable, periodically flushing the memtable to disk in sorted SSTables. It reads from the memtable and SSTables, and performs background compaction to merge SSTables and remove overwritten data. RocksDB supports two compaction styles - level style, which stores SSTables in multiple levels sorted by age, and universal style, which stores all SSTables in level 0 sorted by time.
This document provides information about PostgreSQL and Command Prompt Inc. It discusses that Command Prompt offers professional services for migrating from Oracle to PostgreSQL. It then covers aspects of PostgreSQL like its licensing, large international community, low costs, mature codebase, enterprise features, and technical capabilities like its SQL implementation, replication, foreign data wrappers and user-defined functions. It also notes tools that can help ease migration from Oracle and some differences from Oracle.
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
Join us for a developer workshop where we’ll go hands-on to explore the affinities between Rust, the Tokio framework, and ScyllaDB. You’ll go live with our sample Rust application, built on our new, high performance native Rust client driver.
MariaDB Server Performance Tuning & OptimizationMariaDB plc
This document discusses various techniques for optimizing MariaDB server performance, including:
- Tuning configuration settings like the buffer pool size, query cache size, and thread pool settings.
- Monitoring server metrics like CPU usage, memory usage, disk I/O, and MariaDB-specific metrics.
- Analyzing slow queries with the slow query log and EXPLAIN statements to identify optimization opportunities like adding indexes.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
Agreement in a distributed system is complicated but required. Scylla gained lightweight transactions through Paxos but the latter has a cost of 3X roundtrips. Raft can allow consistent transactions without the performance penalty. Beyond LWT, we plan to integrate Raft with most aspects of Scylla making a leap forward in manageability and consistency
This document discusses techniques for improving latency in HBase. It analyzes the write and read paths, identifying sources of latency such as networking, HDFS flushes, garbage collection, and machine failures. For writes, it finds that single puts can achieve millisecond latency while streaming puts can hide latency spikes. For reads, it notes cache hits are sub-millisecond while cache misses and seeks add latency. GC pauses of 25-100ms are common, and failures hurt locality and require cache rebuilding. The document outlines ongoing work to reduce GC, use off-heap memory, improve compactions and caching to further optimize for low latency.
Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)confluent
The document discusses streams and tables as two sides of the same coin. It proposes a dual streaming model that handles out-of-order data by continuously updating stateful operators and emitting changelog streams. This allows processing data with low latency while avoiding buffering and reordering. The model is implemented in Apache Kafka Streams and adopted in industry through tools like KSQL.
This document discusses how to implement an Active Session History feature in PostgreSQL similar to what is available in Oracle. It describes how the default PostgreSQL monitoring views only show current activity and lack historical blocking diagnostics. The document then explains how to write a PostgreSQL extension to sample the pg_stat_activity view periodically and store the results, similar to the pgsentinel extension. The pgsentinel extension is discussed in detail, including how it uses background workers and hooks to collect activity samples. Finally, potential future enhancements are mentioned like integrating with pg_stat_statements.
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
As more and more people are moving to PostgreSQL from Oracle, a pattern of mistakes is emerging. They can be caused by the tools being used or just not understanding how PostgreSQL is different than Oracle. In this talk we will discuss the top mistakes people generally make when moving to PostgreSQL from Oracle and what the correct course of action.
This presentation is designed to provide a comprehensive overview of the top 10 features of MySQL 8.0, explaining why they are advantageous and how they will improve the MySQL experience for users. Furthermore, this presentation will provide a timeline for users to plan for and upgrade from MySQL 5.7, which will reach its end of life by October 2023.
Recording available YouTube Channel: https://www.youtube.com/c/Mydbops?app=desktop
Managing Data and Operation Distribution In MongoDBJason Terpko
In a sharded MongoDB cluster, scale and data distribution are defined by your shard keys. Even when choosing the correct shards key, ongoing maintenance and review can still be required to maintain optimal performance.
This presentation will review shard key selection and how the distribution of chunks can create scenarios where you may need to manually move, split, or merge chunks in your sharded cluster. Scenarios requiring these actions can exist with both optimal and sub-optimal shard keys. Example use cases will provide tips on selection of shard key, detecting an issue, reasons why you may encounter these scenarios, and specific steps you can take to rectify the issue.
This document provides an overview of managing data and operations distribution in MongoDB sharded clusters. It discusses shard key selection, applying and reverting shard keys, automatic and manual chunk splitting and merging, balancing data distribution across shards, and cleaning up orphaned documents. The goal is to optimally distribute data and load across shards in the cluster.
RedisConf17- Using Redis at scale @ TwitterRedis Labs
The document discusses Nighthawk, Twitter's distributed caching system which uses Redis. It provides caching services at a massive scale of over 10 million queries per second and 10 terabytes of data across 3000 Redis nodes. The key aspects of Nighthawk's architecture that allow it to scale are its use of a client-oblivious proxy layer and cluster manager that can independently scale and rebalance partitions across Redis nodes. It also employs replication between data centers to provide high availability even in the event of node failures. Some challenges discussed are handling "hot keys" that get an unusually high volume of requests and more efficiently warming up replicas when nodes fail.
Scylla Summit 2022: Scylla 5.0 New Features, Part 2ScyllaDB
Scylla 5.0 introduces several new features to improve node operations and compaction:
1. Repair-based node operations (RBNO) provide more efficient, consistent, and simplified bootstrap, replace, rebuild, and other node operations by using row-level repair as the underlying mechanism instead of streaming.
2. Off-strategy compaction keeps sstables generated during node operations in a separate data set and compacts them together after the operation finishes for less compaction work and faster completion.
3. Space amplification goal (SAG) for compaction optimizes space efficiency for overwrite workloads by dynamically adapting compaction to meet latency and space goals, improving storage density.
1. Log structured merge trees store data in multiple levels with different storage speeds and costs, requiring data to periodically merge across levels.
2. This structure allows fast writes by storing new data in faster levels before merging to slower levels, and efficient reads by querying multiple levels and merging results.
3. The merging process involves loading, sorting, and rewriting levels to consolidate and propagate deletions and updates between levels.
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
In the last few years, Netflix’s S3 data warehouse has grown to more than 100 PB. In that time, the company has shared several techniques and released open source tools for working around S3’s quirks, including s3mper to work around eventual consistency, S3 multipart committers to commit data without renames, and the batchid pattern for cross-partition atomic commits.
Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3 that is replacing many of the company’s current tools. Iceberg enables a new generation of improvements, including:
* Snapshot isolation with no directory listing or file renames
* Distributed planning to relieve metastore bottlenecks
* Improved data layout for S3 performance
* Immediately available writes from streaming applications
* Opportunistic compaction and data optimization
PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas
PostgreSQL is an open source relational database. Kafka is an open source log-based messaging system. Because both systems are powerful and flexible, they’re devouring whole categories of infrastructure. And they’re even better together.
In this talk, you’ll learn about commit logs and how that fundamental data structure underlies both PostgreSQL and Kafka. We’ll use that basis to understand what Kafka is, what advantages it has over traditional messaging systems, and why it’s perfect for modeling database tables as streams. From there, we’ll introduce the concept of change data capture (CDC) and run a live demo of Bottled Water, an open source CDC pipeline, watching INSERT, UPDATE, and DELETE operations in PostgreSQL stream into Kafka. We’ll wrap up with a discussion of use cases for this pipeline: messaging between systems with transactional guarantees, transmitting database changes to a data warehouse, and stream processing.
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
The document discusses updates to InfluxDB IOx, a new columnar time series database. It covers changes and improvements to the API, CLI, query capabilities, and path to open sourcing builds. Key points include moving to gRPC for management, adding PostgreSQL string functions to queries, optimizing functions for scalar values and columns, and monitoring internal systems as the first step to releasing open source builds.
MySQL and MariaDB though they share the same roots for replication .They support parallel replication , but they diverge the way the parallel replication is implemented.
RocksDB is an embedded key-value store written in C++ and optimized for fast storage environments like flash or RAM. It uses a log-structured merge tree to store data by writing new data sequentially to an in-memory log and memtable, periodically flushing the memtable to disk in sorted SSTables. It reads from the memtable and SSTables, and performs background compaction to merge SSTables and remove overwritten data. RocksDB supports two compaction styles - level style, which stores SSTables in multiple levels sorted by age, and universal style, which stores all SSTables in level 0 sorted by time.
This document provides information about PostgreSQL and Command Prompt Inc. It discusses that Command Prompt offers professional services for migrating from Oracle to PostgreSQL. It then covers aspects of PostgreSQL like its licensing, large international community, low costs, mature codebase, enterprise features, and technical capabilities like its SQL implementation, replication, foreign data wrappers and user-defined functions. It also notes tools that can help ease migration from Oracle and some differences from Oracle.
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
Join us for a developer workshop where we’ll go hands-on to explore the affinities between Rust, the Tokio framework, and ScyllaDB. You’ll go live with our sample Rust application, built on our new, high performance native Rust client driver.
MariaDB Server Performance Tuning & OptimizationMariaDB plc
This document discusses various techniques for optimizing MariaDB server performance, including:
- Tuning configuration settings like the buffer pool size, query cache size, and thread pool settings.
- Monitoring server metrics like CPU usage, memory usage, disk I/O, and MariaDB-specific metrics.
- Analyzing slow queries with the slow query log and EXPLAIN statements to identify optimization opportunities like adding indexes.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
Agreement in a distributed system is complicated but required. Scylla gained lightweight transactions through Paxos but the latter has a cost of 3X roundtrips. Raft can allow consistent transactions without the performance penalty. Beyond LWT, we plan to integrate Raft with most aspects of Scylla making a leap forward in manageability and consistency
This document discusses techniques for improving latency in HBase. It analyzes the write and read paths, identifying sources of latency such as networking, HDFS flushes, garbage collection, and machine failures. For writes, it finds that single puts can achieve millisecond latency while streaming puts can hide latency spikes. For reads, it notes cache hits are sub-millisecond while cache misses and seeks add latency. GC pauses of 25-100ms are common, and failures hurt locality and require cache rebuilding. The document outlines ongoing work to reduce GC, use off-heap memory, improve compactions and caching to further optimize for low latency.
Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)confluent
The document discusses streams and tables as two sides of the same coin. It proposes a dual streaming model that handles out-of-order data by continuously updating stateful operators and emitting changelog streams. This allows processing data with low latency while avoiding buffering and reordering. The model is implemented in Apache Kafka Streams and adopted in industry through tools like KSQL.
This document discusses how to implement an Active Session History feature in PostgreSQL similar to what is available in Oracle. It describes how the default PostgreSQL monitoring views only show current activity and lack historical blocking diagnostics. The document then explains how to write a PostgreSQL extension to sample the pg_stat_activity view periodically and store the results, similar to the pgsentinel extension. The pgsentinel extension is discussed in detail, including how it uses background workers and hooks to collect activity samples. Finally, potential future enhancements are mentioned like integrating with pg_stat_statements.
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
As more and more people are moving to PostgreSQL from Oracle, a pattern of mistakes is emerging. They can be caused by the tools being used or just not understanding how PostgreSQL is different than Oracle. In this talk we will discuss the top mistakes people generally make when moving to PostgreSQL from Oracle and what the correct course of action.
This presentation is designed to provide a comprehensive overview of the top 10 features of MySQL 8.0, explaining why they are advantageous and how they will improve the MySQL experience for users. Furthermore, this presentation will provide a timeline for users to plan for and upgrade from MySQL 5.7, which will reach its end of life by October 2023.
Recording available YouTube Channel: https://www.youtube.com/c/Mydbops?app=desktop
Managing Data and Operation Distribution In MongoDBJason Terpko
In a sharded MongoDB cluster, scale and data distribution are defined by your shard keys. Even when choosing the correct shards key, ongoing maintenance and review can still be required to maintain optimal performance.
This presentation will review shard key selection and how the distribution of chunks can create scenarios where you may need to manually move, split, or merge chunks in your sharded cluster. Scenarios requiring these actions can exist with both optimal and sub-optimal shard keys. Example use cases will provide tips on selection of shard key, detecting an issue, reasons why you may encounter these scenarios, and specific steps you can take to rectify the issue.
This document provides an overview of managing data and operations distribution in MongoDB sharded clusters. It discusses shard key selection, applying and reverting shard keys, automatic and manual chunk splitting and merging, balancing data distribution across shards, and cleaning up orphaned documents. The goal is to optimally distribute data and load across shards in the cluster.
QuestDB: ingesting a million time series per second on a single instance. Big...javier ramirez
In this session I will show you the technical decisions we made when building QuestDB, the open source, Postgres compatible, time-series database, and how we can achieve a million row writes per second without blocking or slowing down the reads.
MariaDB ColumnStore is a high performance columnar storage engine for MariaDB that supports analytical workloads on large datasets. It uses a distributed, massively parallel architecture to provide faster and more efficient queries. Data is stored column-wise which improves compression and enables fast loading and filtering of large datasets. The cpimport tool allows loading data into MariaDB ColumnStore in bulk from CSV files or other sources, with options for centralized or distributed parallel loading. Proper sizing of ColumnStore deployments depends on factors like data size, workload, and hardware specifications.
How QBerg scaled to store data longer, query it fasterMariaDB plc
The continuous increase in terms of services and countries to which QBerg delivers its services requires an ever-increasing load of resources. During the last year QBerg has reached a critical point, storing so much transactional data that standard relational databases were unable to meet the SLAs, or support the features, required by customers. As an example, they had to cap web analytics to running on a maximum of four months of history. The introduction of MariaDB ColumnStore, flanked by existing MariaDB Server databases, not only will allow them to store multiple years’ worth of historical data for analytics – it decreased overall processing time by one order of magnitude right off the bat. The move to a unified platform was incremental, using MariaDB MaxScale as both a router and a replicator. QBerg is now able to replicate full InnoDB schemas to MariaDB ColumnStore and incrementally update big tables without impacting the performance of ongoing transactions.
Big data challenges are common : we are all doing aggregations , machine learning , anomaly detection, OLAP ...
This presentation describe how InnerActive answer those requirements
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]Kevin Xu
This presentation was delivered at the NYC SQL meetup on September 27, 2018. It provided a technical overview of the TiDB Platform, a deep dive into TiDB's MySQL compatible layer and MySQL ecosystem tools, use case of Mobike, and appendix with detail materials on coprocessor and transaction model.
Memcached is an in-memory key-value store used to improve performance by caching data and database queries. It uses a simple get/set interface and stores data in memory for low latency access. Memcached is commonly used to cache data from dynamic database-driven websites to reduce the load on databases and improve response time. It provides basic caching functions but has no persistence or security features. Improving areas include networking, parallel access, data structures and memory management.
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
This document provides a summary of migrating to ClickHouse for analytics use cases. It discusses the author's background and company's requirements, including ingesting 10 billion events per day and retaining data for 3 months. It evaluates ClickHouse limitations and provides recommendations on schema design, data ingestion, sharding, and SQL. Example queries demonstrate ClickHouse performance on large datasets. The document outlines the company's migration timeline and challenges addressed. It concludes with potential future integrations between ClickHouse and MySQL.
Testing Persistent Storage Performance in Kubernetes with SherlockScyllaDB
Getting to understand your Kubernetes storage capabilities is important in order to run a proper cluster in production. In this session I will demonstrate how to use Sherlock, an open source platform written to test persistent NVMe/TCP storage in Kubernetes, either via synthetic workload or via variety of databases, all easily done and summarized to give you an estimate of what your IOPS, Latency and Throughput your storage can provide to the Kubernetes cluster.
The document discusses Snowflake architecture and data ingestion workflows. It provides an overview of Snowflake's architecture including its storage, compute, and virtual warehouse components. It also covers topics like file formats, stages, the COPY command, load metadata, and automation of data ingestion including for semi-structured data formats.
Achieving Extreme Scale with ScyllaDB: Tips & TradeoffsScyllaDB
Explore critical strategies – and antipatterns – for achieving low latency at extreme scale
If you’re getting started with ScyllaDB, you’re probably intrigued by its potential to achieve predictable low latency at extreme scale. But how do you ensure that you’re maximizing that potential for your team’s specific workloads and technical requirements?
This webinar offers practical advice for navigating the various decision points you’ll face as you evaluate ScyllaDB for your project and move into production. We’ll cover the most critical considerations, tradeoffs, and recommendations related to:
- Infrastructure selection
- ScyllaDB configuration
- Client-side setup
- Data modeling
Join us for an inside look at the lessons learned across thousands of real-world distributed database projects.
KSCOPE 2013: Exadata Consolidation Success StoryKristofferson A
This document summarizes an Exadata consolidation success story. It describes how three Exadata clusters were consolidated to host 60 databases total. Tools and methodology used included gathering utilization metrics, creating a provisioning plan, implementing the plan, and auditing. The document describes some "war stories" including resolving a slow HR time entry system through SQL profiling, addressing a memory exhaustion issue from an OBIEE report, and using I/O resource management to prioritize critical processes when storage cells became saturated.
We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.
Argus Production Monitoring at Salesforce HBaseCon
Tom Valine and Bhinav Sura (Salesforce)
We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.
This year, a set of technologies we’ve been developing for years, graduated together to bring you uncompromised performance, availability and elasticity. From Raft consensus to serverless and goodput. In this session Dor Laor, ScyllaDB’s CEO will cover the major innovations coming to ScyllaDB core and ScyllaDB Cloud.
How We Added Replication to QuestDB - JonTheBeachjavier ramirez
Building a database that can beat industry benchmarks is hard work, and we had to use every trick in the book to keep as close to the hardware as possible. In doing so, we initially decided QuestDB would scale only vertically, on a single instance.
A few years later, data replication —for horizontally scaling reads and for high availability— became one of the most demanded features, especially for enterprise and cloud environments. So, we rolled up our sleeves and made it happen.
Today, QuestDB supports an unbounded number of geographically distributed read-replicas without slowing down reads on the primary node, which can ingest data at over 4 million rows per second.
In this talk, I will tell you about the technical decisions we made, and their trade offs. You'll learn how we had to revamp the whole ingestion layer, and how we actually made the primary faster than before when we added multi-threaded Write Ahead Logs to deal with data replication. I'll also discuss how we are leveraging object storage as a central part of the process. And of course, I'll show you a live demo of high-performance multi-region replication in action.
This document summarizes a company's migration from their old MySQL database to MariaDB Columnstore to address scalability issues. They identified alternative databases, chose MariaDB Columnstore for its performance and support. They saw a 71% reduction in query times after migrating and refactored their ETL processes and application to take advantage of the new database. While deployment had some challenges around storage capacity, they were able to automate cluster creation and define a migration process to successfully move to the new database.
Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son
Tajo can infer the schema of self-describing data formats like JSON, ORC, and Parquet at query execution time without needing to pre-define and store the schema separately. This allows Tajo to query nested, complex data without requiring tedious schema definition by the user. Tajo's support of self-describing formats simplifies the process of querying nested, hierarchical data from files like the JSON log example shown.
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveScyllaDB
Want to learn practical tips for designing systems that can scale efficiently without compromising speed?
Join us for a workshop where we’ll address these challenges head-on and explore how to architect low-latency systems using Rust. During this free interactive workshop oriented for developers, engineers, and architects, we’ll cover how Rust’s unique language features and the Tokio async runtime enable high-performance application development.
As you explore key principles of designing low-latency systems with Rust, you will learn how to:
- Create and compile a real-world app with Rust
- Connect the application to ScyllaDB (NoSQL data store)
- Negotiate tradeoffs related to data modeling and querying
- Manage and monitor the database for consistently low latencies
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...ScyllaDB
With over a billion Indians set to shop online, Meesho is redefining e-commerce by making it accessible, affordable, and inclusive at an unprecedented scale. But scaling for Bharat isn’t just about growth—it’s about building a tech backbone that can handle massive traffic surges, dynamic pricing, real-time recommendations, and seamless user experiences. In this session, we’ll take you behind the scenes of Meesho’s journey in democratizing e-commerce while operating at Monster Scale. Discover how ScyllaDB plays a crucial role in handling millions of transactions, optimizing catalog ranking, and ensuring ultra-low-latency operations. We’ll deep dive into our real-world use cases, performance optimizations, and the key architectural decisions that have helped us scale effortlessly.
Navigating common mistakes and critical success factors
Is your team considering or starting a database migration? Learn from the frontline experience gained guiding hundreds of high-stakes migration projects – from startups to Google and Twitter. Join us as Miles Ward and Tim Koopmans have a candid chat about what tends to go wrong and how to steer things right.
We will explore:
- What really pushes teams to the database migration tipping point
- How to scope and manage the complexity of a migration
- Proven migration strategies and antipatterns
- Where complications commonly arise and ways to prevent them
Expect plenty of war stories, along with pragmatic ways to make your own migration as “blissfully boring” as possible.
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...ScyllaDB
Cloudflare’s boot infrastructure dynamically generates and signs boot artifacts for nodes worldwide, ensuring secure, scalable, and customizable deployments. This talk dives into its architecture, scaling decisions, and how it enables seamless testing while maintaining a strong chain of trust.
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn IsarathamScyllaDB
Learn about Agoda's performance tuning strategies for ScyllaDB. Worakarn shares how they optimized disk performance, fine-tuned compaction strategies, and adjusted SSTable settings to match their workload for peak efficiency.
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd ColemanScyllaDB
Yieldmo processes hundreds of billions of ad requests daily with subsecond latency. Initially using DynamoDB for its simplicity and stability, they faced rising costs, suboptimal latencies, and cloud provider lock-in. This session explores their journey to ScyllaDB’s DynamoDB-compatible API.
There’s a common adage that it takes 10 years to develop a file system. As ScyllaDB reaches that 10 year milestone in 2025, it’s the perfect time to reflect on the last decade of ScyllaDB development – both hits and misses. It’s especially appropriate given that our project just reached a critical mass with certain scalability and elasticity goals that we dreamed up years ago. This talk will cover how we arrived at ScyllaDB X Cloud achieving our initial vision, and share where we’re heading next.
Reduce Your Cloud Spend with ScyllaDB by Tzach LivyatanScyllaDB
This talk will explore why ScyllaDB Cloud is a cost-effective alternative to DynamoDB, highlighting efficient design implementations like shared compute, local NVMe storage, and storage compression. It will also discuss new X Cloud features, better plans and pricing, and a direct cost comparison between ScyllaDB and DynamoDB
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence LiuScyllaDB
Terence share how Clearview AI's infra needs evolved and why they chose ScyllaDB after first-principles research. From fast ingestion to production queries, the talk explores their journey with Rust, embedded DB readers, and the ScyllaDB Rust driver—plus config tips for bulk ingestion and achieving data parity.
Vector Search with ScyllaDB by Szymon WasikScyllaDB
Vector search is an essential element of contemporary machine learning pipelines and AI tools. This talk will share preliminary results on the forthcoming vector storage and search features in ScyllaDB. By leveraging Scylla's scalability and USearch library's performance, we have designed a system with exceptional query latency and throughput. The talk will cover vector search use cases, our roadmap, and a comparison of our initial implementation with other vector databases.
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...ScyllaDB
Workload Prioritization is a ScyllaDB exclusive feature for controlling how different workloads compete for system resources. It's used to prioritize urgent application requests that require immediate response times versus others that can tolerate slighter delays (e.g., large scans). Join this session for a demo of how applying workload prioritization reduces infrastructure costs while ensuring predictable performance at scale.
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...ScyllaDB
Should you move code to data or data to code? Conventional wisdom favors the former, but cloud trends push the latter. This session by the creator of PACELC explores the shift, its risks, and the ongoing debate in data virtualization between push- and pull-based processing.
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...ScyllaDB
Scaling from 66M to 25B+ records in a core financial system is tough—every number must be right, and data must be fresh. In this session, Dmytro shares real-world strategies to balance accuracy with real-time performance and avoid scaling pitfalls. It's purely practical, no-BS insights for engineers.
Object Storage in ScyllaDB by Ran Regev, ScyllaDBScyllaDB
In this talk we take a look at how Object Storage is used by Scylla. We focus on current usage, namely - for backup, and we look at the shift in implementation from an external tool to native Scylla. We take a close look at the complexity of backup and restore mostly in the face of topology changes and token assignments. We also take a glimpse to the future and see how Scylla is going to use Object Storage as its native storage. We explore a few possible applications of it and understand the tradeoffs.
Lessons Learned from Building a Serverless Notifications System by Srushith R...ScyllaDB
Reaching your audience isn’t just about email. Learn how we built a scalable, cost-efficient notifications system using AWS serverless—handling SMS, WhatsApp, and more. From architecture to throttling challenges, this talk dives into key decisions for high-scale messaging.
A Dist Sys Programmer's Journey into AI by Piotr SarnaScyllaDB
This talk explores the culture shock of transitioning from distributed databases to AI. While AI operates at massive scale, distributed storage and compute remain essential. Discover key differences, unexpected parallels, and how database expertise applies in the AI world.
High Availability: Lessons Learned by Paul PreuveneersScyllaDB
How does ScyllaDB keep your data safe, and your mission critical applications running smoothly, even in the face of disaster? In this talk we’ll discuss what we have learned about High Availability, how it is implemented within ScyllaDB and what that means for your business. You’ll learn about ScyllaDB cloud architecture design, consistency, replication and even load balancing and much more.
How Natura Uses ScyllaDB and ScyllaDB Connector to Create a Real-time Data Pi...ScyllaDB
Natura, a top global cosmetics brand with 3M+ beauty consultants in Latin America, processes massive data for orders, campaigns, and analytics. In this talk, Rodrigo Luchini & Marcus Monteiro share how Natura leverages ScyllaDB’s CDC Source Connector for real-time sales insights.
Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce b...ScyllaDB
This is a case study on managing mutable big data: Exploring the evolution of the persistence layer in a processing graph, tackling design challenges, and refining key operational principles along the way.
Database Migration Strategies and Pitfalls by Patrick BossmanScyllaDB
Jump start your migration to ScyllaDB! In this talk, we will discuss how to migrate from Cassandra, DynamoDB, as well as other sources. Review tooling available to assist with Scylla Migrations. Review approaches and considerations for online and offline migrations, and how to plan for a faster migration if necessary.
Procurement Insights Cost To Value Guide.pptxJon Hansen
Procurement Insights integrated Historic Procurement Industry Archives, serves as a powerful complement — not a competitor — to other procurement industry firms. It fills critical gaps in depth, agility, and contextual insight that most traditional analyst and association models overlook.
Learn more about this value- driven proprietary service offering here.
Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo
From predictive maintenance to robotic automation, AI is driving the future of manufacturing. But without high-quality annotated data, even the smartest models fall short.
Discover how data annotation services are powering accuracy, safety, and efficiency in AI-driven manufacturing systems.
Precision in data labeling = Precision on the production floor.
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok
In today's fast-paced retail environment, efficiency is key. Every minute counts, and every penny matters. One tool that can significantly boost your store's efficiency is a well-executed planogram. These visual merchandising blueprints not only enhance store layouts but also save time and money in the process.
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
With expertise in data architecture, performance tracking, and revenue forecasting, Andrew Marnell plays a vital role in aligning business strategies with data insights. Andrew Marnell’s ability to lead cross-functional teams ensures businesses achieve sustainable growth and operational excellence.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
Unlocking the Power of IVR: A Comprehensive Guidevikasascentbpo
Streamline customer service and reduce costs with an IVR solution. Learn how interactive voice response systems automate call handling, improve efficiency, and enhance customer experience.
Web & Graphics Designing Training at Erginous Technologies in Rajpura offers practical, hands-on learning for students, graduates, and professionals aiming for a creative career. The 6-week and 6-month industrial training programs blend creativity with technical skills to prepare you for real-world opportunities in design.
The course covers Graphic Designing tools like Photoshop, Illustrator, and CorelDRAW, along with logo, banner, and branding design. In Web Designing, you’ll learn HTML5, CSS3, JavaScript basics, responsive design, Bootstrap, Figma, and Adobe XD.
Erginous emphasizes 100% practical training, live projects, portfolio building, expert guidance, certification, and placement support. Graduates can explore roles like Web Designer, Graphic Designer, UI/UX Designer, or Freelancer.
For more info, visit erginous.co.in , message us on Instagram at erginoustechnologies, or call directly at +91-89684-38190 . Start your journey toward a creative and successful design career today!
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
5. Dataprime Query Engine
■ Custom distributed query engine for proprietary query language (DataPrime)
on arbitrary semi-structured data
■ Querying data stored in object storage
■ Storage format is specialized parquet files
6. ■ Reading parquet metadata from object storage is too expensive for large
queries
■ Move metadata into separate (faster) storage
■ Block listing with bloom filters
■ Transactional commit log
Metastore: Motivation
7. Requirements for metastore
■ Low latency
■ Scaleable
■ Transactional guarantees
Initial implementation: postgres
Metastore: Requirements
Example one (large) customer:
■ 2k parquet files / hour
■ 50k parquet files / day
■ 15 TB of data / day
■ 20 GB of metadata / day
8. Important for listing: give me all the blocks for a table in a given time range
Example:
■ Block url:
s3://cgx-production-c4c-archive-data/cx/parquet/v1/team_id=555585/…
…dt=2022-12-02/hr=10/0246f9e9-f0da-4723-9b64-a12346095d25.parquet
■ Row group: 0, 1, 2 …
■ Min timestamp
■ Max timestamp
■ Number of rows
■ Total size
■ …
Blocks
9. Bloom Filters
Used for pruning blocks when filtering by search term
■ is a given token maybe in this block or definitely not?
Works by hashing tokens multiple times & setting bits to 1. When checking, hash
again and check bits are all 1.
Specifically using blocked bloom filter (sequence of bloom filters):
8192 * 32 bytes
10. Column Metadata
Per-column parquet metadata required for scanning and decoding
parquet file
Example:
■ Block URL
■ Row Group
■ Column Name
■ Column metadata (blob)
13. Bloom Filters
Problem: how to verify bits are set?
Solution: read bloom filters and process in application
Problem: ~50k blocks/day * 262kB = ~12GB of data, too much for one query
Solution:
■ chunk bloom filters and split into rows
■ by chunking per bloom filter block we read one row per token,
50k * 32 bytes / token = 1.6MB / token
14. Bloom Filters: Primary Key (1)
Primary key: ((block_url, row_group), chunk index)
~ 8192 chunks of 32 bytes per bloom filter = ~262kB per partition
Pros:
■ Easy to insert and delete, single batch query
Cons:
■ Need to know the block id before reading
■ A lot of partitions to access, 1 day: 50k partitions
15. Bloom Filters: Primary Key (2)
Primary key: ((table url, hour, chunk index), block url, row group)
~ 2000 chunks of 32 bytes per bloom filter = ~64kB per partition
Pros
■ Very fast listing, less partitions.
1 day, 5 tokens: 24 * 5 = 120 partitions
■ No dependency on block, can read in parallel
Cons
■ Expensive to insert and delete: 8192 partitions for a single block!
16. Bloom Filters: Future Approach
Investigate optimal chunking:
find middle ground between writing large enough chunks and reading unnecessary data
Can we use UDF’s with WebAssembly?
SELECT block_url, row_group
FROM bloom_filters
WHERE table_url = ? AND hour = ? AND bloom_filter_matches(bloom_filter, indexes)
■ Let ScyllaDB do the hard work
■ Don’t need to worry about amount of data we’re sending back to app
■ Code is already written in rust
17. Be Careful
It’s very much not SQL - try to avoid migrations (/bugs)
Solutions:
■ Rename columns?
■ Add new columns, UPDATE blocks SET query?
■ Truncate table and start over again
18. ScyllaDB: Ecosystem
Extensive usage of ScyllaDB libraries and components:
■ Written in rust on top of ScyllaDB-rust-driver
■ ScyllaDB Operator for k8s
■ ScyllaDB Monitoring
■ ScyllaDB Manager
From knowing ScyllaDB exists to production ready & terabytes of data in 2 months
19. Hardware
Cost is very important
3-node cluster:
■ 8 vCPU
■ 32 GiB memory
■ ARM/Graviton
■ EBS volumes (gp3)
■ 500 MBps bandwidth
■ 12k IOPS
20. Metastore: Block Listing
Largest cluster: 4-5 TB on each node, mostly for one customer
Writes:
■ p99 latency: <1ms
■ ~10k writes / s
Block listing:
■ Depends on query & whether we’re using bloom filters
■ for 1 hour: <20ms latency
■ for 1 day: <500ms latency
21. Metastore: Column Metadata
Reads:
■ p50 latency: 5ms
■ p99 latency: 100ms (when we timeout)
Issue: large amount of concurrent queries
Probably disk issue
22. Conclusion
■ Keep an eye on partition sizes
■ Think about read/write patterns
■ Very happy with block listing…
… but unpredictable tail latency for reading
column metadata
■ Probably shouldn’t use EBS :-)
23. Thank You
Stay in Touch
Dan Harris
[email protected]
@thinkharderdev
github.com/thinkharderdev
www.linkedin.com/in/dsh2va
github.com/sebver
[email protected]
www.linkedin.com/in/sebastian-vercruysse
Sebastian Vercruysse