Building a Virtual Data Lake with Apache Arrow

Jul 31, 201713 likes8,247 views

Building a data lake is a daunting task. The promise of a virtual data lake is to provide the advantages of a data lake without consolidating all data into a single repository. With Apache Arrow and Dremio, companies can, for the first time, build virtual data lakes that provide full access to data no matter where it is stored and no matter what size it is.

Analytics on modern
data is incredibly hard
Unprecedented complexity

The demands for data
are growing rapidly
Increasing demands
Reporting
New products
Forecasting
Threat detection
BI
Machine
Learning
Segmenting
Fraud prevention

Your analysts are hungry for data
SQL
But your data is everywhere
And it’s not in the shape they need

Today you engineer data flows and reshaping
Data Staging
• Custon ETL
• Fragile transforms
• Slow moving
SQL

Today you engineer data flows and reshaping
Data Staging
Data Warehouse
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL

Today you engineer data flows and reshaping
Data Staging
Data Warehouse
Cubes, BI Extracts &
Aggregation Tables • Data sprawl
• Governance issues
• Slow to update
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
+
+
+
+
+
+
+
+
+

How can we Tackle this Age-old
Problem?
Direct access to data In-memory, GPU,
…
Columnar Distributed

Apache Arrow: Process & Move Data
Fast
• Top-level Apache project as of Feb 2016
• Collaboration among many open source projects around shared needs
• Three components:
• Language-independent columnar data structures
• Implementations available for C++, Java, Python
• Metadata for describing schemas/record batches
• Protocol for moving data between between processes without
serialization overhead

High-Performance Data Interchange
Today With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and
deserialization
• Similar functionality implemented in multiple projects
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg, Parquet-
to-Arrow reader)

Data is Organized in Record Batches
Schema
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Schema & File
Layout
Streaming Format File Format

Each Record Batch is Columnar
Intel CPU
SELECT * FROM clickstream WHERE
session_id = 1331246351
Traditional
Memory Buffer
Arrow
Memory Buffer
Arrow leverages the data parallelism
(SIMD) in modern Intel CPUs:

Example: Spark to
Pandas via Apache
Arrow

Fast Import of Arrow in Pandas & R
Credit: Wes McKinney, Two Sigma

Fast Export of Arrow in Spark
• Legacy export from Spark to Pandas (toPandas) was extremely
slow
• Row-by-row conversion from Spark driver to Python memory
• SPARK-13534 introduced an Arrow based implementation
• Wes McKinney (Two Sigma), Bryan Cutler (IBM), Li Jin (Two Sigma), and
Yin Xusen (IBM)
• Set spark.sql.execution.arrow.enable = True
Clock Time 12.5s 1.89s (6.6x)
Deserialization 88% of the time 1% of the time
Peak memory usage 8x dataset size 2x dataset size

Designing a Virtual Data
Lake Powered by Apache
Arrow

Arrow-based Distributed Execution
Persistent Columnar Cache (Parquet)
In-Memory Columnar Cache (Arrow)
Pandas
R
BI
Data Sources
(NoSQL, RDBMS, Hadoop, S3)
Arrow-based Execution and Integration

From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.

Iceberg: a fast table format for S3DataWorks Summit

Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots. In this session, you'll learn: • Some background about big data at Netflix • Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive • How Iceberg maintains table metadata to make queries fast and reliable • The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse • How you can get started using Iceberg Speaker Ryan Blue, Software Engineer, Netflix

Dremio introductionAlexis Gendronneau

This document discusses Apache Dremio, an open source data virtualization platform that provides self-service SQL access to data sources like Elasticsearch, MongoDB, HDFS, and relational databases. It aims to make data analytics faster by avoiding the need for data staging, warehouses, cubes, and extracts. Dremio uses techniques like reflections, pushdowns, and a universal relational algebra to optimize queries and leverage caches. It is based on projects like Apache Drill, Calcite, Arrow, and Parquet and can be deployed on Hadoop or the cloud. The presentation includes a demo of using Dremio to create datasets, curate/prepare data, accelerate queries with reflections, and manage resources.

Solving Enterprise Data Challenges with Apache ArrowWes McKinney

This document discusses Apache Arrow, an open-source library that enables fast and efficient data interchange and processing. It summarizes the growth of Arrow and its ecosystem, including new features like the Arrow C++ query engine and Arrow Rust DataFusion. It also highlights how enterprises are using Arrow to solve challenges around data interoperability, access speed, query performance, and embeddable analytics. Case studies describe how companies like Microsoft, Google Cloud, Snowflake, and Meta leverage Arrow in their products and platforms. The presenter promotes Voltron Data's enterprise subscription and upcoming conference to support business use of Apache Arrow.

Hive: Loading DataBenjamin Leonhardi

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait. Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including: * All reads use snapshot isolation without locking. * No directory listings are required for query planning. * Files can be added, removed, or replaced atomically. * Full schema evolution supports changes in the table over time. * Partitioning evolution enables changes to the physical layout without breaking existing queries. * Data files are stored as Avro, ORC, or Parquet. * Support for Spark, Pig, and Presto.

Apache Arrow - An OverviewDremio Corporation

Apache Arrow: In Theory, In PracticeDremio Corporation

This document discusses Apache Arrow, an open source cross-language development platform for in-memory analytics. It provides an overview of Arrow's goals of being cross-language compatible, optimized for modern CPUs, and enabling interoperability between systems. Key components include core C++/Java libraries, integrations with projects like Pandas and Spark, and common message patterns for sharing data. The document also describes how Arrow is implemented in practice in systems like Dremio's Sabot query engine.

OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...Altinity Ltd

OSA Con 2022: Apache Iceberg: An Architectural Look Under the Covers Alex Merced - Dremio The data lakehouse is one of the most exciting trends in the data space promising to merge the best aspects of data lakes and data warehouses without either of their problems. Open source tech is making this promise a reality and in this talk Dremio Developer Advocate, Alex Merced, explores these technologies. In this talk Alex Merced will cover: - What is a Data Lakehouse? - Why open matters in preserving the promise of lakehouses (better costs, vendor freedom, data freedom) - What are technologies that enable lakehouses like Apache Iceberg, Apache Parquet, Apache Arrow and Project Nessie

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation

Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.

The Parquet Format and Performance Optimization OpportunitiesDatabricks

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

Spark shuffle introductioncolorant

This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.

Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks

Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Data Science Across Data Sources with Apache ArrowDatabricks

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet. At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it. We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData

The document discusses updates to InfluxDB IOx, a new columnar time series database. It covers changes and improvements to the API, CLI, query capabilities, and path to open sourcing builds. Key points include moving to gRPC for management, adding PostgreSQL string functions to queries, optimizing functions for scalar values and columns, and monitoring internal systems as the first step to releasing open source builds.

Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks

Introduction to DremioDremio Corporation

An introduction to self-service data with Dremio. Dremio reimagines analytics for modern data. Created by veterans of open source and big data technologies, Dremio is a fundamentally new approach that dramatically simplifies and accelerates time to insight. Dremio empowers business users to curate precisely the data they need, from any data source, then accelerate analytical processing for BI tools, machine learning, data science, and SQL clients. Dremio starts to deliver value in minutes, and learns from your data and queries, making your data engineers, analysts, and data scientists more productive.

Apache Spark OverviewVadim Y. Bichutskiy

This document provides an overview of a talk on Apache Spark. It introduces the speaker and their background. It acknowledges inspiration from a previous Spark training. It then outlines the structure of the talk, which will include: a brief history of big data; a tour of Spark including its advantages over MapReduce; and explanations of Spark concepts like RDDs, transformations, and actions. The document serves to introduce the topics that will be covered in the talk.

What is in a Lucene index?lucenerevolution

Presented by Adrien Grand, Software Engineer, Elasticsearch Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.

Observability for Data Pipelines With OpenLineageDatabricks

Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment. Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security. Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.

Building an open data platform with apache icebergAlluxio, Inc.

Apache Calcite: One planner fits allJulian Hyde

This document discusses how Apache Calcite makes it easier to write database management systems (DBMS) by decomposing them into modular components like a query parser, catalog, algorithms, and storage engines. It presents Calcite as a framework that allows these components to be mixed and matched, with a core relational algebra and rule-based optimization. Calcite powers systems like Apache Hive, Drill, Phoenix, and Kylin by translating SQL and other queries to relational algebra and optimizing queries using over 100 rules before executing them using configurable engines and data sources.

Data Science Languages and Industry AnalyticsWes McKinney

More Related Content

What's hot (20)