In-Memory Analytics with Apache Arrow: Accelerate data analytics for efficient processing of flat and hierarchical data structures

Ebook861 pages6 hours

In-Memory Analytics with Apache Arrow: Accelerate data analytics for efficient processing of flat and hierarchical data structures

Name: In-Memory Analytics with Apache Arrow: Accelerate data analytics for efficient processing of flat and hierarchical data structures
Author: Matthew Topol
ISBN: 9781835469682

By Matthew Topol and Wes McKinney

Rating: 0 out of 5 stars

()

Read preview

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateSep 30, 2024

ISBN9781835469682

Author

Matthew Topol

Matthew Topol is a member of the Apache Arrow Project Management Committee (PMC) and a staff software engineer at Voltron Data, Inc. Matt has worked in infrastructure, application development, and large-scale distributed system analytical processing for financial data. At Voltron Data, Matt's primary responsibilities have been working on and enhancing the Apache Arrow libraries and associated sub-projects. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented fantasy games for his victims—er—friends, and share his knowledge and experience with anyone interested enough to listen.

Related authors

Skip carousel

Related to In-Memory Analytics with Apache Arrow

Related ebooks

Skip carousel

Mastering Apache Arrow: Accelerating Data Processing and In-Memory Analytics
Ebook
Mastering Apache Arrow: Accelerating Data Processing and In-Memory Analytics
byRobert Johnson
Rating: 0 out of 5 stars
0 ratings
Python Data Structures and Algorithms
Ebook
Python Data Structures and Algorithms
byBenjamin Baka
Rating: 5 out of 5 stars
5/5
Python 3 Programming for Beginners: The Beginner's Guide for Learning How to Code in Python (version 3.X) From Scratch in Under 7 Days: Computer Programming, #1
Ebook
Python 3 Programming for Beginners: The Beginner's Guide for Learning How to Code in Python (version 3.X) From Scratch in Under 7 Days: Computer Programming, #1
byRamon Nastase
Rating: 5 out of 5 stars
5/5
CHERIoT Programmers' Guide: CHERIoT, #1
Ebook
CHERIoT Programmers' Guide: CHERIoT, #1
byDavid Chisnall
Rating: 0 out of 5 stars
0 ratings
Beginning Software Engineering
Ebook
Beginning Software Engineering
byRod Stephens
Rating: 5 out of 5 stars
5/5
Learning Object-Oriented Programming
Ebook
Learning Object-Oriented Programming
byGastón C. Hillar
Rating: 4 out of 5 stars
4/5
HTML language complete
Ebook
HTML language complete
byHiyesh Ratee
Rating: 0 out of 5 stars
0 ratings
Practical Data Analysis - Second Edition
Ebook
Practical Data Analysis - Second Edition
byHector Cuesta
Rating: 0 out of 5 stars
0 ratings
Object-Oriented Basics
Ebook
Object-Oriented Basics
byAlisa Turing
Rating: 0 out of 5 stars
0 ratings
IPython Notebook Essentials
Ebook
IPython Notebook Essentials
byL. Felipe Martins
Rating: 0 out of 5 stars
0 ratings
Parallel and High Performance Programming with Python: Unlock Parallel and Concurrent Programming in Python using Multithreading, CUDA, Pytorch, and Dask
Ebook
Parallel and High Performance Programming with Python: Unlock Parallel and Concurrent Programming in Python using Multithreading, CUDA, Pytorch, and Dask
byFabio Nelli
Rating: 0 out of 5 stars
0 ratings
Computer Data
Ebook
Computer Data
byAngel Gabaldon
Rating: 0 out of 5 stars
0 ratings
Python | Learn to Code Step by Step
Ebook
Python | Learn to Code Step by Step
byM.Eng. Johannes Wild
Rating: 0 out of 5 stars
0 ratings
Mastering Python High Performance: Learn how to optimize your code and Python performance with this vital guide to Python performance profiling and benchmarking
Ebook
Mastering Python High Performance: Learn how to optimize your code and Python performance with this vital guide to Python performance profiling and benchmarking
byFernando Donglio
Rating: 0 out of 5 stars
0 ratings
Mastering Python Networking - Third Edition: Your one-stop solution to using Python for network automation, programmability, and DevOps, 3rd Edition
Ebook
Mastering Python Networking - Third Edition: Your one-stop solution to using Python for network automation, programmability, and DevOps, 3rd Edition
byEric Chou
Rating: 3 out of 5 stars
3/5
Mastering Objectoriented Python
Ebook
Mastering Objectoriented Python
bySteven F. Lott
Rating: 5 out of 5 stars
5/5
Ultimate Web API Development with Django REST Framework: Build Robust and Secure Web APIs with Django REST Framework Using Test-Driven Development for Data Analysis and Management (English Edition)
Ebook
Ultimate Web API Development with Django REST Framework: Build Robust and Secure Web APIs with Django REST Framework Using Test-Driven Development for Data Analysis and Management (English Edition)
byLuis Lazzaro Leonardo
Rating: 0 out of 5 stars
0 ratings
Python Data Analysis - Second Edition
Ebook
Python Data Analysis - Second Edition
byArmando Fandango
Rating: 0 out of 5 stars
0 ratings
Fundamentals of Python Data Engineering
Ebook
Fundamentals of Python Data Engineering
byAarav Joshi
Rating: 0 out of 5 stars
0 ratings
Microsoft Azure Machine Learning
Ebook
Microsoft Azure Machine Learning
bySumit Mund
Rating: 4 out of 5 stars
4/5
Android Hacker's Handbook
Ebook
Android Hacker's Handbook
byJoshua J. Drake
Rating: 4 out of 5 stars
4/5
Business Analytics
Ebook
Business Analytics
byHiriyappa .B
Rating: 4 out of 5 stars
4/5
NumPy Essentials
Ebook
NumPy Essentials
byLeo (Liang-Huan) Chin
Rating: 0 out of 5 stars
0 ratings
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Ebook
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
byMustafa Al-Dori
Rating: 5 out of 5 stars
5/5
Clean Code in JavaScript: Develop reliable, maintainable, and robust JavaScript
Ebook
Clean Code in JavaScript: Develop reliable, maintainable, and robust JavaScript
byJames Padolsey
Rating: 5 out of 5 stars
5/5
Dart By Example
Ebook
Dart By Example
byMitchell Davy
Rating: 0 out of 5 stars
0 ratings
Apache Spark 2.x Cookbook
Ebook
Apache Spark 2.x Cookbook
byRishi Yadav
Rating: 0 out of 5 stars
0 ratings
Learning Modular Java Programming
Ebook
Learning Modular Java Programming
byTejaswini Mandar Jog
Rating: 0 out of 5 stars
0 ratings
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
Ebook
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
bye3
Rating: 0 out of 5 stars
0 ratings
Python The Complete Reference: Comprehensive Guide to Mastering Python Programming from Fundamentals to Advanced Techniques
Ebook
Python The Complete Reference: Comprehensive Guide to Mastering Python Programming from Fundamentals to Advanced Techniques
byAarav Joshi
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 4 out of 5 stars
4/5
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Ebook
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
byMargot Lee Shetterly
Rating: 4 out of 5 stars
4/5
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
Ebook
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
byAlec Rowe
Rating: 0 out of 5 stars
0 ratings
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 5 out of 5 stars
5/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Data Analytics for Beginners: Introduction to Data Analytics
Ebook
Data Analytics for Beginners: Introduction to Data Analytics
byAnthony S. Williams
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
Ebook
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
byJoe Shelley
Rating: 5 out of 5 stars
5/5
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Quantum Computing For Dummies
Ebook
Quantum Computing For Dummies
bywhurley
Rating: 3 out of 5 stars
3/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms
Ebook
The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms
byCory Althoff
Rating: 0 out of 5 stars
0 ratings
Storytelling with Data: Let's Practice!
Ebook
Storytelling with Data: Let's Practice!
byCole Nussbaumer Knaflic
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byT.C. Boyle
Rating: 5 out of 5 stars
5/5
Computer Science I Essentials
Ebook
Computer Science I Essentials
byRandall Raus
Rating: 5 out of 5 stars
5/5
Learn Python Programming for Beginners: The Best Step-by-Step Guide for Coding with Python, Great for Kids and Adults. Includes Practical Exercises on Data Analysis, Machine Learning and More.
Ebook
Learn Python Programming for Beginners: The Best Step-by-Step Guide for Coding with Python, Great for Kids and Adults. Includes Practical Exercises on Data Analysis, Machine Learning and More.
byFlynn Fisher
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
UX/UI Design Playbook
Ebook
UX/UI Design Playbook
byOlha Bahaieva
Rating: 4 out of 5 stars
4/5
Thinking in Algorithms: Strategic Thinking Skills, #2
Ebook
Thinking in Algorithms: Strategic Thinking Skills, #2
byAlbert Rutherford
Rating: 4 out of 5 stars
4/5
Technical Writing For Dummies
Ebook
Technical Writing For Dummies
bySheryl Lindsell-Roberts
Rating: 0 out of 5 stars
0 ratings
Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning
Ebook
Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning
byAlex J. Gutman
Rating: 5 out of 5 stars
5/5
Boundaries in an Overconnected World: Setting Limits to Preserve Your Focus, Privacy, Relationships, and Sanity
Ebook
Boundaries in an Overconnected World: Setting Limits to Preserve Your Focus, Privacy, Relationships, and Sanity
byAnne Katherine
Rating: 4 out of 5 stars
4/5
Algorithms For Dummies
Ebook
Algorithms For Dummies
byJohn Paul Mueller
Rating: 4 out of 5 stars
4/5
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
Ebook
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
bySteven Cooper
Rating: 4 out of 5 stars
4/5
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
Ebook
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
byDavid Kadavy
Rating: 5 out of 5 stars
5/5
Learning the Chess Openings
Ebook
Learning the Chess Openings
byJef Kaan
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

Episode 103. Let's share data cross-language with Apache Arrow! (among other things): We have a great time talking to Matt Topol from Voltron Data on one of his Apache Software Foundation projects called Apache Arrow. It's both a spec and implementation of a columnar data format that is not only efficient, but cross-language...
Podcast episode
Episode 103. Let's share data cross-language with Apache Arrow! (among other things): We have a great time talking to Matt Topol from Voltron Data on one of his Apache Software Foundation projects called Apache Arrow. It's both a spec and implementation of a columnar data format that is not only efficient, but cross-language...
byJava Pub House
0 ratings
0% found this document useful
SE Radio 622: Wolf Vollprecht on Python Tooling in Rust: Wolf Vollprecht, the CEO and founder of Prefix.dev, speaks with host Gregory M. Kapfhammer about how to implement Python tools, such as package managers, in the Rust programming language. They discuss the challenges associated with building Python...
Podcast episode
SE Radio 622: Wolf Vollprecht on Python Tooling in Rust: Wolf Vollprecht, the CEO and founder of Prefix.dev, speaks with host Gregory M. Kapfhammer about how to implement Python tools, such as package managers, in the Rust programming language. They discuss the challenges associated with building Python...
bySoftware Engineering Radio - the podcast for professional software developers
0 ratings
0% found this document useful
Open Standards Make MLOps Easier and Silos Harder // Cody Peterson // #234
Podcast episode
Open Standards Make MLOps Easier and Silos Harder // Cody Peterson // #234
byMLOps.community
0 ratings
0% found this document useful
Eliminating Garbage In/Garbage Out for Analytics and ML // Roy Hasson & Santona Tuli // MLOps Podcast #166
Podcast episode
Eliminating Garbage In/Garbage Out for Analytics and ML // Roy Hasson & Santona Tuli // MLOps Podcast #166
byMLOps.community
0 ratings
0% found this document useful
Episode 77 - Open Source
Podcast episode
Episode 77 - Open Source
byThe Structural Engineering Podcast
0 ratings
0% found this document useful
Opening AI's Black Box with Prof. David Bau, Koyena Pal, and Eric Todd of Northeastern University: In this episode, we dive deep into the inner workings of large language models with Professor David Bau and grad students Koyena Pal and Eric Todd from Northeastern University.
Podcast episode
Opening AI's Black Box with Prof. David Bau, Koyena Pal, and Eric Todd of Northeastern University: In this episode, we dive deep into the inner workings of large language models with Professor David Bau and grad students Koyena Pal and Eric Todd from Northeastern University.
by"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis
0 ratings
0% found this document useful
Harnessing Python for Research: Scientific Applications of Python with Michael Kennedy: Still scrabbling with Excel? Consider Python language uses, says programmer and podcaster Michael Kennedy. A general programming language that is easy to use in multiple environments, Python programming is limitless and has numerous open source...
Podcast episode
Harnessing Python for Research: Scientific Applications of Python with Michael Kennedy: Still scrabbling with Excel? Consider Python language uses, says programmer and podcaster Michael Kennedy. A general programming language that is easy to use in multiple environments, Python programming is limitless and has numerous open source...
byFinding Genius Podcast
0 ratings
0% found this document useful
SE Radio 624: Marcelo Trylesinski on FastAPI: Marcelo Trylesinski, a senior software engineer at Pydantic and a maintainer of open-source Python tools including Starlette and Uvicorn, joins host Gregory M. Kapfhammer to talk about FastAPI. Their conversation focuses on the design and...
Podcast episode
SE Radio 624: Marcelo Trylesinski on FastAPI: Marcelo Trylesinski, a senior software engineer at Pydantic and a maintainer of open-source Python tools including Starlette and Uvicorn, joins host Gregory M. Kapfhammer to talk about FastAPI. Their conversation focuses on the design and...
bySoftware Engineering Radio - the podcast for professional software developers
0 ratings
0% found this document useful
Autonomous Database Tools: In this episode, hosts Lois Houston and Nikita Abraham speak with Oracle Database experts about the various tools you can use with Autonomous Database, including Oracle Application Express (APEX), Oracle Machine Learning, and more. Oracle...
Podcast episode
Autonomous Database Tools: In this episode, hosts Lois Houston and Nikita Abraham speak with Oracle Database experts about the various tools you can use with Autonomous Database, including Oracle Application Express (APEX), Oracle Machine Learning, and more. Oracle...
byOracle University Podcast
0 ratings
0% found this document useful
ThursdAI Aug 10 - Deepfakes get real, OSS Embeddings heating up, Wizard 70B tops tops the charts and more!
Podcast episode
ThursdAI Aug 10 - Deepfakes get real, OSS Embeddings heating up, Wizard 70B tops tops the charts and more!
byThursdAI - The top AI news from the past week
0 ratings
0% found this document useful
AI-Assisted Development in Oracle APEX: Get ready to explore how generative AI is transforming development in Oracle APEX. In this episode, hosts Lois Houston and Nikita Abraham are joined by Oracle APEX experts Apoorva Srinivas and Toufiq Mohammed to break down the innovative features of...
Podcast episode
AI-Assisted Development in Oracle APEX: Get ready to explore how generative AI is transforming development in Oracle APEX. In this episode, hosts Lois Houston and Nikita Abraham are joined by Oracle APEX experts Apoorva Srinivas and Toufiq Mohammed to break down the innovative features of...
byOracle University Podcast
0 ratings
0% found this document useful
Do generated types from OpenAPI spec change testing?: Jordan asked this on 2024-04-10
Podcast episode
Do generated types from OpenAPI spec change testing?: Jordan asked this on 2024-04-10
byThe Call Kent Podcast
0 ratings
0% found this document useful
Build Your Second Brain One Piece At A Time: Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.
Podcast episode
Build Your Second Brain One Piece At A Time: Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.
byData Engineering Podcast
0 ratings
0% found this document useful
Practical MLOps // Noah Gift // MLOps Coffee Sessions #27
Podcast episode
Practical MLOps // Noah Gift // MLOps Coffee Sessions #27
byMLOps.community
0 ratings
0% found this document useful
Sora: OpenAI’s Text-to-Video Generation Model
Podcast episode
Sora: OpenAI’s Text-to-Video Generation Model
byDeep Papers
0 ratings
0% found this document useful
The Role of Infrastructure in ML // Niels Bantilan // #197
Podcast episode
The Role of Infrastructure in ML // Niels Bantilan // #197
byMLOps.community
0 ratings
0% found this document useful
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
Podcast episode
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
byThe Web Platform Podcast
0 ratings
0% found this document useful
Big Data In The Browser: So why would anyone want to put alot of data into a browser? Well, for a lot of the same reasons that edge computing and distributed computing have become so popular. You get the data a lot closer to the user and you don’t have to pay for the compute...
Podcast episode
Big Data In The Browser: So why would anyone want to put alot of data into a browser? Well, for a lot of the same reasons that edge computing and distributed computing have become so popular. You get the data a lot closer to the user and you don’t have to pay for the compute...
byThe MapScaping Podcast - GIS, Geospatial, Remote Sensing, earth observation and digital geography
0 ratings
0% found this document useful
Building Cody, an Open Source AI Coding Assistant // Beyang Liu // MLOps Podcast #173
Podcast episode
Building Cody, an Open Source AI Coding Assistant // Beyang Liu // MLOps Podcast #173
byMLOps.community
0 ratings
0% found this document useful
? ThursdAI Apr 4 - Weave, CMD R+, SWE-Agent, Everyone supports Tool Use + JAMBA deep dive with AI21
Podcast episode
? ThursdAI Apr 4 - Weave, CMD R+, SWE-Agent, Everyone supports Tool Use + JAMBA deep dive with AI21
byThursdAI - The top AI news from the past week
0 ratings
0% found this document useful
ThursdAI Aug 3 - OpenAI, Qwen 7B beats LLaMa, Orca is replicated, and more AI news
Podcast episode
ThursdAI Aug 3 - OpenAI, Qwen 7B beats LLaMa, Orca is replicated, and more AI news
byThursdAI - The top AI news from the past week
0 ratings
0% found this document useful
Rust in Production Ep 1 - InfluxData's Paul Dix: Paul Dix, CTO of InfluxDB, talks about the open-source time series database's development, the decision to use Go and Rust, challenges of managing high data volumes, performance improvements, future plans, and the value of hands-on learning.
Podcast episode
Rust in Production Ep 1 - InfluxData's Paul Dix: Paul Dix, CTO of InfluxDB, talks about the open-source time series database's development, the decision to use Go and Rust, challenges of managing high data volumes, performance improvements, future plans, and the value of hands-on learning.
byRust in Production
0 ratings
0% found this document useful
Anirudh and Akshay from Tangled.sh and the Future of Decentralized Coding
Podcast episode
Anirudh and Akshay from Tangled.sh and the Future of Decentralized Coding
bydevtools.fm: Developer Tools, Open Source, Software Development
0 ratings
0% found this document useful
Marc Cornellà - Maintaining Open Source Projects: Robby brings on the official maintainer and major contributor for the Oh My Zsh project, Marc Cornellà, to share his wisdom on the characteristics of well-maintained proprietary software, whether the same characteristics apply when it comes to opensource software, how engineering teams can organize and prioritize a popular project that has a consistent 400 to 500 open pull-requests from people across the planet, and so much more. Enjoy!
Podcast episode
Marc Cornellà - Maintaining Open Source Projects: Robby brings on the official maintainer and major contributor for the Oh My Zsh project, Marc Cornellà, to share his wisdom on the characteristics of well-maintained proprietary software, whether the same characteristics apply when it comes to opensource software, how engineering teams can organize and prioritize a popular project that has a consistent 400 to 500 open pull-requests from people across the planet, and so much more. Enjoy!
byMaintainable
0 ratings
0% found this document useful
⚡️The new OpenAI Agents Platform
Podcast episode
⚡️The new OpenAI Agents Platform
byLatent Space: The AI Engineer Podcast
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
How Hera is an Enabler of MLOps Integrations // Flaviu Vadan // Coffee Sessions #115
Podcast episode
How Hera is an Enabler of MLOps Integrations // Flaviu Vadan // Coffee Sessions #115
byMLOps.community
0 ratings
0% found this document useful
Fast Stream Processing In Python Using Faust with Ask Solem: Fast Stream Processing On Kafka In Python With Faust (Interview)
Podcast episode
Fast Stream Processing In Python Using Faust with Ask Solem: Fast Stream Processing On Kafka In Python With Faust (Interview)
byThe Python Podcast.__init__
0 ratings
0% found this document useful
381 Programming Framework: Which Ones To Learn? - Simple Programmer Podcast: If you're a software developer I doubt you'll ever be able to learn everything that software developer has to offer. Every day new programming languages come out, technology changes and the process is updated. All this amount of information makes it...
Podcast episode
381 Programming Framework: Which Ones To Learn? - Simple Programmer Podcast: If you're a software developer I doubt you'll ever be able to learn everything that software developer has to offer. Every day new programming languages come out, technology changes and the process is updated. All this amount of information makes it...
bySimple Programmer Podcast
0 ratings
0% found this document useful
386 The Top 10 Books To Learn Python - Simple Programmer Podcast: Have you ever wondered what are the best books to learn Python? "Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic...
Podcast episode
386 The Top 10 Books To Learn Python - Simple Programmer Podcast: Have you ever wondered what are the best books to learn Python? "Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic...
bySimple Programmer Podcast
0 ratings
0% found this document useful

Related categories

Skip carousel

Reviews for In-Memory Analytics with Apache Arrow

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

In-Memory Analytics with Apache Arrow - Matthew Topol

Cover.png

In-Memory Analytics with Apache Arrow

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Apeksha Shetty

Publishing Product Manager: Chayan Majumdar

Book Project Manager: Aparna Nair

Senior Content Development Editor: Shreya Moharir

Technical Editor: Sweety Pagaria

Copy Editor: Safis Editing

Proofreader: Shreya Moharir

Indexer: Pratik Shirodkar

Production Designer: Prafulla Nikalje

DevRel Marketing Executive: Nivedita Singh

First published: June 2022

Second edition: September 2024

Production reference: 1060924

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-83546-122-8

www.packtpub.com

For my family, Kat and Haley, who managed to tolerate me the entire time I was writing this.

Also for Logan and Penny, my fuzzy coding companions who got me through so much. Their memory is a blessing.

Foreword

Since launching as an open source project in 2016, Apache Arrow has rapidly become the de facto standard for interoperability and accelerated in-memory processing for tabular data. We have broadened support to a dozen programming languages while expanding substantially beyond the project’s initial goal of defining a standardized columnar data format to create a true multi-language developer toolbox for creating high-performance data applications. While Arrow has helped greatly with improving interoperability and performance in heterogeneous systems (such as across programming languages or different kinds of execution engines), it is also increasingly being chosen as the foundation for building new data processing systems and databases. With Dremio as the first true Arrow-native system, we hope that many more production systems will become Arrow-compatible or Arrow-native over the coming years.

Part of Arrow’s success and the rapid growth of its developer community comes from the passion and time investment of its early adopters and most prolific core contributors. Matt Topol has been a driving force in the Go libraries for Arrow, and with this new book, he has made a significant contribution to making the whole project a lot more accessible to newcomers. The book goes in depth into the details of how different pieces of Arrow work while highlighting the many different building blocks that could be employed by an Arrow user to accelerate or simplify their application.

I am thrilled to see this updated second edition of this book as the Arrow project and its open source ecosystem continue to expand in new, impactful directions, even more than eight years since the project started. This was the first true Arrow book since the project’s founding, and it is a valuable resource for developers who want to explore different areas in depth and to learn how to apply new tools in their projects. I’m always happy to recommend it to new users of Arrow as well as existing users who are looking to deepen their knowledge by learning from an expert like Matt.

– Wes McKinney

Co-founder of Voltron Data and Principal Architect at Posit

Co-creator and PMC for Apache Arrow

Contributors

About the author

Matthew Topol is a member of the Apache Arrow Project Management Committee (PMC) and a staff software engineer at Voltron Data, Inc. Matt has worked in infrastructure, application development, and large-scale distributed system analytical processing for financial data. At Voltron Data, Matt’s primary responsibilities have been working on and enhancing the Apache Arrow libraries and associated sub-projects. In his spare time, Matt likes to bash his head against a keyboard, develop and run delightfully demented fantasy games for his victims—er—friends, and share his knowledge and experience with anyone interested enough to listen.

A very special thanks go out to my friends Hope and Stan, whose encouragement is the only reason I wrote a book in the first place. Finally, thanks go to my parents, who beam with pride every time I talk about this book. Thank you for your support and for being there through everything.

About the reviewers

Weston Pace is a maintainer for the Apache Arrow project and a member of the Arrow PMC and Substrait SMC. He has worked closely with the C++, Python, and Rust implementations of Apache Arrow. He has developed components in several of the systems described in this book, such as datasets and Acero. Weston is currently employed at LanceDB, where he is working on new Arrow-compatible storage formats to enable even more Arrow-native technology.

Jacob Wujciak-Jens is an Apache Arrow committer and an elected member of the Apache Software Foundation. His work at Voltron Data as a senior software release engineer has included pivotal roles in the Apache Arrow and Velox projects. During his tenure, he has developed a deep knowledge of the release processes, build systems, and inner workings of these high-profile open source software projects. Jacob has a passion for open source and its use, both in the open source community and industry. Holding a Master of Education in computer science and public health, he loves to share his knowledge, enriching the community and enhancing collaborative projects.

Raúl Cumplido is a PMC of the Apache Arrow project and has been the release manager for the project for more than 10 releases now. He has worked on several areas of the project. He has been always involved with open source communities, contributing mainly to Python-related projects. He’s one of the cofounders of the Python Spanish Association and has also been involved in the organization of several EuroPython and PyCon ES conferences. He currently works as a senior software release engineer at Voltron Data where he contributed to the Apache Arrow project.

Table of Contents

Preface

Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals

Getting Started with Apache Arrow

Technical requirements

Understanding the Arrow format and specifications

Why does Arrow use a columnar in-memory format?

Learning the terminology and physical memory layout

Quick summary of physical layouts, or TL;DR

How to speak Arrow

Arrow format versioning and stability

Would you download a library? Of course!

Setting up your shooting range

Using PyArrow for Python

C++ for the 1337 coders

Go, Arrow, go!

Summary

References

Working with Key Arrow Specifications

Technical requirements

Playing with data, wherever it might be!

Working with Arrow tables

Accessing data files with PyArrow

Accessing data files with Arrow in C++

Bears firing arrows

Putting pandas in your quiver

Making pandas run fast

Keeping pandas from running wild

Polar bears use Rust-y arrows

Sharing is caring… especially when it’s your memory

Diving into memory management

Managing buffers for performance

Crossing boundaries

Summary

Format and Memory Handling

Technical requirements

Storage versus runtime in-memory versus message-passing formats

Long-term storage formats

In-memory runtime formats

Message-passing formats

Summing up

Passing your Arrows around

What is this sorcery?!

Producing and consuming Arrows

Learning about memory cartography

The base case

Parquet versus CSV

Mapping data into memory

Too long; didn’t read (TL;DR) – computers are magic

Leaving the CPU – using device memory

Starting with a few pointers

Device-agnostic buffer handling

Summary

Part 2: Interoperability with Arrow: The Power of Open Standards

Crossing the Language Barrier with the Arrow C Data API

Technical requirements

Using the Arrow C data interface

The ArrowSchema structure

The ArrowArray structure

Example use cases

Using the C data API to export Arrow-formatted data

Importing Arrow data with Python

Exporting Arrow data with the C Data API from Python to Go

Streaming Arrow data between Python and Go

What about non-CPU device data?

The ArrowDeviceArray struct

Using ArrowDeviceArray

Other use cases

Some exercises

Summary

Acero: A Streaming Arrow Execution Engine

Technical requirements

Letting Acero do the work for you

Input shaping

Value casting

Types of functions in Acero

Invoking functions

Using the C++ compute library

Using the compute library in Python

Picking the right tools

Adding a constant value to an array

Compute Add function

A simple for loop

Using std::for_each and reserve space

Divide and conquer

Always have a plan

Where does Acero fit?

Acero’s core concepts

Let’s get streaming!

Simplifying complexity

Summary

Using the Arrow Datasets API

Technical requirements

Querying multifile datasets

Creating a sample dataset

Discovering dataset fragments

Filtering data programmatically

Expressing yourself – a quick detour

Using expressions for filtering data

Deriving and renaming columns (projecting)

Using the Datasets API in Python

Creating our sample dataset

Discovering the dataset

Using different file formats

Filtering and projecting columns with Python

Streaming results

Working with partitioned datasets

Writing partitioned data

Connecting everything together

Summary

Exploring Apache Arrow Flight RPC

Technical requirements

The basics and complications of gRPC

Building modern APIs for data

Efficiency and streaming are important

Arrow Flight’s building blocks

Horizontal scalability with Arrow Flight

Adding your business logic to Flight

Other bells and whistles

Understanding the Flight Protobuf definitions

Using Flight, choose your language!

Building a Python Flight server

Building a Go Flight server

What is Flight SQL?

Setting up a performance test

Everyone gets a containerized development environment!

Running the performance test

Flight SQL, the new kid on the block

Summary

Understanding Arrow Database Connectivity (ADBC)

Technical requirements

ODBC takes an Arrow to the knee

Lost in translation

Arrow adoption in ODBC drivers

The benefits of standards around connectivity

The ADBC specification

ADBC databases

ADBC connections

ADBC statements

ADBC error handling

Using ADBC for performance and adaptability

ADBC with C/C++

Using ADBC with Python

Using ADBC with Go

Summary

Using Arrow with Machine Learning Workflows

Technical requirements

SPARKing new ideas on Jupyter

Understanding the integration of Arrow in Spark

Containerization makes life easier

SPARKing joy with Arrow and PySpark

Facehuggers implanting data

Setting up your environment

Proving the benefits by checking resource usage

Using Arrow with the standard tools for ML

More GPU, more speed!

Summary

Part 3: Real-World Examples, Use Cases, and Future Development

Powered by Apache Arrow

Swimming in data with Dremio Sonar

Clarifying Dremio Sonar’s architecture

The library of the gods…of data analysis

Spicing up your data workflows

Arrow in the browser using JavaScript

Gaining a little perspective

Taking flight with Falcon

An Influx of connectivity

Summary

How to Leave Your Mark on Arrow

Technical requirements

Contributing to open source projects

Communication is key

You don’t necessarily have to contribute code

There are a lot of reasons why you should contribute!

Preparing your first pull request

Creating and navigating GitHub issues

Setting up Git

Orienting yourself in the code base

Building the Arrow libraries

Creating the pull request

Understanding Archery and the CI configuration

Find your interest and expand on it

Getting that sweet, sweet approval

Finishing up with style!

C++ code styling

Python code styling

Go code styling

Summary

Future Development and Plans

Globetrotting with data – GeoArrow and GeoParquet

Collaboration breeds success

Expanding ADBC adoption

Final words

Index

Other Books You May Enjoy

Preface

To quote a famous blue hedgehog, Gotta Go Fast! When it comes to data, speed is important. It doesn’t matter if you’re collecting or analyzing data or developing utilities for others to do so, performance and efficiency are going to be huge factors in your technology choices, not just in the efficiency of the software itself, but also in development time. You need the right tools and the right technology, or you’re dead in the water.

The Apache Arrow ecosystem is developer-centric, and this book is no different. Get started with understanding what Arrow is and how it works, then learn how to utilize it in your projects. You’ll find code examples, explanations, and diagrams here, all with the express purpose of helping you learn. You’ll integrate your data sources with Python DataFrame libraries such as pandas or NumPy and utilize Arrow Flight to create efficient data services.

With real-world datasets, you’ll learn how to leverage Apache Arrow with Apache Spark and other technologies. Apache Arrow’s format is language-independent and organized so that analytical operations are performed extremely quickly on modern CPU and GPU hardware. Join the industry adoption of this open source data format and save yourself valuable development time creating high-performant, memory-efficient, analytical workflows.

This book has been a labor of love to share knowledge. I hope you learn a lot from it! I sure did when writing it.

Who this book is for

This book is for developers, data analysts, and data scientists looking to explore the capabilities of Apache Arrow from the ground up. This book will also be useful for any engineers who are working on building utilities for data analytics, query engines, or otherwise working with tabular data, regardless of the language they are programming in.

What this book covers

Chapter 1

, Getting Started with Apache Arrow, introduces you to the basic concepts underpinning Apache Arrow. It introduces and explains the Arrow format and the data types it supports, along with how they are represented in memory. Afterward, you’ll set up your development environment and run some simple code examples showing the basic operation of Arrow libraries.

Chapter 2

, Working with Key Arrow Specifications, continues your introduction to Apache Arrow by explaining how to read both local and remote data files using different formats. You’ll learn how to integrate Arrow with the Python pandas and Polars libraries and how to utilize the zero-copy aspects of Arrow to share memory for performance.

Chapter 3

, Format and Memory Handling, discusses the relationships between Apache Arrow and Apache Parquet, Feather, Protocol Buffers, JSON, and CSV data, along with when and why to use these different formats. Following this, the Arrow IPC format is introduced and described, along with an explanation of using memory mapping to further improve performance. Finally, we wrap up with some basic leveraging of Arrow on a GPU.

Chapter 4

, Crossing the Language Barrier with the Arrow C Data API, introduces the titular C Data API for efficiently passing Apache Arrow data between different language runtimes and devices. This chapter will cover the struct definitions utilized for this interface along with describing use cases that make it beneficial.

Chapter 5

, Acero: A Streaming Arrow Execution Engine, describes how to utilize the reference implementation of an Arrow computation engine named Acero. You’ll learn when and why you should use the compute engine to perform analytics rather than implementing something yourself and why we’re seeing Arrow showing up in many popular execution engines.

Chapter 6

, Using the Arrow Datasets API, demonstrates querying, filtering, and otherwise interacting with multi-file datasets that can potentially be across multiple sources. Partitioned datasets are also covered, along with utilizing Acero to perform streaming filtering and other operations on the data.

Chapter 7

, Exploring Apache Arrow Flight RPC, examines the Flight RPC protocol and its benefits. You will be walked through building a simple Flight server and client in multiple languages to produce and consume tabular data.

Chapter 8

, Understanding Arrow Database Connectivity (ADBC), introduces and explains an Apache Arrow-based alternative to ODBC/JDBC and why it matters for the ecosystem. You will be walked through several examples with sample code that interact with multiple database systems such as DuckDB and PostgreSQL.

Chapter 9

, Using Arrow with Machine Learning Workflows, integrates multiple concepts that have been covered to explain the various ways that Apache Arrow can be utilized to improve parts of data pipelines and the performance of machine learning model training. It will describe how Arrow’s interoperability and defined standards make it ideal for use with Spark, GPU compute, and many other tools.

Chapter 10

, Powered by Apache Arrow, provides a few examples of current real-world usage of Apache Arrow, such as Dremio, Spice.AI, and InfluxDB.

Chapter 11

, How to Leave Your Mark on Arrow, provides a brief introduction to contributing to open source projects in general, but specifically how to contribute to the Arrow project itself. You will be walked through finding starter issues, setting up your first pull request to contribute, and what to expect when doing so. To that end, this chapter also contains various instructions on locally building Arrow C++, Python, and Go libraries from source to test your contributions.

Chapter 12

, Future Development and Plans, wraps up the book by examining the features that are still in development at the time of writing. This includes geospatial integrations with GeoArrow and GeoParquet along with expanding Arrow Database Connectivity (ADBC) adoption. Finally, there are some parting words and a challenge from me to you.

To get the most out of this book

It is assumed that you have a basic understanding of writing code in at least one of C++, Python, or Go to benefit from and use the code snippets. You should know how to compile and run code in the desired language. Some familiarity with basic concepts of data analysis will help you get the most out of the scenarios and use cases explained in this book. Beyond this, concepts such as tabular data and installing software on your machine are assumed to be understood rather than explained.

The sample data is in the book’s GitHub repository. You’ll need to use Git Large File Storage (LFS) or a browser to download the large data files. There are also a couple of larger sample data files in publicly accessible AWS S3 buckets. The book will provide a link to download the files when necessary. Code examples are provided in C++, Python, and Go.

If you are using the digital version of this book, we advise you to the complete code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Take your time, enjoy, and experiment in all kinds of ways, and please, have fun with the exercises!

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/In-Memory-Analytics-with-Apache-Arrow-Second-Edition

. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/

. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: We’re using PyArrow in this example, but if you have the ArrowDeviceArray struct definition, you could create and populate the struct without ever needing to directly include or link against the Arrow libraries!

A block of code is set as follows:

>>> import numba.cuda

>>> import pyarrow as pa

>>> from pyarrow import cuda

>>> import numpy as np

>>> from pyarrow.cffi import ffi

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

std::unique_ptr tmp;

// returns a status, handle the error case

arrow::MakeBuilder(arrow::default_memory_pool(), st_type, &tmp);

std::shared_ptr builder;

builder.reset(static_cast( tmp.release()));

Any command-line input or output is written as follows:

$ mkdir arrow_chapter1 && cd arrow_chapter1

$ go mod init arrow_chapter1

$ go get -u github.com/apache/arrow/go/v17/arrow@latest

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: You'll notice that for the Filter and Project nodes in the figure, since they each use a compute expression, there is a sub-tree of the execution graph representing the expression tree.

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected]

and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata

and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected]

with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com

Share Your Thoughts

Once you’ve read In-Memory Analytics with Apache Arrow, Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page

for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781835461228

Submit your proof of purchase

That’s it! We’ll send your free PDF and other benefits to your email directly

Part 1: Overview of What Arrow is, Its Capabilities, Benefits, and Goals

This section is an introduction to Apache Arrow as a format specification and a project, the benefits it claims, and the goals it’s trying to achieve. You’ll also find a high-level overview of basic use cases and examples.

This part has the following chapters:

Chapter 1

, Getting Started with Apache Arrow

Chapter 2

, Working with Key Arrow Specifications

Chapter 3

, Format and Memory Handling

Getting Started with Apache Arrow

Regardless of whether you’re a data scientist/engineer, a machine learning (ML) specialist, or a software engineer trying to build something to perform data analytics, you’ve probably heard of or read about something called Apache Arrow and either looked for more information or wondered what it was. Hopefully, this book can serve as a springboard in understanding what Apache Arrow is and isn’t, as well as a reference book to be continuously utilized so that you can supercharge your analytical capabilities.

For now, we’ll start by explaining what Apache Arrow is and what you will use it for. Following that, we will walk through the Arrow specifications, set up a development environment where you can play around with the various Apache Arrow libraries, and walk through a few simple exercises so that you can get a feel for how to use them.

In this chapter, we’re going to cover the following topics:

Understanding the Arrow format and specifications

Why does Arrow use a columnar in-memory format?

Learning the terminology and the physical memory layout

Arrow format versioning and stability

Setting up your shooting range

Technical requirements

For the portion of this chapter that describes how to set up a development environment for working with various Arrow libraries, you’ll need the following:

Your preferred integrated development environment (IDE) – for example, VS Code, Sublime, Emacs, or Vim

Plugins for your desired language (optional but highly recommended)

An interpreter or toolchain for your desired language(s):

Python 3.8+: pip and venvand/or pipenv

Go 1.21+

C++ Compiler (capable of compiling C++17 or newer)

Understanding the Arrow format and specifications

The Apache Arrow documentation states the following [1]:

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Well, that’s a lot of technical jargon! Let’s start from the top. Apache Arrow (just Arrow for brevity) is an open source project from the Apache Software Foundation (https://apache.org

) that is released under the Apache License, version 2.0 [2]. It was co-created by Jacques Nadeau and Wes McKinney, the creator of pandas, and first released in 2016. Simply put, Arrow is a collection of libraries and specifications that make it easy to build high-performance software utilities for processing and transporting large datasets. It consists of a collection of libraries related to in-memory data processing, including specifications for memory layouts and protocols for sharing and efficiently transporting data between systems and processes. When we’re talking about in-memory data processing, we’re talking exclusively about processing data in RAM and eliminating slow data access (as well as redundantly copying and converting data) wherever possible to improve performance. This is where Arrow excels and provides libraries to support this with utilities for streaming and transportation to speed up data access.

When working with data, there are two primary situations to consider, and each has different needs: the in-memory format and the on-disk format. When data is stored on disk, the biggest concerns are the size of the data and the input/output (I/O) cost to read it into the main memory before you can operate on it. As a result, formats for data on disk tend to focus much more on increasing I/O throughput, such as compressing the data to make it smaller and faster to read into memory. One example of this might be the Apache Parquet format, which is a columnar on-disk file format. Instead of being an on-disk format, Arrow’s focus is on the in-memory format, which targets processing efficiency, with numerous tactics such as cache locality and vectorization of computation.

The primary goal of Arrow is to become the lingua franca of data analytics and processing – the One Format to Rule Them All, so to speak. Different databases, programming languages, and libraries tend to implement and use separate internal formats for managing data, which means that any time you’re moving data between these components for different uses, you’re paying a cost to serialize and deserialize that data every time. Not only that but lots of time and resources get spent reimplementing common algorithms and processing in those different data formats over and over. If we can standardize on an efficient, feature-rich internal data format that can be widely adopted and used instead, this excess computation and development time is no longer necessary. Figure 1.1 shows a simplified diagram of multiple systems, each with their own data formats, having to be copied and/or converted for the different components to work with each other:

Figure 1.1 – Copy and convert components

Figure 1.1 – Copy and convert components

In many cases, the serialization and deserialization processes can end up taking nearly 90% of the processing time in such a system and prevent you from being able to spend that CPU on analytics. Alternatively, if every component is using Arrow’s in-memory format, you end up with a system similar to the one shown in Figure 1.2, where the data can be transferred between components at little-to-no cost. All the components can either share memory directly or send the data as-is without having to convert between different formats:

Figure 1.2 – Sharing Arrow memory between components

Figure 1.2 – Sharing Arrow memory between components

At this point, there’s no need for the different components and systems to implement custom connectors or re-implement common algorithms and utilities. The same libraries and connectors can be utilized, even across programming languages and process barriers, by sharing memory directly so that it refers to the same data rather than copying multiple times between them. An example of this idea will be covered in Chapter 8

, Understanding Arrow Database Connectivity (ADBC), where we’ll consider a specification for leveraging common database drivers in a cross-platform way to enable efficient interactions using Arrow-formatted data.

Most data processing systems now use distributed processing by breaking the data into chunks and sending those chunks across the network to various workers. So, even if we can share memory across processes on a box, there’s still the cost to send it across the network. This brings us to the final piece of the puzzle: the format of raw Arrow data on the wire is the same as it is in memory. You can avoid having to deserialize that data before you can use it (skipping a copy) or reference the memory buffers you were operating on to send it across the network without having to serialize it first. Just a bit of metadata sent along with the raw data buffers and interfaces that perform zero copies can be created to achieve performance benefits, by reducing memory usage and improving throughput. We’ll cover this more directly in Chapter 3

, Format and Memory Handling, so look forward to it!

Let’s quickly recap the features of the Arrow format we just described before moving on:

Using the same high-performance internal format across components allows for much more code reuse in libraries instead of the need to reimplement common workflows.

The Arrow libraries provide mechanisms to directly share memory buffers to reduce copying between processes by using the same internal representation, regardless of the language. This is what’s being referred to whenever you see theterm zero-copy.

The wire format is the same as the in-memory format to eliminate serialization and deserialization costs when sending data across networks between components of a system.

Now, you might be thinking, Well, this sounds too good to be true! And of course, being skeptical of promises like this is always a good idea. The community around Arrow has done a ton of work over the years to bring these ideas and concepts to fruition. The project itself provides and distributes libraries in a variety of different programming languages so that projects that want to incorporate and/or support the Arrow format don’t need to implement it themselves. Above and beyond the interaction with Arrow-formatted data, the libraries provide a significant amount of utility in assisting with common processes such as data access and I/O-related optimizations. As a result, the Arrow libraries can be useful for projects, even if they don’t utilize the Arrow format themselves.

Here’s just a quick sample of use cases where using Arrow as the internal/intermediate data format can be very beneficial:

SQL execution engines (such as Dremio Sonar, InfluxDB, or Apache DataFusion)

Data analysis utilities and pipelines (such as pandas or Apache Spark)

Streaming and message queue systems (such as Apache Kafka or Storm)

Storage systems and formats (such as Apache Parquet, Cassandra, and Kudu)

As for how Arrow can help you, it depends on which piece of the data puzzle you work with. The following are a few different roles that work with data and show how using Arrow could potentially be beneficial; it’s by no means a complete list though:

If you’re a data scientist:

You can utilize Arrow via Polars or pandas and NumPy integration to significantly improve the performance of your data manipulations.

If the tools you use integrate Arrow support, you can gain significant speed-ups for your queries and computations by using Arrow directly to reduce copies and/or serialization costs.

If you’re a data engineer specializing in extract, transform, and load (ETL):

The higher adoption of Arrow as an internal and externally-facing format can make it easier to integrate with many different utilities.

By using Arrow, data can be shared between processes and tools, with shared memory increasing the tools available to you for building pipelines, regardless of the language you’re operating in. You could take data from Python, use it in Spark, and then pass it directly to the Java virtual machine (JVM) without paying the cost of copying between them.

If you’re a software engineer or ML specialist building computation tools and utilities for data analysis:

Arrow, as an internal format, can be used to improve your memory usage and performance by reducing serialization and deserialization between components.

Understanding how to best utilize the data transfer protocols can improve your ability to parallelize queries and access your data, wherever it might be.

Because Arrow can be used for any sort of tabular data, it can be integrated into many different areas of data analysis and computation pipelines and is versatile enough to be beneficial as an internal and data transfer format, regardless of the shape of your data.

Now that you know what Arrow is, let’s dig into its design and how it delivers on the aforementioned promises of high-performance analytics, zero-copy sharing, and network communication without serialization costs. First, you’ll see why a column-oriented memory representation was chosen for Arrow’s internal format. In later chapters, we’ll cover specific integration points, explicit examples, and transfer protocols.

Why does Arrow use a columnar in-memory format?

There is often a lot of debate surrounding whether a database should be row-oriented or column-oriented, but this primarily refers to the on-disk format of the underlying storage files. Arrow’s data format is different from most cases discussed so far since it uses a columnar organization of data structures in memory directly. If you’re not familiar with columnar as a term, let’s take a look at what it means. First, imagine the following table of data:

Figure 1.3 – Sample data table

Figure 1.3 – Sample data table

Traditionally, if you were to read this table into memory, you’d likely have some structure to represent a row and then read the data in one row at a

Enjoying the preview?

Page 1 of 1

In-Memory Analytics with Apache Arrow: Accelerate data analytics for efficient processing of flat and hierarchical data structures

Matthew Topol

Related authors

Related to In-Memory Analytics with Apache Arrow

Related ebooks

Mastering Apache Arrow: Accelerating Data Processing and In-Memory Analytics

Python Data Structures and Algorithms

Python 3 Programming for Beginners: The Beginner's Guide for Learning How to Code in Python (version 3.X) From Scratch in Under 7 Days: Computer Programming, #1

CHERIoT Programmers' Guide: CHERIoT, #1

Beginning Software Engineering

Learning Object-Oriented Programming

HTML language complete

Practical Data Analysis - Second Edition

Object-Oriented Basics

IPython Notebook Essentials

Parallel and High Performance Programming with Python: Unlock Parallel and Concurrent Programming in Python using Multithreading, CUDA, Pytorch, and Dask

Computer Data

Python | Learn to Code Step by Step

Mastering Python High Performance: Learn how to optimize your code and Python performance with this vital guide to Python performance profiling and benchmarking

Mastering Python Networking - Third Edition: Your one-stop solution to using Python for network automation, programmability, and DevOps, 3rd Edition

Mastering Objectoriented Python

Ultimate Web API Development with Django REST Framework: Build Robust and Secure Web APIs with Django REST Framework Using Test-Driven Development for Data Analysis and Management (English Edition)

Python Data Analysis - Second Edition

Fundamentals of Python Data Engineering

Microsoft Azure Machine Learning

Android Hacker's Handbook

Business Analytics

NumPy Essentials

Lexicon of Programming Terminology: Lexicon of Tech and Business, #17

Clean Code in JavaScript: Develop reliable, maintainable, and robust JavaScript

Dart By Example

Apache Spark 2.x Cookbook

Learning Modular Java Programming

Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2

Python The Complete Reference: Comprehensive Guide to Mastering Python Programming from Fundamentals to Advanced Techniques

Computers For You

The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology

SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL

Mastering ChatGPT: 21 Prompts Templates for Effortless Writing

Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race

ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind

Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence

Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work

Elon Musk

Data Analytics for Beginners: Introduction to Data Analytics

Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are

CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide

Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates

Quantum Computing For Dummies

Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees

The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms

Storytelling with Data: Let's Practice!

Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad

Computer Science I Essentials

Learn Python Programming for Beginners: The Best Step-by-Step Guide for Coding with Python, Great for Kids and Adults. Includes Practical Exercises on Data Analysis, Machine Learning and More.

CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61

Deep Search: How to Explore the Internet More Effectively

UX/UI Design Playbook

Thinking in Algorithms: Strategic Thinking Skills, #2

Technical Writing For Dummies

Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning

Boundaries in an Overconnected World: Setting Limits to Preserve Your Focus, Privacy, Relationships, and Sanity

Algorithms For Dummies

Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention

How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)

Learning the Chess Openings

Related podcast episodes

Related categories

Reviews for In-Memory Analytics with Apache Arrow

What did you think?

Book preview

In-Memory Analytics with Apache Arrow - Matthew Topol