SlideShare a Scribd company logo
Apache Arrow as a full stack data
engineering solution
Alessandro Molina
@__amol__
https:/
/alessandro.molina.fyi/
Who am I, Alessandro
● Maintainer of TurboGears2,
Apache Arrow Contributor,
Author of DukPy and DEPOT
● Director of Engineering at Voltron Data Labs
● Author of
“Modern Python Standard Library Cookbook” and
“Crafting Test-Driven Software with Python”.
What’s Apache Arrow?
● a data interchange standard
● an in-memory format
● a networking format
● a storage format
● an i/o library
● a computation engine
● a tabular data library
● a query engine
● a partitioned datasets
manager
So much there!
The Apache Arrow project is a huge effort, aimed at
solving the foundamental problems in the data
analytics world.
Aimed at providing a “write everywhere, run
everywhere” experience it’s easy to get lost if you
don’t know where to start.
PyArrow is the entry point to the Apache Arrow
ecosystem for Python developers, and it can easily
give you access to many of the benefits of Arrow itself.
Introducing PyArrow
● Apache Arrow was born as a Columnar Data Format
● So the foundamental type in PyArrow is a “column of data”,
which is exposed by the pyarrow.Array object and its
subclasses.
● At this level, PyArrow is similar to NumPy single dimension
arrays.
PyArrow Arrays
import pyarrow as pa
# Arrays can be made of numbers
>>> pa.array([1, 2, 3, 4, 5])
<pyarrow.lib.Int64Array object at 0xffff77d75d20>
# Or strings
>>> pa.array(["A", "B", "C", "D", "E"])
<pyarrow.lib.StringArray object at 0xffff77d75b40>
# And even complex objects
>>> pa.array([{"a": 5}, {"a": 7}])
<pyarrow.lib.StructArray object at 0xffff77d75d20>
# Arrays can also be masked
>>> pa.array([1, 2, 3, 4, 5],
... mask=pa.array([True, False, True, False, True]))
<pyarrow.lib.Int64Array object at 0xffff77d75d80>
Compared to classic NumPy arrays, PyArrow
arrays are a bit more complex.
● They pair a buffer with the data with one
with the validity map. So that null values
can be more than just None
● Also arrays of strings retaining the
guarantee of having a single continuous
buffer for the values
Introducing PyArrow Tables
● As Arrays are “columns”, their grouping can form pyarrow.Table
● Tables are actually consistuted by pyarrow.ChunkedArray so
that appending rows to them is a cheap operation.
● At this level, PyArrow is similar to Pandas Dataframes
PyArrow Tables
>>> table = pa.table([
... pa.array([1, 2, 3, 4, 5]),
... pa.array(["a", "b", "c", "d", "e"]),
... pa.array([1.0, 2.0, 3.0, 4.0, 5.0])
... ], names=["col1", "col2", "col3"])
>>> table.take([0, 1, 4])
col1: [[1,2,5]]
col2: [["a","b","e"]]
col3: [[1,2,5]]
>>> table.schema
col1: int64
col2: string
col3: double
Compared to Pandas, PyArrow tables are fully
implemented in C++ and never modify data in
place.
Tables are based on ChunkedArrays so that
appending data to them is a zero copy
operation. A new table is created that
references the data from the existing table as
the first chunk of the arrays and the added
data se the new chunk.
The Acero compute engine in Arrow is able to
provide many common analytics and
transformation capabilities, like joining, filtering
and aggregating data in tables.
Running Analytics
The Acero compute engine
powers the analytics and
transformation capabilities
available on tables.
Many pyarrow.compute
functions provide kernels
that work on tables and
Table exposes join, filtering
and grouping methods
import pyarrow as pa
import pyarrow.compute as pc
>>> table = pa.table([
... pa.array(["a", "a", "b", "b", "c", "d", "e", "c"]),
... pa.array([11, 20, 3, 4, 5, 1, 4, 10]),
... ], names=["keys", "values"])
>>> table.filter(pc.field("values") == 4)
keys: [["b","e"]]
values: [[4,4]]
>>> table.group_by("keys").aggregate([("values", "sum")])
values_sum: [[31,7,15,1,4]]
keys: [["a","b","c","d","e"]]
>>> table1 = pa.table({'id': [1, 2, 3],
... 'year': [2020, 2022, 2019]})
>>>
>>> table2 = pa.table({'id': [3, 4],
... 'n_legs': [5, 100],
... 'animal': ["Brittle stars", "Centipede"]})
>>>
>>> table1.join(table2, keys="id")
id: [[3,1,2]]
year: [[2019,2020,2022]]
n_legs: [[5,null,null]]
animal: [["Brittle stars",null,null]]
PyArrow, Numpy and Pandas
One of the original design goals of Apache Arrow was
to allow ease exchange of data without the cost of
converting it across multiple formats or marshaling it
before transfer.
In the spirit of those capabilities, PyArrow provides
copy-free support for converting data to and from
pandas and numpy.
If you have data in PyArrow you can invoke to_numpy
on pyarrow.Array and to_pandas on pyarrow.Array and
pyarrow.Table to get them as pandas or numpy
objects without facing any additional conversion cost.
And it’s fast!
>>> data = [a % 5 for a in range(100000000)]
>>> npdata = np.array(data)
>>> padata = pa.array(data)
>>> import timeit
>>> timeit.timeit(
... lambda: np.unique(npdata, return_counts=True),
... number=1
... )
1.5212857750011608
>>> timeit.timeit(
... lambda: pc.value_counts(padata),
... number=1
... )
0.3754262370057404
Very fast!
In [3]: timeit df = pd.DataFrame(dict_of_numpy_arrays)
82.5 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: timeit df = pa.table(dict_of_numpy_arrays).to_pandas()
50.2 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df = pd.read_csv("large.csv", engine="pyarrow")
Full-Stack Solution
DISK
Arrow Storage Format
Data can be stored in the Arrow
Disk Format itself.
Arrow InMemory Format
When loaded,
it will still be in the Arrow
Format.
MEMORY
Acero
Computation can be
performed natively on the
Arrow format.
COMPUTE
Arrow Flight
Arrow format can be used to
ship data across network
through Arrow Flight
NETWORK
Arrow from disk to memory
● Saving data in the Arrow format allows PyArrow
to leverage the same exact format for disk and
in-memory data.
● This means that no marshaling cost happens
when loading back the data.
● And allows to leverage memory mapping to
avoid processing data until it’s actually
accessed.
● This means reducing the latency to access data
from seconds to milliseconds.
● Memory mapping also allows managing data
bigger than memory.
Arrow format does not solve it all
● The Arrow format can make working with your data very fast
● But is expensive in terms of disk space as it’s optimized for fast computation and SIMD
instructions, not for storage size.
● It natively support compressions algorithms, but those come at a cost that nullify most
benefits of using the Arrow format itself.
● Arrow format is a great hot format, but there are better solutions for cold storage.
total 1.3G
-rw-r--r-- 1 root root 1.2G Nov 2 16:10 data.arrow
-rw-r--r-- 1 root root 155M Nov 2 16:10 data.pqt
Yes, you can read 17 Milions Rows in 9ms*
* for some definitions of read
From memory-to-network: Arrow Flight
● Arrow Flight is a protocol and implementation provided in Arrow itself that is optimized for
transferring columnar data using Apache Arrow format.
● pyarrow.flight.FlightServerBase provides the server implementation and
pyarrow.flight.connect allows to create clients that connect to flight servers.
● Flight hooks directly into gRPC,
thus no marshaling or
unmarshaling happens when
sending data through network.
● https:/
/arrow.apache.org/coo
kbook/py/flight.html
Arrow Flight speed
Based on the same foundations that we saw for dealing with data on disk, using Arrow Flight for
data on network can provide major performance gains compared to other existing solutions to
transfer data
PyCon Ireland 2022 - PyArrow full stack.pdf
Full-Stack Solution, evolved
DISK
Arrow Storage Format
Data can be stored in the Arrow
Disk Format itself.
Arrow InMemory Format
When loaded,
it will still be in the Arrow
Format.
MEMORY
Acero
Computation can be
performed natively on the
Arrow format.
COMPUTE
Arrow Flight
Arrow format can be used to
ship data across network
through Arrow Flight
NETWORK
COLD
STORAGE
Parquet
PyArrow natively
supports optimized
parquet loading
FLIGHT
SQL
ADBC & FlightSQL
Native support for fetching data
from databases in Arrow format.
ADBC
NANO
ARROW
NanoArrow
Sharing Arrow data
between languages
and libraries in same
process
Arrow & Database: FlightSQL
● Flight SQL aims to provide broadly similar functionality to existing APIs like JDBC and ODBC
● Using Flight, it provides an efficient implementation of a wire format that supports features
like encryption and authentication out of the box, while allowing for further optimizations like
parallel data access
● You get the performance
of Flight, with the
convenience of a SQL
database.
● FlightSQL is mostly a
transport for higher level
APIs, you are not meant
to use it directly.
Arrow & Database: ADBC
● Standard database interface built around
Arrow data, especially for efficiently fetching
large datasets (i.e. with minimal or no
serialization and copying)
● ADBC can leverage FlightSQL or directly
connect to the database (currently supports
Postgres, DuckDB, SQLite, …)
● Optimized for transferring column major data
instead of row major data like most database
drivers.
● Support both SQL dialects and the emergent
Substrait standard.
Arrow & Database: ADBC
with sqlite.cursor() as cur:
cur.execute('SELECT 1, "foo", 2.0')
assert cur.fetch_arrow_table() == pyarrow.table(
{
"1": [1],
'"foo"': ["foo"],
"2.0": [2.0],
}
)
with sqlite.cursor() as cur:
cur.execute('SELECT 1, "foo", 2.0')
assert_frame_equal(
cur.fetch_df(),
pandas.DataFrame(
{
"1": [1],
'"foo"': ["foo"],
"2.0": [2.0],
}
),
)
ARROW Tables
Pandas DataFrame
PyCon Ireland 2022 - PyArrow full stack.pdf
Questions?
● PyArrow Documentation
https:/
/arrow.apache.org/docs/python
/getstarted.html
● PyArrow Cookbook
https:/
/arrow.apache.org/cookbook/p
y/index.html

More Related Content

Similar to PyCon Ireland 2022 - PyArrow full stack.pdf (20)

PPTX
An Introduction to Apache Arrow for Python Programmers.pptx
ssuser59b75e
 
PDF
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PDF
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
Altinity Ltd
 
PDF
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
 
PDF
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
PDF
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Uwe Korn
 
PDF
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
PDF
Apache Arrow
Mike Frampton
 
PDF
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Uwe Korn
 
PDF
New Directions for Apache Arrow
Wes McKinney
 
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
PDF
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
PDF
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
PDF
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
PDF
Extending Pandas using Apache Arrow and Numba
Uwe Korn
 
PDF
Make your PySpark Data Fly with Arrow!
Databricks
 
PDF
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
An Introduction to Apache Arrow for Python Programmers.pptx
ssuser59b75e
 
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
Altinity Ltd
 
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
 
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Uwe Korn
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
Apache Arrow
Mike Frampton
 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Uwe Korn
 
New Directions for Apache Arrow
Wes McKinney
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Extending Pandas using Apache Arrow and Numba
Uwe Korn
 
Make your PySpark Data Fly with Arrow!
Databricks
 
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 

More from Alessandro Molina (17)

PDF
PyconIE 2016 - Kajiki, the fast and validated template engine your were looki...
Alessandro Molina
 
PDF
EP2016 - Moving Away From Nodejs To A Pure Python Solution For Assets
Alessandro Molina
 
PDF
EuroPython 2015 - Storing files for the web is not as straightforward as you ...
Alessandro Molina
 
PDF
PyConIT6 - MAKING SESSIONS AND CACHING ROOMMATES
Alessandro Molina
 
PDF
PyConIT6 - Messing up with pymongo for fun and profit
Alessandro Molina
 
PDF
PyConFR 2014 - DEPOT, Story of a file.write() gone wrong
Alessandro Molina
 
PDF
PyConUK 2014 - PostMortem Debugging and Web Development Updated
Alessandro Molina
 
PDF
Reactive & Realtime Web Applications with TurboGears2
Alessandro Molina
 
PDF
Post-Mortem Debugging and Web Development
Alessandro Molina
 
PDF
MongoTorino 2013 - BSON Mad Science for fun and profit
Alessandro Molina
 
PDF
PyConUK2013 - Validated documents on MongoDB with Ming
Alessandro Molina
 
PDF
EuroPython 2013 - FAST, DOCUMENTED AND RELIABLE JSON BASED WEBSERVICES WITH P...
Alessandro Molina
 
PDF
EuroPython 2013 - Python3 TurboGears Training
Alessandro Molina
 
PDF
PyGrunn2013 High Performance Web Applications with TurboGears
Alessandro Molina
 
PDF
Rapid Prototyping with TurboGears2
Alessandro Molina
 
PDF
TurboGears2 Pluggable Applications
Alessandro Molina
 
PDF
From SQLAlchemy to Ming with TurboGears2
Alessandro Molina
 
PyconIE 2016 - Kajiki, the fast and validated template engine your were looki...
Alessandro Molina
 
EP2016 - Moving Away From Nodejs To A Pure Python Solution For Assets
Alessandro Molina
 
EuroPython 2015 - Storing files for the web is not as straightforward as you ...
Alessandro Molina
 
PyConIT6 - MAKING SESSIONS AND CACHING ROOMMATES
Alessandro Molina
 
PyConIT6 - Messing up with pymongo for fun and profit
Alessandro Molina
 
PyConFR 2014 - DEPOT, Story of a file.write() gone wrong
Alessandro Molina
 
PyConUK 2014 - PostMortem Debugging and Web Development Updated
Alessandro Molina
 
Reactive & Realtime Web Applications with TurboGears2
Alessandro Molina
 
Post-Mortem Debugging and Web Development
Alessandro Molina
 
MongoTorino 2013 - BSON Mad Science for fun and profit
Alessandro Molina
 
PyConUK2013 - Validated documents on MongoDB with Ming
Alessandro Molina
 
EuroPython 2013 - FAST, DOCUMENTED AND RELIABLE JSON BASED WEBSERVICES WITH P...
Alessandro Molina
 
EuroPython 2013 - Python3 TurboGears Training
Alessandro Molina
 
PyGrunn2013 High Performance Web Applications with TurboGears
Alessandro Molina
 
Rapid Prototyping with TurboGears2
Alessandro Molina
 
TurboGears2 Pluggable Applications
Alessandro Molina
 
From SQLAlchemy to Ming with TurboGears2
Alessandro Molina
 
Ad

Recently uploaded (20)

PDF
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
PDF
Enhancing Security in VAST: Towards Static Vulnerability Scanning
ESUG
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
PDF
Protecting the Digital World Cyber Securit
dnthakkar16
 
PPTX
ChessBase 18.02 Crack + Serial Key Free Download
cracked shares
 
PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PPTX
Role Of Python In Programing Language.pptx
jaykoshti048
 
PDF
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
 
PDF
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PDF
How to Download and Install ADT (ABAP Development Tools) for Eclipse IDE | SA...
SAP Vista, an A L T Z E N Company
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PPT
Activate_Methodology_Summary presentatio
annapureddyn
 
PDF
Supabase Meetup: Build in a weekend, scale to millions
Carlo Gilmar Padilla Santana
 
PDF
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
PPTX
Cutting Optimization Pro 5.18.2 Crack With Free Download
cracked shares
 
PDF
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
Enhancing Security in VAST: Towards Static Vulnerability Scanning
ESUG
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Protecting the Digital World Cyber Securit
dnthakkar16
 
ChessBase 18.02 Crack + Serial Key Free Download
cracked shares
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
Role Of Python In Programing Language.pptx
jaykoshti048
 
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
 
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
How to Download and Install ADT (ABAP Development Tools) for Eclipse IDE | SA...
SAP Vista, an A L T Z E N Company
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
Activate_Methodology_Summary presentatio
annapureddyn
 
Supabase Meetup: Build in a weekend, scale to millions
Carlo Gilmar Padilla Santana
 
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
Cutting Optimization Pro 5.18.2 Crack With Free Download
cracked shares
 
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
Ad

PyCon Ireland 2022 - PyArrow full stack.pdf

  • 1. Apache Arrow as a full stack data engineering solution Alessandro Molina @__amol__ https:/ /alessandro.molina.fyi/
  • 2. Who am I, Alessandro ● Maintainer of TurboGears2, Apache Arrow Contributor, Author of DukPy and DEPOT ● Director of Engineering at Voltron Data Labs ● Author of “Modern Python Standard Library Cookbook” and “Crafting Test-Driven Software with Python”.
  • 3. What’s Apache Arrow? ● a data interchange standard ● an in-memory format ● a networking format ● a storage format ● an i/o library ● a computation engine ● a tabular data library ● a query engine ● a partitioned datasets manager
  • 4. So much there! The Apache Arrow project is a huge effort, aimed at solving the foundamental problems in the data analytics world. Aimed at providing a “write everywhere, run everywhere” experience it’s easy to get lost if you don’t know where to start. PyArrow is the entry point to the Apache Arrow ecosystem for Python developers, and it can easily give you access to many of the benefits of Arrow itself.
  • 5. Introducing PyArrow ● Apache Arrow was born as a Columnar Data Format ● So the foundamental type in PyArrow is a “column of data”, which is exposed by the pyarrow.Array object and its subclasses. ● At this level, PyArrow is similar to NumPy single dimension arrays.
  • 6. PyArrow Arrays import pyarrow as pa # Arrays can be made of numbers >>> pa.array([1, 2, 3, 4, 5]) <pyarrow.lib.Int64Array object at 0xffff77d75d20> # Or strings >>> pa.array(["A", "B", "C", "D", "E"]) <pyarrow.lib.StringArray object at 0xffff77d75b40> # And even complex objects >>> pa.array([{"a": 5}, {"a": 7}]) <pyarrow.lib.StructArray object at 0xffff77d75d20> # Arrays can also be masked >>> pa.array([1, 2, 3, 4, 5], ... mask=pa.array([True, False, True, False, True])) <pyarrow.lib.Int64Array object at 0xffff77d75d80> Compared to classic NumPy arrays, PyArrow arrays are a bit more complex. ● They pair a buffer with the data with one with the validity map. So that null values can be more than just None ● Also arrays of strings retaining the guarantee of having a single continuous buffer for the values
  • 7. Introducing PyArrow Tables ● As Arrays are “columns”, their grouping can form pyarrow.Table ● Tables are actually consistuted by pyarrow.ChunkedArray so that appending rows to them is a cheap operation. ● At this level, PyArrow is similar to Pandas Dataframes
  • 8. PyArrow Tables >>> table = pa.table([ ... pa.array([1, 2, 3, 4, 5]), ... pa.array(["a", "b", "c", "d", "e"]), ... pa.array([1.0, 2.0, 3.0, 4.0, 5.0]) ... ], names=["col1", "col2", "col3"]) >>> table.take([0, 1, 4]) col1: [[1,2,5]] col2: [["a","b","e"]] col3: [[1,2,5]] >>> table.schema col1: int64 col2: string col3: double Compared to Pandas, PyArrow tables are fully implemented in C++ and never modify data in place. Tables are based on ChunkedArrays so that appending data to them is a zero copy operation. A new table is created that references the data from the existing table as the first chunk of the arrays and the added data se the new chunk. The Acero compute engine in Arrow is able to provide many common analytics and transformation capabilities, like joining, filtering and aggregating data in tables.
  • 9. Running Analytics The Acero compute engine powers the analytics and transformation capabilities available on tables. Many pyarrow.compute functions provide kernels that work on tables and Table exposes join, filtering and grouping methods import pyarrow as pa import pyarrow.compute as pc >>> table = pa.table([ ... pa.array(["a", "a", "b", "b", "c", "d", "e", "c"]), ... pa.array([11, 20, 3, 4, 5, 1, 4, 10]), ... ], names=["keys", "values"]) >>> table.filter(pc.field("values") == 4) keys: [["b","e"]] values: [[4,4]] >>> table.group_by("keys").aggregate([("values", "sum")]) values_sum: [[31,7,15,1,4]] keys: [["a","b","c","d","e"]] >>> table1 = pa.table({'id': [1, 2, 3], ... 'year': [2020, 2022, 2019]}) >>> >>> table2 = pa.table({'id': [3, 4], ... 'n_legs': [5, 100], ... 'animal': ["Brittle stars", "Centipede"]}) >>> >>> table1.join(table2, keys="id") id: [[3,1,2]] year: [[2019,2020,2022]] n_legs: [[5,null,null]] animal: [["Brittle stars",null,null]]
  • 10. PyArrow, Numpy and Pandas One of the original design goals of Apache Arrow was to allow ease exchange of data without the cost of converting it across multiple formats or marshaling it before transfer. In the spirit of those capabilities, PyArrow provides copy-free support for converting data to and from pandas and numpy. If you have data in PyArrow you can invoke to_numpy on pyarrow.Array and to_pandas on pyarrow.Array and pyarrow.Table to get them as pandas or numpy objects without facing any additional conversion cost.
  • 11. And it’s fast! >>> data = [a % 5 for a in range(100000000)] >>> npdata = np.array(data) >>> padata = pa.array(data) >>> import timeit >>> timeit.timeit( ... lambda: np.unique(npdata, return_counts=True), ... number=1 ... ) 1.5212857750011608 >>> timeit.timeit( ... lambda: pc.value_counts(padata), ... number=1 ... ) 0.3754262370057404
  • 12. Very fast! In [3]: timeit df = pd.DataFrame(dict_of_numpy_arrays) 82.5 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [4]: timeit df = pa.table(dict_of_numpy_arrays).to_pandas() 50.2 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) df = pd.read_csv("large.csv", engine="pyarrow")
  • 13. Full-Stack Solution DISK Arrow Storage Format Data can be stored in the Arrow Disk Format itself. Arrow InMemory Format When loaded, it will still be in the Arrow Format. MEMORY Acero Computation can be performed natively on the Arrow format. COMPUTE Arrow Flight Arrow format can be used to ship data across network through Arrow Flight NETWORK
  • 14. Arrow from disk to memory ● Saving data in the Arrow format allows PyArrow to leverage the same exact format for disk and in-memory data. ● This means that no marshaling cost happens when loading back the data. ● And allows to leverage memory mapping to avoid processing data until it’s actually accessed. ● This means reducing the latency to access data from seconds to milliseconds. ● Memory mapping also allows managing data bigger than memory.
  • 15. Arrow format does not solve it all ● The Arrow format can make working with your data very fast ● But is expensive in terms of disk space as it’s optimized for fast computation and SIMD instructions, not for storage size. ● It natively support compressions algorithms, but those come at a cost that nullify most benefits of using the Arrow format itself. ● Arrow format is a great hot format, but there are better solutions for cold storage. total 1.3G -rw-r--r-- 1 root root 1.2G Nov 2 16:10 data.arrow -rw-r--r-- 1 root root 155M Nov 2 16:10 data.pqt
  • 16. Yes, you can read 17 Milions Rows in 9ms* * for some definitions of read
  • 17. From memory-to-network: Arrow Flight ● Arrow Flight is a protocol and implementation provided in Arrow itself that is optimized for transferring columnar data using Apache Arrow format. ● pyarrow.flight.FlightServerBase provides the server implementation and pyarrow.flight.connect allows to create clients that connect to flight servers. ● Flight hooks directly into gRPC, thus no marshaling or unmarshaling happens when sending data through network. ● https:/ /arrow.apache.org/coo kbook/py/flight.html
  • 18. Arrow Flight speed Based on the same foundations that we saw for dealing with data on disk, using Arrow Flight for data on network can provide major performance gains compared to other existing solutions to transfer data
  • 20. Full-Stack Solution, evolved DISK Arrow Storage Format Data can be stored in the Arrow Disk Format itself. Arrow InMemory Format When loaded, it will still be in the Arrow Format. MEMORY Acero Computation can be performed natively on the Arrow format. COMPUTE Arrow Flight Arrow format can be used to ship data across network through Arrow Flight NETWORK COLD STORAGE Parquet PyArrow natively supports optimized parquet loading FLIGHT SQL ADBC & FlightSQL Native support for fetching data from databases in Arrow format. ADBC NANO ARROW NanoArrow Sharing Arrow data between languages and libraries in same process
  • 21. Arrow & Database: FlightSQL ● Flight SQL aims to provide broadly similar functionality to existing APIs like JDBC and ODBC ● Using Flight, it provides an efficient implementation of a wire format that supports features like encryption and authentication out of the box, while allowing for further optimizations like parallel data access ● You get the performance of Flight, with the convenience of a SQL database. ● FlightSQL is mostly a transport for higher level APIs, you are not meant to use it directly.
  • 22. Arrow & Database: ADBC ● Standard database interface built around Arrow data, especially for efficiently fetching large datasets (i.e. with minimal or no serialization and copying) ● ADBC can leverage FlightSQL or directly connect to the database (currently supports Postgres, DuckDB, SQLite, …) ● Optimized for transferring column major data instead of row major data like most database drivers. ● Support both SQL dialects and the emergent Substrait standard.
  • 23. Arrow & Database: ADBC with sqlite.cursor() as cur: cur.execute('SELECT 1, "foo", 2.0') assert cur.fetch_arrow_table() == pyarrow.table( { "1": [1], '"foo"': ["foo"], "2.0": [2.0], } ) with sqlite.cursor() as cur: cur.execute('SELECT 1, "foo", 2.0') assert_frame_equal( cur.fetch_df(), pandas.DataFrame( { "1": [1], '"foo"': ["foo"], "2.0": [2.0], } ), ) ARROW Tables Pandas DataFrame
  • 25. Questions? ● PyArrow Documentation https:/ /arrow.apache.org/docs/python /getstarted.html ● PyArrow Cookbook https:/ /arrow.apache.org/cookbook/p y/index.html