0% found this document useful (0 votes)
9 views

What Is Apache Parquet

Uploaded by

ihavaneid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

What Is Apache Parquet

Uploaded by

ihavaneid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Parquet in Practice & Detail

What is Parquet? How is it so efficient? Why should I


actually use it?
About me

• Data Scientist at Blue Yonder (@BlueYonderTech)

• Committer to Apache {Arrow, Parquet}

• Work in Python, Cython, C++11 and SQL

xhochy
[email protected]
Agenda
Origin and Use Case
Parquet under the bonnet
Python & C++
The Community and its neighbours
About Parquet

1. Columnar on-disk storage format


2. Started in fall 2012 by Cloudera & Twitter
3. July 2013: 1.0 release
4. top-level Apache project
5. Fall 2016: Python & C++ support
6. State of the art format in the Hadoop ecosystem
• often used as the default I/O option
Why use Parquet?

1. Columnar format
—> vectorized operations
2. Efficient encodings and compressions
—> small size without the need for a fat CPU
3. Query push-down
—> bring computation to the I/O layer
4. Language independent format
—> libs in Java / Scala / C++ / Python /…
Who uses Parquet?

• Query Engines • Frameworks


• Hive • Spark
• Impala • MapReduce
• Drill • …
• Presto • Pandas
• …
Nested data
• More than a flat table!
• Structure borrowed from Dremel paper
• https://blog.twitter.com/2013/dremel-made-simple-with-parquet

Columns:
Document
docid
DocId Links Name
links.backward
links.forward
Backward Forward Language Url
name.language.code
Code Country name.language.country
name.url
Why columnar?
2D Table

row layout

columnar layout
File Structure
File
RowGroup
Column Chunks

Page
Statistics
Encodings
• Know the data
• Exploit the knowledge
• Cheaper than universal compression
• Example dataset:
• NYC TLC Trip Record data for January 2016
• 1629 MiB as CSV
• columns: bool(1), datetime(2), float(12), int(4)
• Source: http://www.nyc.gov/html/tlc/html/about/
trip_record_data.shtml
Encodings — PLAIN
• Simply write the binary representation to disk
• Simple to read & write
• Performance limited by I/O throughput
• —> 1499 MiB
Encodings — RLE & Bit Packing
• bit-packing: only use the necessary bit
• RunLengthEncoding: 378 times „12“
• hybrid: dynamically choose the best
• Used for Definition & Repetition levels
Encodings — Dictionary
• PLAIN_DICTIONARY / RLE_DICTIONARY
• every value is assigned a code
• Dictionary: store a map of code —> value
• Data: store only codes, use RLE on that
• —> 329 MiB (22%)
Compression

1. Shrink data size independent of its content


2. More CPU intensive than encoding
3. encoding+compression performs better than
compression alone with less CPU cost
4. LZO, Snappy, GZIP, Brotli
—> If in doubt: use Snappy
5. GZIP: 174 MiB (11%)
Snappy: 216 MiB (14 %)
https://github.com/apache/parquet-mr/pull/384
Query pushdown

1. Only load used data


1. skip columns that are not needed
2. skip (chunks of ) rows that not relevant
2. saves I/O load as the data is not transferred
3. saves CPU as the data is not decoded
Competitors (Python)
• HDF5
• binary (with schema)
• fast, just not with strings
• not a first-class citizen in the Hadoop ecosystem
• msgpack
• fast but unstable
• CSV
• The universal standard.
• row-based
• schema-less
C++

1. General purpose read & write of Parquet


• data structure independent
• pluggable interfaces (allocator, I/O, …)
2. Routines to read into specific data structures
• Apache Arrow
• …
Use Parquet in Python

https://pyarrow.readthedocs.io/en/latest/install.html#building-from-source
Get involved!

1. Mailinglist: [email protected]
2. Website: https://parquet.apache.org/
3. Or directly start contributing by grabbing an issue on
https://issues.apache.org/jira/browse/PARQUET
4. Slack: https://parquet-slack-invite.herokuapp.com/

You might also like