What Is Apache Parquet
What Is Apache Parquet
xhochy
[email protected]
Agenda
Origin and Use Case
Parquet under the bonnet
Python & C++
The Community and its neighbours
About Parquet
1. Columnar format
—> vectorized operations
2. Efficient encodings and compressions
—> small size without the need for a fat CPU
3. Query push-down
—> bring computation to the I/O layer
4. Language independent format
—> libs in Java / Scala / C++ / Python /…
Who uses Parquet?
Columns:
Document
docid
DocId Links Name
links.backward
links.forward
Backward Forward Language Url
name.language.code
Code Country name.language.country
name.url
Why columnar?
2D Table
row layout
columnar layout
File Structure
File
RowGroup
Column Chunks
Page
Statistics
Encodings
• Know the data
• Exploit the knowledge
• Cheaper than universal compression
• Example dataset:
• NYC TLC Trip Record data for January 2016
• 1629 MiB as CSV
• columns: bool(1), datetime(2), float(12), int(4)
• Source: http://www.nyc.gov/html/tlc/html/about/
trip_record_data.shtml
Encodings — PLAIN
• Simply write the binary representation to disk
• Simple to read & write
• Performance limited by I/O throughput
• —> 1499 MiB
Encodings — RLE & Bit Packing
• bit-packing: only use the necessary bit
• RunLengthEncoding: 378 times „12“
• hybrid: dynamically choose the best
• Used for Definition & Repetition levels
Encodings — Dictionary
• PLAIN_DICTIONARY / RLE_DICTIONARY
• every value is assigned a code
• Dictionary: store a map of code —> value
• Data: store only codes, use RLE on that
• —> 329 MiB (22%)
Compression
https://pyarrow.readthedocs.io/en/latest/install.html#building-from-source
Get involved!
1. Mailinglist: [email protected]
2. Website: https://parquet.apache.org/
3. Or directly start contributing by grabbing an issue on
https://issues.apache.org/jira/browse/PARQUET
4. Slack: https://parquet-slack-invite.herokuapp.com/