0% found this document useful (0 votes)

9 views

What Is Apache Parquet

Uploaded by

ihavaneid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

What Is Apache Parquet

Uploaded by

ihavaneid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Parquet in Practice & Detail

What is Parquet? How is it so eﬃcient? Why should I

actually use it?
About me

• Data Scientist at Blue Yonder (@BlueYonderTech)

• Committer to Apache {Arrow, Parquet}

• Work in Python, Cython, C++11 and SQL

xhochy
[email protected]
Agenda
Origin and Use Case
Parquet under the bonnet
Python & C++
The Community and its neighbours
About Parquet

1. Columnar on-disk storage format

2. Started in fall 2012 by Cloudera & Twitter
3. July 2013: 1.0 release
4. top-level Apache project
5. Fall 2016: Python & C++ support
6. State of the art format in the Hadoop ecosystem
• often used as the default I/O option
Why use Parquet?

1. Columnar format
—> vectorized operations
2. Eﬃcient encodings and compressions
—> small size without the need for a fat CPU
3. Query push-down
—> bring computation to the I/O layer
4. Language independent format
—> libs in Java / Scala / C++ / Python /…
Who uses Parquet?

• Query Engines • Frameworks

• Hive • Spark
• Impala • MapReduce
• Drill • …
• Presto • Pandas
• …
Nested data
• More than a flat table!
• Structure borrowed from Dremel paper
• https://blog.twitter.com/2013/dremel-made-simple-with-parquet

Columns:
Document
docid
DocId Links Name
links.backward
links.forward
Backward Forward Language Url
name.language.code
Code Country name.language.country
name.url
Why columnar?
2D Table

row layout

columnar layout
File Structure
File
RowGroup
Column Chunks

Page
Statistics
Encodings
• Know the data
• Exploit the knowledge
• Cheaper than universal compression
• Example dataset:
• NYC TLC Trip Record data for January 2016
• 1629 MiB as CSV
• columns: bool(1), datetime(2), float(12), int(4)
• Source: http://www.nyc.gov/html/tlc/html/about/
trip_record_data.shtml
Encodings — PLAIN
• Simply write the binary representation to disk
• Simple to read & write
• Performance limited by I/O throughput
• —> 1499 MiB
Encodings — RLE & Bit Packing
• bit-packing: only use the necessary bit
• RunLengthEncoding: 378 times „12“
• hybrid: dynamically choose the best
• Used for Definition & Repetition levels
Encodings — Dictionary
• PLAIN_DICTIONARY / RLE_DICTIONARY
• every value is assigned a code
• Dictionary: store a map of code —> value
• Data: store only codes, use RLE on that
• —> 329 MiB (22%)
Compression

1. Shrink data size independent of its content

2. More CPU intensive than encoding
3. encoding+compression performs better than
compression alone with less CPU cost
4. LZO, Snappy, GZIP, Brotli
—> If in doubt: use Snappy
5. GZIP: 174 MiB (11%)
Snappy: 216 MiB (14 %)
https://github.com/apache/parquet-mr/pull/384
Query pushdown

1. Only load used data

1. skip columns that are not needed
2. skip (chunks of ) rows that not relevant
2. saves I/O load as the data is not transferred
3. saves CPU as the data is not decoded
Competitors (Python)
• HDF5
• binary (with schema)
• fast, just not with strings
• not a first-class citizen in the Hadoop ecosystem
• msgpack
• fast but unstable
• CSV
• The universal standard.
• row-based
• schema-less
C++

1. General purpose read & write of Parquet

• data structure independent
• pluggable interfaces (allocator, I/O, …)
2. Routines to read into specific data structures
• Apache Arrow
• …
Use Parquet in Python

https://pyarrow.readthedocs.io/en/latest/install.html#building-from-source
Get involved!

1. Mailinglist: [email protected]
2. Website: https://parquet.apache.org/
3. Or directly start contributing by grabbing an issue on
https://issues.apache.org/jira/browse/PARQUET
4. Slack: https://parquet-slack-invite.herokuapp.com/

Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
1997 2001 Prelude Electrical Troubleshooting Manua
No ratings yet
1997 2001 Prelude Electrical Troubleshooting Manua
427 pages
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
React to Python: Creating React Front-End Web Applications with Python
From Everand
React to Python: Creating React Front-End Web Applications with Python
John Sheehan
No ratings yet
Learning R Programming
From Everand
Learning R Programming
Kun Ren
5/5 (3)
Learning Informatica PowerCenter 9.x
From Everand
Learning Informatica PowerCenter 9.x
Rahul Malewar
3/5 (4)
Parquet In 8 Minute
No ratings yet
Parquet In 8 Minute
20 pages
Documentation - Parquet
No ratings yet
Documentation - Parquet
75 pages
Parquet: Columnar Storage For The People
No ratings yet
Parquet: Columnar Storage For The People
27 pages
Big Data File Formats For Data Engineers
No ratings yet
Big Data File Formats For Data Engineers
3 pages
File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1
No ratings yet
File Format Benchmark - Avro, JSON, OrC, and Parquet Presentation 1
40 pages
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
2.1.1 Data Formats
No ratings yet
2.1.1 Data Formats
14 pages
Avro Parquet
No ratings yet
Avro Parquet
5 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Hive Performance With Different Fileformats
No ratings yet
Hive Performance With Different Fileformats
12 pages
File Types
No ratings yet
File Types
1 page
p148-zeng
No ratings yet
p148-zeng
14 pages
Arrow Cookbook
No ratings yet
Arrow Cookbook
12 pages
Demystifying Apache Arrow
No ratings yet
Demystifying Apache Arrow
6 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Btrblocks - Data Lake Compression
No ratings yet
Btrblocks - Data Lake Compression
14 pages
Big Data Analytics
From Everand
Big Data Analytics
Venkat Ankam
No ratings yet
Software Architecture with Python
From Everand
Software Architecture with Python
Anand Balachandran Pillai
3/5 (1)
IBM Cognos 8 Planning
From Everand
IBM Cognos 8 Planning
Jason Edwards
No ratings yet
Bigdata Fileformats
No ratings yet
Bigdata Fileformats
12 pages
Mastering DynamoDB
From Everand
Mastering DynamoDB
Tanmay Deshpande
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Comparison of File Formats for Big Data
No ratings yet
Comparison of File Formats for Big Data
4 pages
Information Technology HandBook
From Everand
Information Technology HandBook
Duong Tran
3/5 (1)
Dataset #1
No ratings yet
Dataset #1
5 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
DynamoDB Applied Design Patterns
From Everand
DynamoDB Applied Design Patterns
Uchit Vyas
3/5 (1)
Relayd and Httpd Mastery: IT Mastery, #11
From Everand
Relayd and Httpd Mastery: IT Mastery, #11
Michael W. Lucas
No ratings yet
Cloud Infrastructure and Data Center
From Everand
Cloud Infrastructure and Data Center
Duong Tran
No ratings yet
Raspberry Pi Computer Architecture Essentials
From Everand
Raspberry Pi Computer Architecture Essentials
Dennis Andrew K.
No ratings yet
C++ Advanced Programming: Building High-Performance Applications
From Everand
C++ Advanced Programming: Building High-Performance Applications
Robert Johnson
No ratings yet
Managing Multimedia and Unstructured Data in the Oracle Database
From Everand
Managing Multimedia and Unstructured Data in the Oracle Database
Marcelle Kratochvil
No ratings yet
Learning Boost C++ Libraries
From Everand
Learning Boost C++ Libraries
Arindam Mukherjee
No ratings yet
Amazon SimpleDB: LITE
From Everand
Amazon SimpleDB: LITE
Prabhakar Chaganti
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Mastering Google App Engine: Build robust and highly scalable web applications with Google App Engine
From Everand
Mastering Google App Engine: Build robust and highly scalable web applications with Google App Engine
Packt Publishing
No ratings yet
Building a Product Master
From Everand
Building a Product Master
Edufdev
No ratings yet
Functional Python Programming
From Everand
Functional Python Programming
Steven Lott
No ratings yet
System Design - 100 Job Interview Questions
From Everand
System Design - 100 Job Interview Questions
Cristian Scutaru
No ratings yet
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Implementing Cloud Storage with OpenStack Swift
From Everand
Implementing Cloud Storage with OpenStack Swift
Amar Kapadia
No ratings yet
Amazon SimpleDB Developer Guide
From Everand
Amazon SimpleDB Developer Guide
Prabhakar Chaganti
No ratings yet
Learning Go Programming
From Everand
Learning Go Programming
Vladimir Vivien
4.5/5 (3)
Odoo 10 Development Essentials
From Everand
Odoo 10 Development Essentials
Daniel Reis
No ratings yet
Mastering Swift
From Everand
Mastering Swift
Jon Hoffman
No ratings yet
Learning DHTMLX Suite UI
From Everand
Learning DHTMLX Suite UI
Eli Geske
No ratings yet
Oracle GoldenGate 11g Implementer's guide
From Everand
Oracle GoldenGate 11g Implementer's guide
John P Jeffries
5/5 (1)
Data Science Formats Beyond CSV and Hdfs
No ratings yet
Data Science Formats Beyond CSV and Hdfs
54 pages
Simple Golang Programming for Beginners
From Everand
Simple Golang Programming for Beginners
Terry T. Diaz
No ratings yet
Wireshark Essentials
From Everand
Wireshark Essentials
James H. Baxter
No ratings yet
Oracle Coherence 3.5
From Everand
Oracle Coherence 3.5
Aleksandar Seovic
4/5 (1)
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Data Skew and Remedies in Spark Programming
No ratings yet
Data Skew and Remedies in Spark Programming
19 pages
With Spark SQL: Delta Lake DDL/DML: Time Travel
No ratings yet
With Spark SQL: Delta Lake DDL/DML: Time Travel
2 pages
Slowly Changing Dimension DW
No ratings yet
Slowly Changing Dimension DW
3 pages
Verifiable Encryption of Digital Signatures and Applications
No ratings yet
Verifiable Encryption of Digital Signatures and Applications
20 pages
Aggregate and Verifiably Encrypted Signatures From Bilinear Maps
No ratings yet
Aggregate and Verifiably Encrypted Signatures From Bilinear Maps
22 pages
DR Sujay 3
No ratings yet
DR Sujay 3
15 pages
Contract Labour Compliance Checklist
No ratings yet
Contract Labour Compliance Checklist
9 pages
10 Key Techniques For Making Cocktails
No ratings yet
10 Key Techniques For Making Cocktails
4 pages
Big Data Technologies PG-DBDA March 2022
No ratings yet
Big Data Technologies PG-DBDA March 2022
8 pages
Download Complete (eBook PDF) Essentials of Business Analytics 2nd Edition PDF for All Chapters
100% (5)
Download Complete (eBook PDF) Essentials of Business Analytics 2nd Edition PDF for All Chapters
55 pages
4 MS
No ratings yet
4 MS
3 pages
EXPT 4 HEAT of COMBUSTION
No ratings yet
EXPT 4 HEAT of COMBUSTION
3 pages
BPharma Paper 1 - Answer Key
No ratings yet
BPharma Paper 1 - Answer Key
14 pages
LS-DYNA Applications in Shipbuilding
No ratings yet
LS-DYNA Applications in Shipbuilding
17 pages
Analyzing The Budgeting and Budgetary Control Process Followed at Bharat Electronics Limited, Ghaziabad
No ratings yet
Analyzing The Budgeting and Budgetary Control Process Followed at Bharat Electronics Limited, Ghaziabad
59 pages
Dentigerous Cyst
No ratings yet
Dentigerous Cyst
4 pages
Swami Gitananda
100% (1)
Swami Gitananda
32 pages
Patterns of Facebook Usage Among Baby Boomers, Generation X and Generation Y in Malaysia
100% (1)
Patterns of Facebook Usage Among Baby Boomers, Generation X and Generation Y in Malaysia
7 pages
Manipal University Jaipur First Semester 2017-18 Object-Oriented Programming (CS 1304) LAB-1 (Introduction To Eclipse IDE and Java Basics)
No ratings yet
Manipal University Jaipur First Semester 2017-18 Object-Oriented Programming (CS 1304) LAB-1 (Introduction To Eclipse IDE and Java Basics)
12 pages
Preposition File
No ratings yet
Preposition File
8 pages
Weride Prospectus
No ratings yet
Weride Prospectus
378 pages
Metallica - Am I Evil - Partitura
100% (1)
Metallica - Am I Evil - Partitura
40 pages
Short Stories of F. Scott Fitzgerald Summary and Analysis of "Winter Dreams"
No ratings yet
Short Stories of F. Scott Fitzgerald Summary and Analysis of "Winter Dreams"
2 pages
Documentum Server 7.2 Release Notes
No ratings yet
Documentum Server 7.2 Release Notes
37 pages
Final Exam 2015 Psychology Questions
100% (2)
Final Exam 2015 Psychology Questions
8 pages
The Soviet Union Under Stalin (Russian) : Industrial State
No ratings yet
The Soviet Union Under Stalin (Russian) : Industrial State
3 pages
Osd-Answers For Phonics
No ratings yet
Osd-Answers For Phonics
2 pages
Knowledge and Awareness Level of Online Sellers On Electronic Banking Services
100% (1)
Knowledge and Awareness Level of Online Sellers On Electronic Banking Services
31 pages
SEMI-DETAILED LESSON PLAN DAY 3 Week 4
No ratings yet
SEMI-DETAILED LESSON PLAN DAY 3 Week 4
2 pages
CRUD in Servlet
No ratings yet
CRUD in Servlet
21 pages
In Vehicle Monitoring Policy
No ratings yet
In Vehicle Monitoring Policy
4 pages
Wadola Habte Seminar
No ratings yet
Wadola Habte Seminar
16 pages
Pronoun-Verb To Be in Family Member
No ratings yet
Pronoun-Verb To Be in Family Member
7 pages
Apuntes de Optica
No ratings yet
Apuntes de Optica
236 pages

What Is Apache Parquet

Uploaded by

What Is Apache Parquet

Uploaded by

Parquet in Practice & Detail

What is Parquet? How is it so eﬃcient? Why should I

• Data Scientist at Blue Yonder (@BlueYonderTech)

• Committer to Apache {Arrow, Parquet}

• Work in Python, Cython, C++11 and SQL

1. Columnar on-disk storage format

• Query Engines • Frameworks

1. Shrink data size independent of its content

1. Only load used data

1. General purpose read & write of Parquet

You might also like