SlideShare a Scribd company logo
File Format Benchmark -
Avro, JSON, ORC, & Parquet
Owen O’Malley
owen@hortonworks.com
@owen_omalley
August 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who Am I?
Worked on Hadoop since Jan 2006
MapReduce, Security, Hive, and ORC
Worked on different file formats
–Sequence File, RCFile, ORC File, T-File, and Avro
requirements
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal
Seeking to discover unknowns
–How do the different formats perform?
–What could they do better?
–Best part of open source is looking inside!
Use real & diverse data sets
–Over-reliance on similar datasets leads to weakness
Open & reviewed benchmarks
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The File Formats
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Avro
Cross-language file format for Hadoop
Schema evolution was primary goal
Schema segregated from data
–Unlike Protobuf and Thrift
Row major format
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
JSON
Serialization format for HTTP & Javascript
Text-format with MANY parsers
Schema completely integrated with data
Row major format
Compression applied on top
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
ORC
Originally part of Hive to replace RCFile
–Now top-level project
Schema segregated into footer
Column major format with stripes
Rich type model, stored top-down
Integrated compression, indexes, & stats
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Parquet
Design based on Google’s Dremel paper
Schema segregated into footer
Column major format with stripes
Simpler type-model with logical types
All data pushed to leaves of the tree
Integrated compression and indexes
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Sets
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NYC Taxi Data
Every taxi cab ride in NYC from 2009
–Publically available
–http://tinyurl.com/nyc-taxi-analysis
18 columns with no null values
–Doubles, integers, decimals, & strings
2 months of data – 22.7 million rows
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Github Logs
All actions on Github public repositories
–Publically available
–https://www.githubarchive.org/
704 columns with a lot of structure & nulls
–Pretty much the kitchen sink
 1/2 month of data – 10.5 million rows
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Finding the Github Schema
The data is all in JSON.
No schema for the data is published.
We wrote a JSON schema discoverer.
–Scans the document and figures out the type
Available at https://github.com/hortonworks/hive-json
Schema is huge (12k)
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sales
Generated data
–Real schema from a production Hive deployment
–Random data based on the data statistics
55 columns with lots of nulls
–A little structure
–Timestamps, strings, longs, booleans, list, & struct
25 million rows
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storage costs
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Compression
Data size matters!
–Hadoop stores all your data, but requires hardware
–Is one factor in read speed
ORC and Parquet use RLE & Dictionaries
All the formats have general compression
–ZLIB (GZip) – tight compression, slower
–Snappy – some compression, faster
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Taxi Size Analysis
Don’t use JSON
Use either Snappy or Zlib compression
Avro’s small compression window hurts
Parquet Zlib is smaller than ORC
–Group the column sizes by type
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Taxi Size Analysis
ORC did better than expected
–String columns have small cardinality
–Lots of timestamp columns
–No doubles 
Need to revalidate results with original
–Improve random data generator
–Add non-smooth distributions
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Github Size Analysis
Surprising win for JSON and Avro
–Worst when uncompressed
–Best with zlib
Many partially shared strings
–ORC and Parquet don’t compress across columns
Need to investigate shared dictionaries.
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Cases
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Full Table Scans
Read all columns & rows
All formats except JSON are splitable
–Different workers do different parts of file
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Taxi Read Performance Analysis
JSON is very slow to read
–Large storage size for this data set
–Needs to do a LOT of string parsing
Tradeoff between space & time
–Less compression is sometimes faster
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sales Read Performance Analysis
Read performance is dominated by format
–Compression matters less for this data set
–Straight ordering: ORC, Avro, Parquet, & JSON
Garbage collection is important
–ORC 0.3 to 1.4% of time
–Avro < 0.1% of time
–Parquet 4 to 8% of time
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Github Read Performance Analysis
Garbage collection is critical
–ORC 2.1 to 3.4% of time
–Avro 0.1% of time
–Parquet 11.4 to 12.8% of time
A lot of columns needs more space
–Suspect that we need bigger stripes
–Rows/stripe - ORC: 18.6k, Parquet: 88.1k
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Column Projection
Often just need a few columns
–Only ORC & Parquet are columnar
–Only read, decompress, & deserialize some columns
Dataset format compression us/row projection Percent time
github orc zlib 21.319 0.185 0.87%
github parquet zlib 72.494 0.585 0.81%
sales orc zlib 1.866 0.056 3.00%
sales parquet zlib 12.893 0.329 2.55%
taxi orc zlib 2.766 0.063 2.28%
taxi parquet zlib 3.496 0.718 20.54%
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Projection & Predicate Pushdown
Sometimes have a filter predicate on table
–Select a superset of rows that match
–Selective filters have a huge impact
Improves data layout options
–Better than partition pruning with sorting
ORC has added optional bloom filters
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Metadata Access
ORC & Parquet store metadata
–Stored in file footer
–File schema
–Number of records
–Min, max, count of each column
Provides O(1) Access
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Conclusions
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Recomendations
Disclaimer – Everything changes!
–Both these benchmarks and the formats will change.
For complex tables with common strings
–Avro with Snappy is a good fit (w/o projection)
For other tables
–ORC with Zlib is a good fit
Experiment with the benchmarks
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Fun Stuff
Built open benchmark suite for files
Built pieces of a tool to convert files
–Avro, CSV, JSON, ORC, & Parquet
Built a random parameterized generator
–Easy to model arbitrary tables
–Can write to Avro, ORC, or Parquet
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you!
Twitter: @owen_omalley
Email: owen@hortonworks.com
Ad

More Related Content

What's hot (20)

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks
 
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and ParquetFile Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
Apache flink
Apache flinkApache flink
Apache flink
pranay kumar
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
Altinity Ltd
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
Databricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
Altinity Ltd
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
Databricks
 

Similar to File Format Benchmark - Avro, JSON, ORC & Parquet (20)

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
DataWorks Summit
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
Dongjoon Hyun
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
Aaron Brooks
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
DataWorks Summit/Hadoop Summit
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
DataWorks Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
You Can't Search Without Data
You Can't Search Without DataYou Can't Search Without Data
You Can't Search Without Data
Bryan Bende
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
DataWorks Summit
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
Dongjoon Hyun
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
Aaron Brooks
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
DataWorks Summit/Hadoop Summit
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
DataWorks Summit
 
You Can't Search Without Data
You Can't Search Without DataYou Can't Search Without Data
You Can't Search Without Data
Bryan Bende
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Ad

Recently uploaded (20)

Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 

File Format Benchmark - Avro, JSON, ORC & Parquet

  • 1. File Format Benchmark - Avro, JSON, ORC, & Parquet Owen O’Malley [email protected] @owen_omalley August 2016
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Who Am I? Worked on Hadoop since Jan 2006 MapReduce, Security, Hive, and ORC Worked on different file formats –Sequence File, RCFile, ORC File, T-File, and Avro requirements
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goal Seeking to discover unknowns –How do the different formats perform? –What could they do better? –Best part of open source is looking inside! Use real & diverse data sets –Over-reliance on similar datasets leads to weakness Open & reviewed benchmarks
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The File Formats
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Avro Cross-language file format for Hadoop Schema evolution was primary goal Schema segregated from data –Unlike Protobuf and Thrift Row major format
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved JSON Serialization format for HTTP & Javascript Text-format with MANY parsers Schema completely integrated with data Row major format Compression applied on top
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ORC Originally part of Hive to replace RCFile –Now top-level project Schema segregated into footer Column major format with stripes Rich type model, stored top-down Integrated compression, indexes, & stats
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Parquet Design based on Google’s Dremel paper Schema segregated into footer Column major format with stripes Simpler type-model with logical types All data pushed to leaves of the tree Integrated compression and indexes
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Sets
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NYC Taxi Data Every taxi cab ride in NYC from 2009 –Publically available –http://tinyurl.com/nyc-taxi-analysis 18 columns with no null values –Doubles, integers, decimals, & strings 2 months of data – 22.7 million rows
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Github Logs All actions on Github public repositories –Publically available –https://www.githubarchive.org/ 704 columns with a lot of structure & nulls –Pretty much the kitchen sink  1/2 month of data – 10.5 million rows
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Finding the Github Schema The data is all in JSON. No schema for the data is published. We wrote a JSON schema discoverer. –Scans the document and figures out the type Available at https://github.com/hortonworks/hive-json Schema is huge (12k)
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sales Generated data –Real schema from a production Hive deployment –Random data based on the data statistics 55 columns with lots of nulls –A little structure –Timestamps, strings, longs, booleans, list, & struct 25 million rows
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Storage costs
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Compression Data size matters! –Hadoop stores all your data, but requires hardware –Is one factor in read speed ORC and Parquet use RLE & Dictionaries All the formats have general compression –ZLIB (GZip) – tight compression, slower –Snappy – some compression, faster
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Taxi Size Analysis Don’t use JSON Use either Snappy or Zlib compression Avro’s small compression window hurts Parquet Zlib is smaller than ORC –Group the column sizes by type
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Taxi Size Analysis ORC did better than expected –String columns have small cardinality –Lots of timestamp columns –No doubles  Need to revalidate results with original –Improve random data generator –Add non-smooth distributions
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Github Size Analysis Surprising win for JSON and Avro –Worst when uncompressed –Best with zlib Many partially shared strings –ORC and Parquet don’t compress across columns Need to investigate shared dictionaries.
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Cases
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Full Table Scans Read all columns & rows All formats except JSON are splitable –Different workers do different parts of file
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Taxi Read Performance Analysis JSON is very slow to read –Large storage size for this data set –Needs to do a LOT of string parsing Tradeoff between space & time –Less compression is sometimes faster
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sales Read Performance Analysis Read performance is dominated by format –Compression matters less for this data set –Straight ordering: ORC, Avro, Parquet, & JSON Garbage collection is important –ORC 0.3 to 1.4% of time –Avro < 0.1% of time –Parquet 4 to 8% of time
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Github Read Performance Analysis Garbage collection is critical –ORC 2.1 to 3.4% of time –Avro 0.1% of time –Parquet 11.4 to 12.8% of time A lot of columns needs more space –Suspect that we need bigger stripes –Rows/stripe - ORC: 18.6k, Parquet: 88.1k
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Column Projection Often just need a few columns –Only ORC & Parquet are columnar –Only read, decompress, & deserialize some columns Dataset format compression us/row projection Percent time github orc zlib 21.319 0.185 0.87% github parquet zlib 72.494 0.585 0.81% sales orc zlib 1.866 0.056 3.00% sales parquet zlib 12.893 0.329 2.55% taxi orc zlib 2.766 0.063 2.28% taxi parquet zlib 3.496 0.718 20.54%
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Projection & Predicate Pushdown Sometimes have a filter predicate on table –Select a superset of rows that match –Selective filters have a huge impact Improves data layout options –Better than partition pruning with sorting ORC has added optional bloom filters
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Metadata Access ORC & Parquet store metadata –Stored in file footer –File schema –Number of records –Min, max, count of each column Provides O(1) Access
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Conclusions
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Recomendations Disclaimer – Everything changes! –Both these benchmarks and the formats will change. For complex tables with common strings –Avro with Snappy is a good fit (w/o projection) For other tables –ORC with Zlib is a good fit Experiment with the benchmarks
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Fun Stuff Built open benchmark suite for files Built pieces of a tool to convert files –Avro, CSV, JSON, ORC, & Parquet Built a random parameterized generator –Easy to model arbitrary tables –Can write to Avro, ORC, or Parquet
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you! Twitter: @owen_omalley Email: [email protected]