SlideShare a Scribd company logo
ORC File –
Optimizing Your Big Data
Owen O’Malley, Co-founder Hortonworks
Apache Hadoop, Hive, ORC, and
Incubator
@owen_omalley
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Overview
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
In the Beginning…
 Hadoop applications used text or SequenceFile
– Text is slow and not splittable when compressed
– SequenceFile only supports key and value and user-defined serialization
 Hive added RCFile
– User controls the columns to read and decompress
– No type information and user-defined serialization
– Finding splits was expensive
 Avro files created
– Type information included!
– Had to read and decompress entire row
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
ORC File Basics
 Columnar format
– Enables user to read & decompress just the bytes they need
 Fast
– See https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet
 Indexed
 Self-describing
– Includes all of the information about types and encoding
 Rich type system
– All of Hive’s types including timestamp, struct, map, list, and union
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
File Compatibility
 Backwards compatibility
– Automatically detect the version of the file and read it.
 Forward compatibility
– Most changes are made so old readers will read the new files
– Maintain the ability to write old files via orc.write.format
– Always write old version until your last cluster upgrades
 Current file versions
– 0.11 – Original version
– 0.12 – Updated run length encoding (RLE)
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
File Structure
 File contains a list of stripes, which are sets of rows
– Default size is 64MB
– Large stripe size enables efficient reads
 Footer
– Contains the list of stripe locations
– Type description
– File and stripe statistics
 Postscript
– Compression parameters
– File format version
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Stripe Structure
 Indexes
– Offsets to jump to start of row group
– Row group size defaults to 10,000 rows
– Minimum, Maximum, and Count of each column
 Data
– Data for the stripe organized by column
 Footer
– List of stream locations
– Column encoding information
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
File Layout
Page 8
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Index Data
Row Data
Stripe Footer
~64MBStripe
Index Data
Row Data
Stripe Footer
~64MBStripe
Index Data
Row Data
Stripe Footer
~64MBStripe
File Footer
Postscript
File Metadata
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Schema Evolution
 ORC now supports schema evolution
– Hive 2.1 – append columns or type conversion
– Upcoming Hive 2.3 – map columns or inner structures by name
– User passes desired schema to ORC reader
 Type conversions
– Most types will convert although some are ugly.
– If the value doesn’t fit in the new type, it will become null.
 Cautions
– Name mapping requires ORC files written by Hive ≥ 2.0
– Some of the type conversions are slow
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Using ORC
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
From Hive or Presto
 Modify your table definition:
– create table my_table (
name string,
address string,
) stored as orc;
 Import data:
– insert overwrite table my_table select * from my_staging;
 Use either configuration or table properties
– tblproperties ("orc.compress"="NONE")
– set hive.exec.orc.default.compress=NONE;
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
From Java
 Use the ORC project rather than Hive’s ORC.
– Hive’s master branch uses it.
– Maven group id: org.apache.orc version: 1.4.0
– nohive classifier avoids interfering with Hive’s packages
 Two levels of access
– orc-core – Faster access, but uses Hive’s vectorized API
– orc-mapreduce – Row by row access, simpler OrcStruct API
 MapReduce API implements WritableComparable
– Can be shuffled
– Need to specify type information in configuration for shuffle or output
13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
From C++
 Pure C++ client library
– No JNI or JDK so client can estimate and control memory
 Combine with pure C++ HDFS client from HDFS-8707
– Work ongoing in feature branch, but should be committed soon.
 Reader is stable and in production use.
 Alibaba has created a writer and is contributing it to Apache ORC.
– Should be in the next release ORC 1.5.0.
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Command Line
 Using hive –orcfiledump from Hive
– -j -p – pretty prints the metadata as JSON
– -d – prints data as JSON
 Using java -jar orc-tools-1.4.0-uber.jar from ORC
– meta – print the metadata as JSON
– data – print data as JSON
– convert – convert JSON to ORC
– json-schema – scan a set of JSON documents to find the matching schema
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Optimization
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Stripe Size
 Makes a huge difference in performance
– orc.stripe.size or hive.exec.orc.default.stripe.size
– Controls the amount of buffer in writer. Default is 64MB
– Trade off
• Large stripes = Large more efficient reads
• Small stripes = Less memory and more granular processing splits
 Multiple files written at the same time will shrink stripes
– Use Hive’s hive.optimize.sort.dynamic.partition
– Sorting dynamic partitions means a one writer at a time
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
HDFS Block Padding
 The stripes don’t align exactly with HDFS blocks
 HDFS scatters blocks around cluster
 Often want to pad to block boundaries
– Costs space, but improves performance
– hive.exec.orc.default.block.padding – true
– hive.exec.orc.block.padding.tolerance – 0.05
Index Data
Row Data
Stripe Footer
~64MBStripe
Index Data
Row Data
Stripe Footer
~64MBStripe
Index Data
Row Data
Stripe Footer
~64MBStripe
HDFS Block
HDFS Block
Padding
File Footer
Postscript
File Metadata
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Predicate Push Down
 Reader is given a SearchArg
– Limited set predicates over column and literal value
– Reader will skip over any parts of file that can’t contain valid rows
 ORC indexes at three levels:
– File
– Stripe
– Row Group (10k rows)
 Reader still needs to apply predicate to filter out single rows
19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Row Pruning
 Every primitive column has minimum and maximum at each level
– Sorting your data within a file helps a lot
– Consider sorting instead of making lots of partitions
 Writer can optionally include bloomfilters
– Provides a probabilistic bitmap of hashcodes
– Only works with equality predicates at the row group level
– Requires significant space in the file
– Manually enabled by using orc.bloom.filter.columns
– Use orc.bloom.filter.fpp to set the false positive rate (default 0.05)
– Set the default charset in JVM via -Dfile.encoding=UTF-8
20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Row Pruning Example
 TPC-DS
– from tpch1000.lineitem where l_orderkey = 1212000001;
 Rows Read
– Nothing – 5,999,989,709
– Min/Max – 540,000
– BloomFilter – 10,000
 Time Taken
– Nothing – 74 sec
– Min/Max – 4.5 sec
– BloomFilter – 1.3 sec
21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Split Calculation
 Hive’s OrcInputFormat has three strategies for split calculation
– BI
• Small fast queries
• Splits based on HDFS blocks
– ETL
• Large queries
• Read file footer and apply SearchArg to stripes
• Can include footer in splits (hive.orc.splits.include.file.footer)
– Hybrid
• If small files or lots of files, use BI
22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
LLAP – Live Long & Process
 Provides a persistent service to speed up Hive
– Caches ORC and text data
– Saves costs of Yarn container & JVM spin up
– JIT finishes after first few seconds
 Cache uses ORC’s RLE
– Decompresses zlib or Snappy
– RLE is fast and saves memory
– Automatically caches hot columns and partitions
 Allows Spark to use Hive’s column and row security
23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Current Work In Progress
24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Speed Improvements for ACID
 Hive supports ACID transactions on ORC tables
– Uses delta files in HDFS to store changes to each partition
– Delta files store insert/update/delete operations
– Used to support SQL insert commands
 Unfortunately, update operations don’t allow predicate push down
on the deltas
 In the upcoming Hive 2.3, we added a new ACID layout
– It change updates to an insert and delete
– Allows predicate pushdown even on the delta files
 Also added SQL merge command in Hive 2.2
25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Column Encryption (ORC-14)
 Allows users to encrypt some of the columns of the file
– Provides column level security even with access to raw files
– Uses Key Management Server from Ranger or Hadoop
– Includes both the data and the index
– Daily key rolling can anonymize data after 90 days
 User specifies how data is masked if user doesn’t have access
– Nullify
– Redact
– SHA256
26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank You
@owen_omalley
owen@hortonworks.com
Ad

More Related Content

What's hot (20)

The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
ORC Files
ORC FilesORC Files
ORC Files
Owen O'Malley
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Databricks
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
DataWorks Summit
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Databricks
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
DataWorks Summit
 

Similar to ORC File - Optimizing Your Big Data (20)

Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Hortonworks
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in Hive
Eugene Koifman
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
DataWorks Summit/Hadoop Summit
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
Dongjoon Hyun
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
 
HiveWarehouseConnector
HiveWarehouseConnectorHiveWarehouseConnector
HiveWarehouseConnector
Eric Wohlstadter
 
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
Michael Young
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
Hortonworks
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
Hortonworks
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
Chris Nauroth
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
Steve Loughran
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
Muhammad Ali
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
Gergely Devenyi
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Hortonworks
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in Hive
Eugene Koifman
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
Dongjoon Hyun
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
Hortonworks
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
Hortonworks
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
Chris Nauroth
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
Steve Loughran
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
Gergely Devenyi
 
Ad

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 

ORC File - Optimizing Your Big Data

  • 1. ORC File – Optimizing Your Big Data Owen O’Malley, Co-founder Hortonworks Apache Hadoop, Hive, ORC, and Incubator @owen_omalley
  • 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Overview
  • 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved In the Beginning…  Hadoop applications used text or SequenceFile – Text is slow and not splittable when compressed – SequenceFile only supports key and value and user-defined serialization  Hive added RCFile – User controls the columns to read and decompress – No type information and user-defined serialization – Finding splits was expensive  Avro files created – Type information included! – Had to read and decompress entire row
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved ORC File Basics  Columnar format – Enables user to read & decompress just the bytes they need  Fast – See https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet  Indexed  Self-describing – Includes all of the information about types and encoding  Rich type system – All of Hive’s types including timestamp, struct, map, list, and union
  • 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved File Compatibility  Backwards compatibility – Automatically detect the version of the file and read it.  Forward compatibility – Most changes are made so old readers will read the new files – Maintain the ability to write old files via orc.write.format – Always write old version until your last cluster upgrades  Current file versions – 0.11 – Original version – 0.12 – Updated run length encoding (RLE)
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved File Structure  File contains a list of stripes, which are sets of rows – Default size is 64MB – Large stripe size enables efficient reads  Footer – Contains the list of stripe locations – Type description – File and stripe statistics  Postscript – Compression parameters – File format version
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Stripe Structure  Indexes – Offsets to jump to start of row group – Row group size defaults to 10,000 rows – Minimum, Maximum, and Count of each column  Data – Data for the stripe organized by column  Footer – List of stream locations – Column encoding information
  • 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved File Layout Page 8 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Index Data Row Data Stripe Footer ~64MBStripe Index Data Row Data Stripe Footer ~64MBStripe Index Data Row Data Stripe Footer ~64MBStripe File Footer Postscript File Metadata
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Schema Evolution  ORC now supports schema evolution – Hive 2.1 – append columns or type conversion – Upcoming Hive 2.3 – map columns or inner structures by name – User passes desired schema to ORC reader  Type conversions – Most types will convert although some are ugly. – If the value doesn’t fit in the new type, it will become null.  Cautions – Name mapping requires ORC files written by Hive ≥ 2.0 – Some of the type conversions are slow
  • 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Using ORC
  • 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved From Hive or Presto  Modify your table definition: – create table my_table ( name string, address string, ) stored as orc;  Import data: – insert overwrite table my_table select * from my_staging;  Use either configuration or table properties – tblproperties ("orc.compress"="NONE") – set hive.exec.orc.default.compress=NONE;
  • 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved From Java  Use the ORC project rather than Hive’s ORC. – Hive’s master branch uses it. – Maven group id: org.apache.orc version: 1.4.0 – nohive classifier avoids interfering with Hive’s packages  Two levels of access – orc-core – Faster access, but uses Hive’s vectorized API – orc-mapreduce – Row by row access, simpler OrcStruct API  MapReduce API implements WritableComparable – Can be shuffled – Need to specify type information in configuration for shuffle or output
  • 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved From C++  Pure C++ client library – No JNI or JDK so client can estimate and control memory  Combine with pure C++ HDFS client from HDFS-8707 – Work ongoing in feature branch, but should be committed soon.  Reader is stable and in production use.  Alibaba has created a writer and is contributing it to Apache ORC. – Should be in the next release ORC 1.5.0.
  • 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Command Line  Using hive –orcfiledump from Hive – -j -p – pretty prints the metadata as JSON – -d – prints data as JSON  Using java -jar orc-tools-1.4.0-uber.jar from ORC – meta – print the metadata as JSON – data – print data as JSON – convert – convert JSON to ORC – json-schema – scan a set of JSON documents to find the matching schema
  • 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization
  • 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Stripe Size  Makes a huge difference in performance – orc.stripe.size or hive.exec.orc.default.stripe.size – Controls the amount of buffer in writer. Default is 64MB – Trade off • Large stripes = Large more efficient reads • Small stripes = Less memory and more granular processing splits  Multiple files written at the same time will shrink stripes – Use Hive’s hive.optimize.sort.dynamic.partition – Sorting dynamic partitions means a one writer at a time
  • 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved HDFS Block Padding  The stripes don’t align exactly with HDFS blocks  HDFS scatters blocks around cluster  Often want to pad to block boundaries – Costs space, but improves performance – hive.exec.orc.default.block.padding – true – hive.exec.orc.block.padding.tolerance – 0.05 Index Data Row Data Stripe Footer ~64MBStripe Index Data Row Data Stripe Footer ~64MBStripe Index Data Row Data Stripe Footer ~64MBStripe HDFS Block HDFS Block Padding File Footer Postscript File Metadata
  • 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Predicate Push Down  Reader is given a SearchArg – Limited set predicates over column and literal value – Reader will skip over any parts of file that can’t contain valid rows  ORC indexes at three levels: – File – Stripe – Row Group (10k rows)  Reader still needs to apply predicate to filter out single rows
  • 19. 19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Row Pruning  Every primitive column has minimum and maximum at each level – Sorting your data within a file helps a lot – Consider sorting instead of making lots of partitions  Writer can optionally include bloomfilters – Provides a probabilistic bitmap of hashcodes – Only works with equality predicates at the row group level – Requires significant space in the file – Manually enabled by using orc.bloom.filter.columns – Use orc.bloom.filter.fpp to set the false positive rate (default 0.05) – Set the default charset in JVM via -Dfile.encoding=UTF-8
  • 20. 20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Row Pruning Example  TPC-DS – from tpch1000.lineitem where l_orderkey = 1212000001;  Rows Read – Nothing – 5,999,989,709 – Min/Max – 540,000 – BloomFilter – 10,000  Time Taken – Nothing – 74 sec – Min/Max – 4.5 sec – BloomFilter – 1.3 sec
  • 21. 21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Split Calculation  Hive’s OrcInputFormat has three strategies for split calculation – BI • Small fast queries • Splits based on HDFS blocks – ETL • Large queries • Read file footer and apply SearchArg to stripes • Can include footer in splits (hive.orc.splits.include.file.footer) – Hybrid • If small files or lots of files, use BI
  • 22. 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved LLAP – Live Long & Process  Provides a persistent service to speed up Hive – Caches ORC and text data – Saves costs of Yarn container & JVM spin up – JIT finishes after first few seconds  Cache uses ORC’s RLE – Decompresses zlib or Snappy – RLE is fast and saves memory – Automatically caches hot columns and partitions  Allows Spark to use Hive’s column and row security
  • 23. 23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Current Work In Progress
  • 24. 24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Speed Improvements for ACID  Hive supports ACID transactions on ORC tables – Uses delta files in HDFS to store changes to each partition – Delta files store insert/update/delete operations – Used to support SQL insert commands  Unfortunately, update operations don’t allow predicate push down on the deltas  In the upcoming Hive 2.3, we added a new ACID layout – It change updates to an insert and delete – Allows predicate pushdown even on the delta files  Also added SQL merge command in Hive 2.2
  • 25. 25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Column Encryption (ORC-14)  Allows users to encrypt some of the columns of the file – Provides column level security even with access to raw files – Uses Key Management Server from Ranger or Hadoop – Includes both the data and the index – Daily key rolling can anonymize data after 90 days  User specifies how data is masked if user doesn’t have access – Nullify – Redact – SHA256
  • 26. 26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Thank You @owen_omalley [email protected]