SlideShare a Scribd company logo
Hive Partitioning
Best Practices
(Developer audience)
Nabeel Moidu
Solutions Architect
Cloudera Professional Services
2© Cloudera, Inc. All rights reserved.
• Data Warehouses in Hadoop
• Hadoop Data Modelling
• Columnar data storage
• Partitioning
• Bucketing
• Best Practices
• Small Files
• Storage optimizations
• Query optimisation
• Debugging Hive queries
Topics
3 © Cloudera, Inc. All rights reserved.
Data Warehouses in Hadoop
❏ Targeted for analytical query processing (OLAP).
❏ Query processing is handled differently by each engine:
❏ Hive uses MapReduce as the engine.
❏ CDP versions include Tez engine
❏ SparkSQL uses Spark engine
❏ Both Hive and Spark use table metadata stored in the Hive metastore.
❏ Both Hive and Spark are based on “Schema-on-read” approach
❏ Traditional RDBMS use “Schema-on-write” approach
Key differences in query patterns from RDBMS (OLTP) :
❏ May only need select columns from a table
❏ Entire set of columns of select rows not required in most cases.
❏ May involve large join operations.
❏ Denormalized tables are preferred
❏ Query optimisation focus includes minimizing amount of data fetched from disk (pruning)
4 © Cloudera, Inc. All rights reserved.
Hadoop Data Modelling
General guidelines for Data Modelling in Hadoop:
1. Denormalize tables that are joined frequently**.
2. RDBMS implementations of any comparable features are significantly different from that of Hadoop/Hive.
3. Significant level of optimisation is done outside of partitioning.
4. ORC/Parquet file formats with embedded metadata greatly improve predicate pushdown
5. Numerical data stored as strings prevents predicate pushdown options in query optimisation
6. Query engines assess the data involved and optimize the SQL statement before actual execution.
7. Excessive partitioning/bucketing can negatively impact overall performance due to added overhead
** (Note: Very wide tables may have memory implications)
5 © Cloudera, Inc. All rights reserved.
Columnar data storage
Physical data organization on the disk is optimised for OLAP patterns of access.
1. Columnar data organization improves performance of data retrieval.
2. Storing similar patterns together makes storage compression very efficient.
a. Low cardinality columns are replaced with dictionary mapped references
b. Repeated values in column are only stored once
c. Numerically close values are stored as delta
d. Bits are packed for efficient use of disk space
e. Per column metadata is separately stored in footer/trailer
3. Efficient storage on disk significantly improves disk I/O rate for data fetch
4. Columnar storage optimisation is applied both on Parquet and ORC
5. Common file formats like csv, json, xml etc are not optimal for performance
6 © Cloudera, Inc. All rights reserved.
Partitioning
1. Achieves data pruning by dividing backend data storage location for table into sub-folders.
2. Each folder is named as “key=value” and corresponds to a unique value of the partition column.
3. Partitioning at multiple columns creates multiple levels of folders in the backend storage
4. Each subfolder corresponds to one column that is designated to be partitioned.
5. Partitioned directories are selectively chosen during data fetch for a query.
6. File structure doesn’t have to be opened and read to identify partitioned column value
7. Partitions cannot be used for a range of values.
8. Partitions should only be used for columns with low cardinality. Eg year, country etc.
7 © Cloudera, Inc. All rights reserved.
Bucketing
1. Divides columns into preset number of buckets.
2. Hash values are computed on column values and placed into set of buckets.
3. Each unique column value will always end up in same bucket file
4. Helps isolating data fetch to only one file when doing joins on the bucketed column value
5. Bucketing and partitioning can be used together, but set a target file size as 200 MB - 1 GB *
6. Too much partitioning and bucketing can end up with the small file issue
7. Parquet/ORC storage optimisations largely cover similar optimisation as bucketing
* During the session this was mentioned as 1 GB - 2 GB. Please note the correction
8 © Cloudera, Inc. All rights reserved.
Best Practices ( partitioning & bucketing )
● Choose no more than two levels of partitioning
● Partition along columns that are likely to be filtered in queries on the data
● Keep partition count in a table at max of 1000-2000 for optimal performance
● Never partition on columns that have high cardinality / unique values per row
● Target an optimal range of 200 MB - 1 GB of file size inside each partition/bucket.
● Ensure all files inside one partition are merged into one during the ingestion process itself.
● In case of small file issues in cluster, bucketing is better skipped when using ORC/Parquet tables
● For bucketing optimisation inputs, refer to :
https://community.cloudera.com/t5/Support-Questions/Hive-Deciding-the-number-of-buckets/m-p/129310
9 © Cloudera, Inc. All rights reserved.
Small Files
1. The Hadoop cluster filesystem and Yarn processing framework involve some overhead which is:
a. Negligible when the chunk of data stored and processed by each storage block or task unit is large enough
b. Expensive when the data involved is proportionately low.
2. Hadoop and big data processing is generally optimized for large file sizes.
3. Small files make disk reads random during data fetch and significantly reduce performance of data retrieval.
4. No hard boundary is defined as to what constitutes a small file, but
a. Files < 30 MB are not optimal.
b. Files < 1 MB will significantly impact overall performance.
5. Numerous small files increase size of metadata at Master nodes for cluster filesystem
a. Impacts performance of the filesystem response.
6. Metadata processing for numerous individual small files impact query planning stages in Hive, Spark
a. File internal structure has to be individually read to get footer/trailer metadata in each file.
10 © Cloudera, Inc. All rights reserved.
Storage optimizations
Parquet and ORC are the main file formats that are efficient and perform well in Hive.
1. Parquet was originally developed by Cloudera and Twitter - available in CDH clusters
2. ORC was originally by Hortonworks, before the merger with Cloudera - will be available in CDP clusters
3. Both formats involve columnar storage of data in the file
a. Both contain useful metadata in the footer of the file.
4. Both formats involve data being split first into sets of rows.
a. ORC calls it a stripe, while parquet refers to the same as Row Group.
5. Each set of rows then has their individual column data stored together.
a. This helps in both efficient compression and efficient retrieval of column based query output.
6. Data within columns are generally stored using Dictionary RLE( Run Length Encoding) method
7. Use parquet-tools utility to view metadata on a parquet file.
11 © Cloudera, Inc. All rights reserved.
Query optimisation
1. Partitioning on filter column enables data to be fetched only from selected subfolders in HDFS
2. Column metadata is present at the footer of each file in ORC/Parquet
3. Footer data includes among others, min and max values for each column.
4. Footers also contain a dictionary list of column values for lower cardinality columns.
5. These and similar pieces of metadata help in predicate pushdown to optimise queries
6. Query engines can skip entire files during processing based on the metadata. Eg :
6.1. If filter criteria doesn’t fit into the range identified by per column min-max values, entire file can be skipped
6.2. If filter criteria is not part of keys in the dictionary of column values, entire file can be skipped
12 © Cloudera, Inc. All rights reserved.
Debugging Hive queries
1. Use PerfLogger setting in the Hiverserver2 instance to debug Hive query performance.
2. Each stage in Hive query execution is separately timed and logged once PerfLogger is enabled.
3. Compile and Yarn stages are separately identifiable.
4. Compile stages normally should not take more than a few seconds.
a. For complicated queries this may go upto a minute or so.
b. 90% or more of query time is often spent in Yarn stage
5. Use Hive session ID and assigned thread ID to track individual user sessions in Hiveserver2.
6. Use application ID logged against the Hive query ID to track progress of query in Yarn.
13 © Cloudera, Inc. All rights reserved.
Q&A - Part 1
1. How can tables with skewed data columns be partitioned ?
a. Use the Skewed by option
2. Why is there a difference in performance when using “BETWEEN” and “GREATER THAN OR EQUAL TO” ?
a. Needs to be investigated using EXPLAIN output
b. Speculative possibilities include difference in predicate pushdown or vectorisation
3. If downstream use patterns are not known at time of table design, what is to be done ?
a. Optimise based on generally known patterns (query by year/month etc)
b. Optimise based on ingest
4. Can bucketing be done on tables after data ingest ?
a. No. Data will have to be fully re-organized, so it’s better to create another bucketed table and copy into it.
14 © Cloudera, Inc. All rights reserved.
Q&A - Part 2
1. How is HBase columnar format different from Hive columnar optimizations ?
a. HBase stores data in different column families in different folders in HDFS.
b. Inside that the column data is still stored with the key alongside every cell value
c. HBase is extremely fast for key based lookup, and key range based lookup.
d. Queries filtering on column values will be anti-pattern for HBase and will not perform well
e. Column based filtering on HBase will have to scan through all columns for the select rows.
f. HBase file format design is not optimized for any join type operations
g. Schema design in HBase primarily focuses on the way the row key is designed
15 © Cloudera, Inc. All rights reserved.
Q&A - Part 3
1. Why does Spark not have it’s own metastore ?
a. Spark was designed as a data processing framework separate from Hadoop
b. The initial focus of Spark optimizations were in the query engine.
c. When Spark was introduced, data in Hadoop was already being stored as a catalogue in Hive metastore
d. Spark and Hive being open source, it was possible for Spark to directly talk to Hive metastore
e. Hence there was no necessity for Spark to introduce a separate catalog
2. Can Hive PerfLogger output be available to users without access to Hiveserver2 like Spark logs are accessible?
a. The Driver process is where a query is compiled and passed to Yarn in Hadoop
b. The Driver in Spark sits on the Application Master of the job launched. Hence it’s segregated per job
c. The Driver in Hive sits on the Hiveserver2 instance. Hence the logs sit on the central server for all queries.
16 © Cloudera, Inc. All rights reserved.
Thank you
Ad

More Related Content

What's hot (20)

Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
kristinferrier
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Marin Dimitrov
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Sqoop
SqoopSqoop
Sqoop
Prashant Gupta
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
ETL Design for Impala Zero Touch Metadata.pptx
ETL Design for Impala Zero Touch Metadata.pptxETL Design for Impala Zero Touch Metadata.pptx
ETL Design for Impala Zero Touch Metadata.pptx
Manish Maheshwari
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet Format
Yue Chen
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
kristinferrier
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
ETL Design for Impala Zero Touch Metadata.pptx
ETL Design for Impala Zero Touch Metadata.pptxETL Design for Impala Zero Touch Metadata.pptx
ETL Design for Impala Zero Touch Metadata.pptx
Manish Maheshwari
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet Format
Yue Chen
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
 

Similar to Hive partitioning best practices (20)

Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
liang chen
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
Bob Pusateri
 
Hadoop storage
Hadoop storageHadoop storage
Hadoop storage
SanSan149
 
Designing data intensive applications
Designing data intensive applicationsDesigning data intensive applications
Designing data intensive applications
Hemchander Sannidhanam
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
Minio
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
CitiusTech
 
Storage in hadoop
Storage in hadoopStorage in hadoop
Storage in hadoop
Puneet Tripathi
 
New in Hadoop: You should know the Various File Format in Hadoop.
New in Hadoop: You should know the Various File Format in Hadoop.New in Hadoop: You should know the Various File Format in Hadoop.
New in Hadoop: You should know the Various File Format in Hadoop.
veeracynixit
 
HCSA-Presales-Storage V4.0 Training Material (2).pdf
HCSA-Presales-Storage V4.0 Training Material (2).pdfHCSA-Presales-Storage V4.0 Training Material (2).pdf
HCSA-Presales-Storage V4.0 Training Material (2).pdf
priyosantoso13
 
A quick start guide to using HDF5 files in GLOBE Claritas
A quick start guide to using HDF5 files in GLOBE ClaritasA quick start guide to using HDF5 files in GLOBE Claritas
A quick start guide to using HDF5 files in GLOBE Claritas
Guy Maslen
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conference
nkabra
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptx
SwarnaSLcse
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
earnwithme2522
 
Elastic storage in the cloud session 5224 final v2
Elastic storage in the cloud session 5224 final v2Elastic storage in the cloud session 5224 final v2
Elastic storage in the cloud session 5224 final v2
BradDesAulniers2
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
Fei Dong
 
Unit 6 - Compression and Serialization in Hadoop.pptx
Unit 6 - Compression and Serialization in Hadoop.pptxUnit 6 - Compression and Serialization in Hadoop.pptx
Unit 6 - Compression and Serialization in Hadoop.pptx
muhweziart
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
The HDF-EOS Tools and Information Center
 
CS 2212- UNIT -4.pptx
CS 2212-  UNIT -4.pptxCS 2212-  UNIT -4.pptx
CS 2212- UNIT -4.pptx
LilyMkayula
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18
karenostil
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
liang chen
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
Bob Pusateri
 
Hadoop storage
Hadoop storageHadoop storage
Hadoop storage
SanSan149
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
Minio
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
CitiusTech
 
New in Hadoop: You should know the Various File Format in Hadoop.
New in Hadoop: You should know the Various File Format in Hadoop.New in Hadoop: You should know the Various File Format in Hadoop.
New in Hadoop: You should know the Various File Format in Hadoop.
veeracynixit
 
HCSA-Presales-Storage V4.0 Training Material (2).pdf
HCSA-Presales-Storage V4.0 Training Material (2).pdfHCSA-Presales-Storage V4.0 Training Material (2).pdf
HCSA-Presales-Storage V4.0 Training Material (2).pdf
priyosantoso13
 
A quick start guide to using HDF5 files in GLOBE Claritas
A quick start guide to using HDF5 files in GLOBE ClaritasA quick start guide to using HDF5 files in GLOBE Claritas
A quick start guide to using HDF5 files in GLOBE Claritas
Guy Maslen
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conference
nkabra
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptx
SwarnaSLcse
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
earnwithme2522
 
Elastic storage in the cloud session 5224 final v2
Elastic storage in the cloud session 5224 final v2Elastic storage in the cloud session 5224 final v2
Elastic storage in the cloud session 5224 final v2
BradDesAulniers2
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
Fei Dong
 
Unit 6 - Compression and Serialization in Hadoop.pptx
Unit 6 - Compression and Serialization in Hadoop.pptxUnit 6 - Compression and Serialization in Hadoop.pptx
Unit 6 - Compression and Serialization in Hadoop.pptx
muhweziart
 
CS 2212- UNIT -4.pptx
CS 2212-  UNIT -4.pptxCS 2212-  UNIT -4.pptx
CS 2212- UNIT -4.pptx
LilyMkayula
 
Csci12 report aug18
Csci12 report aug18Csci12 report aug18
Csci12 report aug18
karenostil
 
Ad

Recently uploaded (20)

Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Ad

Hive partitioning best practices

  • 1. Hive Partitioning Best Practices (Developer audience) Nabeel Moidu Solutions Architect Cloudera Professional Services
  • 2. 2© Cloudera, Inc. All rights reserved. • Data Warehouses in Hadoop • Hadoop Data Modelling • Columnar data storage • Partitioning • Bucketing • Best Practices • Small Files • Storage optimizations • Query optimisation • Debugging Hive queries Topics
  • 3. 3 © Cloudera, Inc. All rights reserved. Data Warehouses in Hadoop ❏ Targeted for analytical query processing (OLAP). ❏ Query processing is handled differently by each engine: ❏ Hive uses MapReduce as the engine. ❏ CDP versions include Tez engine ❏ SparkSQL uses Spark engine ❏ Both Hive and Spark use table metadata stored in the Hive metastore. ❏ Both Hive and Spark are based on “Schema-on-read” approach ❏ Traditional RDBMS use “Schema-on-write” approach Key differences in query patterns from RDBMS (OLTP) : ❏ May only need select columns from a table ❏ Entire set of columns of select rows not required in most cases. ❏ May involve large join operations. ❏ Denormalized tables are preferred ❏ Query optimisation focus includes minimizing amount of data fetched from disk (pruning)
  • 4. 4 © Cloudera, Inc. All rights reserved. Hadoop Data Modelling General guidelines for Data Modelling in Hadoop: 1. Denormalize tables that are joined frequently**. 2. RDBMS implementations of any comparable features are significantly different from that of Hadoop/Hive. 3. Significant level of optimisation is done outside of partitioning. 4. ORC/Parquet file formats with embedded metadata greatly improve predicate pushdown 5. Numerical data stored as strings prevents predicate pushdown options in query optimisation 6. Query engines assess the data involved and optimize the SQL statement before actual execution. 7. Excessive partitioning/bucketing can negatively impact overall performance due to added overhead ** (Note: Very wide tables may have memory implications)
  • 5. 5 © Cloudera, Inc. All rights reserved. Columnar data storage Physical data organization on the disk is optimised for OLAP patterns of access. 1. Columnar data organization improves performance of data retrieval. 2. Storing similar patterns together makes storage compression very efficient. a. Low cardinality columns are replaced with dictionary mapped references b. Repeated values in column are only stored once c. Numerically close values are stored as delta d. Bits are packed for efficient use of disk space e. Per column metadata is separately stored in footer/trailer 3. Efficient storage on disk significantly improves disk I/O rate for data fetch 4. Columnar storage optimisation is applied both on Parquet and ORC 5. Common file formats like csv, json, xml etc are not optimal for performance
  • 6. 6 © Cloudera, Inc. All rights reserved. Partitioning 1. Achieves data pruning by dividing backend data storage location for table into sub-folders. 2. Each folder is named as “key=value” and corresponds to a unique value of the partition column. 3. Partitioning at multiple columns creates multiple levels of folders in the backend storage 4. Each subfolder corresponds to one column that is designated to be partitioned. 5. Partitioned directories are selectively chosen during data fetch for a query. 6. File structure doesn’t have to be opened and read to identify partitioned column value 7. Partitions cannot be used for a range of values. 8. Partitions should only be used for columns with low cardinality. Eg year, country etc.
  • 7. 7 © Cloudera, Inc. All rights reserved. Bucketing 1. Divides columns into preset number of buckets. 2. Hash values are computed on column values and placed into set of buckets. 3. Each unique column value will always end up in same bucket file 4. Helps isolating data fetch to only one file when doing joins on the bucketed column value 5. Bucketing and partitioning can be used together, but set a target file size as 200 MB - 1 GB * 6. Too much partitioning and bucketing can end up with the small file issue 7. Parquet/ORC storage optimisations largely cover similar optimisation as bucketing * During the session this was mentioned as 1 GB - 2 GB. Please note the correction
  • 8. 8 © Cloudera, Inc. All rights reserved. Best Practices ( partitioning & bucketing ) ● Choose no more than two levels of partitioning ● Partition along columns that are likely to be filtered in queries on the data ● Keep partition count in a table at max of 1000-2000 for optimal performance ● Never partition on columns that have high cardinality / unique values per row ● Target an optimal range of 200 MB - 1 GB of file size inside each partition/bucket. ● Ensure all files inside one partition are merged into one during the ingestion process itself. ● In case of small file issues in cluster, bucketing is better skipped when using ORC/Parquet tables ● For bucketing optimisation inputs, refer to : https://community.cloudera.com/t5/Support-Questions/Hive-Deciding-the-number-of-buckets/m-p/129310
  • 9. 9 © Cloudera, Inc. All rights reserved. Small Files 1. The Hadoop cluster filesystem and Yarn processing framework involve some overhead which is: a. Negligible when the chunk of data stored and processed by each storage block or task unit is large enough b. Expensive when the data involved is proportionately low. 2. Hadoop and big data processing is generally optimized for large file sizes. 3. Small files make disk reads random during data fetch and significantly reduce performance of data retrieval. 4. No hard boundary is defined as to what constitutes a small file, but a. Files < 30 MB are not optimal. b. Files < 1 MB will significantly impact overall performance. 5. Numerous small files increase size of metadata at Master nodes for cluster filesystem a. Impacts performance of the filesystem response. 6. Metadata processing for numerous individual small files impact query planning stages in Hive, Spark a. File internal structure has to be individually read to get footer/trailer metadata in each file.
  • 10. 10 © Cloudera, Inc. All rights reserved. Storage optimizations Parquet and ORC are the main file formats that are efficient and perform well in Hive. 1. Parquet was originally developed by Cloudera and Twitter - available in CDH clusters 2. ORC was originally by Hortonworks, before the merger with Cloudera - will be available in CDP clusters 3. Both formats involve columnar storage of data in the file a. Both contain useful metadata in the footer of the file. 4. Both formats involve data being split first into sets of rows. a. ORC calls it a stripe, while parquet refers to the same as Row Group. 5. Each set of rows then has their individual column data stored together. a. This helps in both efficient compression and efficient retrieval of column based query output. 6. Data within columns are generally stored using Dictionary RLE( Run Length Encoding) method 7. Use parquet-tools utility to view metadata on a parquet file.
  • 11. 11 © Cloudera, Inc. All rights reserved. Query optimisation 1. Partitioning on filter column enables data to be fetched only from selected subfolders in HDFS 2. Column metadata is present at the footer of each file in ORC/Parquet 3. Footer data includes among others, min and max values for each column. 4. Footers also contain a dictionary list of column values for lower cardinality columns. 5. These and similar pieces of metadata help in predicate pushdown to optimise queries 6. Query engines can skip entire files during processing based on the metadata. Eg : 6.1. If filter criteria doesn’t fit into the range identified by per column min-max values, entire file can be skipped 6.2. If filter criteria is not part of keys in the dictionary of column values, entire file can be skipped
  • 12. 12 © Cloudera, Inc. All rights reserved. Debugging Hive queries 1. Use PerfLogger setting in the Hiverserver2 instance to debug Hive query performance. 2. Each stage in Hive query execution is separately timed and logged once PerfLogger is enabled. 3. Compile and Yarn stages are separately identifiable. 4. Compile stages normally should not take more than a few seconds. a. For complicated queries this may go upto a minute or so. b. 90% or more of query time is often spent in Yarn stage 5. Use Hive session ID and assigned thread ID to track individual user sessions in Hiveserver2. 6. Use application ID logged against the Hive query ID to track progress of query in Yarn.
  • 13. 13 © Cloudera, Inc. All rights reserved. Q&A - Part 1 1. How can tables with skewed data columns be partitioned ? a. Use the Skewed by option 2. Why is there a difference in performance when using “BETWEEN” and “GREATER THAN OR EQUAL TO” ? a. Needs to be investigated using EXPLAIN output b. Speculative possibilities include difference in predicate pushdown or vectorisation 3. If downstream use patterns are not known at time of table design, what is to be done ? a. Optimise based on generally known patterns (query by year/month etc) b. Optimise based on ingest 4. Can bucketing be done on tables after data ingest ? a. No. Data will have to be fully re-organized, so it’s better to create another bucketed table and copy into it.
  • 14. 14 © Cloudera, Inc. All rights reserved. Q&A - Part 2 1. How is HBase columnar format different from Hive columnar optimizations ? a. HBase stores data in different column families in different folders in HDFS. b. Inside that the column data is still stored with the key alongside every cell value c. HBase is extremely fast for key based lookup, and key range based lookup. d. Queries filtering on column values will be anti-pattern for HBase and will not perform well e. Column based filtering on HBase will have to scan through all columns for the select rows. f. HBase file format design is not optimized for any join type operations g. Schema design in HBase primarily focuses on the way the row key is designed
  • 15. 15 © Cloudera, Inc. All rights reserved. Q&A - Part 3 1. Why does Spark not have it’s own metastore ? a. Spark was designed as a data processing framework separate from Hadoop b. The initial focus of Spark optimizations were in the query engine. c. When Spark was introduced, data in Hadoop was already being stored as a catalogue in Hive metastore d. Spark and Hive being open source, it was possible for Spark to directly talk to Hive metastore e. Hence there was no necessity for Spark to introduce a separate catalog 2. Can Hive PerfLogger output be available to users without access to Hiveserver2 like Spark logs are accessible? a. The Driver process is where a query is compiled and passed to Yarn in Hadoop b. The Driver in Spark sits on the Application Master of the job launched. Hence it’s segregated per job c. The Driver in Hive sits on the Hiveserver2 instance. Hence the logs sit on the central server for all queries.
  • 16. 16 © Cloudera, Inc. All rights reserved. Thank you