SlideShare a Scribd company logo
ORACLESTORE:
A HIGHLY PERFORMANT
RAWSTORE
IMPLEMENTATION FOR
HIVE METASTORE
Chris Drome, Sr. Principal Engineer
Jin Sun, Principal Engineer
2
IT’S ABOUT SCALE
• 20+ clusters
• 2-4 dedicated metastore servers per cluster
• More including HS2 instances
• Large cluster:
• ~6-9M partitions over ~6K tables
• Largest table has ~1.1M partitions
• Medium cluster:
• ~3-6M partitions over ~3-4K tables
• Largest table has ~250K partitions
3
SO WHAT?
• Client-side timeouts
• Queries over large number of partitions
• socket.timeout=200s; socket.lifetime=300s
• Increased load and memory usage on metastore
• Concurrent connections across clients (server.max.threads=4000)
• Long running queries to retrieve large amounts of data
• Retrieve/convert/serialize duplicate data
• Abandoned/rerun operations
• Increased load on Oracle
• Concurrent connections across metastores
• Read/write operations on large tables takes time
• Queries start to back up
4
Final Thoughts
Background & Issues
Test Results
Implementation Details
Goals
5
METASTORE ARCHITECTURE
Oracle
SQL
Server
DerbyMySQL
Postgre
SQL
CLI CLI CLI
Thrift Server
Metastore Core Logic
ObjectStore/DirectSQL
DBCP
JDBC
ORM Model
DataNucleus
6
WHAT IS DATANUCLEUS?
• ORM framework
• ORM model classes
• ORM model object-relational mapping
• Executes queries via JDOQL
• Generates DB-specific SQL
• Black box limits control; hampers debugging
• Requires two sets of classes
• ORM classes to interact with the DB
• Thrift classes in core logic and over wire
• Conversion between the two
• Object-relational impedance mismatch
• Limits opportunity to optimize schema
• Tendency to duplicate data needlessly
• Limits control of SQL
• Limits opportunity to optimize queries
• Clean-up thread to identify abandoned ids
WHAT IT MEANS
7
WHAT IS DIRECTSQL?
• Custom code to improve performance
• Focus on get_partition operations
 Yahoo! added drop_partitions
• DB agnostic SQL instead of JDOQL
• Timeouts and fallback to ORM
• Works through DataNucleus
• Adds tables to evaluate filters
• Batch retrieval of lists of objects
• Greatly improves performance
• Large amount of code for specific
functionality
• Requires code to identify underlying DB
• Failure/timeout results in long latency
• Constrained by DataNucleus
• Does not address fundamental issues
• Requires self-JOIN for each filter condition
• Deep table (# partitions x avg # partition
columns)
WHAT IT MEANS
8
Final Thoughts
Background & Issues
Test Results
Implementation Details
Goals
9
GOALS
• Reduce load on Oracle
• Fix database schema inefficiencies
 Foundation for more performant queries
 Reduce the storage/computational requirements of data
 Better utilize native constructs and SQL
• Fix database layer inefficiencies
 Improve performance characteristics of SQL
 Improve maintainability of code
• Address repetitive/redundant requests for data (future)
• Reduce payload over wire
• Optimize client-server communication protocol (future)
10
Final Thoughts
Background & Issues
Test Results
Implementation Details
Goals
11
WHAT SHOULD WE DO?
• Recognize there is a problem
• Understand effects of original schema on query performance
• Identify pain points and areas for improvement
• Design more performant schema
• Write OracleStore
• Leverage lessons learned from DirectSQL
• Create migration and validation toolset
• How to migrate existing data?
• How to ensure data integrity?
• How to rollback if necessary?
• Deploy it!
12
LESSONS LEARNED
• Object structure should not dictate schema
• Operations on Tables/Partitions are king
• Most frequent and most expensive operations
• Group data specific to Tables/Partitions
• Promote table columns to TBLS/PARTITIONS
• Direct references to all satellite data
• Invert relationships between member objects
• Merge tables
• Gets rid of needless JOINs
• Deduplicate, deduplicate, deduplicate
13
LESSONS LEARNED
Table ObjectStore OracleStore +/- Comment
SDS ~3.3M 6 n/a Restructure; Dedup
SERDES ~3.3M 11 n/a Merge; Dedup
SDS JOIN SERDES ~3.3M 15 -100% Dedup
COLUMNS_V2 ~24.9M ~0.7M -97.2% Dedup
TABLE_PARAMS ~0.06M n/a n/a Merge; Dedup
PARTITION_PARAMS ~11.5M ~0.1M -98.9% Merge; Dedup
SD_PARAMS 0 0 0.0%
SERDE_PARAMS ~5.4M ~0.02M -99.7% Dedup
14
SCHEMA REDESIGN
• OracleStore tables should co-exist with ObjectStore tables
• Utilize native constructs and features
• SEQUENCE
• FOREIGN KEY … CASCADE
• LIMIT
• Oracle built-in functions
• One degree of separation from TBLS/PARTITIONS
• Attribute tables
• PARAM tables
• De-emphasize importance of SDS
• Promote SDS.LOCATION, SDS.CD_ID to TBLS/PARTITIONS
• No indexes on Oracle tables yet
15
ORACLESTORE IMPLEMENTATION
• OracleStore implements RawStore
• HBaseStore (HIVE-9453)
• OracleStore co-exists with ObjectStore
• Code changes are additive
 Annotations to identify read/write operations
 Log messages display performance numbers
 HybridRawStoreProxy for tee’d reads/writes
• Aggressive deduplication of data
• Batched retrieval of lists of objects
16
METASTORE ARCHITECTURE
Oracle
CLI CLI CLI
Thrift Server
Metastore Core Logic
SQL
Server
DerbyMySQL
Postgre
SQL
ObjectStore/DirectSQL
DBCP
JDBC
ORM Model
DataNucleus
DBCP
ObjectStore/DirectSQL
DataNucleus
OracleStore
JDBC
17
Final Thoughts
Background & Issues
Test Results
Implementation Details
Goals
18
METASTORE OPERATIONS
Operation % of Total
get_table 54.1%
get_database 10.0%
get_function 6.9%
get_partitions_by_filter 6.7%
add_partitions 3.9%
get_delegation_token 3.6%
drop_partitions 3.0%
get_partitions_with_auth 3.0%
get_all_databases 2.4%
get_partitions_by_expr 2.1%
other 4.3%
19
METASTORE OPERATIONS
• # databases: 643
• # functions: 6
• # tables: 12170
20
METASTORE OPERATIONS
• # partitions (total): 3337925
• # partitions (table): 88261
• # partition columns: 6
• dt=20170101/p1=a/p2=b/p3=c/p4=d/p5=e
• Range query on dt
• 1 hour and 4 hour increments
• OracleStore 2971ms
latency on 13K partitions
• 46x faster than ObjectStore
• 13x faster than DirectSQL
21
METASTORE OPERATIONS
• # partitions (total): 3337925
• # partitions (table): 88261
• # partition columns: 6
• dt=20170101/p1=a/p2=b/p3=c/p4=d/p5=e
• Range query on dt and
equality filter on p1=abc
• 1 hour and 4 hour increments
• OracleStore latency is
relatively constant at this scale
22
METASTORE OPERATIONS
• # partitions (total): 3337925
• # partitions (table): 88261
• # partition columns: 6
• dt=20170101/p1=a/p2=b/p3=c/p4=d/p5=e
• Range query on dt and
equality filter on p1=abc
and p2=xyz
• 1 hour and 4 hour increments
• OracleStore latency is
relatively constant at this scale
23
METASTORE OPERATIONS (AUDIT)
get_table
get_function
get_database
get_partition_with_auth
get_all_databases
get_partitions_ps_with_auth
get_partitions_names_ps
get_partitions_by_filter
get_index_names
get_partition
get_indexes
get_all_tables
get_multi_table
get_partitions
get_partition_names
get_tables
get_databases
get_table_statistics_req
24
METASTORE OPERATIONS (AUDIT)
get_partitions
get_table
get_partitions_by_filter
get_partition_with_auth
get_multi_table
get_partitions_ps_with_auth
get_function
get_partitions_names_ps
get_all_databases
get_database
get_partition
get_partition_names
get_indexes
get_index_names
get_all_tables
get_tables
get_databases
get_table_statistics_req
25
METASTORE OPERATIONS (AUDIT)
26
METASTORE OPERATIONS (AUDIT)
27
Final Thoughts
Background & Issues
Test Results
Implementation Details
Goals
28
SHIP IT
• Deployed to 4 clusters
• Configured with HybridRawStoreProxy
• Tee writes to both sets of tables
• Provides rollback path
• Scheduled validation process
• Verifies data integrity
• Compares every object and reports differences
• HIVE-14870
29
FUTURE WORK
• Reduce impact of redundant calls
• HIVE-9453 introduces per query object cache
• Optimize Thrift layer communication protocol
• Lessons learned from schema redesign
• Thrift objects should promote deduplication of data
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
31
/* CODE COMMENTS */
• Use sparingly, we don't want to devolve into another DataNucleus...
• Get partition objects for the query using direct SQL queries, to avoid bazillion queries created by
DN retrieving stuff for each object individually.
• Essentially it's an object join. DN could do this for us, but it issues queries separately for every
object, which is suboptimal.
• Makes shallow copy of a list to avoid DataNucleus mucking with our objects.
• DataNucleus objects get detached all over the place for no (real) reason. So let's not use them
anywhere unless absolutely necessary.
• We have to get mtable again because DataNucleus.
• We need Partition-s for firing events and for result; DN needs MPartition-s to drop. Great... Maybe
we could bypass fetching MPartitions by issuing direct SQL deletes.
32
METASTORE OPERATIONS
• # partitions (total): 3337925
• # partitions (table): 765
• # partition columns: 2
• job_ts, dt
• Query for specific values of
job_ts, dt
33
OBJECTSTORE TBLS
CREATE TABLE TBLS (
TBL_ID NUMBER
CREATE_TIME NUMBER
DB_ID NUMBER
LAST_ACCESS_TIME NUMBER
OWNER VARCHAR
RETENTION NUMBER
SD_ID NUMBER
TBL_NAME VARCHAR
TBL_TYPE VARCHAR
VIEW_EXPANDED_TEXT CLOB
VIEW_ORIGINAL_TEXT CLOB
)
CREATE TABLE V2_TBLS (
TBL_ID NUMBER
DB_ID NUMBER
SD_ID NUMBER
CD_ID NUMBER
SD_PARAM_ID NUMBER
SERDE_PARAM_ID NUMBER
NAME VARCHAR
TYPE VARCHAR
OWNER_NAME VARCHAR
LOCATION VARCHAR
RETENTION NUMBER
CREATION_TIME NUMBER
LAST_MODIFIED_TIME NUMBER
LAST_ACCESS_TIME NUMBER
BUCKET_ID NUMBER
NUM_BUCKETS NUMBER
VIEW_EXPANDED_TEXT CLOB
VIEW_ORIGINAL_TEXT CLOB
)
ORACLESTORE V2_TBLS
34
OBJECTSTORE SDS
CREATE TABLE SDS (
SD_ID NUMBER
CD_ID NUMBER
INPUT_FORMAT VARCHAR
IS_COMPRESSED NUMBER
LOCATION VARCHAR
NUM_BUCKETS NUMBER
OUTPUT_FORMAT VARCHAR
SERDE_ID NUMBER
IS_STOREDASSUBDIRECTORIES NUMBER
)
CREATE TABLE SERDES (
SERDE_ID NUMBER
NAME VARCHAR
SLIB VARCHAR
)
CREATE TABLE V2_SDS (
SD_ID NUMBER
HASHCODE NUMBER
IS_COMPRESSED NUMBER
IS_STOREDASSUBDIRECTORIES NUMBER
INPUT_FORMAT VARCHAR
OUTPUT_FORMAT VARCHAR
SERDE_NAME VARCHAR
SERDE_LIB VARCHAR
)
ORACLESTORE V2_SDS
35
OBJECTSTORE GET_TABLE
SELECT DISTINCT ... FROM TBLS A0 LEFT OUTER JOIN DBS B0 ON
A0.DB_ID = B0.DB_ID WHERE A0.TBL_NAME = ? AND B0."NAME" = ?
SELECT ... FROM TBLS A0 LEFT OUTER JOIN DBS B0 ON A0.DB_ID =
B0.DB_ID LEFT OUTER JOIN SDS C0 ON A0.SD_ID = C0.SD_ID WHERE
A0.TBL_ID = ?
SELECT ... FROM TABLE_PARAMS A0 WHERE A0.TBL_ID = ?
SELECT ... FROM PARTITION_KEYS A0 WHERE A0.TBL_ID = ? AND
A0.INTEGER_IDX >= 0 ORDER BY NUCORDER0
SELECT B0.CD_ID FROM SDS A0 LEFT OUTER JOIN CDS B0 ON A0.CD_ID =
B0.CD_ID WHERE A0.SD_ID = ?
SELECT COUNT(*) FROM COLUMNS_V2 THIS WHERE CD_ID=?
SELECT ... FROM COLUMNS_V2 A0 WHERE A0.CD_ID = ? ORDER BY
NUCORDER0
SELECT ... FROM SDS A0 LEFT OUTER JOIN SERDES B0 ON A0.SERDE_ID =
B0.SERDE_ID WHERE A0.SD_ID = ?
SELECT ... FROM SERDE_PARAMS A0 WHERE A0.SERDE_ID = ?
...
SELECT ... FROM V2_TBLS WHERE DB_ID = ? AND NAME = ?
SELECT ... FROM V2_PARTITION_COLS WHERE TBL_ID = ? ORDER BY
POSITION ASC
SELECT ... FROM V2_SDS WHERE SD_ID = ?
SELECT ... FROM V2_TBL_COLS WHERE CD_ID = ? ORDER BY POSITION ASC
SELECT ... FROM V2_SD_PARAMS WHERE SD_PARAM_ID = ?
SELECT ... FROM V2_SERDE_PARAMS WHERE SERDE_PARAM_ID = ?
SELECT ... FROM V2_TBL_PARAMS WHERE TBL_ID = ?
ORACLESTORE GET_TABLE
36
Object-Relational Mapping is the
Vietnam of our industry
- Ted Neward
37
ORM ALTERNATIVES
… nice quick initial development, and a big drain on your resources further on in the
project when tracking ORM related bugs and inefficiencies
• Hibernate
• HQL; generated SQL; generated skeleton code; de facto driver of JPA
• jOOq
• Light database mapping; jOOq DSL; generated SQL; generated code
• Apache Cayenne
• GUI; generated SQL; generated skeleton code; nested contexts
• TopLink
• Commercial
38
GET_TABLE ISSUES
• get_table request is non-trivial
• Requires 19 queries to create one Table object
 6 multi-table JOIN queries with tables containing millions of records
 3 queries to populate auxiliary parameters
 4 COUNT queries to determine existence
• get_table is called multiple times during plan generation
• Unnecessarily query the same data
• Potential data consistency problems
39
THRIFT OBJECT ISSUES
• Thrift objects map to ORM objects one-to-one
• Inherits inefficiencies in the ORM model
• Duplicates data across lists of objects
• Requires conversion between model and Thrift objects
Ad

More Related Content

What's hot (20)

What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
Simplilearn
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
How queries work with sharding
How queries work with shardingHow queries work with sharding
How queries work with sharding
MongoDB
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
Max De Marzi
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | Edureka
Edureka!
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
Max De Marzi
 
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Michael Rainey
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdf
Michael Kogan
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4j
Neo4j
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Basic oracle-database-administration
Basic oracle-database-administrationBasic oracle-database-administration
Basic oracle-database-administration
sreehari orienit
 
ORACLE ARCHITECTURE
ORACLE ARCHITECTUREORACLE ARCHITECTURE
ORACLE ARCHITECTURE
Manohar Tatwawadi
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
Simplilearn
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
How queries work with sharding
How queries work with shardingHow queries work with sharding
How queries work with sharding
MongoDB
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
Max De Marzi
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | Edureka
Edureka!
 
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Michael Rainey
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdf
Michael Kogan
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4j
Neo4j
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Basic oracle-database-administration
Basic oracle-database-administrationBasic oracle-database-administration
Basic oracle-database-administration
sreehari orienit
 

Similar to OracleStore: A Highly Performant RawStore Implementation for Hive Metastore (20)

Cassandra training
Cassandra trainingCassandra training
Cassandra training
András Fehér
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data Warehouses
Connor McDonald
 
Top 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseTop 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous Database
Sandesh Rao
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
Roy Russo
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
政宏 张
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
Marco Tusa
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
Taro L. Saito
 
NewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACID
Tony Rogerson
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with Epsilon
Sina Madani
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
NAVER D2
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
AiougVizagChapter
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Zohar Elkayam
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
Malin Weiss
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
Speedment, Inc.
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Denver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierDenver SQL Saturday The Next Frontier
Denver SQL Saturday The Next Frontier
Kellyn Pot'Vin-Gorman
 
Developing on SQL Azure
Developing on SQL AzureDeveloping on SQL Azure
Developing on SQL Azure
Ike Ellis
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
Christian Johannsen
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data Warehouses
Connor McDonald
 
Top 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseTop 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous Database
Sandesh Rao
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
Roy Russo
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
政宏 张
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
Marco Tusa
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
Taro L. Saito
 
NewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACID
Tony Rogerson
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with Epsilon
Sina Madani
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
NAVER D2
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
AiougVizagChapter
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Zohar Elkayam
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
Malin Weiss
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
Speedment, Inc.
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Denver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierDenver SQL Saturday The Next Frontier
Denver SQL Saturday The Next Frontier
Kellyn Pot'Vin-Gorman
 
Developing on SQL Azure
Developing on SQL AzureDeveloping on SQL Azure
Developing on SQL Azure
Ike Ellis
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
Ad

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 

OracleStore: A Highly Performant RawStore Implementation for Hive Metastore

  • 1. ORACLESTORE: A HIGHLY PERFORMANT RAWSTORE IMPLEMENTATION FOR HIVE METASTORE Chris Drome, Sr. Principal Engineer Jin Sun, Principal Engineer
  • 2. 2 IT’S ABOUT SCALE • 20+ clusters • 2-4 dedicated metastore servers per cluster • More including HS2 instances • Large cluster: • ~6-9M partitions over ~6K tables • Largest table has ~1.1M partitions • Medium cluster: • ~3-6M partitions over ~3-4K tables • Largest table has ~250K partitions
  • 3. 3 SO WHAT? • Client-side timeouts • Queries over large number of partitions • socket.timeout=200s; socket.lifetime=300s • Increased load and memory usage on metastore • Concurrent connections across clients (server.max.threads=4000) • Long running queries to retrieve large amounts of data • Retrieve/convert/serialize duplicate data • Abandoned/rerun operations • Increased load on Oracle • Concurrent connections across metastores • Read/write operations on large tables takes time • Queries start to back up
  • 4. 4 Final Thoughts Background & Issues Test Results Implementation Details Goals
  • 5. 5 METASTORE ARCHITECTURE Oracle SQL Server DerbyMySQL Postgre SQL CLI CLI CLI Thrift Server Metastore Core Logic ObjectStore/DirectSQL DBCP JDBC ORM Model DataNucleus
  • 6. 6 WHAT IS DATANUCLEUS? • ORM framework • ORM model classes • ORM model object-relational mapping • Executes queries via JDOQL • Generates DB-specific SQL • Black box limits control; hampers debugging • Requires two sets of classes • ORM classes to interact with the DB • Thrift classes in core logic and over wire • Conversion between the two • Object-relational impedance mismatch • Limits opportunity to optimize schema • Tendency to duplicate data needlessly • Limits control of SQL • Limits opportunity to optimize queries • Clean-up thread to identify abandoned ids WHAT IT MEANS
  • 7. 7 WHAT IS DIRECTSQL? • Custom code to improve performance • Focus on get_partition operations  Yahoo! added drop_partitions • DB agnostic SQL instead of JDOQL • Timeouts and fallback to ORM • Works through DataNucleus • Adds tables to evaluate filters • Batch retrieval of lists of objects • Greatly improves performance • Large amount of code for specific functionality • Requires code to identify underlying DB • Failure/timeout results in long latency • Constrained by DataNucleus • Does not address fundamental issues • Requires self-JOIN for each filter condition • Deep table (# partitions x avg # partition columns) WHAT IT MEANS
  • 8. 8 Final Thoughts Background & Issues Test Results Implementation Details Goals
  • 9. 9 GOALS • Reduce load on Oracle • Fix database schema inefficiencies  Foundation for more performant queries  Reduce the storage/computational requirements of data  Better utilize native constructs and SQL • Fix database layer inefficiencies  Improve performance characteristics of SQL  Improve maintainability of code • Address repetitive/redundant requests for data (future) • Reduce payload over wire • Optimize client-server communication protocol (future)
  • 10. 10 Final Thoughts Background & Issues Test Results Implementation Details Goals
  • 11. 11 WHAT SHOULD WE DO? • Recognize there is a problem • Understand effects of original schema on query performance • Identify pain points and areas for improvement • Design more performant schema • Write OracleStore • Leverage lessons learned from DirectSQL • Create migration and validation toolset • How to migrate existing data? • How to ensure data integrity? • How to rollback if necessary? • Deploy it!
  • 12. 12 LESSONS LEARNED • Object structure should not dictate schema • Operations on Tables/Partitions are king • Most frequent and most expensive operations • Group data specific to Tables/Partitions • Promote table columns to TBLS/PARTITIONS • Direct references to all satellite data • Invert relationships between member objects • Merge tables • Gets rid of needless JOINs • Deduplicate, deduplicate, deduplicate
  • 13. 13 LESSONS LEARNED Table ObjectStore OracleStore +/- Comment SDS ~3.3M 6 n/a Restructure; Dedup SERDES ~3.3M 11 n/a Merge; Dedup SDS JOIN SERDES ~3.3M 15 -100% Dedup COLUMNS_V2 ~24.9M ~0.7M -97.2% Dedup TABLE_PARAMS ~0.06M n/a n/a Merge; Dedup PARTITION_PARAMS ~11.5M ~0.1M -98.9% Merge; Dedup SD_PARAMS 0 0 0.0% SERDE_PARAMS ~5.4M ~0.02M -99.7% Dedup
  • 14. 14 SCHEMA REDESIGN • OracleStore tables should co-exist with ObjectStore tables • Utilize native constructs and features • SEQUENCE • FOREIGN KEY … CASCADE • LIMIT • Oracle built-in functions • One degree of separation from TBLS/PARTITIONS • Attribute tables • PARAM tables • De-emphasize importance of SDS • Promote SDS.LOCATION, SDS.CD_ID to TBLS/PARTITIONS • No indexes on Oracle tables yet
  • 15. 15 ORACLESTORE IMPLEMENTATION • OracleStore implements RawStore • HBaseStore (HIVE-9453) • OracleStore co-exists with ObjectStore • Code changes are additive  Annotations to identify read/write operations  Log messages display performance numbers  HybridRawStoreProxy for tee’d reads/writes • Aggressive deduplication of data • Batched retrieval of lists of objects
  • 16. 16 METASTORE ARCHITECTURE Oracle CLI CLI CLI Thrift Server Metastore Core Logic SQL Server DerbyMySQL Postgre SQL ObjectStore/DirectSQL DBCP JDBC ORM Model DataNucleus DBCP ObjectStore/DirectSQL DataNucleus OracleStore JDBC
  • 17. 17 Final Thoughts Background & Issues Test Results Implementation Details Goals
  • 18. 18 METASTORE OPERATIONS Operation % of Total get_table 54.1% get_database 10.0% get_function 6.9% get_partitions_by_filter 6.7% add_partitions 3.9% get_delegation_token 3.6% drop_partitions 3.0% get_partitions_with_auth 3.0% get_all_databases 2.4% get_partitions_by_expr 2.1% other 4.3%
  • 19. 19 METASTORE OPERATIONS • # databases: 643 • # functions: 6 • # tables: 12170
  • 20. 20 METASTORE OPERATIONS • # partitions (total): 3337925 • # partitions (table): 88261 • # partition columns: 6 • dt=20170101/p1=a/p2=b/p3=c/p4=d/p5=e • Range query on dt • 1 hour and 4 hour increments • OracleStore 2971ms latency on 13K partitions • 46x faster than ObjectStore • 13x faster than DirectSQL
  • 21. 21 METASTORE OPERATIONS • # partitions (total): 3337925 • # partitions (table): 88261 • # partition columns: 6 • dt=20170101/p1=a/p2=b/p3=c/p4=d/p5=e • Range query on dt and equality filter on p1=abc • 1 hour and 4 hour increments • OracleStore latency is relatively constant at this scale
  • 22. 22 METASTORE OPERATIONS • # partitions (total): 3337925 • # partitions (table): 88261 • # partition columns: 6 • dt=20170101/p1=a/p2=b/p3=c/p4=d/p5=e • Range query on dt and equality filter on p1=abc and p2=xyz • 1 hour and 4 hour increments • OracleStore latency is relatively constant at this scale
  • 27. 27 Final Thoughts Background & Issues Test Results Implementation Details Goals
  • 28. 28 SHIP IT • Deployed to 4 clusters • Configured with HybridRawStoreProxy • Tee writes to both sets of tables • Provides rollback path • Scheduled validation process • Verifies data integrity • Compares every object and reports differences • HIVE-14870
  • 29. 29 FUTURE WORK • Reduce impact of redundant calls • HIVE-9453 introduces per query object cache • Optimize Thrift layer communication protocol • Lessons learned from schema redesign • Thrift objects should promote deduplication of data
  • 31. 31 /* CODE COMMENTS */ • Use sparingly, we don't want to devolve into another DataNucleus... • Get partition objects for the query using direct SQL queries, to avoid bazillion queries created by DN retrieving stuff for each object individually. • Essentially it's an object join. DN could do this for us, but it issues queries separately for every object, which is suboptimal. • Makes shallow copy of a list to avoid DataNucleus mucking with our objects. • DataNucleus objects get detached all over the place for no (real) reason. So let's not use them anywhere unless absolutely necessary. • We have to get mtable again because DataNucleus. • We need Partition-s for firing events and for result; DN needs MPartition-s to drop. Great... Maybe we could bypass fetching MPartitions by issuing direct SQL deletes.
  • 32. 32 METASTORE OPERATIONS • # partitions (total): 3337925 • # partitions (table): 765 • # partition columns: 2 • job_ts, dt • Query for specific values of job_ts, dt
  • 33. 33 OBJECTSTORE TBLS CREATE TABLE TBLS ( TBL_ID NUMBER CREATE_TIME NUMBER DB_ID NUMBER LAST_ACCESS_TIME NUMBER OWNER VARCHAR RETENTION NUMBER SD_ID NUMBER TBL_NAME VARCHAR TBL_TYPE VARCHAR VIEW_EXPANDED_TEXT CLOB VIEW_ORIGINAL_TEXT CLOB ) CREATE TABLE V2_TBLS ( TBL_ID NUMBER DB_ID NUMBER SD_ID NUMBER CD_ID NUMBER SD_PARAM_ID NUMBER SERDE_PARAM_ID NUMBER NAME VARCHAR TYPE VARCHAR OWNER_NAME VARCHAR LOCATION VARCHAR RETENTION NUMBER CREATION_TIME NUMBER LAST_MODIFIED_TIME NUMBER LAST_ACCESS_TIME NUMBER BUCKET_ID NUMBER NUM_BUCKETS NUMBER VIEW_EXPANDED_TEXT CLOB VIEW_ORIGINAL_TEXT CLOB ) ORACLESTORE V2_TBLS
  • 34. 34 OBJECTSTORE SDS CREATE TABLE SDS ( SD_ID NUMBER CD_ID NUMBER INPUT_FORMAT VARCHAR IS_COMPRESSED NUMBER LOCATION VARCHAR NUM_BUCKETS NUMBER OUTPUT_FORMAT VARCHAR SERDE_ID NUMBER IS_STOREDASSUBDIRECTORIES NUMBER ) CREATE TABLE SERDES ( SERDE_ID NUMBER NAME VARCHAR SLIB VARCHAR ) CREATE TABLE V2_SDS ( SD_ID NUMBER HASHCODE NUMBER IS_COMPRESSED NUMBER IS_STOREDASSUBDIRECTORIES NUMBER INPUT_FORMAT VARCHAR OUTPUT_FORMAT VARCHAR SERDE_NAME VARCHAR SERDE_LIB VARCHAR ) ORACLESTORE V2_SDS
  • 35. 35 OBJECTSTORE GET_TABLE SELECT DISTINCT ... FROM TBLS A0 LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID WHERE A0.TBL_NAME = ? AND B0."NAME" = ? SELECT ... FROM TBLS A0 LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID LEFT OUTER JOIN SDS C0 ON A0.SD_ID = C0.SD_ID WHERE A0.TBL_ID = ? SELECT ... FROM TABLE_PARAMS A0 WHERE A0.TBL_ID = ? SELECT ... FROM PARTITION_KEYS A0 WHERE A0.TBL_ID = ? AND A0.INTEGER_IDX >= 0 ORDER BY NUCORDER0 SELECT B0.CD_ID FROM SDS A0 LEFT OUTER JOIN CDS B0 ON A0.CD_ID = B0.CD_ID WHERE A0.SD_ID = ? SELECT COUNT(*) FROM COLUMNS_V2 THIS WHERE CD_ID=? SELECT ... FROM COLUMNS_V2 A0 WHERE A0.CD_ID = ? ORDER BY NUCORDER0 SELECT ... FROM SDS A0 LEFT OUTER JOIN SERDES B0 ON A0.SERDE_ID = B0.SERDE_ID WHERE A0.SD_ID = ? SELECT ... FROM SERDE_PARAMS A0 WHERE A0.SERDE_ID = ? ... SELECT ... FROM V2_TBLS WHERE DB_ID = ? AND NAME = ? SELECT ... FROM V2_PARTITION_COLS WHERE TBL_ID = ? ORDER BY POSITION ASC SELECT ... FROM V2_SDS WHERE SD_ID = ? SELECT ... FROM V2_TBL_COLS WHERE CD_ID = ? ORDER BY POSITION ASC SELECT ... FROM V2_SD_PARAMS WHERE SD_PARAM_ID = ? SELECT ... FROM V2_SERDE_PARAMS WHERE SERDE_PARAM_ID = ? SELECT ... FROM V2_TBL_PARAMS WHERE TBL_ID = ? ORACLESTORE GET_TABLE
  • 36. 36 Object-Relational Mapping is the Vietnam of our industry - Ted Neward
  • 37. 37 ORM ALTERNATIVES … nice quick initial development, and a big drain on your resources further on in the project when tracking ORM related bugs and inefficiencies • Hibernate • HQL; generated SQL; generated skeleton code; de facto driver of JPA • jOOq • Light database mapping; jOOq DSL; generated SQL; generated code • Apache Cayenne • GUI; generated SQL; generated skeleton code; nested contexts • TopLink • Commercial
  • 38. 38 GET_TABLE ISSUES • get_table request is non-trivial • Requires 19 queries to create one Table object  6 multi-table JOIN queries with tables containing millions of records  3 queries to populate auxiliary parameters  4 COUNT queries to determine existence • get_table is called multiple times during plan generation • Unnecessarily query the same data • Potential data consistency problems
  • 39. 39 THRIFT OBJECT ISSUES • Thrift objects map to ORM objects one-to-one • Inherits inefficiencies in the ORM model • Duplicates data across lists of objects • Requires conversion between model and Thrift objects

Editor's Notes

  • #7: JDO = Java Data Objects YHIVE: datanucleus-api-jdo.version=3.0.7, datanucleus-core.version=3.0.9, datanucleus-rdbms.version=3.0.8 branch-1: datanucleus-api-jdo.version=3.2.6, datanucleus-core.version=3.2.10, datanucleus-rdbms.version=3.2.9 master: datanucleus-api-jdo.version=4.2.1, datanucleus-core.version=4.1.6, datanucleus-rdbms.version=4.1.7 DBCP vs BoneCP
  • #8: ObjectStore: 7682 lines; MetaStoreDirectSQL: 2087 lines; OracleStore: 6261 lines
  • #14: SDS.INPUT_FORMAT=161MB, SDS.OUTPUT_FORMAT=181MB, SDS.LOCATION=423MB, SERDES.SLIB=152MB, TOTAL=917MB
  • #19: Ignore get_functions operation which skews distributions Top 4 operations account for 77.7%
  • #21: ObjectStore: 134s; DirectSQL: 38s; OracleStore: 2.9s DirectSQL uses PARTITION_KEY_VALUES table (7,067,982 records) SELF JOIN * # of distinct partition keys, then LEFT OUTER JOIN with PARTITIONS
  • #22: ObjectStore: 4.0s; DirectSQL: 1.1s; OracleStore: 270ms
  • #23: ObjectStore: 3.4s; DirectSQL: 1.0s; OracleStore: 260ms
  • #24: get_table: 215K (79%); get_function: 20K (7.2%); get_partition_with_auth: 10K (3.7%); get_partitions*: ~2000 ea; get_partitions: 352 (0.1%)
  • #25: total: 10359s(3h)/2542s(42m)/4x; get_partitions: 5600s/503s/11x
  • #34: JDO = Java Data Objects YHIVE: datanucleus-api-jdo.version=3.0.7, datanucleus-core.version=3.0.9, datanucleus-rdbms.version=3.0.8 branch-1: datanucleus-api-jdo.version=3.2.6, datanucleus-core.version=3.2.10, datanucleus-rdbms.version=3.2.9 master: datanucleus-api-jdo.version=4.2.1, datanucleus-core.version=4.1.6, datanucleus-rdbms.version=4.1.7 DBCP vs BoneCP
  • #35: JDO = Java Data Objects YHIVE: datanucleus-api-jdo.version=3.0.7, datanucleus-core.version=3.0.9, datanucleus-rdbms.version=3.0.8 branch-1: datanucleus-api-jdo.version=3.2.6, datanucleus-core.version=3.2.10, datanucleus-rdbms.version=3.2.9 master: datanucleus-api-jdo.version=4.2.1, datanucleus-core.version=4.1.6, datanucleus-rdbms.version=4.1.7 DBCP vs BoneCP