Hive ACID Apache BigData 2016

Apache Hive on ACID
Alan Gates
Hive PMC Member
Co-founder Hortonworks
May 2016

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
History
 Hive only updated partitions
– INSERT...OVERWRITE rewrote an entire partition
– Forced daily or even hourly partitions
– Could add files to partition directory, file compaction was manual
 What about concurrent readers?
– Ok for inserts, but overwrite caused races
– There is a zookeeper lock manager, but…
 No way to delete or update rows
 No INSERT INTO T VALUES…
– Breaks some tools

Why Do You Need ACID?
 Hadoop and Hive have always…
– Just said no to ACID
– Perceived as tradeoff for performance
 But, your data isn’t static
– It changes daily, hourly, or faster
– Sometimes it needs restated (late arriving data) or facts change (e.g. a user’s physical address)
– Loading data into Hive every hour is so 2010; data should be available in Hive as soon as it arrives
 We saw users implementing ad hoc solutions
– This is a lot of work and hard to get right
– Hive should support this as a first class feature

When Should You Use Hive’s ACID?
 NOT OLTP!!!
 Updating a Dimension Table
– Changing a customer’s address
 Delete Old Records
– Remove records for compliance
 Update/Restate Large Fact Tables
– Fix problems after they are in the warehouse
 Streaming Data Ingest
– A continual stream of data coming in
– Typically from Flume or Storm
 NOT OLTP!!!

SQL Changes for ACID
 Since Hive 0.14
 New DML
– INSERT INTO T VALUES(1, ‘fred’, ...);
– UPDATE T SET (x = 5[, ...]) [WHERE ...]
– DELETE FROM T [WHERE ...]
– Supports partitioned and non-partitioned tables, WHERE clause can specify partition but not required
 Restrictions
– Table must have format that extends AcidInputFormat
• currently ORC
• work started on Parquet (HIVE-8123)
– Table must be bucketed and not sorted
• can use 1 bucket but this will restrict write parallelism
– Table must be marked transactional
• create table T(...) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES
('transactional'='true');
• Existing ORC tables that are bucketed can be marked transactional via ALTER

Ingesting Data Into Hive From a Stream
 Data is flowing in from generators in a stream
 Without this, you have to add it to Hive in batches, often every hour
– Thus your users have to wait an hour before they can see their data
 New interface in hive.hcatalog.streaming lets applications write small batches of
records and commit them
– Users can now see data within a few seconds of it arriving from the data generators
 Available for Apache Flume and Apache Storm

Design
 HDFS does not allow arbitrary writes
– Store changes as delta files
– Stitched together by client on read
 Writes get a transaction ID
– Sequentially assigned by metastore
 Reads get highest committed transaction & list of open/aborted transactions
– Provides snapshot consistency
– No exclusive locks required

Why Not HBase
 Good
– Handles compactions for us
– Already has similar data model with LSM
 Bad
– When we started this there were no transaction managers for HBase, this requires transactions
– Hfile is column family based rather than columnar
– HBase focused on point lookups and range scans
• Warehousing requires full scans

Stitching Buckets Together

HDFS Layout
 Partition locations remain unchanged
– Still warehouse/$db/$tbl/$part
 Bucket Files Structured By Transactions
– Base files $part/base_$tid/bucket_*
– Delta files $part/delta_$tid_$tid/bucket_*

Input and Output Formats
 Created new AcidInput/OutputFormat
– Unique key is original transaction id, bucket, row id
 Reader returns correct version of row based on transaction state
 Also added raw API for compactor
– Provides previous events as well
 ORC implements new API
– Extends records with change metadata
• Add operation (d, u, i), latest transaction id, and key

Transaction Manager
 Existing lock managers
– In memory - not durable
– ZooKeeper - requires additional components to install, administer, etc.
 Locks need to be integrated with transactions
– commit/rollback must atomically release locks
 We sort of have this database lying around which has ACID characteristics (metastore)
 Transactions and locks stored in metastore
 Uses metastore DB to provide unique, ascending ids for transactions and locks

Transaction & Locking Model
 DML statements are auto-commit
 Snapshot isolation
– Reader will see consistent data for the duration of a query
 Current transactions can be displayed using SHOW TRANSACTIONS
 Three types of locks
– shared read
– shared write (can co-exist with shared read, but not other shared write)
– exclusive
 Operations require different locks
– SELECT, INSERT – shared read (inserts cannot conflict because there is no primary key)
– UPDATE, DELETE – shared write
– DROP, INSERT OVERWRITE – exclusive

Compaction
 Each transaction (or batch of transactions in streaming) creates a new delta directory
 Too many files = NameNode  and poor read performance due to fan in on merge
 Need to automatically compact files
– Initiated by metastore server, run as MR jobs in the cluster
– Can be manually initiated by user via ALTER TABLE COMPACT
 Minor compaction merges many deltas into one
– Run when there are more than 10 delta directories (configurable)
 Major compaction merges deltas with base and rewrites base
– Run when size of the deltas > 10% of the size of the base (configurable)
 Old files kept around until all readers are done with their snapshots, then cleaned up
– Compaction and data read/writes can be done in parallel with no need to pause the world

Issues Found and (Some) Fixed
 Not GA ready in Hive 1.2 or 2.0, hope to have GA ready by 1.3 and 2.1
 Deadlocks in the RDBMS
– The way the Hive metastore used the RDBMS caused a lot of deadlocks – greatly improved
 Usability
– SHOW COMPACTIONS and SHOW LOCKS did not give users/admins enough information to successfully
determine who was blocking whom or what was getting compacted – improved, some work still to do
here
 Resilience
– System was easy to knock over when clients did silly things (like open 1M+ transactions) – improved,
though I am sure there are still some ways to kill it
– Initially compactor threads only run in 1 metastore instance – resolved, now can run in multiple instances
 Correctness
– Streaming ingest did not enforce proper bucket spraying – resolved
– Initial versions of the compactor had a race condition that resulted in record loss – resolved
– Adding a column to a table or changing a column’s type caused read time errors - resolved
– Updates can get lost when overlapping transactions update the same partition – HIVE-13395
 Performance
– Some work done here (e.g. making predicate push down work, efficient split combinations)
– Much still to be done

Next: MERGE
 Standard SQL, added in SQL 2003
 Problem, today each UPDATE requires a scan of the partition or table
– There is no way to apply separate updates in a batch
 Allows upserts
 Use case:
– bring in batch from transactional/front end systems
– Apply as insert or updates (as appropriate) in one read/write pass

Future Work
 Multi-statement transactions (BEGIN, COMMIT, ROLLBACK)
 Integration with LLAP
– Figure out how MVCC works with LLAP’s caching
– Build a write path through LLAP
 Lower the user burden
– Make the bucketing automatic so the user does not have to be aware of it
– Allow user to determine sort order of the table
– Eventually remove the transactional/non-transactional distinction in tables
 Improve monitoring and alerting facilities
– Make is easier for an admin to determine when the system is in trouble, e.g. the compactor is not
running or is failing on every run, there are too many open transactions, etc.

Thank You

Hive ACID Apache BigData 2016

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Hive ACID Apache BigData 2016 (20)

Recently uploaded (20)

Hive ACID Apache BigData 2016