Dancing Elephants - Efficiently Working with Object Stores from Apache Spark and Apache Hive

© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Dancing Elephants:
Working with Object Storage in
Apache Spark and Hive
Sanjay Radia
June 2017

About the Speaker
Sanjay Radia
Chief Architect, Founder, Hortonworks
Part of the original Hadoop team at Yahoo! since 2007
– Chief Architect of Hadoop Core at Yahoo!
–Apache Hadoop PMC and Committer
Prior
Data center automation, virtualization, Java, HA, OSs, File Systems
Startup, Sun Microsystems, Inria …
Ph.D., University of Waterloo
Page 2

Why Cloud?
• No upfront hardware costs – Pay as you use
• Elasticity
• Often lower TCO
• Natural for Data ingress for IoT, mobile apps, ..
• Business agility

Key Architectural Considerations for Hadoop in the Cloud
Shared Data
& Storage
On-Demand
Ephemeral
Workloads
10101
10101010101
01010101010101
010101010101010101
0
Elastic Resource
Management
Shared Metadata,
Security &
Governance

Shared Data Requires Shared Metadata, Security, and Governance
⬢Shared Metadata Across All Workloads/Clusters
 Metadata considerations
• Tabular data metastore
• Lineage and provenance metadata
• Add upon ingest
• Update as processing modifies data
 Access / tag-based policies
⬢ & audit logs
 Key Observation:
Classification
Prohibition
Time
Location
Streams
Pipelines
Feeds
Tables
Files Objects
Shared
Metadata
Policies

Shared Data and Cloud Storage
⬢ Cloud Storage is the Shared Data Lake
–For both Hadoop and Cloud-native (non-Hadoop) Apps
–Lower Cost
•HDFS via EBS can get very very expensive
•HDFS’s role changes
–Built-in geo-distribution and DR
⬢ Challenges
–Cloud storage designed for scale, low cost and geo-distribution
–Performance is slower – was not designed for data-intensive apps
–Cloud storage segregated from compute
–API and semantics not like a FS – especially wrt. consistency

Making Object Stores work for Big Data Apps
⬢ Focus Areas
–Address cloud storage consistency
–Performance (changes in connectors and frameworks)
–Caching in memory and local storage
⬢ Other issues not covered in this talk
–Shared Metastore, Common Governance, Security
across multiple clusters
–Columnar access control to Tabular data
See Hortonworks cloud offering Shared Data
& Storage
10101
10101010101
01010101010101
0101010101010101010

Cloud Storage Integration: Evolution for Agility
HDFS
Application
HDFS
Application
GoalEvolution towards cloud storage as the persistent Data Lake
Input Output
Backup Restore
Input
Output
Upload
HDFS
Application
Input
Output
tmp
AzureAWS –today ->>>>>

Danger:
Object stores are not hierarchical
filesystems
Focus: Cost & Geo-distribution over
Consistency and Performance

A Filesystem: Directories, Files  Data
/
work
pending
part-00
part-01
00
00
00
01
01
01
complete
part-01
rename("/work/pending/part-01", "/work/complete")

Object Store: hash(name)⇒data
00
00
00
01
01
s01 s02
s03 s04
hash("/work/pending/part-01")
["s02", "s03", "s04"]
No rename hence:
copy("/work/pending/part-01",
"/work/complete/part01")
01
01
01
01
delete("/work/pending/part-01")
hash("/work/pending/part-00")
["s01", "s02", "s04"]

Often: Eventually Consistent
00
00
00
01
01
s01 s02
s03 s04
01
DELETE /work/pending/part-00
GET /work/pending/part-00
GET /work/pending/part-00
200
200
200

Eventual Consistency problems
⬢ When listing a directory
–Newly created files may not yet be visible, deleted ones still present
⬢ After updating a file
–Opening and reading the file may still return the previous data
⬢ After deleting a file
–Opening the file may succeed, returning the data
⬢ While reading an object
–If object is updated or deleted during the process

The dangers of Eventual Consistency and Lack of Atomicity
⬢ Temp Data leftovers
–Annoying Garbage or Worse if direct output committer is used
⬢ List inconsistency means new data may not be visible
–Hadoop thinks of directories are containers of data
⬢ Lack of atomic rename() can leave output directories inconsistent
You can get bad or missing data and not even notice
Especially if only a portion of your large data is missing

org.apache.hadoop.fs.FileSystem
hdfs s3awasb adlswift gs

s3:// —“inode on S3”
s3n://
“Native” S3
s3a:// Replaces s3n
swift://
OpenStack
wasb://
Azure WASB
Phase I: Stabilize S3A
oss://
Aliyun
gs://
Google Cloud
Phase II: speed & scale
adl://
Azure Data Lake
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
s3://
Amazon EMR S3
History of Object Storage Support
Phase III: scale & consistency
(proprietary)

Cloud Storage Connectors
Azure WASB ● Strongly consistent
● Good performance
● Well-tested on applications (incl. HBase)
ADL ● Strongly consistent
● Tuned for big data analytics workloads
Amazon Web Services S3A ● Eventually consistent - consistency work in
recently completed by Hortonworks,
Cloudera, others..
● Performance improvements recently and
in progress
● Active development in Apache
EMRFS ● Proprietary connector used in EMR
● Optional strong consistency for a cost
Google Cloud Platform GCS ● Multiple configurable consistency policies
● Currently Google open source
● Good performance
● Could improve test coverage

Make Apache Hadoop at home
in the cloud
Step 1: Hadoop runs great on Azure
Step 2: Be the best performance on EC2
(i.e. Beat propriety solutions like EMR)
✔ ✔
✔

Problem: S3 Analytics
is too slow/broken
1. Analyze benchmarks and bug-reports
2. Optimize the non-io metaDataOps (very cheap on HDFS)
3. Fix Read-path for Columnar Data
4. Fix Write-path
5. Improve query partitioning (not covered in this talk)
6. The Commitment Problem

getFileStatus()
read()
LLAP (single node) on AWS
TPC-DS queries at 200 GB scale
readFully(pos)

Hadoop 2.8/HDP 2.6 transforms I/O performance!
// forward seek by skipping stream
fs.s3a.readahead.range=256K
// faster backward seek for Columnar Storage
fs.s3a.experimental.input.fadvise=random
// Write-IO - enhanced data upload (parallel background uploads)
// Additional flags for mem vs disk
fs.s3a.fast.output.enabled=true
fs.s3a.multipart.size=32M
fs.s3a.fast.upload.active.blocks=8
// Additional per-bucket flags
—see HADOOP-11694 for lots more!

Every HTTP request is precious
⬢ HADOOP-13162: Reduce number of getFileStatus calls in mkdirs()
⬢ HADOOP-13164: Optimize deleteUnnecessaryFakeDirectories()
⬢ HADOOP-13406: Consider reusing filestatus in delete() and mkdirs()
⬢ HADOOP-13145: DistCp to skip getFileStatus when not preserving metadata
⬢ HADOOP-13208: listFiles(recursive=true) to do a bulk listObjects
see HADOOP-11694

Caching in Memory or Local Disk (ssd)
even more relevant for slow cloud storage
benchmarks != your queries, your data, your VMs, …
…but we think we've made a good start

S3 Data Source 1TB TPCDS LLAP- vs Hive 1.x:
0
500
1,000
1,500
2,000
2,500
LLAP-1TB-TPCDS
Hive-1-1TB-TPCDS
1 TB TPC-DS ORC DataSet
3 x i2x4x Large (16 CPU x 122 GB RAM x 4 SSD)

Rename Problem and
Direct Output
Committer
"

The S3 Commitment Problem
rename() used for atomic commitment transaction
⬢ Additional Time: copy() + delete() proportional to data * files
– Server side copy is used to make this faster, but still a copy
– Non-Atomic!!
⬢ Alternate: Direct output committer can solve the performance problem
⬢ BOTH can give wrong results
–Intermediate data may be visible
–Failures (task or job) leave storage in unknown state
–Speculative execution makes it worse
⬢ BTW Compared to Azure Storage, S3 is slow (6-10+ MB/s)

Spark's Direct Output Committer? Risk of Corruption of data

Netflix Staging Committer
1. Saves output to file://
2. Task commit: upload to S3A as multipart PUT —but does not commit the PUT, just saves the
information about it to hdfs://
3. Normal commit protocol manages task and job data promotion in HDFS
4. Final Job committer reads pending information and generates final PUT
—possibly from a different host
1. But multiple files hence not fully atomic – window is much much smaller
Outcome:
⬢ No visible overwrite until final job commit: resilience and speculation
⬢ Task commit time = data/bandwidth
⬢ Job commit time = POST * #files

Use the Hive Metastore to Commit Atomically
⬢ Work in progress – use the Hive metastore to record the commit
–Databricks seems to have done a similar thing for Databricks Spark (i.e. propriety)
⬢ Fits into the Hive ACID work

S3guard
Fast, consistent S3 metadata
HADOOP-13445

S3Guard: Fast And Consistent S3 Metadata
⬢ Goals
–Provide consistent list and get status operations on S3 objects written with S3Guard enabled
•listStatus() after put and delete
•getFileStatus() after put and delete
–Performance improvements that impact real workloads
–Provide tools to manage associated metadata and caching policies.
⬢ Again, 100% open source in Apache Hadoop community
–Hortonworks, Cloudera, Western Digital, Disney …
⬢ Inspired by Apache licensed S3mper project from Netflix
–Note apparently EMRFS’s committer is also inspired from this but copied and kept prorierty
⬢ Seamless integration with S3AFileSystem

Use DynamoDB as fast, consistent metadata store
00
00
00
01
01
s01 s02
s03 s04
01
DELETE part-00
200
HEAD part-00
200
HEAD part-00
404
PUT part-00
200
00

Availability
 Read + Write in HDP 2.6 and Apache Hadoop 2.8
 S3Guard: preview of DDB caching soon
 Zero-rename commit: work in progress

Summary
⬢ Cloud Storage is the Data Lake on the Cloud
– HDFS plays a different role
⬢ Challenges: Performance, Consistency, Correctness
– Output committer – non-atomicity should not be ignored
⬢ We have made significant improvements
– Object store connectors
– Upper layers, such as Hive and ORC
– S3Guard branch merged
⬢ LLAP as the cache for tabular data
⬢ Other considerations
– Shared Metadata, Security and Governance (See HDP Cloud offerings)

Big thanks to:
Rajesh Balamohan
Steve Loughran
Mingliang Liu
Chris Nauroth
Dominik Bialek
Ram Venkatesh
Everyone in QE, RE
+ everyone who reviewed/tested, patches and
added their own, filed bug reports and measured
performance

Dancing Elephants - Efficiently Working with Object Stores from Apache Spark and Apache Hive

More Related Content

What's hot (20)

Similar to Dancing Elephants - Efficiently Working with Object Stores from Apache Spark and Apache Hive (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Dancing Elephants - Efficiently Working with Object Stores from Apache Spark and Apache Hive

Editor's Notes