SlideShare a Scribd company logo
Accordion:
HBase Breathes w ith In-Memor y Compaction

Eshcar Hillel, Anastasia Braginsky, Edward Bortnikov ⎪ HBaseCon, Jun 12, 2017
The Team
2
Edward Bortnikov

Anastasia Braginsky

(committer)

Eshcar Hillel

(committer)

Michael Stack

Anoop Sam John

Ramkrishna Vasudevan
Quest: The User’s Holy Grail
3
Reliable 

Persistent

Storage

In-Memory

Database

Performance
What is Accordion?
4
Novel Write-Path Algorithm



Better Performance of Write-Intensive Workloads


Write Throughput ì, Read Latency î



Better Disk Use


Write amplification î



GA in HBase 2.0 (becomes default MemStore implementation)
In a Nutshell
5
Inspired by Log-Structured-Merge (LSM) Tree Design

Transforms random I/O to sequential I/O (efficient!)

Governs the HBase storage organization



Accordion reapplies the LSM Tree design to RAM data

à Efficient resource use – data lives in memory longer

à  Less disk I/O

à  Ultimately, higher speed
How LSM Trees Work
6
MemStore

HFile

HFile

HFile

RAM

Disk

Put
 Get/Scan

Flush

Compaction

HRegion 

Data updates
stored as versions



Compaction
eliminates

redundancies
LSM Trees in Action
7
MemStore
 MemStore
 MemStore
 MemStore
 MemStore

HFile
 HFile

HFile

HFile

HFile

HFile

HFile

HFile

Flush!
 Flush!
 Flush!
 Compaction!
Accordion: In-Memory LSM Tree
8
Active Segment

HFile

Immutable Segment

Immutable Segment



Immutable Segment

HFile

Compacting

MemStore

Flush

Put
 Get/Scan

Compaction

RAM

Disk
Accordion in Action
9
Active

Segment

Active

Segment

Active

Segment

Active

Segment



Active

Segment



Immutable

Segment

Immutable

Segment

Immutable

Segment



Immutable

Segment



In-Memory

Flush!

In-Memory

Flush!

In-Memory

Compaction!

Snapshot



Disk Flush!

Compaction

Pipeline
Flat Immutable Segment Index
10
Hhjs
iutkldfk;wjt;w
iejerg;iopp
Jkkgkykytkt
gcccdddeiuy
oweuoweiuo
ieu
Poiuytrewqa
sdfaaabbbm
nppppbvcxq
qaaaxcvb
qqqwertyuioas
dfghjklrtyuiopl
kjhgfpppwww
mnbvcmnb
Jkdddfkgbbbd
iwpoqqqaaacc
cdddeiuyoweu
oweiuoieu
k;wjt;wiej;iwj
opppqqqyrta
aajeutkiyt
Jkkgkykytcc
dddeiuyowe
uoweiuoieu
Poiuytrewqa
hmnppppbv
cxqqaaaxcv
b
qqqwertyuioas
dfghjklrtyuiopl
kjhgfpppwww
mnbvcmnb
Jkkgkykytktjjjjo
ooooooqqbyfjt
dhghfhfngfhfg
bcccdddeiuyo
weuoweiuoieu
Hhjjuuyrqaa
ss
iuaaajeutkiyt
Jkkgkykytktg
kg;diwpoqeu
oweiuoieu
Poiuytrejkl;;
mnppppbvcx
qqaaaxcvb
qqqwertyuioas
dfghjklrtyuioplk
jhgfpppwwwm
nbvcmnb
Jkdddfkaabbb
cccdddeiuyow
euoweiuoieu
utkldfk;ioppp
qqqyrtaaaje
utkiyt
diwpoqqqaa
abbbcccddw
euoweiuoieu
hjkl;;mnpppp
bvcxqqaaax
cvb
qqqwertyuioas
dfghjklrtyuioplk
jhgfpppwwwm
nbvcmnb
Jkkgkyaaabbyf
jtdhghfhfngfhfg
bcccdddeiuyo
weuoweiuoieu
Cell Storage

Cell Storage

Flatten

Skiplist Index
 CellArrayMap Index

Lean footprint – the smaller the cells the better!

KV-
Objects
Redundancy Elimination
11
In-Memory Compaction merges the pipelined segments 

Get access latency under control (less segments to scan)



BASIC compaction 

Multiple indexes merged into one, cell data remains in place



EAGER compaction 

Redundant data versions eliminated (SQM scan)
BASIC vs EAGER
12
BASIC: universal optimization, avoids physical data copy 




EAGER: high value for highly redundant workloads

SQM scan is expensive

Data relocation cost may be high (think MSLAB!) 



Configuration

BASIC is default, EAGER may be configured

Future implementation may figure out the right mode automatically
Compaction Pipeline: Correctness & Performance
13
Shared Data Structure

Read access: Get, Scan, in-memory compaction

Write access: in-memory flush, in-memory compaction, disk flush



Design Choice: Non-Blocking Reads

Read-Only pipeline clone – no synchronization upon read access

Copy-on-Write upon modification 

Versioning prevents compaction concurrent to other updates
More Memory Efficiency - KV Object Elimination
14
Hhjs
iutkldfk;wjt;w
iejerg;iopp
Jkkgkykytkt
gcccdddeiuy
oweuoweiuo
ieu
Poiuytrewqa
sdfaaabbbm
nppppbvcxq
qaaaxcvb
qqqwertyuioas
dfghjklrtyuiopl
kjhgfpppwww
mnbvcmnb
Jkdddfkgbbbd
iwpoqqqaaacc
cdddeiuyoweu
oweiuoieu
k;wjt;wiej;iwj
opppqqqyrta
aajeutkiyt
Jkkgkykytcc
dddeiuyowe
uoweiuoieu
Poiuytrewqa
hmnppppbv
cxqqaaaxcv
b
qqqwertyuioas
dfghjklrtyuiopl
kjhgfpppwww
mnbvcmnb
Jkkgkykytktjjjjo
ooooooqqbyfjt
dhghfhfngfhfg
bcccdddeiuyo
weuoweiuoieu
Cell Storage

CellArrayMap Index

Lean Footprint (no KV-Objects). Friendly to Off-Heap Implementation. 

Hhjs
iutkldfk;wjt;w
iejerg;iopp
Jkkgkykytkt
gcccdddeiuy
oweuoweiuo
ieu
Poiuytrewqa
sdfaaabbbm
nppppbvcxq
qaaaxcvb
qqqwertyuioas
dfghjklrtyuiopl
kjhgfpppwww
mnbvcmnb
Jkdddfkgbbbd
iwpoqqqaaacc
cdddeiuyoweu
oweiuoieu
k;wjt;wiej;iwj
opppqqqyrta
aajeutkiyt
Jkkgkykytcc
dddeiuyowe
uoweiuoieu
Poiuytrewqa
hmnppppbv
cxqqaaaxcv
b
qqqwertyuioas
dfghjklrtyuiopl
kjhgfpppwww
mnbvcmnb
Jkkgkykytktjjjjo
ooooooqqbyfjt
dhghfhfngfhfg
bcccdddeiuyo
weuoweiuoieu
Cell Storage

CellChunkMap Index
The Software Side: What’s New?
15
CompactingMemStore: BASIC and EAGER configurations

DefaultMemStore: NONE configuration



Segment Class Hierarchy: Mutable, Immutable, Composite



NavigableMap Implementations: CellArrayMap, CellChunkMap



MemStoreCompactor: compaction algorithms implementation
CellChunkMap Support (Experimental)
16
Cell objects embedded directly into CellChunkMap (CCM)

New cell type - reference data by unique ChunkID



ChunkCreator: Chunk allocation + ChunkID management

Stores mapping of ChunkID’s to Chunk references

Strong references to chunks managed by CCM’s, weak to the rest

The CCM’s themselves are allocated via the same mechanism



Some exotic use cases 

E.g., jumbo cells allocated in one-time chunks outside chunk pools
Evaluation Setup
17
System 


2-node HBase on top of 3-node HDFS, 1Gbps interconnect

Intel Xeon E5620 (12-core), 2.8TB SSD storage, 48GB RAM

RS config: 16GB RAM (40% Cache/40% MemStore), on-heap, no MSLAB

Data 



1 table (100 regions, 50 columns), 30GB-100GB

Workload Driver 

YCSB (1 node, 12 threads)

Batched (async) writes (10KB buffer)
Experiments
18
Metrics 

Write throughput, read latency (distribution), disk footprint/amplification



Workloads 
 (varied at client side)

Write-Only (100% Put) vs Mixed (50% Put/50% Get)

Uniform vs Zipfian Key Distributions

Small Values (100B) vs Big Values (1K)



Configurations (varied at server side)

Most experiments exercise Async WAL
19




Write Throughput
+25% +44% 100GB Dataset

100% Writes 

100B Values



Every write updates

a single column

Gains less pronounced 

with big values (1KB)

+11%
(why?)
-
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
Zipf Uniform
Throughput,ops/sec
NONE
BASIC
EAGER
20




Single-Key Write Latency
0
1
2
3
4
5
6
7
50% (median) 75% 95% 99% (tail)
Latency,ms
NONE
BASIC
EAGER
100GB Dataset

Zipf distribution

100% Writes 

100B Values
21




Single-Key Read Latency
30GB Dataset

Zipf Distribution

50% Writes/50% Reads 

100B Values

0
1
2
3
4
5
6
50% (median) 75% 95% 99% (tail)
Latency,ms
NONE
BASIC
EAGER
+9%
(why?)
-13%
22




Disk Footprint/Write Amplification
100GB Dataset

Zipf Distribution

100% Writes

100B Values
-29%
0
200
400
600
800
1000
1200
Flushes Compactions Data Written (GB)
NONE
BASIC
EAGER
Status
23
In-Memory Compaction GA in HBase 2.0 

Master JIRA HBASE-14918 complete (~20 subtasks)

Major refactoring/extension of the MemStore code

Many details in Apache HBase blog posts



CellChunkMap Index, Off-Heap support in progress 

Master JIRA HBASE-16421
Summary
24
Accordion = a leaner and faster write path



Space-Efficient Index + Redundancy Elimination à less I/O

Less Frequent Flushes à increased write throughput

Less On-Disk Compaction à reduced write amplification

Data stays longer in RAM à reduced tail read latency



Edging Closer to In-Memory Database Performance
Thanks to Our Partners for Being Awesome

25

More Related Content

PDF
FlashSQL 소개 & TechTalk
I Goo Lee
 
PDF
Linux internals for Database administrators at Linux Piter 2016
PostgreSQL-Consulting
 
PPTX
IMC Summit 2016 Breakout - Andy Pavlo - What Non-Volatile Memory Means for th...
In-Memory Computing Summit
 
PDF
TokuDB - What You Need to Know
Jervin Real
 
PDF
hbaseconasia2017: HBase在Hulu的使用和实践
HBaseCon
 
PDF
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Danielle Womboldt
 
PPTX
Cache in API Gateway
GilWon Oh
 
PDF
Rigorous and Multi-tenant HBase Performance
Cloudera, Inc.
 
FlashSQL 소개 & TechTalk
I Goo Lee
 
Linux internals for Database administrators at Linux Piter 2016
PostgreSQL-Consulting
 
IMC Summit 2016 Breakout - Andy Pavlo - What Non-Volatile Memory Means for th...
In-Memory Computing Summit
 
TokuDB - What You Need to Know
Jervin Real
 
hbaseconasia2017: HBase在Hulu的使用和实践
HBaseCon
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Danielle Womboldt
 
Cache in API Gateway
GilWon Oh
 
Rigorous and Multi-tenant HBase Performance
Cloudera, Inc.
 

What's hot (19)

PPTX
Microsoft azure for sql server professionals
Armando Lacerda
 
PPTX
Meet hbase 2.0
enissoz
 
PPTX
PostgreSQL Hangout Parameter Tuning
Ashnikbiz
 
PDF
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
Cloudera, Inc.
 
PPTX
HBaseCon 2013: Apache HBase on Flash
Cloudera, Inc.
 
PDF
HBase Blockcache 101
Nick Dimiduk
 
PDF
PostgreSQL worst practices, version FOSDEM PGDay 2017 by Ilya Kosmodemiansky
PostgreSQL-Consulting
 
PPTX
Apache HBase, Accelerated: In-Memory Flush and Compaction
HBaseCon
 
POTX
WiredTiger MongoDB Integration
MongoDB
 
PPTX
Rit 2011 ats
Leif Hedstrom
 
ODP
Hug Hbase Presentation.
Jack Levin
 
PPTX
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
Cloudera, Inc.
 
PPTX
MongoDB 101 & Beyond: Get Started in MongoDB 3.0, Preview 3.2 & Demo of Ops M...
MongoDB
 
PDF
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
Equnix Business Solutions
 
PDF
Capacity Planning
MongoDB
 
PDF
Answering the Database Scale Out Problem with PCI SSDs
answers
 
PDF
[B4]deview 2012-hdfs
NAVER D2
 
PDF
Accelerating HBase with NVMe and Bucket Cache
Nicolas Poggi
 
PPTX
HBase Low Latency
DataWorks Summit
 
Microsoft azure for sql server professionals
Armando Lacerda
 
Meet hbase 2.0
enissoz
 
PostgreSQL Hangout Parameter Tuning
Ashnikbiz
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
Cloudera, Inc.
 
HBaseCon 2013: Apache HBase on Flash
Cloudera, Inc.
 
HBase Blockcache 101
Nick Dimiduk
 
PostgreSQL worst practices, version FOSDEM PGDay 2017 by Ilya Kosmodemiansky
PostgreSQL-Consulting
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
HBaseCon
 
WiredTiger MongoDB Integration
MongoDB
 
Rit 2011 ats
Leif Hedstrom
 
Hug Hbase Presentation.
Jack Levin
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
Cloudera, Inc.
 
MongoDB 101 & Beyond: Get Started in MongoDB 3.0, Preview 3.2 & Demo of Ops M...
MongoDB
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
Equnix Business Solutions
 
Capacity Planning
MongoDB
 
Answering the Database Scale Out Problem with PCI SSDs
answers
 
[B4]deview 2012-hdfs
NAVER D2
 
Accelerating HBase with NVMe and Bucket Cache
Nicolas Poggi
 
HBase Low Latency
DataWorks Summit
 
Ad

Similar to Accordion HBaseCon 2017 (20)

PDF
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 
PPTX
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
PPTX
Design Tradeoffs for SSD Performance
jimmytruong
 
PPTX
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Виталий Стародубцев
 
PPTX
IO Dubi Lebel
sqlserver.co.il
 
PDF
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
Michael Stack
 
ODP
Ceph Day NYC: Ceph Performance & Benchmarking
Ceph Community
 
PPTX
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
Aerospike, Inc.
 
PDF
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
PDF
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
PDF
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...
In-Memory Computing Summit
 
ODP
Ceph Day Santa Clara: Ceph Performance & Benchmarking
Ceph Community
 
PDF
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
InfluxData
 
PPTX
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Kyle Hailey
 
KEY
MySQL Performance - SydPHP October 2011
Graham Weldon
 
PPT
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Yahoo Developer Network
 
PDF
Performance and predictability (1)
RichardWarburton
 
PDF
Performance and Predictability - Richard Warburton
JAXLondon2014
 
PDF
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
In-Memory Computing Summit
 
PPT
A novel method to extend flash memory lifetime in flash based dbms
Zhichao Liang
 
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
Design Tradeoffs for SSD Performance
jimmytruong
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Виталий Стародубцев
 
IO Dubi Lebel
sqlserver.co.il
 
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
Michael Stack
 
Ceph Day NYC: Ceph Performance & Benchmarking
Ceph Community
 
How to Get a Game Changing Performance Advantage with Intel SSDs and Aerospike
Aerospike, Inc.
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...
In-Memory Computing Summit
 
Ceph Day Santa Clara: Ceph Performance & Benchmarking
Ceph Community
 
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
InfluxData
 
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Kyle Hailey
 
MySQL Performance - SydPHP October 2011
Graham Weldon
 
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Yahoo Developer Network
 
Performance and predictability (1)
RichardWarburton
 
Performance and Predictability - Richard Warburton
JAXLondon2014
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
In-Memory Computing Summit
 
A novel method to extend flash memory lifetime in flash based dbms
Zhichao Liang
 
Ad

Recently uploaded (20)

PDF
Migrating Katalon Studio Tests to Playwright with Model Driven Engineering
ESUG
 
PPTX
Reticular formation_nuclei_afferent_efferent
muralinath2
 
PPT
1a. Basic Principles of Medical Microbiology Part 2 [Autosaved].ppt
separatedwalk
 
PDF
Identification of unnecessary object allocations using static escape analysis
ESUG
 
PDF
A deep Search for Ethylene Glycol and Glycolonitrile in the V883 Ori Protopla...
Sérgio Sacani
 
PDF
Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology
ESUG
 
PPTX
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
PPT
Grade_9_Science_Atomic_S_t_r_u_cture.ppt
QuintReynoldDoble
 
PDF
FASTTypeScript metamodel generation using FAST traits and TreeSitter project
ESUG
 
PDF
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
PPTX
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
PPTX
Quality control test for plastic & metal.pptx
shrutipandit17
 
PPTX
Brain_stem_Medulla oblongata_functions of pons_mid brain
muralinath2
 
PPTX
fghvqwhfugqaifbiqufbiquvbfuqvfuqyvfqvfouiqvfq
PERMISONJERWIN
 
PDF
Vera C. Rubin Observatory of interstellar Comet 3I ATLAS - July 21, 2025.pdf
SOCIEDAD JULIO GARAVITO
 
PPTX
Embark on a journey of cell division and it's stages
sakyierhianmontero
 
PDF
Renewable Energy Resources (Solar, Wind, Nuclear, Geothermal) Presentation
RimshaNaeem23
 
PPTX
Introduction to biochemistry.ppt-pdf_shotrs!
Vishnukanchi darade
 
PPTX
General Characters and Classification of Su class Apterygota.pptx
Dr Showkat Ahmad Wani
 
PPTX
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 
Migrating Katalon Studio Tests to Playwright with Model Driven Engineering
ESUG
 
Reticular formation_nuclei_afferent_efferent
muralinath2
 
1a. Basic Principles of Medical Microbiology Part 2 [Autosaved].ppt
separatedwalk
 
Identification of unnecessary object allocations using static escape analysis
ESUG
 
A deep Search for Ethylene Glycol and Glycolonitrile in the V883 Ori Protopla...
Sérgio Sacani
 
Evaluating Benchmark Quality: a Mutation-Testing- Based Methodology
ESUG
 
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
Grade_9_Science_Atomic_S_t_r_u_cture.ppt
QuintReynoldDoble
 
FASTTypeScript metamodel generation using FAST traits and TreeSitter project
ESUG
 
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
Quality control test for plastic & metal.pptx
shrutipandit17
 
Brain_stem_Medulla oblongata_functions of pons_mid brain
muralinath2
 
fghvqwhfugqaifbiqufbiquvbfuqvfuqyvfqvfouiqvfq
PERMISONJERWIN
 
Vera C. Rubin Observatory of interstellar Comet 3I ATLAS - July 21, 2025.pdf
SOCIEDAD JULIO GARAVITO
 
Embark on a journey of cell division and it's stages
sakyierhianmontero
 
Renewable Energy Resources (Solar, Wind, Nuclear, Geothermal) Presentation
RimshaNaeem23
 
Introduction to biochemistry.ppt-pdf_shotrs!
Vishnukanchi darade
 
General Characters and Classification of Su class Apterygota.pptx
Dr Showkat Ahmad Wani
 
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 

Accordion HBaseCon 2017

  • 1. Accordion: HBase Breathes w ith In-Memor y Compaction Eshcar Hillel, Anastasia Braginsky, Edward Bortnikov ⎪ HBaseCon, Jun 12, 2017
  • 2. The Team 2 Edward Bortnikov Anastasia Braginsky (committer) Eshcar Hillel (committer) Michael Stack Anoop Sam John Ramkrishna Vasudevan
  • 3. Quest: The User’s Holy Grail 3 Reliable Persistent Storage In-Memory Database Performance
  • 4. What is Accordion? 4 Novel Write-Path Algorithm Better Performance of Write-Intensive Workloads Write Throughput ì, Read Latency î Better Disk Use Write amplification î GA in HBase 2.0 (becomes default MemStore implementation)
  • 5. In a Nutshell 5 Inspired by Log-Structured-Merge (LSM) Tree Design Transforms random I/O to sequential I/O (efficient!) Governs the HBase storage organization Accordion reapplies the LSM Tree design to RAM data à Efficient resource use – data lives in memory longer à  Less disk I/O à  Ultimately, higher speed
  • 6. How LSM Trees Work 6 MemStore HFile HFile HFile RAM Disk Put Get/Scan Flush Compaction HRegion Data updates stored as versions Compaction eliminates redundancies
  • 7. LSM Trees in Action 7 MemStore MemStore MemStore MemStore MemStore HFile HFile HFile HFile HFile HFile HFile HFile Flush! Flush! Flush! Compaction!
  • 8. Accordion: In-Memory LSM Tree 8 Active Segment HFile Immutable Segment Immutable Segment Immutable Segment HFile Compacting MemStore Flush Put Get/Scan Compaction RAM Disk
  • 10. Flat Immutable Segment Index 10 Hhjs iutkldfk;wjt;w iejerg;iopp Jkkgkykytkt gcccdddeiuy oweuoweiuo ieu Poiuytrewqa sdfaaabbbm nppppbvcxq qaaaxcvb qqqwertyuioas dfghjklrtyuiopl kjhgfpppwww mnbvcmnb Jkdddfkgbbbd iwpoqqqaaacc cdddeiuyoweu oweiuoieu k;wjt;wiej;iwj opppqqqyrta aajeutkiyt Jkkgkykytcc dddeiuyowe uoweiuoieu Poiuytrewqa hmnppppbv cxqqaaaxcv b qqqwertyuioas dfghjklrtyuiopl kjhgfpppwww mnbvcmnb Jkkgkykytktjjjjo ooooooqqbyfjt dhghfhfngfhfg bcccdddeiuyo weuoweiuoieu Hhjjuuyrqaa ss iuaaajeutkiyt Jkkgkykytktg kg;diwpoqeu oweiuoieu Poiuytrejkl;; mnppppbvcx qqaaaxcvb qqqwertyuioas dfghjklrtyuioplk jhgfpppwwwm nbvcmnb Jkdddfkaabbb cccdddeiuyow euoweiuoieu utkldfk;ioppp qqqyrtaaaje utkiyt diwpoqqqaa abbbcccddw euoweiuoieu hjkl;;mnpppp bvcxqqaaax cvb qqqwertyuioas dfghjklrtyuioplk jhgfpppwwwm nbvcmnb Jkkgkyaaabbyf jtdhghfhfngfhfg bcccdddeiuyo weuoweiuoieu Cell Storage Cell Storage Flatten Skiplist Index CellArrayMap Index Lean footprint – the smaller the cells the better! KV- Objects
  • 11. Redundancy Elimination 11 In-Memory Compaction merges the pipelined segments Get access latency under control (less segments to scan) BASIC compaction Multiple indexes merged into one, cell data remains in place EAGER compaction Redundant data versions eliminated (SQM scan)
  • 12. BASIC vs EAGER 12 BASIC: universal optimization, avoids physical data copy EAGER: high value for highly redundant workloads SQM scan is expensive Data relocation cost may be high (think MSLAB!) Configuration BASIC is default, EAGER may be configured Future implementation may figure out the right mode automatically
  • 13. Compaction Pipeline: Correctness & Performance 13 Shared Data Structure Read access: Get, Scan, in-memory compaction Write access: in-memory flush, in-memory compaction, disk flush Design Choice: Non-Blocking Reads Read-Only pipeline clone – no synchronization upon read access Copy-on-Write upon modification Versioning prevents compaction concurrent to other updates
  • 14. More Memory Efficiency - KV Object Elimination 14 Hhjs iutkldfk;wjt;w iejerg;iopp Jkkgkykytkt gcccdddeiuy oweuoweiuo ieu Poiuytrewqa sdfaaabbbm nppppbvcxq qaaaxcvb qqqwertyuioas dfghjklrtyuiopl kjhgfpppwww mnbvcmnb Jkdddfkgbbbd iwpoqqqaaacc cdddeiuyoweu oweiuoieu k;wjt;wiej;iwj opppqqqyrta aajeutkiyt Jkkgkykytcc dddeiuyowe uoweiuoieu Poiuytrewqa hmnppppbv cxqqaaaxcv b qqqwertyuioas dfghjklrtyuiopl kjhgfpppwww mnbvcmnb Jkkgkykytktjjjjo ooooooqqbyfjt dhghfhfngfhfg bcccdddeiuyo weuoweiuoieu Cell Storage CellArrayMap Index Lean Footprint (no KV-Objects). Friendly to Off-Heap Implementation. Hhjs iutkldfk;wjt;w iejerg;iopp Jkkgkykytkt gcccdddeiuy oweuoweiuo ieu Poiuytrewqa sdfaaabbbm nppppbvcxq qaaaxcvb qqqwertyuioas dfghjklrtyuiopl kjhgfpppwww mnbvcmnb Jkdddfkgbbbd iwpoqqqaaacc cdddeiuyoweu oweiuoieu k;wjt;wiej;iwj opppqqqyrta aajeutkiyt Jkkgkykytcc dddeiuyowe uoweiuoieu Poiuytrewqa hmnppppbv cxqqaaaxcv b qqqwertyuioas dfghjklrtyuiopl kjhgfpppwww mnbvcmnb Jkkgkykytktjjjjo ooooooqqbyfjt dhghfhfngfhfg bcccdddeiuyo weuoweiuoieu Cell Storage CellChunkMap Index
  • 15. The Software Side: What’s New? 15 CompactingMemStore: BASIC and EAGER configurations DefaultMemStore: NONE configuration Segment Class Hierarchy: Mutable, Immutable, Composite NavigableMap Implementations: CellArrayMap, CellChunkMap MemStoreCompactor: compaction algorithms implementation
  • 16. CellChunkMap Support (Experimental) 16 Cell objects embedded directly into CellChunkMap (CCM) New cell type - reference data by unique ChunkID ChunkCreator: Chunk allocation + ChunkID management Stores mapping of ChunkID’s to Chunk references Strong references to chunks managed by CCM’s, weak to the rest The CCM’s themselves are allocated via the same mechanism Some exotic use cases E.g., jumbo cells allocated in one-time chunks outside chunk pools
  • 17. Evaluation Setup 17 System 2-node HBase on top of 3-node HDFS, 1Gbps interconnect Intel Xeon E5620 (12-core), 2.8TB SSD storage, 48GB RAM RS config: 16GB RAM (40% Cache/40% MemStore), on-heap, no MSLAB Data 1 table (100 regions, 50 columns), 30GB-100GB Workload Driver YCSB (1 node, 12 threads) Batched (async) writes (10KB buffer)
  • 18. Experiments 18 Metrics Write throughput, read latency (distribution), disk footprint/amplification Workloads (varied at client side) Write-Only (100% Put) vs Mixed (50% Put/50% Get) Uniform vs Zipfian Key Distributions Small Values (100B) vs Big Values (1K) Configurations (varied at server side) Most experiments exercise Async WAL
  • 19. 19 Write Throughput +25% +44% 100GB Dataset 100% Writes 100B Values Every write updates a single column Gains less pronounced with big values (1KB) +11% (why?) - 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 Zipf Uniform Throughput,ops/sec NONE BASIC EAGER
  • 20. 20 Single-Key Write Latency 0 1 2 3 4 5 6 7 50% (median) 75% 95% 99% (tail) Latency,ms NONE BASIC EAGER 100GB Dataset Zipf distribution 100% Writes 100B Values
  • 21. 21 Single-Key Read Latency 30GB Dataset Zipf Distribution 50% Writes/50% Reads 100B Values 0 1 2 3 4 5 6 50% (median) 75% 95% 99% (tail) Latency,ms NONE BASIC EAGER +9% (why?) -13%
  • 22. 22 Disk Footprint/Write Amplification 100GB Dataset Zipf Distribution 100% Writes 100B Values -29% 0 200 400 600 800 1000 1200 Flushes Compactions Data Written (GB) NONE BASIC EAGER
  • 23. Status 23 In-Memory Compaction GA in HBase 2.0 Master JIRA HBASE-14918 complete (~20 subtasks) Major refactoring/extension of the MemStore code Many details in Apache HBase blog posts CellChunkMap Index, Off-Heap support in progress Master JIRA HBASE-16421
  • 24. Summary 24 Accordion = a leaner and faster write path Space-Efficient Index + Redundancy Elimination à less I/O Less Frequent Flushes à increased write throughput Less On-Disk Compaction à reduced write amplification Data stays longer in RAM à reduced tail read latency Edging Closer to In-Memory Database Performance
  • 25. Thanks to Our Partners for Being Awesome 25