HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

EXPLOITING COARSE-GRAINED PARALLELISM IN
B+ TREE SEARCHES ON APUS
MAYANK DAGA, MARK NUTTER
AMD RESEARCH

RELEVANCE OF B+ TREE SEARCHES
 B+ Trees are special case of B Trees
‒ Fundamental data structure used in several popular database
management systems
B Tree
B+ Tree
mongoDB
MySQL

CouchDB
SQLite

 High-throughput, read-only index searches are gaining traction in
- Video-copy detection
‒ Audio-search
‒ Online Transaction Processing (OLTP) Benchmarks

 Increase in memory capacity allows many database tables to
reside in memory
‒Brings computational performance to the forefront

X

2 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

A REAL-WORLD USE-CASE OF READ-ONLY SEARCHES

Mobile:
Step 1. Record Audio
Step 2. Generate Audio Fingerprint
Step 3. Send search request to server

App on a
smartphone

Database
3
2
1
d1

2
d2

5

4

6

7

3

4

5

6

7 8

d3

d4

d5

d6

Server:
Step 1. Receive search requests
Step 2. Query Database
Step 3. Return search results to client

d7 d8


Thousands
of clients
send
requests

Music Library –
Millions of Songs

DATABASE PRIMITIVES ON ACCELERATORS
 Discrete graphics processing units (dGPUs) provide a compelling
mix of
‒ Performance per Watt
‒ Performance per Dollar

 dGPUs have been used to accelerate critical database primitives
‒ scan
‒ sort
‒ join
‒ aggregation
‒ B+ Tree Searches?

B+ TREE SEARCHES ON ACCELERATORS
 B+ Tree searches present significant challenges
‒ Irregular representation in memory
‒ An artifact of malloc() and new()

‒ Today’s dGPUs do not have a direct mapping to the CPU virtual address
space
‒ Indirect links need to be converted to relative offsets

‒ Requirement to copy the tree to the dGPU, which entails
‒ One is always bound by the amount of GPU device memory


OUR SOLUTION
 Accelerated B+ Tree searches on a fused CPU+GPU processor (or
APU1)
‒ Eliminates data-copies by combining x86 CPU
and vector GPU cores on the same silicon die

 Developed a memory allocator to form a regular representation
of the tree in memory
‒ Fundamental data structure is not altered
‒ Merely parts of its layout is changed

[1] www.hsafoundation.com

OUTLINE
 Motivation and Contribution
 Background
‒ AMD APU Architecture
‒ B+ Trees

 Approach
‒ Transforming the Memory Layout
‒ Eliminating the Divergence

 Results
‒ Performance
‒ Analysis

 Summary and Next Steps

AMD APU ARCHITECTURE
System Memory
Host Memory
DRAM
Controller

x86
Cores

DRAM
Controller

RMB
System Request
Interface (SRI)

xBar

Link
Controll
er

MCT

UNB

GPU Frame-Buffer

FCL

Platform Interfaces

AMD 2nd Gen. A-series APU
UNB - Unified Northbridge, MCT - Memory Controller

 The APU consists of a dedicated
IOMMUv2 hardware
GPU
Vector
Cores

- Provides direct mapping between
GPU and CPU virtual address (VA)
space
- Enables GPUs to access the system
memory
- Enables GPUs to track whether
pages are resident in memory

 Today, GPU cores can access VA
space at a granularity of continuous
chunks


B+ TREES

3
2

5

4

6

7

1

 A B+ Tree …

2

3

4

5

6

7

d1

d2

d3

d4

d5

d6

d7 d8

‒ is a dynamic, multi-level index
‒ Is efficient for retrieval of data, stored in a block-oriented context
‒ has a high fan-out to reduce disk I/O operations

 Order (b) of a B+ Tree measures the capacity of its nodes
 Number of children (m) in an internal node is
‒ [b/2] <= m <= b
‒ Root node can have as few as two children

 Number of keys in an internal node = (m – 1)

8

APPROACH FOR PARALLELIZATION
 Fine-grained (Accelerate a single query)
‒ Replace Binary search in each node with K-ary search
‒ Maximum performance improvement = log(k)/log(2)
‒ Results in poor occupancy of the GPU cores

 Coarse-grained (Perform many queries in parallel)
‒ Enables data-parallelism

‒Increases memory bandwidth with parallel reads
‒Increases throughput (transactions per second for OLTP)

TRANSFORMING THE MEMORY LAYOUT
nodes w/ metadata
..

keys
..

values

 Metadata
‒ Number of keys in a node
‒ Offset to keys/values in the buffer
‒ Offset to the first child node
‒ Whether a node is a leaf

 Pass a pointer to this memory buffer to the accelerator

..

ELIMINATING THE DIVERGENCE
 Each work-item/thread executes a single query
 May increase divergence within a wave-front
‒ Every query may follow a different path in the B+ Tree

WI-1

WI-2
2

3

5

4

6

7

WI-2

1

2

3

4

5

6

7

d1

d2

d3

d4

d5

d6

d7 d8

8

 Sort the keys to be searched
‒ Increases the chances of work-items within a wave-front to follow similar paths in the B+ Tree
‒ We use Radix Sort1 to sort the keys on the GPU

[1] D. G. Merrill and A. S. Grimshaw, “Revisiting sorting for gpgpu stream architectures,” in Proceedings of the 19th intl. conf. on
Parallel architectures and compilation techniques, ser. PACT ’10. New York, NY, USA: ACM, 2010.

IMPACT OF DIVERGENCE IN B+ TREE SEARCHES

Impcat of Divergence

5

4

3

2

1

16K

32K

64K

128K

Number of Queries w/ Order of B+ Tree
TM

AMD Radeon HD 7660

TM

AMD Phenom II X6 1090T

Impact of Divergence on GPU – 3.7-fold (average)
Impact of Divergence on CPU – 1.8-fold (average)

128

64

32

16

8

128

64

32

16

8

128

64

32

16

8

128

64

32

16

8

0

OUTLINE
 Motivation and Contribution
 Background
‒ AMD APU Architecture
‒ B+ Trees

 Approach
‒ Transforming the Memory Layout
‒ Eliminating the Divergence

 Results
‒ Performance
‒ Analysis

 Summary and Next Steps

60000
50000
40000
30000
20000
10000
0
100000
300000
500000
700000
900000
1100000
1300000
1500000
1700000
1900000
2100000
2300000
2500000
2700000
2900000
3100000
3300000
3500000
3700000
3900000
4100000

 Software
‒ A B+ Tree w/ 4M records is used
‒ Search queries are created using

Frequency

EXPERIMENTAL SETUP

‒ normal_distribution() (C++-11 feature)
‒ The queries have been sorted

‒ CPU Implementation from
‒ http://www.amittai.com/prose/bplustree.html

‒ Driver: AMD CatalystTM v12.8
‒ Programming Model: OpenCLTM
 Hardware
‒ AMD RadeonTM HD 7660 APU (Trinity)
‒ 4 cores w/ 6GB DDR3, 6 CUs w/ 2GB DDR3

‒ AMD PhenomTM II X6 1090T + AMD RadeonTM HD 7970 (Tahiti)
‒ 6 cores w/ 8GB DDR3, 32 CUs w/ 3GB GDDR5

‒ Device Memory does not include data-copy time

Bin

E ID (PKey)

Age

0000001

34

4

million

4194304

entries
50

RESULTS – QUERIES PER SECOND
700M Queries/Sec

Queries/Second (Million)

1000

100

10

16K

32K

64K

128K

Number of Queries w/ Order of B+Tree
AMD Phenom II X6 1090T (6-Threads+SSE)
AMD PhenomTM II 1090T (6-Threads+SSE)

AMD Radeon HDTM HD(Device Memory)
AMD Radeon 7970 7970 (Device Memory)

AMD Radeon HD 7970 (Pinned(Pinned Memory)
AMD RadeonTM HD 7970 Memory)

dGPU (device memory)

~350M Queries/Sec. (avg.)

dGPU (pinned memory)


AMD PhenomTM CPU



128

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

1

RESULTS – QUERIES PER SECOND
140

Queries/Second (Million)

120
100
80
60
40
20

16K
AMD Phenom II X6 1090T (6-Threads+SSE)
AMD PhenomTM II 1090T (6-Threads+SSE)

32K

HD 7660 (Device Memory)

128K
AMD Radeon HDTM HD(Pinned Memory)
AMD Radeon 7660 7660 (Pinned Memory)

APU (device memory)


APU (pinned memory)


APU (pinned memory) is faster than the CPU implementation

128

64

32

16

8

4

128

64K

AMD Radeon HD 7660 (Device Memory)
TM

AMD Radeon

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

0

RESULTS - SPEEDUP
12

Speedup

10
8

4.9-fold speedup

6
4
2

16K

32K

64K

128K

AMD Radeon HDTM HD(Pinned Memory)
AMD Radeon 7660 7660 (Pinned Memory)

AMD Radeon HDTM HD(Device(Device Memory)
AMD Radeon 7660 7660 Memory)

Baseline: six-threaded, hand-tuned, SSE-optimized CPU implementation.

Average Speedup – 4.3-fold (Device Memory), 2.5-fold (Pinned Memory)

• Efficacy of IOMMUv2 + HSA on the APU
Platform
Discrete GPU (memory size = 3GB)
APU (prototype software)

< 1.5GB
✓
✓


Size of the B+ Tree
1.5GB – 2.7GB
✓
✓

> 2.7GB

✗
✓

128

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

0

ANALYSIS
 The accelerators and the CPU yield best performance for different
orders of the B+ Tree
‒ CPU  order = 64
‒ Ability of CPUs to prefetch data is beneficial for higher orders

nodes w/ metadata
..

keys
..

values

..

‒ APU and dGPU  order = 16
‒ GPUs do not have a prefetcher  cache line should be most efficiently utilized
‒ GPUs have a cache-line size of 64 bytes
‒ Order = 16 is most beneficial (16 * 4 bytes)


ANALYSIS
 Minimum batch size to match the CPU performance
Order = 64

Order = 16

dGPU (device memory)

4K queries

2K queries

dGPU (pinned memory

N.A.

N.A.

APU (device memory)

10K queries

4K queries

APU (pinned memory

20K queries

16K queries

 reuse_factor - amortizing the cost of data-copies to the GPU

90% Queries

100% Queries

dGPU

15

54

APU

100

N.A.


PROGRAMMABILITY

GPU

CPU-SSE

int i = 0, j;

typedef global unsigned int g_uint;

node * c = root;

typedef global mynode g_mynode;

__m128i vkey = _mm_set1_epi32(key);

int tid = get_global_id(0);

__m128i vnodekey, *vptr;

int i = 0;

short int mask;

g_mynode *c = (g_mynode *)root;

/* find the leaf node */

/* find the leaf node */

while( !c->is_leaf ){

while(!c->is_leaf){

while (i < c->num_keys){

for(i = 0; i < (c->num_keys-3); i+=4){

if(keys[tid] >= ((g_uint *)((intptr_t)root + c->keys))[i])

vptr = (__m128i *)&(c->keys[i]);

i++;

vnodekey = _mm_load_si128(vptr);

else break;

mask = _mm_movemask_ps(_mm_cvtepi32_ps( _mm_cmplt_epi32(vkey, vnodekey)));
}

if((mask) & 8) break;

c = (g_mynode *)((intptr_t)root + c->ptr + i*sizeof(mynode));

}
for(j = i; j < c->num_keys; j++){
if(key < c->keys[j]) break;
}

}
/* match the key in the leaf node */
for(i=0; i<c->num_keys; i++){
if((((g_uint *)((intptr_t)root + c->keys))[i]) == keys[tid]) break;

c = (node *)c->pointers[j];
}

}

/* match the key in the leaf node */

/* retrieve the record */

for (i = 0; i < c->num_keys; i++)

if(i != c->num_keys)

if (c->keys[i] == key) break;

records[tid] = ((g_uint *)((intptr_t)root + c->is_leaf + i*sizeof(g_uint)))[0];

/* retrieve the record */
if (i != c->num_keys)
return (record *)c->pointers[i];


RELATED WORK
 J. Fix, A. Wilkes, and K. Skadron, "Accelerating Braided B+ Tree Searches on a GPU with CUDA." In
Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors: Analysis,
Implementation, and Performance, in conjunction with ISCA, 2011
‒ Authors report ~10-fold speedup over single-thread-non-SSE CPU implementation, using a discrete NVIDIA GTX
480 GPU (do not take data-copies into account)

 C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, P. Dubey, “FAST:
fast architecture sensitive tree search on modern CPUs and GPUs”, SIGMOD Conference, 2010
‒ Authors report ~100M queries per second using a discrete NVIDIA GTX 280 GPU (do not take data-copies into
account)

 J. Sewall, J. Chhugani, C. Kim, N. Satish, P. Dubey, “PALM: Parallel, Architecture-Friendly, Latch-Free
Modifications to B+ Trees on Multi-Core Processors”, Proceedings of VLDB Endowment, (VLDB 2011)
‒ Applicable for B+ Tree modifications on the GPU


POSSIBILITIES WITH HSA
Memory
3
2

5

4

6

7

1

2

3

4

5

6

7

8

d1

d2

d3

d4

d5

d6

d7 d8

Parallel Reads for
Searches

Serial Modifications

GPU Vector Cores
x86 Cores

Graphics Core Next, User-mode Queuing, Shared Virtual Memory

SUMMARY
 B+ Tree is the fundamental data structure in many RDBMS
‒ Accelerating B+ Tree searches is critical
‒ Presents significant challenges on discrete GPUs

 We have accelerated B+ Tree searches by exploiting coarsegrained parallelism on a APU
‒ 2.5-fold (avg.) speedup over 6-threads+SSE CPU implementation

 Possible Next Steps
‒ HSA + IOMMUv2 would reduce the issue of modifying B+ Tree
representation
‒ Investigate CPU-GPU co-scheduling

‒ Investigate modifications on the B+ Tree

DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Phenom, Radeon, Catalyst and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of
their respective owners.


HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

Recommended

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter (20)

More from AMD Developer Central (20)

Recently uploaded (20)

HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter