SlideShare a Scribd company logo
EXPLOITING COARSE-GRAINED PARALLELISM IN
B+ TREE SEARCHES ON APUS
MAYANK DAGA, MARK NUTTER
AMD RESEARCH
RELEVANCE OF B+ TREE SEARCHES
 B+ Trees are special case of B Trees
‒ Fundamental data structure used in several popular database
management systems
B Tree
B+ Tree
mongoDB
MySQL

CouchDB
SQLite

 High-throughput, read-only index searches are gaining traction in
- Video-copy detection
‒ Audio-search
‒ Online Transaction Processing (OLTP) Benchmarks

 Increase in memory capacity allows many database tables to
reside in memory
‒Brings computational performance to the forefront

X

2 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
A REAL-WORLD USE-CASE OF READ-ONLY SEARCHES

Mobile:
Step 1. Record Audio
Step 2. Generate Audio Fingerprint
Step 3. Send search request to server

App on a
smartphone

Database
3
2
1
d1

2
d2

5

4

6

7

3

4

5

6

7 8

d3

d4

d5

d6

Server:
Step 1. Receive search requests
Step 2. Query Database
Step 3. Return search results to client

d7 d8

3 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

Thousands
of clients
send
requests

Music Library –
Millions of Songs
DATABASE PRIMITIVES ON ACCELERATORS
 Discrete graphics processing units (dGPUs) provide a compelling
mix of
‒ Performance per Watt
‒ Performance per Dollar

 dGPUs have been used to accelerate critical database primitives
‒ scan
‒ sort
‒ join
‒ aggregation
‒ B+ Tree Searches?
4 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
B+ TREE SEARCHES ON ACCELERATORS
 B+ Tree searches present significant challenges
‒ Irregular representation in memory
‒ An artifact of malloc() and new()

‒ Today’s dGPUs do not have a direct mapping to the CPU virtual address
space
‒ Indirect links need to be converted to relative offsets

‒ Requirement to copy the tree to the dGPU, which entails
‒ One is always bound by the amount of GPU device memory

5 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
OUR SOLUTION
 Accelerated B+ Tree searches on a fused CPU+GPU processor (or
APU1)
‒ Eliminates data-copies by combining x86 CPU
and vector GPU cores on the same silicon die

 Developed a memory allocator to form a regular representation
of the tree in memory
‒ Fundamental data structure is not altered
‒ Merely parts of its layout is changed

[1] www.hsafoundation.com
6 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
OUTLINE
 Motivation and Contribution
 Background
‒ AMD APU Architecture
‒ B+ Trees

 Approach
‒ Transforming the Memory Layout
‒ Eliminating the Divergence

 Results
‒ Performance
‒ Analysis

 Summary and Next Steps
7 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
AMD APU ARCHITECTURE
System Memory
Host Memory
DRAM
Controller

x86
Cores

DRAM
Controller

RMB
System Request
Interface (SRI)

xBar

Link
Controll
er

MCT

UNB

GPU Frame-Buffer

FCL

Platform Interfaces

AMD 2nd Gen. A-series APU
UNB - Unified Northbridge, MCT - Memory Controller

 The APU consists of a dedicated
IOMMUv2 hardware
GPU
Vector
Cores

- Provides direct mapping between
GPU and CPU virtual address (VA)
space
- Enables GPUs to access the system
memory
- Enables GPUs to track whether
pages are resident in memory

 Today, GPU cores can access VA
space at a granularity of continuous
chunks

8 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
B+ TREES

3
2

5

4

6

7

1

 A B+ Tree …

2

3

4

5

6

7

d1

d2

d3

d4

d5

d6

d7 d8

‒ is a dynamic, multi-level index
‒ Is efficient for retrieval of data, stored in a block-oriented context
‒ has a high fan-out to reduce disk I/O operations

 Order (b) of a B+ Tree measures the capacity of its nodes
 Number of children (m) in an internal node is
‒ [b/2] <= m <= b
‒ Root node can have as few as two children

 Number of keys in an internal node = (m – 1)
9 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

8
APPROACH FOR PARALLELIZATION
 Fine-grained (Accelerate a single query)
‒ Replace Binary search in each node with K-ary search
‒ Maximum performance improvement = log(k)/log(2)
‒ Results in poor occupancy of the GPU cores

 Coarse-grained (Perform many queries in parallel)
‒ Enables data-parallelism

‒Increases memory bandwidth with parallel reads
‒Increases throughput (transactions per second for OLTP)
10 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
TRANSFORMING THE MEMORY LAYOUT
nodes w/ metadata
..

keys
..

values

 Metadata
‒ Number of keys in a node
‒ Offset to keys/values in the buffer
‒ Offset to the first child node
‒ Whether a node is a leaf

 Pass a pointer to this memory buffer to the accelerator
11 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

..
ELIMINATING THE DIVERGENCE
 Each work-item/thread executes a single query
 May increase divergence within a wave-front
‒ Every query may follow a different path in the B+ Tree

WI-1

WI-2
2

3

5

4

6

7

WI-2

1

2

3

4

5

6

7

d1

d2

d3

d4

d5

d6

d7 d8

8

 Sort the keys to be searched
‒ Increases the chances of work-items within a wave-front to follow similar paths in the B+ Tree
‒ We use Radix Sort1 to sort the keys on the GPU

[1] D. G. Merrill and A. S. Grimshaw, “Revisiting sorting for gpgpu stream architectures,” in Proceedings of the 19th intl. conf. on
Parallel architectures and compilation techniques, ser. PACT ’10. New York, NY, USA: ACM, 2010.
12 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
IMPACT OF DIVERGENCE IN B+ TREE SEARCHES

Impcat of Divergence

5

4

3

2

1

16K

32K

64K

128K

Number of Queries w/ Order of B+ Tree
TM

AMD Radeon HD 7660

TM

AMD Phenom II X6 1090T

Impact of Divergence on GPU – 3.7-fold (average)
Impact of Divergence on CPU – 1.8-fold (average)
13 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

128

64

32

16

8

128

64

32

16

8

128

64

32

16

8

128

64

32

16

8

0
OUTLINE
 Motivation and Contribution
 Background
‒ AMD APU Architecture
‒ B+ Trees

 Approach
‒ Transforming the Memory Layout
‒ Eliminating the Divergence

 Results
‒ Performance
‒ Analysis

 Summary and Next Steps
14 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
60000
50000
40000
30000
20000
10000
0
100000
300000
500000
700000
900000
1100000
1300000
1500000
1700000
1900000
2100000
2300000
2500000
2700000
2900000
3100000
3300000
3500000
3700000
3900000
4100000

 Software
‒ A B+ Tree w/ 4M records is used
‒ Search queries are created using

Frequency

EXPERIMENTAL SETUP

‒ normal_distribution() (C++-11 feature)
‒ The queries have been sorted

‒ CPU Implementation from
‒ http://www.amittai.com/prose/bplustree.html

‒ Driver: AMD CatalystTM v12.8
‒ Programming Model: OpenCLTM
 Hardware
‒ AMD RadeonTM HD 7660 APU (Trinity)
‒ 4 cores w/ 6GB DDR3, 6 CUs w/ 2GB DDR3

‒ AMD PhenomTM II X6 1090T + AMD RadeonTM HD 7970 (Tahiti)
‒ 6 cores w/ 8GB DDR3, 32 CUs w/ 3GB GDDR5

‒ Device Memory does not include data-copy time
15 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

Bin

E ID (PKey)

Age

0000001

34

4

million

4194304

entries
50
RESULTS – QUERIES PER SECOND
700M Queries/Sec

Queries/Second (Million)

1000

100

10

16K

32K

64K

128K

Number of Queries w/ Order of B+Tree
AMD Phenom II X6 1090T (6-Threads+SSE)
AMD PhenomTM II 1090T (6-Threads+SSE)

AMD Radeon HDTM HD(Device Memory)
AMD Radeon 7970 7970 (Device Memory)

AMD Radeon HD 7970 (Pinned(Pinned Memory)
AMD RadeonTM HD 7970 Memory)

dGPU (device memory)

~350M Queries/Sec. (avg.)

dGPU (pinned memory)

~9M Queries/Sec. (avg.)

AMD PhenomTM CPU

~18M Queries/Sec. (avg.)

16 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

128

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

1
RESULTS – QUERIES PER SECOND
140

Queries/Second (Million)

120
100
80
60
40
20

16K
AMD Phenom II X6 1090T (6-Threads+SSE)
AMD PhenomTM II 1090T (6-Threads+SSE)

32K

Number of Queries w/ Order of B+Tree
HD 7660 (Device Memory)

128K
AMD Radeon HDTM HD(Pinned Memory)
AMD Radeon 7660 7660 (Pinned Memory)

APU (device memory)

~66M Queries/Sec. (avg.)

APU (pinned memory)

~40M Queries/Sec. (avg.)

APU (pinned memory) is faster than the CPU implementation
17 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

128

64

32

16

8

4

128

64K

AMD Radeon HD 7660 (Device Memory)
TM

AMD Radeon

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

0
RESULTS - SPEEDUP
12

Speedup

10
8

4.9-fold speedup

6
4
2

16K

32K

64K

128K

Number of Queries w/ Order of B+Tree
AMD Radeon HDTM HD(Pinned Memory)
AMD Radeon 7660 7660 (Pinned Memory)

AMD Radeon HDTM HD(Device(Device Memory)
AMD Radeon 7660 7660 Memory)

Baseline: six-threaded, hand-tuned, SSE-optimized CPU implementation.

Average Speedup – 4.3-fold (Device Memory), 2.5-fold (Pinned Memory)

• Efficacy of IOMMUv2 + HSA on the APU
Platform
Discrete GPU (memory size = 3GB)
APU (prototype software)

< 1.5GB
✓
✓

18 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public

Size of the B+ Tree
1.5GB – 2.7GB
✓
✓

> 2.7GB

✗
✓

128

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

128

64

32

16

8

4

0
ANALYSIS
 The accelerators and the CPU yield best performance for different
orders of the B+ Tree
‒ CPU  order = 64
‒ Ability of CPUs to prefetch data is beneficial for higher orders

nodes w/ metadata
..

keys
..

values

..

‒ APU and dGPU  order = 16
‒ GPUs do not have a prefetcher  cache line should be most efficiently utilized
‒ GPUs have a cache-line size of 64 bytes
‒ Order = 16 is most beneficial (16 * 4 bytes)

19 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
ANALYSIS
 Minimum batch size to match the CPU performance
Order = 64

Order = 16

dGPU (device memory)

4K queries

2K queries

dGPU (pinned memory

N.A.

N.A.

APU (device memory)

10K queries

4K queries

APU (pinned memory

20K queries

16K queries

 reuse_factor - amortizing the cost of data-copies to the GPU

90% Queries

100% Queries

dGPU

15

54

APU

100

N.A.

20 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
PROGRAMMABILITY

GPU

CPU-SSE

int i = 0, j;

typedef global unsigned int g_uint;

node * c = root;

typedef global mynode g_mynode;

__m128i vkey = _mm_set1_epi32(key);

int tid = get_global_id(0);

__m128i vnodekey, *vptr;

int i = 0;

short int mask;

g_mynode *c = (g_mynode *)root;

/* find the leaf node */

/* find the leaf node */

while( !c->is_leaf ){

while(!c->is_leaf){

while (i < c->num_keys){

for(i = 0; i < (c->num_keys-3); i+=4){

if(keys[tid] >= ((g_uint *)((intptr_t)root + c->keys))[i])

vptr = (__m128i *)&(c->keys[i]);

i++;

vnodekey = _mm_load_si128(vptr);

else break;

mask = _mm_movemask_ps(_mm_cvtepi32_ps( _mm_cmplt_epi32(vkey, vnodekey)));
}

if((mask) & 8) break;

c = (g_mynode *)((intptr_t)root + c->ptr + i*sizeof(mynode));

}
for(j = i; j < c->num_keys; j++){
if(key < c->keys[j]) break;
}

}
/* match the key in the leaf node */
for(i=0; i<c->num_keys; i++){
if((((g_uint *)((intptr_t)root + c->keys))[i]) == keys[tid]) break;

c = (node *)c->pointers[j];
}

}

/* match the key in the leaf node */

/* retrieve the record */

for (i = 0; i < c->num_keys; i++)

if(i != c->num_keys)

if (c->keys[i] == key) break;

records[tid] = ((g_uint *)((intptr_t)root + c->is_leaf + i*sizeof(g_uint)))[0];

/* retrieve the record */
if (i != c->num_keys)
return (record *)c->pointers[i];

21 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
RELATED WORK
 J. Fix, A. Wilkes, and K. Skadron, "Accelerating Braided B+ Tree Searches on a GPU with CUDA." In
Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors: Analysis,
Implementation, and Performance, in conjunction with ISCA, 2011
‒ Authors report ~10-fold speedup over single-thread-non-SSE CPU implementation, using a discrete NVIDIA GTX
480 GPU (do not take data-copies into account)

 C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, P. Dubey, “FAST:
fast architecture sensitive tree search on modern CPUs and GPUs”, SIGMOD Conference, 2010
‒ Authors report ~100M queries per second using a discrete NVIDIA GTX 280 GPU (do not take data-copies into
account)

 J. Sewall, J. Chhugani, C. Kim, N. Satish, P. Dubey, “PALM: Parallel, Architecture-Friendly, Latch-Free
Modifications to B+ Trees on Multi-Core Processors”, Proceedings of VLDB Endowment, (VLDB 2011)
‒ Applicable for B+ Tree modifications on the GPU

22 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
POSSIBILITIES WITH HSA
Memory
3
2

5

4

6

7

1

2

3

4

5

6

7

8

d1

d2

d3

d4

d5

d6

d7 d8

Parallel Reads for
Searches

Serial Modifications

GPU Vector Cores
x86 Cores

Graphics Core Next, User-mode Queuing, Shared Virtual Memory
23 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
SUMMARY
 B+ Tree is the fundamental data structure in many RDBMS
‒ Accelerating B+ Tree searches is critical
‒ Presents significant challenges on discrete GPUs

 We have accelerated B+ Tree searches by exploiting coarsegrained parallelism on a APU
‒ 2.5-fold (avg.) speedup over 6-threads+SSE CPU implementation

 Possible Next Steps
‒ HSA + IOMMUv2 would reduce the issue of modifying B+ Tree
representation
‒ Investigate CPU-GPU co-scheduling

‒ Investigate modifications on the B+ Tree
24 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Phenom, Radeon, Catalyst and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of
their respective owners.

25 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
Ad

More Related Content

What's hot (20)

PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
AMD Developer Central
 
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
AMD Developer Central
 
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
AMD Developer Central
 
PG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry KozlovPG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry Kozlov
AMD Developer Central
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
AMD Developer Central
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
AMD Developer Central
 
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
AMD Developer Central
 
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard HoffnungPG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
AMD Developer Central
 
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe ClavelIS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
AMD Developer Central
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-Bilodeau
AMD Developer Central
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
AMD Developer Central
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
AMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
AMD Developer Central
 
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
AMD Developer Central
 
HSA-4122, "HSA Queuing Mode," by Ian Bratt
HSA-4122, "HSA Queuing Mode," by Ian BrattHSA-4122, "HSA Queuing Mode," by Ian Bratt
HSA-4122, "HSA Queuing Mode," by Ian Bratt
AMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
AMD Developer Central
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
Kelum Senanayake
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
AMD Developer Central
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
AMD Developer Central
 
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerHC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
AMD Developer Central
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
AMD Developer Central
 
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...
AMD Developer Central
 
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...
AMD Developer Central
 
PG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry KozlovPG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry Kozlov
AMD Developer Central
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
AMD Developer Central
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
AMD Developer Central
 
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...
AMD Developer Central
 
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard HoffnungPG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
AMD Developer Central
 
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe ClavelIS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
AMD Developer Central
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-Bilodeau
AMD Developer Central
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
AMD Developer Central
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
AMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
AMD Developer Central
 
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
AMD Developer Central
 
HSA-4122, "HSA Queuing Mode," by Ian Bratt
HSA-4122, "HSA Queuing Mode," by Ian BrattHSA-4122, "HSA Queuing Mode," by Ian Bratt
HSA-4122, "HSA Queuing Mode," by Ian Bratt
AMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
AMD Developer Central
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
AMD Developer Central
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
AMD Developer Central
 
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerHC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
AMD Developer Central
 

Viewers also liked (7)

Sycl 1.2 Reference Card
Sycl 1.2 Reference CardSycl 1.2 Reference Card
Sycl 1.2 Reference Card
The Khronos Group Inc.
 
OpenGL SC 2.0 Quick Reference
OpenGL SC 2.0 Quick ReferenceOpenGL SC 2.0 Quick Reference
OpenGL SC 2.0 Quick Reference
The Khronos Group Inc.
 
OpenVX 1.1 Reference Guide
OpenVX 1.1 Reference GuideOpenVX 1.1 Reference Guide
OpenVX 1.1 Reference Guide
The Khronos Group Inc.
 
OpenGL ES 3.2 Reference Guide
OpenGL ES 3.2 Reference GuideOpenGL ES 3.2 Reference Guide
OpenGL ES 3.2 Reference Guide
The Khronos Group Inc.
 
OpenCL 2.1 Reference Guide
OpenCL 2.1 Reference GuideOpenCL 2.1 Reference Guide
OpenCL 2.1 Reference Guide
The Khronos Group Inc.
 
Vulkan 1.0 Quick Reference
Vulkan 1.0 Quick ReferenceVulkan 1.0 Quick Reference
Vulkan 1.0 Quick Reference
The Khronos Group Inc.
 
WebGL 2.0 Reference Guide
WebGL 2.0 Reference GuideWebGL 2.0 Reference Guide
WebGL 2.0 Reference Guide
The Khronos Group Inc.
 
Ad

Similar to HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter (20)

Final apu13 phil-rogers-keynote-21
Final apu13 phil-rogers-keynote-21Final apu13 phil-rogers-keynote-21
Final apu13 phil-rogers-keynote-21
r Skip
 
Why sap hana
Why sap hanaWhy sap hana
Why sap hana
ugur candan
 
SDC Server Sao Jose
SDC Server Sao JoseSDC Server Sao Jose
SDC Server Sao Jose
Roberto Brandao
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
Kohei KaiGai
 
IO Dubi Lebel
IO Dubi LebelIO Dubi Lebel
IO Dubi Lebel
sqlserver.co.il
 
AMD AM1 Platform Presentation
AMD AM1 Platform PresentationAMD AM1 Platform Presentation
AMD AM1 Platform Presentation
Low Hong Chuan
 
Gluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & TricksGluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & Tricks
GlusterFS
 
Designing Information Structures For Performance And Reliability
Designing Information Structures For Performance And ReliabilityDesigning Information Structures For Performance And Reliability
Designing Information Structures For Performance And Reliability
bryanrandol
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
Kohei KaiGai
 
Multimedia hardware
Multimedia hardwareMultimedia hardware
Multimedia hardware
Utsav Roy
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
Junli Gu
 
Ds8000 Practical Performance Analysis P04 20060718
Ds8000 Practical Performance Analysis P04 20060718Ds8000 Practical Performance Analysis P04 20060718
Ds8000 Practical Performance Analysis P04 20060718
brettallison
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
iguazio
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
Wilhelm van Belkum
 
Storage, San And Business Continuity Overview
Storage, San And Business Continuity OverviewStorage, San And Business Continuity Overview
Storage, San And Business Continuity Overview
Alan McSweeney
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
Database Hardware Benchmarking
Database Hardware BenchmarkingDatabase Hardware Benchmarking
Database Hardware Benchmarking
Command Prompt., Inc
 
Measuring Firebird Disk I/O
Measuring Firebird Disk I/OMeasuring Firebird Disk I/O
Measuring Firebird Disk I/O
Mind The Firebird
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
Volodymyr Saviak
 
Final apu13 phil-rogers-keynote-21
Final apu13 phil-rogers-keynote-21Final apu13 phil-rogers-keynote-21
Final apu13 phil-rogers-keynote-21
r Skip
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
Kohei KaiGai
 
AMD AM1 Platform Presentation
AMD AM1 Platform PresentationAMD AM1 Platform Presentation
AMD AM1 Platform Presentation
Low Hong Chuan
 
Gluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & TricksGluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & Tricks
GlusterFS
 
Designing Information Structures For Performance And Reliability
Designing Information Structures For Performance And ReliabilityDesigning Information Structures For Performance And Reliability
Designing Information Structures For Performance And Reliability
bryanrandol
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
Kohei KaiGai
 
Multimedia hardware
Multimedia hardwareMultimedia hardware
Multimedia hardware
Utsav Roy
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
Junli Gu
 
Ds8000 Practical Performance Analysis P04 20060718
Ds8000 Practical Performance Analysis P04 20060718Ds8000 Practical Performance Analysis P04 20060718
Ds8000 Practical Performance Analysis P04 20060718
brettallison
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
iguazio
 
Storage, San And Business Continuity Overview
Storage, San And Business Continuity OverviewStorage, San And Business Continuity Overview
Storage, San And Business Continuity Overview
Alan McSweeney
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
Volodymyr Saviak
 
Ad

More from AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
AMD Developer Central
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
AMD Developer Central
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
AMD Developer Central
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
AMD Developer Central
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
AMD Developer Central
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
AMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
AMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
AMD Developer Central
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
AMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
AMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
AMD Developer Central
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
AMD Developer Central
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
AMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
AMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
AMD Developer Central
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
AMD Developer Central
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
AMD Developer Central
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
AMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
AMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
AMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
AMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
AMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
AMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
AMD Developer Central
 

Recently uploaded (20)

UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 

HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

  • 1. EXPLOITING COARSE-GRAINED PARALLELISM IN B+ TREE SEARCHES ON APUS MAYANK DAGA, MARK NUTTER AMD RESEARCH
  • 2. RELEVANCE OF B+ TREE SEARCHES  B+ Trees are special case of B Trees ‒ Fundamental data structure used in several popular database management systems B Tree B+ Tree mongoDB MySQL CouchDB SQLite  High-throughput, read-only index searches are gaining traction in - Video-copy detection ‒ Audio-search ‒ Online Transaction Processing (OLTP) Benchmarks  Increase in memory capacity allows many database tables to reside in memory ‒Brings computational performance to the forefront X 2 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 3. A REAL-WORLD USE-CASE OF READ-ONLY SEARCHES Mobile: Step 1. Record Audio Step 2. Generate Audio Fingerprint Step 3. Send search request to server App on a smartphone Database 3 2 1 d1 2 d2 5 4 6 7 3 4 5 6 7 8 d3 d4 d5 d6 Server: Step 1. Receive search requests Step 2. Query Database Step 3. Return search results to client d7 d8 3 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public Thousands of clients send requests Music Library – Millions of Songs
  • 4. DATABASE PRIMITIVES ON ACCELERATORS  Discrete graphics processing units (dGPUs) provide a compelling mix of ‒ Performance per Watt ‒ Performance per Dollar  dGPUs have been used to accelerate critical database primitives ‒ scan ‒ sort ‒ join ‒ aggregation ‒ B+ Tree Searches? 4 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 5. B+ TREE SEARCHES ON ACCELERATORS  B+ Tree searches present significant challenges ‒ Irregular representation in memory ‒ An artifact of malloc() and new() ‒ Today’s dGPUs do not have a direct mapping to the CPU virtual address space ‒ Indirect links need to be converted to relative offsets ‒ Requirement to copy the tree to the dGPU, which entails ‒ One is always bound by the amount of GPU device memory 5 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 6. OUR SOLUTION  Accelerated B+ Tree searches on a fused CPU+GPU processor (or APU1) ‒ Eliminates data-copies by combining x86 CPU and vector GPU cores on the same silicon die  Developed a memory allocator to form a regular representation of the tree in memory ‒ Fundamental data structure is not altered ‒ Merely parts of its layout is changed [1] www.hsafoundation.com 6 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 7. OUTLINE  Motivation and Contribution  Background ‒ AMD APU Architecture ‒ B+ Trees  Approach ‒ Transforming the Memory Layout ‒ Eliminating the Divergence  Results ‒ Performance ‒ Analysis  Summary and Next Steps 7 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 8. AMD APU ARCHITECTURE System Memory Host Memory DRAM Controller x86 Cores DRAM Controller RMB System Request Interface (SRI) xBar Link Controll er MCT UNB GPU Frame-Buffer FCL Platform Interfaces AMD 2nd Gen. A-series APU UNB - Unified Northbridge, MCT - Memory Controller  The APU consists of a dedicated IOMMUv2 hardware GPU Vector Cores - Provides direct mapping between GPU and CPU virtual address (VA) space - Enables GPUs to access the system memory - Enables GPUs to track whether pages are resident in memory  Today, GPU cores can access VA space at a granularity of continuous chunks 8 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 9. B+ TREES 3 2 5 4 6 7 1  A B+ Tree … 2 3 4 5 6 7 d1 d2 d3 d4 d5 d6 d7 d8 ‒ is a dynamic, multi-level index ‒ Is efficient for retrieval of data, stored in a block-oriented context ‒ has a high fan-out to reduce disk I/O operations  Order (b) of a B+ Tree measures the capacity of its nodes  Number of children (m) in an internal node is ‒ [b/2] <= m <= b ‒ Root node can have as few as two children  Number of keys in an internal node = (m – 1) 9 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 8
  • 10. APPROACH FOR PARALLELIZATION  Fine-grained (Accelerate a single query) ‒ Replace Binary search in each node with K-ary search ‒ Maximum performance improvement = log(k)/log(2) ‒ Results in poor occupancy of the GPU cores  Coarse-grained (Perform many queries in parallel) ‒ Enables data-parallelism ‒Increases memory bandwidth with parallel reads ‒Increases throughput (transactions per second for OLTP) 10 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 11. TRANSFORMING THE MEMORY LAYOUT nodes w/ metadata .. keys .. values  Metadata ‒ Number of keys in a node ‒ Offset to keys/values in the buffer ‒ Offset to the first child node ‒ Whether a node is a leaf  Pass a pointer to this memory buffer to the accelerator 11 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public ..
  • 12. ELIMINATING THE DIVERGENCE  Each work-item/thread executes a single query  May increase divergence within a wave-front ‒ Every query may follow a different path in the B+ Tree WI-1 WI-2 2 3 5 4 6 7 WI-2 1 2 3 4 5 6 7 d1 d2 d3 d4 d5 d6 d7 d8 8  Sort the keys to be searched ‒ Increases the chances of work-items within a wave-front to follow similar paths in the B+ Tree ‒ We use Radix Sort1 to sort the keys on the GPU [1] D. G. Merrill and A. S. Grimshaw, “Revisiting sorting for gpgpu stream architectures,” in Proceedings of the 19th intl. conf. on Parallel architectures and compilation techniques, ser. PACT ’10. New York, NY, USA: ACM, 2010. 12 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 13. IMPACT OF DIVERGENCE IN B+ TREE SEARCHES Impcat of Divergence 5 4 3 2 1 16K 32K 64K 128K Number of Queries w/ Order of B+ Tree TM AMD Radeon HD 7660 TM AMD Phenom II X6 1090T Impact of Divergence on GPU – 3.7-fold (average) Impact of Divergence on CPU – 1.8-fold (average) 13 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 128 64 32 16 8 128 64 32 16 8 128 64 32 16 8 128 64 32 16 8 0
  • 14. OUTLINE  Motivation and Contribution  Background ‒ AMD APU Architecture ‒ B+ Trees  Approach ‒ Transforming the Memory Layout ‒ Eliminating the Divergence  Results ‒ Performance ‒ Analysis  Summary and Next Steps 14 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 15. 60000 50000 40000 30000 20000 10000 0 100000 300000 500000 700000 900000 1100000 1300000 1500000 1700000 1900000 2100000 2300000 2500000 2700000 2900000 3100000 3300000 3500000 3700000 3900000 4100000  Software ‒ A B+ Tree w/ 4M records is used ‒ Search queries are created using Frequency EXPERIMENTAL SETUP ‒ normal_distribution() (C++-11 feature) ‒ The queries have been sorted ‒ CPU Implementation from ‒ http://www.amittai.com/prose/bplustree.html ‒ Driver: AMD CatalystTM v12.8 ‒ Programming Model: OpenCLTM  Hardware ‒ AMD RadeonTM HD 7660 APU (Trinity) ‒ 4 cores w/ 6GB DDR3, 6 CUs w/ 2GB DDR3 ‒ AMD PhenomTM II X6 1090T + AMD RadeonTM HD 7970 (Tahiti) ‒ 6 cores w/ 8GB DDR3, 32 CUs w/ 3GB GDDR5 ‒ Device Memory does not include data-copy time 15 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public Bin E ID (PKey) Age 0000001 34 4 million 4194304 entries 50
  • 16. RESULTS – QUERIES PER SECOND 700M Queries/Sec Queries/Second (Million) 1000 100 10 16K 32K 64K 128K Number of Queries w/ Order of B+Tree AMD Phenom II X6 1090T (6-Threads+SSE) AMD PhenomTM II 1090T (6-Threads+SSE) AMD Radeon HDTM HD(Device Memory) AMD Radeon 7970 7970 (Device Memory) AMD Radeon HD 7970 (Pinned(Pinned Memory) AMD RadeonTM HD 7970 Memory) dGPU (device memory) ~350M Queries/Sec. (avg.) dGPU (pinned memory) ~9M Queries/Sec. (avg.) AMD PhenomTM CPU ~18M Queries/Sec. (avg.) 16 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 128 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 1
  • 17. RESULTS – QUERIES PER SECOND 140 Queries/Second (Million) 120 100 80 60 40 20 16K AMD Phenom II X6 1090T (6-Threads+SSE) AMD PhenomTM II 1090T (6-Threads+SSE) 32K Number of Queries w/ Order of B+Tree HD 7660 (Device Memory) 128K AMD Radeon HDTM HD(Pinned Memory) AMD Radeon 7660 7660 (Pinned Memory) APU (device memory) ~66M Queries/Sec. (avg.) APU (pinned memory) ~40M Queries/Sec. (avg.) APU (pinned memory) is faster than the CPU implementation 17 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 128 64 32 16 8 4 128 64K AMD Radeon HD 7660 (Device Memory) TM AMD Radeon 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 0
  • 18. RESULTS - SPEEDUP 12 Speedup 10 8 4.9-fold speedup 6 4 2 16K 32K 64K 128K Number of Queries w/ Order of B+Tree AMD Radeon HDTM HD(Pinned Memory) AMD Radeon 7660 7660 (Pinned Memory) AMD Radeon HDTM HD(Device(Device Memory) AMD Radeon 7660 7660 Memory) Baseline: six-threaded, hand-tuned, SSE-optimized CPU implementation. Average Speedup – 4.3-fold (Device Memory), 2.5-fold (Pinned Memory) • Efficacy of IOMMUv2 + HSA on the APU Platform Discrete GPU (memory size = 3GB) APU (prototype software) < 1.5GB ✓ ✓ 18 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public Size of the B+ Tree 1.5GB – 2.7GB ✓ ✓ > 2.7GB ✗ ✓ 128 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 0
  • 19. ANALYSIS  The accelerators and the CPU yield best performance for different orders of the B+ Tree ‒ CPU  order = 64 ‒ Ability of CPUs to prefetch data is beneficial for higher orders nodes w/ metadata .. keys .. values .. ‒ APU and dGPU  order = 16 ‒ GPUs do not have a prefetcher  cache line should be most efficiently utilized ‒ GPUs have a cache-line size of 64 bytes ‒ Order = 16 is most beneficial (16 * 4 bytes) 19 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 20. ANALYSIS  Minimum batch size to match the CPU performance Order = 64 Order = 16 dGPU (device memory) 4K queries 2K queries dGPU (pinned memory N.A. N.A. APU (device memory) 10K queries 4K queries APU (pinned memory 20K queries 16K queries  reuse_factor - amortizing the cost of data-copies to the GPU 90% Queries 100% Queries dGPU 15 54 APU 100 N.A. 20 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 21. PROGRAMMABILITY GPU CPU-SSE int i = 0, j; typedef global unsigned int g_uint; node * c = root; typedef global mynode g_mynode; __m128i vkey = _mm_set1_epi32(key); int tid = get_global_id(0); __m128i vnodekey, *vptr; int i = 0; short int mask; g_mynode *c = (g_mynode *)root; /* find the leaf node */ /* find the leaf node */ while( !c->is_leaf ){ while(!c->is_leaf){ while (i < c->num_keys){ for(i = 0; i < (c->num_keys-3); i+=4){ if(keys[tid] >= ((g_uint *)((intptr_t)root + c->keys))[i]) vptr = (__m128i *)&(c->keys[i]); i++; vnodekey = _mm_load_si128(vptr); else break; mask = _mm_movemask_ps(_mm_cvtepi32_ps( _mm_cmplt_epi32(vkey, vnodekey))); } if((mask) & 8) break; c = (g_mynode *)((intptr_t)root + c->ptr + i*sizeof(mynode)); } for(j = i; j < c->num_keys; j++){ if(key < c->keys[j]) break; } } /* match the key in the leaf node */ for(i=0; i<c->num_keys; i++){ if((((g_uint *)((intptr_t)root + c->keys))[i]) == keys[tid]) break; c = (node *)c->pointers[j]; } } /* match the key in the leaf node */ /* retrieve the record */ for (i = 0; i < c->num_keys; i++) if(i != c->num_keys) if (c->keys[i] == key) break; records[tid] = ((g_uint *)((intptr_t)root + c->is_leaf + i*sizeof(g_uint)))[0]; /* retrieve the record */ if (i != c->num_keys) return (record *)c->pointers[i]; 21 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 22. RELATED WORK  J. Fix, A. Wilkes, and K. Skadron, "Accelerating Braided B+ Tree Searches on a GPU with CUDA." In Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors: Analysis, Implementation, and Performance, in conjunction with ISCA, 2011 ‒ Authors report ~10-fold speedup over single-thread-non-SSE CPU implementation, using a discrete NVIDIA GTX 480 GPU (do not take data-copies into account)  C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, P. Dubey, “FAST: fast architecture sensitive tree search on modern CPUs and GPUs”, SIGMOD Conference, 2010 ‒ Authors report ~100M queries per second using a discrete NVIDIA GTX 280 GPU (do not take data-copies into account)  J. Sewall, J. Chhugani, C. Kim, N. Satish, P. Dubey, “PALM: Parallel, Architecture-Friendly, Latch-Free Modifications to B+ Trees on Multi-Core Processors”, Proceedings of VLDB Endowment, (VLDB 2011) ‒ Applicable for B+ Tree modifications on the GPU 22 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 23. POSSIBILITIES WITH HSA Memory 3 2 5 4 6 7 1 2 3 4 5 6 7 8 d1 d2 d3 d4 d5 d6 d7 d8 Parallel Reads for Searches Serial Modifications GPU Vector Cores x86 Cores Graphics Core Next, User-mode Queuing, Shared Virtual Memory 23 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 24. SUMMARY  B+ Tree is the fundamental data structure in many RDBMS ‒ Accelerating B+ Tree searches is critical ‒ Presents significant challenges on discrete GPUs  We have accelerated B+ Tree searches by exploiting coarsegrained parallelism on a APU ‒ 2.5-fold (avg.) speedup over 6-threads+SSE CPU implementation  Possible Next Steps ‒ HSA + IOMMUv2 would reduce the issue of modifying B+ Tree representation ‒ Investigate CPU-GPU co-scheduling ‒ Investigate modifications on the B+ Tree 24 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 25. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Phenom, Radeon, Catalyst and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. 25 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public