Presentation, HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter at the AMD Developer Summit (APU13) Nov. 11-13.
The document discusses the specifications and architecture of the AMD Radeon R9-290X graphics processing unit (GPU). Some key points:
- The R9-290X contains 44 compute units with a total of 2816 stream processors. It has a 512-bit GDDR5 memory interface providing 320 GB/sec of memory bandwidth.
- The GPU uses AMD's Graphics Core Next (GCN) architecture. This includes improvements to geometry processing, new local data share memory operations, and enhanced media processing instructions.
- The GCN architecture includes compute units containing vector units and a local data store. Compute units provide computational power through 2816 stream processors.
- New features include support for flat
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...AMD Developer Central
Keynote presentation, The Programmers Guide to Reaching for the Cloud, by Phil Rogers, AMD Corporate Fellow, AMD, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...AMD Developer Central
This document discusses debugging and profiling challenges with OpenCL and how AMD CodeXL addresses them. It provides an overview of CodeXL's debugging and profiling capabilities for OpenCL, including API-level debugging, kernel source debugging, profiling views for APIs, objects, and kernel variables, and integrated support in Visual Studio. Demo code is included to illustrate pinpointing OpenCL errors and optimizing work item loads.
AMD held a developer summit to share updates on their APU and GPU products. They discussed how computing demands are increasing for gaming, simulations and cloud applications. AMD's APUs combine CPU and GPU capabilities on a single chip. Their newest APU, codenamed "Kaveri", will feature heterogeneous system architecture capabilities. It will offer improved graphics and efficiency over previous APU designs. AMD also unveiled their new Radeon R9 290X GPU and discussed how both products will benefit from lower-level APIs like Mantle.
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyAMD Developer Central
The document introduces AMD's developer tools strategy and CodeXL tool. It discusses how AMD is converging its CPU and GPU tools into a unified HSA Developer Tools Suite, with CodeXL being a key tool. CodeXL allows debugging, profiling, and analyzing applications across AMD CPUs, GPUs, and APUs in a "white box" view. It is available for Windows, Visual Studio, and Linux. The document then describes several CodeXL capabilities such as GPU debugging, CPU and GPU profiling, static kernel analysis, and what is new in CodeXL.
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...AMD Developer Central
Presentation Hc-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton at the AMD Developer Summit (APU13) November 11-13, 2013.
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...AMD Developer Central
This document provides an overview of using the Synthetic Workload Analysis Toolkit (SWAT) and IPython notebooks to analyze big data workloads. SWAT is a software platform that automates the creation, deployment, execution, and data gathering of synthetic compute workloads on clusters. IPython notebooks can be used to interactively explore system logs gathered by SWAT to identify performance bottlenecks and optimize workloads. Graphs of resource utilization are generated to determine if the system is CPU-bound, disk-bound, or network-bound. This analysis helps tune workloads and characterize systems.
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...AMD Developer Central
This document discusses optimizing raytracing on AMD's GCN architecture using AMD development tools. It provides an overview of raytracing and KD trees, describes the GCN architecture and its scalar nature, and how this impacts raytracing. It then discusses mapping raytracing to GPUs, optimizing with a stackless traversal and short stack in local memory. CodeXL is used to analyze occupancy and optimize the kernel further.
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...AMD Developer Central
Presentation WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour and Brian Salomon at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
The document discusses AMD's RapidFire technology for remote graphics solutions. RapidFire uses dedicated cloud hardware and software to deliver multiple HD game streams from a single GPU with low latency. It has four independent components - the server, network, client, and user interface. The server performs GPU encoding of the desktop into an H264 video stream. The stream is sent to the client over the network, where it is decoded by the client hardware and displayed in the UI. RapidFire is designed to work across different hardware and support various use cases and workflows.
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
This document discusses optimizing FFmpeg and Handbrake using OpenCL. It describes FFmpeg as a popular open-source multimedia software library used for recording, converting, and streaming audio and video. It was optimized to leverage heterogeneous computing by accelerating video decoding and encoding using hardware accelerators and accelerating video processing filters using the GPU. Specific filters were implemented in OpenCL for improved performance compared to CPU. Performance tests showed the accelerated FFmpeg approach achieved significantly higher frame rates than the original CPU-only FFmpeg.
The document discusses HSA compiler technology. It outlines the architecture of HSA compilers, which leverage the LLVM framework and generate the HSAIL intermediate representation. Performance is improved through optimizations in the high-level compiler and a thin finalizer. OpenCL 2.0 features like shared virtual memory and platform atomics will be supported. The first release of the OpenCL/HSA compiler is planned for Q2 2014.
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...AMD Developer Central
Presentation MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Achievements, by Joseph Hsieh at the AMD Developer Summit, November 11-13, 2013.
This document summarizes improvements to the TressFX hair rendering and simulation technology. TressFX 2.0 features improved performance through deferred lighting and shadowing, continuous LOD, and code restructuring. Rendering is faster through optimizations to the anti-aliasing, self-shadowing, and transparency techniques. The simulation is formulated with general constraints and solved using a tridiagonal matrix approach for better stability under various hair conditions like wet, dry, or with wind. Overall, TressFX 2.0 provides over 2x performance increases for hair rendering compared to the previous version.
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...AMD Developer Central
Presentation WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael Sevenier, at the AMD Developer Summit (APU13) November 11-13, 2013.
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Harris Gasparakis, AMD, at the Embedded Vision Alliance Summit, May 2014.
Harris Gasparakis, Ph.D., is AMD’s OpenCV manager. In addition to enhancing OpenCV with OpenCL acceleration, he is engaged in AMD’s Computer Vision strategic planning, ISVs, and AMD Ventures engagements, including technical leadership and oversight in the AMD Gesture product line. He holds a Ph.D. in theoretical high energy physics from YITP at SUNYSB. He is credited with enabling real-time volumetric visualization and analysis in Radiology Information Systems (Terarecon), including the first commercially available virtual colonoscopy system (Vital Images). He was responsible for cutting edge medical technology (Biosense Webster, Stereotaxis, Boston Scientific), incorporating image and signal processing with AI and robotic control.
Direct3D12 aims to address issues with existing APIs by providing a more direct mapping to hardware capabilities. It features command buffers that allow work to be built in parallel threads and scheduled more efficiently. Pipeline state objects avoid runtime compilation overhead. Descriptor tables provide bindless resources through pointers and reduce state changes. While this gives more control and efficiency, it also means applications have more responsibility to avoid errors. Overall, Direct3D12 is designed to better expose the capabilities of modern graphics hardware.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerAMD Developer Central
The document discusses the challenges of memory access when using a GPU. It describes the programmer's view of memory as a flat address space and how GPUs complicate this model. GPUs have their own memory hierarchies with local memory caches and different types of memory. GPU memory is accessed through specialized APIs that allocate objects like buffers and textures instead of regular malloc memory. This introduces complexity in ensuring coherency between CPU and GPU memory views. The talk will address these memory challenges and how solutions like HSA and hUMA aim to provide a more unified memory model.
SYCL is a C++ programming model for OpenCL that builds on OpenCL concepts like portability and efficiency while adding C++ ease of use and flexibility. The example code shows a typical SYCL application that schedules work on an OpenCL GPU using a queue, buffer, and parallel_for kernel. It initializes data in a buffer, enqueues work via a command group, and prints results.
This document provides a quick reference to the key functions and commands in the OpenGL SC 2.0 specification. It summarizes the functions for handling errors and resets, vertex specification and drawing, buffer objects, viewport settings, rasterization, shaders and programs, per-fragment operations, framebuffer operations, and state queries. The summary is presented in 3 sentences or less highlighting the essential information and organization of the document.
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...AMD Developer Central
This document provides an overview of using the Synthetic Workload Analysis Toolkit (SWAT) and IPython notebooks to analyze big data workloads. SWAT is a software platform that automates the creation, deployment, execution, and data gathering of synthetic compute workloads on clusters. IPython notebooks can be used to interactively explore system logs gathered by SWAT to identify performance bottlenecks and optimize workloads. Graphs of resource utilization are generated to determine if the system is CPU-bound, disk-bound, or network-bound. This analysis helps tune workloads and characterize systems.
PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi C...AMD Developer Central
This document discusses optimizing raytracing on AMD's GCN architecture using AMD development tools. It provides an overview of raytracing and KD trees, describes the GCN architecture and its scalar nature, and how this impacts raytracing. It then discusses mapping raytracing to GPUs, optimizing with a stackless traversal and short stack in local memory. CodeXL is used to analyze occupancy and optimize the kernel further.
WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour...AMD Developer Central
Presentation WT-4072, Rendering Web Content at 60fps, by Vangelis Kokkevis, Antoine Labour and Brian Salomon at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
The document discusses AMD's RapidFire technology for remote graphics solutions. RapidFire uses dedicated cloud hardware and software to deliver multiple HD game streams from a single GPU with low latency. It has four independent components - the server, network, client, and user interface. The server performs GPU encoding of the desktop into an H264 video stream. The stream is sent to the client over the network, where it is decoded by the client hardware and displayed in the UI. RapidFire is designed to work across different hardware and support various use cases and workflows.
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
This document discusses optimizing FFmpeg and Handbrake using OpenCL. It describes FFmpeg as a popular open-source multimedia software library used for recording, converting, and streaming audio and video. It was optimized to leverage heterogeneous computing by accelerating video decoding and encoding using hardware accelerators and accelerating video processing filters using the GPU. Specific filters were implemented in OpenCL for improved performance compared to CPU. Performance tests showed the accelerated FFmpeg approach achieved significantly higher frame rates than the original CPU-only FFmpeg.
The document discusses HSA compiler technology. It outlines the architecture of HSA compilers, which leverage the LLVM framework and generate the HSAIL intermediate representation. Performance is improved through optimizations in the high-level compiler and a thin finalizer. OpenCL 2.0 features like shared virtual memory and platform atomics will be supported. The first release of the OpenCL/HSA compiler is planned for Q2 2014.
MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Ac...AMD Developer Central
Presentation MM-4104, Smart Sharpen using OpenCL in Adobe Photoshop CC – Challenges and Achievements, by Joseph Hsieh at the AMD Developer Summit, November 11-13, 2013.
This document summarizes improvements to the TressFX hair rendering and simulation technology. TressFX 2.0 features improved performance through deferred lighting and shadowing, continuous LOD, and code restructuring. Rendering is faster through optimizations to the anti-aliasing, self-shadowing, and transparency techniques. The simulation is formulated with general constraints and solved using a tridiagonal matrix approach for better stability under various hair conditions like wet, dry, or with wind. Overall, TressFX 2.0 provides over 2x performance increases for hair rendering compared to the previous version.
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...AMD Developer Central
Presentation WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael Sevenier, at the AMD Developer Summit (APU13) November 11-13, 2013.
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Harris Gasparakis, AMD, at the Embedded Vision Alliance Summit, May 2014.
Harris Gasparakis, Ph.D., is AMD’s OpenCV manager. In addition to enhancing OpenCV with OpenCL acceleration, he is engaged in AMD’s Computer Vision strategic planning, ISVs, and AMD Ventures engagements, including technical leadership and oversight in the AMD Gesture product line. He holds a Ph.D. in theoretical high energy physics from YITP at SUNYSB. He is credited with enabling real-time volumetric visualization and analysis in Radiology Information Systems (Terarecon), including the first commercially available virtual colonoscopy system (Vital Images). He was responsible for cutting edge medical technology (Biosense Webster, Stereotaxis, Boston Scientific), incorporating image and signal processing with AI and robotic control.
Direct3D12 aims to address issues with existing APIs by providing a more direct mapping to hardware capabilities. It features command buffers that allow work to be built in parallel threads and scheduled more efficiently. Pipeline state objects avoid runtime compilation overhead. Descriptor tables provide bindless resources through pointers and reduce state changes. While this gives more control and efficiency, it also means applications have more responsibility to avoid errors. Overall, Direct3D12 is designed to better expose the capabilities of modern graphics hardware.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerAMD Developer Central
The document discusses the challenges of memory access when using a GPU. It describes the programmer's view of memory as a flat address space and how GPUs complicate this model. GPUs have their own memory hierarchies with local memory caches and different types of memory. GPU memory is accessed through specialized APIs that allocate objects like buffers and textures instead of regular malloc memory. This introduces complexity in ensuring coherency between CPU and GPU memory views. The talk will address these memory challenges and how solutions like HSA and hUMA aim to provide a more unified memory model.
SYCL is a C++ programming model for OpenCL that builds on OpenCL concepts like portability and efficiency while adding C++ ease of use and flexibility. The example code shows a typical SYCL application that schedules work on an OpenCL GPU using a queue, buffer, and parallel_for kernel. It initializes data in a buffer, enqueues work via a command group, and prints results.
This document provides a quick reference to the key functions and commands in the OpenGL SC 2.0 specification. It summarizes the functions for handling errors and resets, vertex specification and drawing, buffer objects, viewport settings, rasterization, shaders and programs, per-fragment operations, framebuffer operations, and state queries. The summary is presented in 3 sentences or less highlighting the essential information and organization of the document.
This document provides a reference guide for vision functions in OpenVX 1.1, listing functions for common image processing tasks like filtering, feature detection, arithmetic operations on images, and more. Each function is described briefly, noting input/output image formats and data types for parameters. The guide is intended to help developers working with the OpenVX API to select appropriate functions and understand their parameters and data formats.
This document summarizes the OpenGL ES 3.2 API reference guide. It provides an overview of key concepts like command syntax, shader and program objects, buffer objects, textures and samplers. It also lists functions for creating/deleting objects, binding objects, setting object parameters, and synchronizing operations. Section and table references are provided to the OpenGL ES specification for more details.
The document summarizes the OpenCL runtime API and platform layer. It provides an overview of managing OpenCL objects like command queues and memory objects. It lists functions for querying platforms, devices, creating contexts, partitioning devices, and managing memory objects. It also describes pipe objects which are memory objects storing data organized as a FIFO.
Vulkan is a graphics and compute API that specifies shader programs, compute kernels, objects, and operations for producing high-quality 3D graphics images. It defines a programmable and state-driven pipeline with fixed-function stages invoked by drawing operations. The API consists of functions and procedures to initialize Vulkan, create device and command buffer objects, and submit commands for graphics processing and synchronization.
WebGL uses ArrayBuffers and typed arrays to transfer data to the GPU. ArrayBuffers represent unstructured binary data that can be modified by typed array views of the buffer, such as Int8Array or Float32Array views. Views can reference the whole buffer or a subset of it, and methods like set() are used to populate the views and transfer data to WebGL buffers.
The document discusses how modern cloud workloads are becoming more heterogeneous with growing parallel content like video processing and big data analytics. It describes how the Heterogeneous System Architecture (HSA) enables efficient acceleration of these workloads across CPUs, GPUs and other processors. HSA provides a unified memory model and programming model to easily harness the power of parallel processors while increasing performance and reducing power. Emerging programming languages and frameworks are embracing HSA to enable developers to leverage heterogeneous computing resources.
1) SAP HANA is an in-memory database that allows for real-time analytics by storing and processing large amounts of data in memory rather than on disk.
2) It provides faster processing speeds than traditional disk-based databases by keeping all the data in RAM, allowing for faster queries, reporting, planning and advanced analytics.
3) SAP HANA combines both software and optimized hardware to deliver an in-memory computing platform for real-time analytics and applications.
The document discusses AMD's technologies for improving energy efficiency and performance in servers and graphics processors. It highlights AMD's Opteron and FireStream processors, which provide high performance while using less power than competitors. AMD is focusing on innovations like its CoolCore technology, 45nm manufacturing process, and support for OpenCL to enhance efficiency and acceleration capabilities across its product lines.
This document discusses disk I/O performance testing tools. It introduces SQLIO and IOMETER for measuring disk throughput, latency, and IOPS. Examples are provided for running SQLIO tests and interpreting the output, including metrics like throughput in MB/s, latency in ms, and I/O histograms. Other disk performance factors discussed include the number of outstanding I/Os, block size, and sequential vs random access patterns.
The document provides an overview of the new AMD AM1 platform. It introduces the AMD Athlon and Sempron APUs that are part of the AM1 lineup and are designed for the entry-level desktop PC market. The AM1 platform delivers improved performance and features over previous generations at competitive price points starting at $39.
Gluster for Geeks: Performance Tuning Tips & TricksGlusterFS
This document summarizes a webinar on performance tuning tips and tricks for GlusterFS. The webinar covered planning cluster hardware configuration to meet performance requirements, choosing the correct volume type for workloads, key tuning parameters, benchmarking techniques, and the top 5 causes of performance issues. The webinar provided guidance on optimizing GlusterFS performance through hardware sizing, configuration, implementation best practices, and tuning.
Designing Information Structures For Performance And Reliabilitybryanrandol
This document discusses optimizing database server performance through hardware, operating system, and database design considerations. It covers topics like CPU performance, memory architecture, disk I/O, and database types like OLTP and OLAP. The document compares GreenPlum and PostgreSQL databases and explains how to tweak PostgreSQL configuration parameters to optimize performance.
This document discusses the hardware components needed for multimedia production. It outlines the main components as the processing unit (CPU and GPU), memory (RAM and ROM), graphics card, external storage devices (HDD, SSD, optical discs like CDs, DVDs, Blu-Ray), and cameras. It provides details on the latest and best CPU and GPU options from Intel, AMD, and Nvidia. It also explains the differences between various external storage types and optical discs.
This document summarizes Chris Fregly's presentation on how Apache Spark beat Hadoop at sorting 100 TB of data. Key points include:
- Spark set a new record in the Daytona GraySort benchmark by sorting 100 TB of data in 23 minutes using 250,000 partitions on EC2.
- Optimizations that contributed to Spark's win included using CPU cache locality with (Key, Pointer) pairs, an optimized sorting algorithm, reducing network overhead with Netty, and reducing OS resources with a sort-based shuffle.
- The sort-based shuffle merges mapper outputs into a single file per partition to minimize disk seeks during the shuffle.
1) The document discusses implementing and evaluating deep neural networks (DNNs) on mainstream heterogeneous systems like CPUs, GPUs, and APUs.
2) Preliminary results show that an APU achieves the highest performance per watt compared to CPUs and GPUs for DNN models like MLP and autoencoders.
3) Data transfers between the CPU and GPU are identified as a bottleneck, but APUs can help avoid this issue through efficient data sharing and zero-copy techniques between the CPU and GPU.
The document provides an overview and agenda for a 5-day training course on performance analysis of IBM DS8000 storage subsystems. The course covers hardware overview, performance implications, Disk Magic performance modeling tool, techniques for performance analysis including common metrics, and features that enhance DS8000 performance such as memory, cache, disks and host adapters.
1) NVIDIA-Iguazio Accelerated Solutions for Deep Learning and Machine Learning (30 mins):
About the speaker:
Dr. Gabriel Noaje, Senior Solutions Architect, NVIDIA
http://bit.ly/GabrielNoaje
2) GPUs in Data Science Pipelines ( 30 mins)
- GPU as a Service for enterprise AI
- A short demo on the usage of GPUs for model training and model inferencing within a data science workflow
About the speaker:
Anant Gandhi, Solutions Engineer, Iguazio Singapore. https://www.linkedin.com/in/anant-gandhi-b5447614/
The document discusses plans to establish an institutional high performance computing (HPC) facility at North-West University. It outlines the technical goals of building a Beowulf cluster to link existing departmental clusters and integrate with national and international computational grids. It also discusses management principles for the new HPC facility to ensure sustainability, efficiency, reliability, availability and high performance.
Storage, San And Business Continuity OverviewAlan McSweeney
The document provides an overview of storage systems and business continuity options. It discusses various types of storage including DAS, NAS and SAN. It then covers business continuity and disaster recovery strategies like replication, snapshots and mirroring. It also discusses how server virtualization can help improve disaster recovery.
The document discusses network performance profiling of Hadoop jobs. It presents results from running two common Hadoop benchmarks - Terasort and Ranked Inverted Index - on different Amazon EC2 instance configurations. The results show that the shuffle phase accounts for a significant portion (25-29%) of total job runtime. They aim to reproduce existing findings that network performance is a key bottleneck for shuffle-intensive Hadoop jobs. Some questions are also raised about inconsistencies in reported network bandwidth capabilities for EC2.
Greg Smith
Cover how to use simple low-level tools such as memtest86, dd, bonnie++, and sysbench to benchmark the hardware of a server intended for database use. A heavy dose of vendor management suggestions will be included as well, for the inevitable time when your shiny new server fails to deliver the performance it should.
This document discusses two methods for measuring Firebird disk I/O: 1) Using the MON$IO_STATS tables within Firebird to track page reads, writes, fetches, and marks, and 2) Using the host operating system's performance monitoring tools like Windows Performance Monitor. It notes some limitations with MON$IO_STATS and provides examples of specific disk and process counters to log. The document also covers estimating required IOPS based on potential disk throughput and accounting for factors like RAID write penalties.
GPU HPC Clusters document discusses GPU cluster research at NCSA including early GPU clusters like QP and Lincoln, follow-up clusters like AC that expanded GPU resources, and eco-friendly cluster EcoG. It describes ISL research in GPU and heterogeneous computing including systems software, runtimes, tools and application development.
This document discusses new graphics APIs like DX12 and Vulkan that aim to provide lower overhead and more direct hardware access compared to earlier APIs. It covers topics like increased parallelism, explicit memory management using descriptor sets and pipelines, and best practices like batching draw calls and using multiple asynchronous queues. Overall, the new APIs allow more explicit control over GPU hardware for improved performance but require following optimization best practices around areas like parallelism, memory usage, and command batching.
AMD’s math libraries can support a range of programmers from hobbyists to ninja programmers. Kent Knox from AMD’s library team introduces you to OpenCL libraries for linear algebra, FFT, and BLAS, and shows you how to leverage the speed of OpenCL through the use of these libraries.
Review the material presented in the AMD Math libraries webinar in this deck.
For more:
Visit the AMD Developer Forums:http://devgurus.amd.com/welcome
Watch the replay: www.youtube.com/user/AMDDevCentral
Follow us on Twitter: https://twitter.com/AMDDevCentral
This is the slide deck from the popular "Introduction to Node.js" webinar with AMD and DevelopIntelligence, presented by Joshua McNeese. Watch our AMD Developer Central YouTube channel for the replay at https://www.youtube.com/user/AMDDevCentral.
This presentation accompanies the webinar replay located here: http://bit.ly/1zmvlkL
AMD Media SDK Software Architect Mikhail Mironov shows you how to leverage an AMD platform for multimedia processing using the new Media Software Development Kit. He discusses how to use a new set of C++ interfaces for easy access to AMD hardware blocks, and shows you how to leverage the Media SDK in the development of video conferencing, wireless display, remote desktop, video editing, transcoding, and more.
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
This deck presents highlights from the Introduction to OpenCL™ Programming Webinar presented by Acceleware & AMD on Sept. 17, 2014. Watch a replay of this popular webinar on the AMD Dev Central YouTube channel here: https://www.youtube.com/user/AMDDevCentral or here for the direct link: http://bit.ly/1r3DgfF
This document discusses AMD's DirectGMA technology, which allows direct access to GPU memory from other devices. It introduces DirectGMA and explains how it enables peer-to-peer transfers between GPUs and GPUs and FPGAs. It then provides details on implementing DirectGMA in APIs like OpenGL, OpenCL, DirectX 9, 10 and 11 to enable efficient data transfers without CPU involvement.
This Webinar explores a variety of new and updated features in Java 8, and discuss how these changes can positively impact your day-to-day programming.
Watch the video replay here: http://bit.ly/1vStxKN
Your Webinar presenter, Marnie Knue, is an instructor for Develop Intelligence and has taught Sun & Oracle certified Java classes, RedHat JBoss administration, Spring, and Hibernate. Marnie also has spoken at JavaOne.
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
This presentation discusses the Mantle API, what it is, why choose it, and abstraction level, small batch performance and platform efficiency.
Download the presentation from the AMD Developer website here: http://bit.ly/TrEUeC
The document is about an AMD and Microsoft Game Developer Day event held in Stockholm, Sweden on June 2, 2014. It provides the date and location of the event multiple times but no other details.
This document discusses the TressFX hair and fur rendering technique. It begins by stating that next-gen quality hair is expected in current generation titles. It then covers the key components needed for high quality hair, including antialiasing, self-shadowing, and transparency. The document discusses isoline tessellation versus a vertex shader approach and describes TressFX's deferred rendering pipeline with selective shading of only the closest fragments. It demonstrates that TressFX can achieve next-gen quality hair and fur at real-time performance through techniques like variable ratio hair simulation, extrusion into triangles in the vertex shader, selective shading, and distance-based level of detail.
Mantle allows Battlefield 4 to significantly improve CPU and GPU performance compared to DirectX 11. The game utilizes Mantle's low-level access to optimize shader compilation, pipeline state management, asynchronous compute and memory handling. Multi-GPU rendering is supported through Alternate Frame Rendering where resources are duplicated and updated asynchronously across GPUs.
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
The document discusses low-level shader optimization techniques for next-generation consoles and DirectX 11 hardware. It provides lessons from last year on writing efficient shader code, and examines how modern GPU hardware has evolved over the past 7-8 years. Key points include separating scalar and vector work, using hardware-mapped functions like reciprocals and trigonometric functions, and being aware of instruction throughput and costs on modern GCN-based architectures.
The document summarizes a presentation given by Stephan Hodes on optimizing performance for AMD's Graphics Core Next (GCN) architecture. The presentation covers key aspects of the GCN architecture, including compute units, registers, and latency hiding. It then provides a top 10 list of performance advice for GCN, such as using DirectCompute threads in groups of 64, avoiding over-tessellation, keeping shader pipelines short, and batching drawing calls.
The document repeatedly states that AMD and Microsoft held a Game Developer Day event in Stockholm, Sweden on June 2, 2014 to work with game developers.
Direct3D 12 aims to reduce CPU overhead and increase scalability across CPU cores by allowing developers greater control over the graphics pipeline. It optimizes pipeline state handling through pipeline state objects and reduces redundant resource binding by introducing descriptor heaps and tables. Command lists and bundles further improve performance by enabling parallel command list generation and reuse of draw commands.
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
The document discusses faster particle rendering using DirectCompute. It describes using the GPU for particle simulation by taking advantage of its parallel processing capabilities. It discusses using compute shaders to simulate particle behavior, handle collisions via the depth buffer, sort particles using bitonic sort, and render particles in tiles via DirectCompute to avoid overdraw from large particles. Tiled rendering involves culling particles, building per-tile particle indices, and sorting particles within each tile before shading them in parallel threads to composite onto the scene.
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
This document provides an overview of OpenCL libraries for GPU programming. It discusses specialized GPU libraries like clFFT for fast Fourier transforms and Random123 for random number generation. It also covers general GPU libraries like Bolt, OpenCV, and ArrayFire. ArrayFire is highlighted as it provides a flexible array data structure and hundreds of parallel functions across domains like image processing, machine learning, and linear algebra. It supports JIT compilation and data-parallel constructs like GFOR to improve performance.
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
Johan Andersson will show how the Frostbite 3 game engine is using the low-level graphics API Mantle to deliver significantly improved performance in Battlefield 4 on PC and future games from Electronic Arts in this presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Also view this and other presentations on our developer website at http://developer.amd.com/resources/documentation-articles/conference-presentations/
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
RapidFire is a dedicated cloud gaming hardware and software solution from AMD that aims to simplify integration and deliver more high-definition game streams per GPU with low latency. It utilizes AMD hardware on both the server and client sides. The API provides functions for encoding and decoding video and audio streams, capturing input events, and displaying frames with low latency for cloud gaming applications. Eureva has implemented RapidFire in their Swiich solution to virtualize and stream any DirectX or OpenGL game in real-time with ultra-low latency over existing networks.
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
Oxide Games Partners Dan Baker and Tim Kipp will show you how to build a high throughput renderer using the Mantle API in this AMD technology presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Also view this and other presentations on our developer website at http://developer.amd.com/resources/documentation-articles/conference-presentations/
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity
Join this UiPath Community Berlin meetup to explore the Orchestrator API, Swagger interface, and the Test Manager API. Learn how to leverage these tools to streamline automation, enhance testing, and integrate more efficiently with UiPath. Perfect for developers, testers, and automation enthusiasts!
📕 Agenda
Welcome & Introductions
Orchestrator API Overview
Exploring the Swagger Interface
Test Manager API Highlights
Streamlining Automation & Testing with APIs (Demo)
Q&A and Open Discussion
Perfect for developers, testers, and automation enthusiasts!
👉 Join our UiPath Community Berlin chapter: https://community.uipath.com/berlin/
This session streamed live on April 29, 2025, 18:00 CET.
Check out all our upcoming UiPath Community sessions at https://community.uipath.com/events/.
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies
In late April 2025, a significant portion of Europe, particularly Spain, Portugal, and parts of southern France, experienced widespread, rolling power outages that continue to affect millions of residents, businesses, and infrastructure systems.
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
With expertise in data architecture, performance tracking, and revenue forecasting, Andrew Marnell plays a vital role in aligning business strategies with data insights. Andrew Marnell’s ability to lead cross-functional teams ensures businesses achieve sustainable growth and operational excellence.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxJustin Reock
Building 10x Organizations with Modern Productivity Metrics
10x developers may be a myth, but 10x organizations are very real, as proven by the influential study performed in the 1980s, ‘The Coding War Games.’
Right now, here in early 2025, we seem to be experiencing YAPP (Yet Another Productivity Philosophy), and that philosophy is converging on developer experience. It seems that with every new method we invent for the delivery of products, whether physical or virtual, we reinvent productivity philosophies to go alongside them.
But which of these approaches actually work? DORA? SPACE? DevEx? What should we invest in and create urgency behind today, so that we don’t find ourselves having the same discussion again in a decade?
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex
Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Here’s how:
• Optimized Torizon OS & Yocto Support – Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules.
• Seamless Integration with i.MX 8M Plus and i.MX 95 – Toradex SMARC solutions leverage NXP’s i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance.
• Secure and Reliable – With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity.
• Containerized Workflows for AI & IoT – Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications.
• Strong Ecosystem & Developer Support – Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market.
With Toradex’s Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications.
Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market
For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family
Mobile App Development Company in Saudi ArabiaSteve Jonas
EmizenTech is a globally recognized software development company, proudly serving businesses since 2013. With over 11+ years of industry experience and a team of 200+ skilled professionals, we have successfully delivered 1200+ projects across various sectors. As a leading Mobile App Development Company In Saudi Arabia we offer end-to-end solutions for iOS, Android, and cross-platform applications. Our apps are known for their user-friendly interfaces, scalability, high performance, and strong security features. We tailor each mobile application to meet the unique needs of different industries, ensuring a seamless user experience. EmizenTech is committed to turning your vision into a powerful digital product that drives growth, innovation, and long-term success in the competitive mobile landscape of Saudi Arabia.
What is Model Context Protocol(MCP) - The new technology for communication bw...Vishnu Singh Chundawat
The MCP (Model Context Protocol) is a framework designed to manage context and interaction within complex systems. This SlideShare presentation will provide a detailed overview of the MCP Model, its applications, and how it plays a crucial role in improving communication and decision-making in distributed systems. We will explore the key concepts behind the protocol, including the importance of context, data management, and how this model enhances system adaptability and responsiveness. Ideal for software developers, system architects, and IT professionals, this presentation will offer valuable insights into how the MCP Model can streamline workflows, improve efficiency, and create more intuitive systems for a wide range of use cases.
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo
From predictive maintenance to robotic automation, AI is driving the future of manufacturing. But without high-quality annotated data, even the smartest models fall short.
Discover how data annotation services are powering accuracy, safety, and efficiency in AI-driven manufacturing systems.
Precision in data labeling = Precision on the production floor.
2. RELEVANCE OF B+ TREE SEARCHES
B+ Trees are special case of B Trees
‒ Fundamental data structure used in several popular database
management systems
B Tree
B+ Tree
mongoDB
MySQL
CouchDB
SQLite
High-throughput, read-only index searches are gaining traction in
- Video-copy detection
‒ Audio-search
‒ Online Transaction Processing (OLTP) Benchmarks
Increase in memory capacity allows many database tables to
reside in memory
‒Brings computational performance to the forefront
X
2 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
3. A REAL-WORLD USE-CASE OF READ-ONLY SEARCHES
Mobile:
Step 1. Record Audio
Step 2. Generate Audio Fingerprint
Step 3. Send search request to server
App on a
smartphone
Database
3
2
1
d1
2
d2
5
4
6
7
3
4
5
6
7 8
d3
d4
d5
d6
Server:
Step 1. Receive search requests
Step 2. Query Database
Step 3. Return search results to client
d7 d8
3 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
Thousands
of clients
send
requests
Music Library –
Millions of Songs
4. DATABASE PRIMITIVES ON ACCELERATORS
Discrete graphics processing units (dGPUs) provide a compelling
mix of
‒ Performance per Watt
‒ Performance per Dollar
dGPUs have been used to accelerate critical database primitives
‒ scan
‒ sort
‒ join
‒ aggregation
‒ B+ Tree Searches?
4 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
5. B+ TREE SEARCHES ON ACCELERATORS
B+ Tree searches present significant challenges
‒ Irregular representation in memory
‒ An artifact of malloc() and new()
‒ Today’s dGPUs do not have a direct mapping to the CPU virtual address
space
‒ Indirect links need to be converted to relative offsets
‒ Requirement to copy the tree to the dGPU, which entails
‒ One is always bound by the amount of GPU device memory
5 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
6. OUR SOLUTION
Accelerated B+ Tree searches on a fused CPU+GPU processor (or
APU1)
‒ Eliminates data-copies by combining x86 CPU
and vector GPU cores on the same silicon die
Developed a memory allocator to form a regular representation
of the tree in memory
‒ Fundamental data structure is not altered
‒ Merely parts of its layout is changed
[1] www.hsafoundation.com
6 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
7. OUTLINE
Motivation and Contribution
Background
‒ AMD APU Architecture
‒ B+ Trees
Approach
‒ Transforming the Memory Layout
‒ Eliminating the Divergence
Results
‒ Performance
‒ Analysis
Summary and Next Steps
7 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
8. AMD APU ARCHITECTURE
System Memory
Host Memory
DRAM
Controller
x86
Cores
DRAM
Controller
RMB
System Request
Interface (SRI)
xBar
Link
Controll
er
MCT
UNB
GPU Frame-Buffer
FCL
Platform Interfaces
AMD 2nd Gen. A-series APU
UNB - Unified Northbridge, MCT - Memory Controller
The APU consists of a dedicated
IOMMUv2 hardware
GPU
Vector
Cores
- Provides direct mapping between
GPU and CPU virtual address (VA)
space
- Enables GPUs to access the system
memory
- Enables GPUs to track whether
pages are resident in memory
Today, GPU cores can access VA
space at a granularity of continuous
chunks
8 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
9. B+ TREES
3
2
5
4
6
7
1
A B+ Tree …
2
3
4
5
6
7
d1
d2
d3
d4
d5
d6
d7 d8
‒ is a dynamic, multi-level index
‒ Is efficient for retrieval of data, stored in a block-oriented context
‒ has a high fan-out to reduce disk I/O operations
Order (b) of a B+ Tree measures the capacity of its nodes
Number of children (m) in an internal node is
‒ [b/2] <= m <= b
‒ Root node can have as few as two children
Number of keys in an internal node = (m – 1)
9 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
8
10. APPROACH FOR PARALLELIZATION
Fine-grained (Accelerate a single query)
‒ Replace Binary search in each node with K-ary search
‒ Maximum performance improvement = log(k)/log(2)
‒ Results in poor occupancy of the GPU cores
Coarse-grained (Perform many queries in parallel)
‒ Enables data-parallelism
‒Increases memory bandwidth with parallel reads
‒Increases throughput (transactions per second for OLTP)
10 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
11. TRANSFORMING THE MEMORY LAYOUT
nodes w/ metadata
..
keys
..
values
Metadata
‒ Number of keys in a node
‒ Offset to keys/values in the buffer
‒ Offset to the first child node
‒ Whether a node is a leaf
Pass a pointer to this memory buffer to the accelerator
11 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
..
12. ELIMINATING THE DIVERGENCE
Each work-item/thread executes a single query
May increase divergence within a wave-front
‒ Every query may follow a different path in the B+ Tree
WI-1
WI-2
2
3
5
4
6
7
WI-2
1
2
3
4
5
6
7
d1
d2
d3
d4
d5
d6
d7 d8
8
Sort the keys to be searched
‒ Increases the chances of work-items within a wave-front to follow similar paths in the B+ Tree
‒ We use Radix Sort1 to sort the keys on the GPU
[1] D. G. Merrill and A. S. Grimshaw, “Revisiting sorting for gpgpu stream architectures,” in Proceedings of the 19th intl. conf. on
Parallel architectures and compilation techniques, ser. PACT ’10. New York, NY, USA: ACM, 2010.
12 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
13. IMPACT OF DIVERGENCE IN B+ TREE SEARCHES
Impcat of Divergence
5
4
3
2
1
16K
32K
64K
128K
Number of Queries w/ Order of B+ Tree
TM
AMD Radeon HD 7660
TM
AMD Phenom II X6 1090T
Impact of Divergence on GPU – 3.7-fold (average)
Impact of Divergence on CPU – 1.8-fold (average)
13 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
128
64
32
16
8
128
64
32
16
8
128
64
32
16
8
128
64
32
16
8
0
14. OUTLINE
Motivation and Contribution
Background
‒ AMD APU Architecture
‒ B+ Trees
Approach
‒ Transforming the Memory Layout
‒ Eliminating the Divergence
Results
‒ Performance
‒ Analysis
Summary and Next Steps
14 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
15. 60000
50000
40000
30000
20000
10000
0
100000
300000
500000
700000
900000
1100000
1300000
1500000
1700000
1900000
2100000
2300000
2500000
2700000
2900000
3100000
3300000
3500000
3700000
3900000
4100000
Software
‒ A B+ Tree w/ 4M records is used
‒ Search queries are created using
Frequency
EXPERIMENTAL SETUP
‒ normal_distribution() (C++-11 feature)
‒ The queries have been sorted
‒ CPU Implementation from
‒ http://www.amittai.com/prose/bplustree.html
‒ Driver: AMD CatalystTM v12.8
‒ Programming Model: OpenCLTM
Hardware
‒ AMD RadeonTM HD 7660 APU (Trinity)
‒ 4 cores w/ 6GB DDR3, 6 CUs w/ 2GB DDR3
‒ AMD PhenomTM II X6 1090T + AMD RadeonTM HD 7970 (Tahiti)
‒ 6 cores w/ 8GB DDR3, 32 CUs w/ 3GB GDDR5
‒ Device Memory does not include data-copy time
15 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
Bin
E ID (PKey)
Age
0000001
34
4
million
4194304
entries
50
16. RESULTS – QUERIES PER SECOND
700M Queries/Sec
Queries/Second (Million)
1000
100
10
16K
32K
64K
128K
Number of Queries w/ Order of B+Tree
AMD Phenom II X6 1090T (6-Threads+SSE)
AMD PhenomTM II 1090T (6-Threads+SSE)
AMD Radeon HDTM HD(Device Memory)
AMD Radeon 7970 7970 (Device Memory)
AMD Radeon HD 7970 (Pinned(Pinned Memory)
AMD RadeonTM HD 7970 Memory)
dGPU (device memory)
~350M Queries/Sec. (avg.)
dGPU (pinned memory)
~9M Queries/Sec. (avg.)
AMD PhenomTM CPU
~18M Queries/Sec. (avg.)
16 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
128
64
32
16
8
4
128
64
32
16
8
4
128
64
32
16
8
4
128
64
32
16
8
4
1
17. RESULTS – QUERIES PER SECOND
140
Queries/Second (Million)
120
100
80
60
40
20
16K
AMD Phenom II X6 1090T (6-Threads+SSE)
AMD PhenomTM II 1090T (6-Threads+SSE)
32K
Number of Queries w/ Order of B+Tree
HD 7660 (Device Memory)
128K
AMD Radeon HDTM HD(Pinned Memory)
AMD Radeon 7660 7660 (Pinned Memory)
APU (device memory)
~66M Queries/Sec. (avg.)
APU (pinned memory)
~40M Queries/Sec. (avg.)
APU (pinned memory) is faster than the CPU implementation
17 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
128
64
32
16
8
4
128
64K
AMD Radeon HD 7660 (Device Memory)
TM
AMD Radeon
64
32
16
8
4
128
64
32
16
8
4
128
64
32
16
8
4
0
18. RESULTS - SPEEDUP
12
Speedup
10
8
4.9-fold speedup
6
4
2
16K
32K
64K
128K
Number of Queries w/ Order of B+Tree
AMD Radeon HDTM HD(Pinned Memory)
AMD Radeon 7660 7660 (Pinned Memory)
AMD Radeon HDTM HD(Device(Device Memory)
AMD Radeon 7660 7660 Memory)
Baseline: six-threaded, hand-tuned, SSE-optimized CPU implementation.
Average Speedup – 4.3-fold (Device Memory), 2.5-fold (Pinned Memory)
• Efficacy of IOMMUv2 + HSA on the APU
Platform
Discrete GPU (memory size = 3GB)
APU (prototype software)
< 1.5GB
✓
✓
18 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
Size of the B+ Tree
1.5GB – 2.7GB
✓
✓
> 2.7GB
✗
✓
128
64
32
16
8
4
128
64
32
16
8
4
128
64
32
16
8
4
128
64
32
16
8
4
0
19. ANALYSIS
The accelerators and the CPU yield best performance for different
orders of the B+ Tree
‒ CPU order = 64
‒ Ability of CPUs to prefetch data is beneficial for higher orders
nodes w/ metadata
..
keys
..
values
..
‒ APU and dGPU order = 16
‒ GPUs do not have a prefetcher cache line should be most efficiently utilized
‒ GPUs have a cache-line size of 64 bytes
‒ Order = 16 is most beneficial (16 * 4 bytes)
19 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
20. ANALYSIS
Minimum batch size to match the CPU performance
Order = 64
Order = 16
dGPU (device memory)
4K queries
2K queries
dGPU (pinned memory
N.A.
N.A.
APU (device memory)
10K queries
4K queries
APU (pinned memory
20K queries
16K queries
reuse_factor - amortizing the cost of data-copies to the GPU
90% Queries
100% Queries
dGPU
15
54
APU
100
N.A.
20 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
21. PROGRAMMABILITY
GPU
CPU-SSE
int i = 0, j;
typedef global unsigned int g_uint;
node * c = root;
typedef global mynode g_mynode;
__m128i vkey = _mm_set1_epi32(key);
int tid = get_global_id(0);
__m128i vnodekey, *vptr;
int i = 0;
short int mask;
g_mynode *c = (g_mynode *)root;
/* find the leaf node */
/* find the leaf node */
while( !c->is_leaf ){
while(!c->is_leaf){
while (i < c->num_keys){
for(i = 0; i < (c->num_keys-3); i+=4){
if(keys[tid] >= ((g_uint *)((intptr_t)root + c->keys))[i])
vptr = (__m128i *)&(c->keys[i]);
i++;
vnodekey = _mm_load_si128(vptr);
else break;
mask = _mm_movemask_ps(_mm_cvtepi32_ps( _mm_cmplt_epi32(vkey, vnodekey)));
}
if((mask) & 8) break;
c = (g_mynode *)((intptr_t)root + c->ptr + i*sizeof(mynode));
}
for(j = i; j < c->num_keys; j++){
if(key < c->keys[j]) break;
}
}
/* match the key in the leaf node */
for(i=0; i<c->num_keys; i++){
if((((g_uint *)((intptr_t)root + c->keys))[i]) == keys[tid]) break;
c = (node *)c->pointers[j];
}
}
/* match the key in the leaf node */
/* retrieve the record */
for (i = 0; i < c->num_keys; i++)
if(i != c->num_keys)
if (c->keys[i] == key) break;
records[tid] = ((g_uint *)((intptr_t)root + c->is_leaf + i*sizeof(g_uint)))[0];
/* retrieve the record */
if (i != c->num_keys)
return (record *)c->pointers[i];
21 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
22. RELATED WORK
J. Fix, A. Wilkes, and K. Skadron, "Accelerating Braided B+ Tree Searches on a GPU with CUDA." In
Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors: Analysis,
Implementation, and Performance, in conjunction with ISCA, 2011
‒ Authors report ~10-fold speedup over single-thread-non-SSE CPU implementation, using a discrete NVIDIA GTX
480 GPU (do not take data-copies into account)
C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, P. Dubey, “FAST:
fast architecture sensitive tree search on modern CPUs and GPUs”, SIGMOD Conference, 2010
‒ Authors report ~100M queries per second using a discrete NVIDIA GTX 280 GPU (do not take data-copies into
account)
J. Sewall, J. Chhugani, C. Kim, N. Satish, P. Dubey, “PALM: Parallel, Architecture-Friendly, Latch-Free
Modifications to B+ Trees on Multi-Core Processors”, Proceedings of VLDB Endowment, (VLDB 2011)
‒ Applicable for B+ Tree modifications on the GPU
22 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
24. SUMMARY
B+ Tree is the fundamental data structure in many RDBMS
‒ Accelerating B+ Tree searches is critical
‒ Presents significant challenges on discrete GPUs
We have accelerated B+ Tree searches by exploiting coarsegrained parallelism on a APU
‒ 2.5-fold (avg.) speedup over 6-threads+SSE CPU implementation
Possible Next Steps
‒ HSA + IOMMUv2 would reduce the issue of modifying B+ Tree
representation
‒ Investigate CPU-GPU co-scheduling
‒ Investigate modifications on the B+ Tree
24 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public