OpenPOWER Acceleration of HPCC Systems

OpenPOWER Acceleration of HPCC Systems
2015 LexisNexis® Risk Solutions HPCC Engineering Summit – Community Day
Presenters: JT Kellington & Allan Cantle
September 29, 2015
Delray Beach, FL
Accelerating the
Pace of Innovation

Overview/Agenda
• The OpenPOWER Story
• Porting HPCC Systems to ppc64el
• POWER8 at a Glance
• Power 8 Data Centric Architectures with FPGA Acceleration
• Summary
2

OpenPOWER, a catalyst for Open Innovation
4
• Moore’s law no longer satisfies
performance gain
• Growing workload demands
• Numerous IT consumption models
• Mature Open software ecosystem
Performance of POWER architecture
amplified capability
Open Development
open software, open hardware
Collaboration of thought leaders
simultaneous innovation, multiple disciplines
The OpenPOWER Foundation is an open development community,
using the POWER Architecture to serve the evolving needs of customers.
• Rich software ecosystem
• Spectrum of power servers
• Multiple hardware options
• Derivative POWER chips
Market Shifts New Open Innovation
 145+ members investing on Power architecture
 Geographic, sector, and expertise diversity
 1,500 ISVs serving Linux on POWER, now with easier
porting from x86!
 Open from the chip through software
 9 workgroups chartered, 15 hardware innovations
revealed at March 2015 summit
 Partner announced plans, offerings, deployments
www.openpowerfoundation.org

Porting HPCC Systems to ppc64el

HPCC Systems and POWER8 Designed For Easy Porting
HPCC Systems Framework
• ECL is Platform Agnostic
• Hardware and architecture
changes transparent to end user
• CMake, gcc, binutils, Node.js, etc.
• Open Source!
• Quick and easy collaboration
POWER8 Designed for Open Innovation
• Little Endian OS
• More compatible with industry
• IBM Software Development Kit for
Linux on Power
• Advance Toolchain
• Post-Link Optimization
• Linux performance analysis tools
• Full support in Ubuntu, RHEL, and
SLES
6

Porting Details
Initial port took less than a week!
• Added ARCH defines for PPC64EL (system/include/platform.h)
• Added stack description for POWER8 (ecl/hql/hqlstack.hpp)
• Added binutils support for powerpc (ecl/hqlcpp/hqlres.cpp)
7
Deployment and testing took a bit longer
• Couple of bugs found:
• Stricter alignment for atomic operations
• New compiler did not allow compare of “this” to NULL
• Updates to Init system with latest Ubuntu
• Monkey on the keyboard!
• Basic setup and running was easy, but harder to go to the “next level”
• Storage configuration, memory usage, thread support, etc.

POWER8 Is Designed For Big Data!
9
Technology
• 22 nm SOI, eDRAM, 15 ML 650 mm2
Caches
• 512 KB SRAM L2 / core
• 96 MB eDRAM shared L3
Memory
• Up to 230 GB/s
sustained bandwidth
Bus Interfaces
• Durable open memory attach
interface
• Up to 48x 48x Integrated PCI
gen3/socket
• SMP interconnect
• CAPI
Cores
• 12 cores (SMT8)
• 8 dispatch, 10 issue,
16 execution pipes
• 2x internal data
flows/queues
• Enhanced prefetching
• 64 KB data cache,
32 KB instruction cache
Accelerators
• Crypto and memory
expansion
• Transactional memory
• VMM assist
• Data move/VM mobility
POWER8 Scale-Out Dual Chip Module
Chip Interconnect
CoreCoreCore
L2L2L2
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L2L2L2
Core Core Core
Chip Interconnect
Core Core Core
L2 L2 L2
L2 L2 L2
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
Core Core Core
MemoryBus
MemoryBus
SMPInterconnect
SMPInterconnect
SMPSMPCAPIPCIe
SMPCAPIPCIeSMP

POWER8 Introduces CAPI Technology
10
CAPP PCIe
POWER8 Processor
FPGA
Accelerated Functional Unit
(AFU)
CAPI
IBM Supplied POWER Service
Layer
Typical I/O Model Flow
Flow with a Coherent Model
Shared Mem.
Notify Accelerator
Acceleration
Shared Memory
Completion
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator
Acceleration
Poll / Int
Completion
Copy or Unpin
Result Data
Ret. From DD
Completion
Advantages of Coherent Attachment Over I/O Attachment
 Virtual Addressing & Data Caching (significant latency reduction)
 Easier, Natural Programming Model (avoid application restructuring)
 Enables Apps Not Possible on I/O (Pointer chasing, shared mem semaphores, …)

strategy ( )
CAPI Attached Flash Optimization
• Attach IBM FlashSystem to POWER8 via CAPI
• Read/write commands issued via APIs from applications to eliminate 97% of code path length
• Saves 20-30 cores per 1M IOPS
Pin buffers, Translate,
Map DMA, Start I/O
Application
Read/Write Syscall
Interrupt, unmap,
unpin,Iodone scheduling
20K
instructions
reduced to
<2000
Disk and Adapter DD
strategy ( ) iodone ( )
FileSystem
Application
User Library
Posix Async
I/O Style API
Shared Memory
Work Queue
aio_read()
aio_write()
iodone ( )
LVM

CAPI Unlocks the Next Level of Performance for Flash
Identical hardware with 2 different
paths to data
FlashSystem
Conventional
I/O (FC) CAPI
0
20,000
40,000
60,000
80,000
100,000
120,000
Conventional CAPI
IOPS per Hardware Thread
0
50
100
150
200
250
300
350
400
450
500
Conventional CAPI
Latency (microseconds)
IBM POWER S822L >5x better IOPS
per HW thread
>2x lower latency

POWER 8 Data Centric Architectures
with FPGA Acceleration
• Allan Cantle – President & Founder, Nallatech

Nallatech at a glance
Server qualified accelerator cards featuring FPGAs, network I/O and an open
architecture software/firmware framework
» Energy-efficient High Performance
Heterogeneous Computing
» Real-time, low latency network
and I/O processing
» Design Services for Application Porting
» IBM Open Power partner
» Altera OpenCL partner
» Xilinx Alliance partner
» 20 Year History of FPGA Acceleration including
IBM System Z qualified & 1000+ Node deployments
14

Heterogeneous Computing – Motivation
• Efficient Evolutional Compute Performance
• 1998 - DARPA Polymorphic Computing
• 2003 – Power Wall
GPGPUProcessors
FPGA Processors Multicore Microprocessors
Polymorphic Processor
A processor that is equally efficient at processing
all different data types
Polymorphic
Application Data Types
SymbolicStreaming – SIMD - Vector
(DSP)
Bit Level
SWEPTEfficiency
Size,Weight,Energy,Performance,Time

Heterogeneous Computing – Motivation
• Efficient Evolutional Compute Performance
• 1998 - DARPA Polymorphic Computing
• 2003 – Power Wall
GPGPUProcessors
FPGA Processors Multicore Microprocessors
Application Data Types
SymbolicStreaming – SIMD - Vector
(DSP)
Bit Level
SWEPTEfficiency
Size,Weight,Energy,Performance,Time

FPGA’s for “Software Accessible” Compute Acceleration
• FPGAs are best processor for
• “In the pipe” Stream Computing Operations
• Highly parallel Bit & Byte level computation
• FPGA are NOW programmable by software engineers with OpenCL!
• Industry Standard Language managed by Khronos
• Heterogeneous language of choice for all processor types
• Low Level Explicitly Parallel language
• Heterogeneous Support is Evolving
17

Typical Thor & Roxie Cluster Node Configurations
• Large Thor Scale Out Cluster = 400 Nodes
• Large Roxie Scale Out Cluster = 100 Nodes
Xeon
E5-2670 V3
12 Cores
Memory
16GB
SAS
RAID5
Controller
PCIe
L3Cache–30MB
L2/Core–256KB
L1/Core–64KB
1.2TB
HDD
Network
Controller
~60GB/s
PCIe
10GbE
200MB/s
1.2TB
HDD
1.2TB
HDD
140 IOPs
200MB/s
140 IOPs
200MB/s
140 IOPs
Thor
Xeon
E5-2670 V3
12 Cores
L3Cache–30MB
L2/Core–256KB
L1/core–64KB
SATA
RAID5
Controller
1TB
SSD
560MB/s
1TB
SSD
1TB
SSD
100 KIOPs
560MB/s
100 KIOPs
560MB/s
100 KIOPs
Roxie
PCIe
~60GB/s
~60GB/s 2x QPI Links
Memory
16GB
400MB/s
140 IOPs
1.1 GB/s
100 KIOPs
= I/O Bus
= Memory Bus

Server
Node 1
Server
Node 400
Server
Node 2
Server
Node 399
400 Node
THOR Cluster
Hierarchical
400 Port
Network
Switch
Disc
Disc
Disc
Disc
400MB/s
140 IOPs
400MB/s
140 IOPs
400MB/s
140 IOPs
400MB/s
140 IOPs
1GB/s
1GB/s
1GB/s
1GB/s
1GB/s
Server
Node 1
Server
Node 100
Server
Node 2
Server
Node 99
100 Node
ROXIE Cluster
Hierarchical
100 Port
Network
Switch
SSD
SSD
SSD
SSD
1.1GB/s
100 KIOPs
1.1GB/s
100 KIOPs
1.1GB/s
100 KIOPs
1.1GB/s
100 KIOPs
1GB/s
1GB/s
1GB/s
1GB/s
1GB/s
• CPU intensive operations on
fragmented storage data
• Average 1GB/s to storage
• Mapper type parallel operations
• Aggregate 160GBytes/s
Storage bandwidth
• Aggregate 56K IOPs
• CPU intensive operations on
fragmented storage data
• Average 1GB/s to storage
• Pre-indexed eases limitation
• Mapper type parallel operations
• Aggregate 110GBytes/s
Storage bandwidth
• Aggregate 10M IOPs
Typical Scale Out Thor & Roxie Cluster Configurations

Memory Coherency & Data Movement Conflict
• Mapper Type – Needle in a Haystack Operations
• Little to no compute on lots of data
• But hey, it’s not a problem for the Xeon so why worry?
• Creates unnecessary traffic congestion across IO interfaces
• I/O Management Consumes CPU Threads
• Excessive Cache thrashing consumes memory bandwidth
• E.g. only 8x 4K IOPs will fit in a 32K L1 Data Cache
• Power & efficiency issues with all the unnecessary data movement
1x
Network
Controller
1x
10GbE
Storage uP
L3
L2
L1
Memory
1x
100x
Data Movement example of
CPU operating on large data
files with mapper type
functions
The burden of
using CPU centric
architectures for
Data Centric
Problems
= I/O Bus
= Memory Bus

1x
Network
Controller
1x
10GbE
Storage uP
L3
L2
L1
Memory
1x
100x
Example of CPU moving
stored data to network
for another CPU
Memory Coherency & Data Movement Conflict
• Processor Intensive Operations on fragmented data across cluster
• CPU burdened by passing lots of fragmented data from storage to the
network
• Similarly CPU has to wait for fragmented data to arrive
• Network Congestion & latency become big issues
• Less CPU cycles & cache memory available for compute intensive
operations
The burden of
using CPU centric
architectures for
Data Centric
Problems
= I/O Bus
= Memory Bus

Homogeneous Commodity Clusters still win out…right?
• Maybe, but storage is getting a LOT faster with NVMe!
Potential
72GB/s
Network
Controller
PCIe
10GbE
24 NVMe
Drives
uP
L3
L2
L1
Memory
60GB/s
Example of CPU moving
stored data to network
for another CPU
Increased Storage IO Bandwidth would
saturate CPU’s available memory
bandwidth & CPU centric architecture
breaks down completely
The burden of
using CPU centric
architectures for
Data Centric
Problems
= I/O Bus
= Memory Bus

IBM initiated Data Centric Transformation with CAPI
• CAPI = Coherent Accelerator Processor Interface
• Transport over PCIe bus
• IBM’s CAPI Flash Product
• Make Flash memory appear as Block Memory, BM
• FPGA provides “translation” BM to Flash commands
• CPU threads for storage management reduced from 40-50 down to 2
• CAPI attached Network Adaptor
• 3 Symmetrical MultiProcessing, SMP, Interconnects for Scale Up architectures
Network
Controller
PCIe
10GbE
56TB
Storage
Power 8
L3
L2
L1
Memory
230GB/s
SMP x3
FPGA
4GB/s
CAPI
CAPI
4GB/s
IBM
Standard Flash Drawer
vs CAPI Flash Drawer
= I/O Bus
= Memory Bus

Introducing In-Storage Acceleration over CAPI
• Extend CAPI Flash Product Innovations
• Increase CAPI Flash Bandwidth
• FPGA Acceleration
• Leverage NVMe IOPS & Sequential data rates
• Data Compression, Encryption, Filtering
• Bit/Byte Level Stream Computing
• Very power efficient architecture
10GbE
Power 8
L3
L2
L1
FPGA
Acceleration
4GB/s
CAPI
CAPI
Memory
230GB/s
56TB
Storage
Network
Controller
16GB/s
96GB/s
FPGA
SMP x3
= Memory Bus

P8 Node with CAPI Flash – Available Today
P8
12 Core
Socket
56TB
4GB/s
FPGA
4GB/s
1 MIOPS
P8
12 Core
Socket
56TB
4GB/s
FPGA
4GB/s
76.8GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
56TB
4GB/s
FPGA
4GB/s
56TB
4GB/s
FPGA
4GB/s
40/100GbE
NIC
40/100GbE
NIC
1 MIOPS
1 MIOPS
1 MIOPS
Maximum Configuration Shown
• Possible Configuration for Scale Out Roxie & Thor Clusters

Scale Out P8 Node with FPGA accelerated CAPI Flash
P8
12 Core
Socket
56TB
4GB/s
FPGA
1 MIOPS
P8
12 Core
Socket
56TB
4GB/s
FPGA
76.8GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
56TB
4GB/s
FPGA
56TB
4GB/s
FPGA
40/100GbE
NIC
40/100GbE
NIC
1 MIOPS
1 MIOPS
1 MIOPS
Maximum Configuration Shown
60GB/s
14.4 MIOPs
60GB/s
14.4 MIOPs
60GB/s
14.4 MIOPs
60GB/s
14.4 MIOPs
• Possible Configuration for Scale Out Roxie & Thor Clusters

Scale Up IBM E870/E880 with Scale out FPGA Acceleration
P8
Socket 56TB
4GB/s
FPGA
60GB/s
14.4 MIOPs
P8
Socket 56TB
4GB/s
FPGA
60GB/s
14.4 MIOPs
P8
Socket56TB
4GB/s
FPGA
60GB/s
14.4 MIOPs
P8
Socket56TB
4GB/s
FPGA
60GB/s
14.4 MIOPs
76.8GB/s
76.8GB/s
76.8GB/s 76.8GB/s
76.8GB/s
76.8GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
• Balanced Scale Up & Scale Out Solution
• Scale Up Processors ideal for Compute intensive operations
• Scale Out FPGA Accelerated Storage, ideal for Parallel Mapper type operations

P8
Socket 56TB
FPGA
P8
Socket 56TB
FPGA
P8
Socket56TB
FPGA
P8
Socket56TB
FPGA 1TB Memory 1TB Memory
1TB Memory 1TB Memory
Scale Up IBM E880 with Scale out FPGA Acceleration
• 16 Way, 192 Core, Scale Up Compute with aggregate
• 16TB Memory @ 3.68TBytes/s
• 896TB Storage @ 960 GBytes/s

In Summary: HPCC Systems and OpenPOWER Are Ready For Business!
• IBM Power8’s Data Centric Architecture is ideal for Big Data Problems
• FPGAs provide an ideal streaming, near storage, accelerator
• Opportunity to re-architect HPCC’s ECL for Power 8 with FPGA Acceleration
• Will provide a step change in performance at lower power & footprint
• A community effort will help to accelerate the transition
29
Contact information: JT Kellington | Allan Cantle
Phone: 512-286-6948 | 805 377 4993
Email: jwkellin@us.ibm.com | a.cantle@nallatech.com
Web: www.ibm.com | www.nallatech.com

OpenPOWER Acceleration of HPCC Systems

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to OpenPOWER Acceleration of HPCC Systems (20)

More from HPCC Systems (20)

Recently uploaded (20)

OpenPOWER Acceleration of HPCC Systems