FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

Qi Xie (qi.xie@intel.com)
Hao Cheng (hao.cheng@intel.com)
Quanfu Wang (quanfu.wang@intel.com)
FPGA-BASED ACCELERATION
ARCHITECTURE FOR SPARK SQL

LEGAL NOTICES
• You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning
Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter
drafted which includes subject matter disclosed herein.
• No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
• Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness
for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing,
or usage in trade.
• This document contains information on products, services and/or processes in development. All information provided here is
subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and
roadmaps.
• The products and services described may contain defects or errors known as errata which may cause deviations from
published specifications. Current characterized errata are available on request.
• Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-
4725 or by visiting www.intel.com/design/literature.htm.
• Intel, the Intel logo, Intel® are trademarks of Intel Corporation in the U.S. and/or other countries.
• *Other names and brands may be claimed as the property of others.
• Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
• Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of
that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
• Copyright © 2017 Intel Corporation.
2

About me
• Software engineer from Intel Big Data Engineering Spark team
• Focused on Spark optimization for Intel Architecture
3

Outline
• What’s an FPGA
• Intel FPGA Platform
• Workload & Benchmark Introduction
• Baseline Profile - Hotspot Analysis
• FPGA Acceleration Arch Overview
• Performance Comparison
• Future Works
4

Outline
• Future Works
5

What is an FPGA?
• Field Programmable Gate Array
6
‒ Configurable Logic Blocks (CLB)
‒ Embedded Memory
‒ Digital signal processing (DSP) blocks
‒ I/O pads
‒ Hard IP(PCIe, DDR, GigE, etc )

7
Why FPGA?
a
b
c
y
y a b c  
Truth Table
a b c y
0 0 0 1
0 0 1 0
0 1 0 1
0 1 1 1
1 0 0 1
1 0 1 0
1 1 0 1
1 1 1 1
Programmed LUT
1
0
1
1
1
0
1
1
MUX y
a,b,c
LUT
Required Function
‒ Reconfigurable architecture
CLB consists of LUTs. LUT is a RAM with data width of 1 bit.
The contents are programmed at power up.
‒ Low-power, energy efficiency, compared with CPU/GPU
Extreme degree of customizations, Well positioned for High performance and providing flexibility

Outline
• Future Works
8

Discrete and Integrated FPGA platforms
9

Intel Accelerator Abstraction Layer(AAL)
10
FPGAHardware

End User Programming Interfaces
11
FPGACPU
User Application
CPU
Infrastructure IP
(UPI, PCIe*, HSSI, FPGA Management)
FPGA Runtime Software
(Accelerator Abstraction Layer)
FPGA IP
(Acceleration
Function Unit)
Intel-Provided
Infrastructure
USER SOFTWARE
INTERFACE
User Developed
Application
Specific
Functions
UPI/PCIe
HSSI
= New blocks that simplify code development.
CORE CACHE
INTERFACE
Intel® Confidential

Traditional FPGA Development Approach
Kernels
exe
AFU
Bitstream
SW
Compiler
OpenCL
Compiler
HDL
SW
Compiler
exe AFU
Bitstream
HDL Programming
Syn.
PAR
AAL
Software
Blue
Bitstream
CPU FPGA
Green
Bitstream
OpenCL
Emulator
Application
Host
AFU
Simulation
Environment
(ASE)
C
OpenCL Programming
ASE
from Intel
AAL
from Intel
Altera® Quartus
Prime Pro
OpenCL BSP
AAL
Software
Blue
Bitstream
Green
Bitstream
Application
CPU FPGA
12

Outline
• Future Works
13

Workload Introduction
14Intel Confidential
The test case is from a customer and it utilizes SQL query to get the accounting summaries by
USER_ID on a big table. The SQL query contains heavy expression evaluations.
Accounting Big Table:
TIME_ID MBUSER_ID OPER_TID SUM_TIMES CHARGE1 …
20140407 2700007679977 5B013363363w 3 0 …
20140407 2704012998344 31011G13iG0 48 57180 …
20140407 2704040114238 31Q11512ZT0 1 180 …
20140407 2700007012466 31011G13iG0 8 52320 …
20140407 2700001523491 1T0311G80610ydH10G00 2 0 …
20140407 2700000765632 310103015G0 1 30 …
20140407 2700007800325 4562210021 1 0 …
…
1.6x10^8
Rows
38 Columns
 SQL queries to summarize customers consumption characteristics utilizing
billing data.
 5GB parquet format stored on HDFS, 160 Million rows.

Workload code snippet
Function Count
Max 13
Sum 155
Substr 329
Case 133
Implicit Data type cast (String to Double) n/a
Total 630
// Prepare
val parquet = spark.read.parquet ("/mnt/nvme/inputParquet/")
parquet.createOrReplaceTempView ("inputTable")
// Query
A very Long SQL statement, intensive use build-in functions:

SQL Query Physical Execution Plan
Two stages and with a shuffle(cross the data in network), the map stage contains file scan, projection and
partial aggregation while the reduce stage do further aggregation by merging the partial aggregation results.
Stage 1 (Map)
• File Scan
Read data from source.
• Projection
Expression evaluation
consumes most CPU cycles.
• Partial Aggregation
Aggregate per partition.
Shuffle
Stage 2 (Reduce)
• Full Aggregation. Tiny
task, consumes minor
CPU cycles.

Outline
• Future Works
17

Benchmark H/W Setup
In a single server for profiling and performance evaluation.
• MCP(Skylake-FPGA Multiple Chips Package)
o CPU
Intel Xeon Skylake-P, 2Socketsx14Cores@2.8GHz, 56Hyper Threads
o FPGA
1xArria10 GX, 427,200ALM, 8MB RAM (10AX115U3F45E2SG)
o DMA Channels
1xUPI (80Gbps)
• Memory
384GB, DDR4@2133 MHz
• Disk
1xIntel SSD P3700, 1.6TB, SR:2800MB/s, SW:1900MB/s, RR:450K IOPS, RW:150K IOPS

Baseline Profile - CPU, The Bottleneck
• PAT(Performance Analysis Tool) shows CPU is heavily utilized (assigned 54/56 Virtual Cores to
Spark). The total query execution time is 85 seconds.
Note: We started measurement from the 2nd run(the 1st run is to warm up data Linux file system cache), so no disk access
bandwidth in general.
Reduce
Stage does
very simple
aggregation
and takes
minor
CPU.(~1s)
*For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

Baseline Profile - CPU, The Bottleneck, Contd.
• From the VisualVM map task’s CPU breakdown we can see the projection consumes 66.7% CPU.
Projection takes
66.7% of CPU

Outline
• Future Works
21

Arch Overview – Typical SQL Query Operators
2121aaIntel Confidential
This POC Target

JVM
Spark
Spark FPGA Adaptor
Native
HW
InternalRow
to FPGA Batch
FPGA Batch
to InternalRow
FPGA Java Wrapper
FPGA Driver
FPGA
FPGA Project
Pattern
Configure
DMA
Configure
Huge Page
Memory Pool
Computation
Starter, Monitor
Java Native Interface(JNI)
Accelerator Abstraction Layer (AAL)
• Spark FPGA Adaptor
• Identify the expressions in projection and
export to FPGA SQL engine instructions
• Data conversions between Spark Internal
Rows  FPGA Batches.
• FPGA Driver
• Configure the SQL Engine Patterns according
to the instructions from Spark FPGA Adaptor
• Trigger the FPGA computation and collect
results
• Huge pages memory management
• Configure the DMA channel between main
memory & FPGA
• AAL
• FPGA runtime library
• low level API to FPGA Driver
• FPGA SQL Engine (RTL)
• SQL expression pattern units, can be
configurable.
• DMA RX: FPGA reads input data from main
memory.
• DMA TX: FPGA writes results to main memory.
23
Arch Overview - S/W Stack
DMA RX/TX(RTL) SQL Engine(RTL)

null bit set(1 bit/field) values(8 bytes/field) variable length portion
4 bytes(TIME_ID) …… 64 bytes(For 4xCL alignment)
…… 4 bytes(For 4xCL alignment)8 bytes(MBUSER_ID)
8 bytes(MBUSER_ID)FPGA Input Batch
FPGA Output Batch
Internal Row
InternalRow
to FPGA Batch
FPGA Batch
to InternalRow
FPGA Java Wrapper
FPGA Project
1. Get HugePage
wrapped in
DirectByteBuffer
Internal Rows
FPGAInputBatch
FPGAOutputBatch
Internal Rows
4. Input
for
Computation
5. Collect
computation
result
2. Data Conversion 6. Data Conversion
7. Free HugePage
wrapped in
DirectByteBuffer
• Internal Row
Spark representation of one record, flexible to represent fixed and variable length fields.
• FPGA Input Batch
For memory and computation efficiency fields are placed in a sequential physical memory.
• FPGA Output Batch
Similar as FPGA Input Batch.
3. Engine
Configuration,
Start
24
Arch Overview - S/W Stack, Contd.
12 bytes(ACC_NBR)
Input Output
Data Flow
Control Flow
Spark FPGA Adaptor

Spark
Engine
Unit
Engine
Unit
Engine
Unit …DMA
RX
DMA
TX
Output BufferInput Buffer
CPU
FPGA
FPGA
Adapter & Driver
Data Source
Input BufferInput Buffer
Output BufferOutput Buffer
Engine
Unit
Engine
Unit
Engine
Unit …
Engine
Unit
Engine
Unit
Engine
Unit …
169 Levels Pipeline
Data Flow
Control Flow
Pattern Configure,
Computation Control
25
Arch Overview - Engine Pipeline, Data Flow
• Engine Pipeline
Spark FPGA SQL Engine is designed as Engine Unit Pipelines. Every Engine Unit plays a single computation, different Engine Units are
assembled together(configured by Spark) to perform a complex computation and works in the way of pipeline. A lot of pipelines(say N
pipelines) can be constructed to perform N parallel computations, so that in a single FPGA cycle, N records can be digested.
• Data Flow
Spark pumps Data from Data Source and converts them into the format as FPGA required, and then put them into InputBuffer Array.
Then FPGA gets input data via DMA RX and feed them into Engine Pipelines. The results of Engine Pipelines are filled into OutBuffer
Array via DMA TX. Finally Spark converts data back in the format of Spark SQL needed.

Arch Overview – SQL Engine Micro Architecture
26
• Every SQL Expression Evaluation engine is configurable.
• Every engine contain max four pattern engines. The input data is parallel fed into
pattern engine. The final result is the combine of the pattern engine result.
 Pattern Engine 1 is configured to
evaluate the SQL expression
Substr(oper_tid,1,1) IN (‘1’, ‘7’)
 Pattern Engine 2 is configured to
evaluate the SQL expression
Substr(oper_tid, 2, 1) IN (‘o’)

Outline
• Future Works
27

• The FPGA accelerated version significantly reduced the total execution time, from 86
seconds(baseline) to 44 seconds in the end to end benchmark.
Speedup Ratio: 86s/44s => ~2X
FPGA: 44s
Baseline: 86s
Performance Comparison - FPGA vs Baseline
28

• The FPGA accelerated version reduced the CPU time in expression evaluation,
from 66.7%(baseline) to 6.6-% in Map stage.
Projection with FPGA, less
than 6.6%
Projection in Baseline,
66.7%
29
Performance Comparison - FPGA vs Baseline, Contd.

Outline
• Future Works
30

Future Works
• Fully Configurable FPGA SQL Acceleration Engine
• In this PoC, we identified the SQL expression patterns manually in
frontend and configure them to the FPGA SQL Engine units in
runtime; however, we have limit FPGA SQL engines to support
some of the typical expression patterns, and arbitrary SQL
expression combinations is not supported yet.
• More Operators Support
• SQL Expression Evaluation in Projection is the first step, and for the
other typical operators like Aggregation/Sort/Join probably also can
be offload to FPGA.
• CPU can also computes the expression evaluation when FPGA
resources are fully occupied in computation.
31

qi.xie@intel.com
hao.cheng@intel.com
quanfu.wang@intel.com
Thank You

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

Recommended

More Related Content

What's hot (20)

Similar to FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang (20)

More from Spark Summit (20)

Recently uploaded (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang