OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray Camera at the Swiss Light Source synchrotron

WIR SCHAFFEN WISSEN – HEUTE FÜR MORGEN
OpenCAPI-based image analysis
pipeline for 18 GB/s kHz-framerate X-
ray camera at the SLS synchrotron
Filip Leonarski :: Beamline Data Scientist :: Macromolecular Crystallography
Page 1

• Introduction: Macromolecular crystallography at synchrotrons and X-ray
detectors
• Technology: POWER + OpenCAPI
• Solution: Jungfraujoch
Plan
Page 2

X-ray
1901 Nobel Prize
W. Röentgen
Discovery of X-rays

X-ray macromolecular crystallography (MX)
Page 4
1901 Nobel Prize
W. Röentgen
Discovery of X-rays
(Photo 51 by R.
Gosling and R.
Franklin)
1962 Nobel Prize
F. Crick, J. Watson and
M. Wilkins
Structure of DNA
double helix solved
with X-rays

Page 5
1901 Nobel Prize
W. Röentgen
Discovery of X-rays
(Photo 51 by R.
Gosling and R.
Franklin)
1962 Nobel Prize
F. Crick, J. Watson and
M. Wilkins
Structure of DNA
double helix solved
with X-rays
2009 Nobel Prize
V. Ramakrishnan*, T.
Steiz, A. Yonath*
Structure of ribosome
(*) some of their structures
were solved at PSI

Wikipedia:
X-ray crystallography is the experimental science determining the atomic and
molecular structure of a crystal, in which the crystalline structure causes a beam of
incident X-rays to diffract into many specific directions. By measuring the angles
and intensities of these diffracted beams, a crystallographer can produce a three-
dimensional picture of the density of electrons within the crystal.
Page 6

• Particle accelerators are source of the
brightest X-ray beam (multiple orders of
magnitudes as compared to conventional X-
ray tubes), when charged particles travel
through magnetic field
- Effect is nuisance for high energy physics
(undesirable energy loss),
- but it is a blessing for structural science =>
modern storage rings are build exclusively
as light sources.
• Synchrotrons provide continuous X-ray
beam, while X-ray free electron lasers
produce femtosecond long bright pulses
MX at synchrotron
Page 7

Paul Scherrer Institute
Page 8
SwissFEL
Swiss Light
Source
Swiss Alps

• 3 experimental stations at the synchrotron
• 1 experimental station at the SwissFEL
• Beamtime is shared between academic and
industrial users
- Industrial customers are mostly pharmaceutical
companies looking for drug binding to potential
drug targets
- Academic users are universities and scientific
institutes worldwide doing basic research in
structural biology
MX at Swiss Light Source and SwissFEL
Page 9

• New storage ring to be installed in 2024-2025
• Flux (photons/second) will increase by order of magnitude
• Measurements can be done 10x faster
• Enabling fragment screening method – i.e. single protein target is
crystallized with hundredths or thousands of molecular fragments to
find best drug
- This is like molecular docking, but fully experimentally
Major upgrade in 2024/2025 for SLS 2.0
Page 10

• PSI is major detector developer
- Hybrid pixel detectors developed for
CERN high energy physics
experiments
- Design could be used for X-ray
cameras – first PILATUS in 2000s
- PSI start-up Dectris, commercialized
PILATUS and EIGER detectors, most
synchrotrons are equipped with
their detectors
• Currently PSI is rolling out new
generation: JUNGFRAU
Page 11
New detector for SwissFEL and SLS 2.0

• Silicon sensor converts X-ray to
electric charge
• Bump bonded to sensor is ASIC, with
dedicated electronics for each pixel
• Pixel has three capacitors allowing
different amplification
• They are dynamically switched during
exposure to adjust for incoming
charge
Page 12
Adaptive gain detector to increase dynamic
range
Aim: measure reliably from 1 to 20,000,000 photons per second

Adaptive gain detector to increase dynamic
range
0001010111110011
Pixel output in JF:
0001010111110011
Gain: 00:G0 01:G1 11:G2
ADC value: 0001010111110011
Photon number: =
!"# $ %&'&()*+
,*-.∗%01)1. &.&2,3 Gain and pedestal factors are
specific for pixel and gain setting
Prior calibration
Dedicated dark run

• Detector is modular
• 524,288 pixels per module
• 2.2 kHz * 524,288 pixels * 16 bit = 2.3 GB/s
- 2 x 10 Gbit/s links
• 4 Mpixel detector (2020)
- 16 x 10 Gbit/s
• 10 Mpixel (2022)
- 40 x 10 Gbit/s
Page 14
Modular detector
4 Mpixel (2020)
10 Mpixel (2022)

MX detector data rates double every 2 years
0.1
1
10
100
2006 2008 2010 2012 2014 2016 2018 2020 2022 2024
Frame
rate
[GB/s]
Year
2007 PSI PILATUS 6 Mpixel 12.5 Hz 0.2 GB/s
2014 Dectris EIGER 16 Mpixel 133 Hz 3.4 GB/s
2019 Dectris EIGER 2 XE 16 Mpixel 400 Hz 13.5 GB/s
2020 PSI JUNGFRAU 4 Mpixel 2200 Hz 18.4 GB/s
2022 PSI JUNGFRAU 10 Mpixel 2200 Hz 46.1 GB/s

• Detector is streaming frames over UDP
- Receiver using Linux Datagram Socket
• Conversion of pixel read-out
- CPU SIMD code
• Compression
- CPU compression
First approach: scale conventional architecture
Page 16

• Detector is streaming frames over UDP
- Receiver using Linux Datagram Socket
• Conversion of pixel read-out
- CPU SIMD code
• Compression
- CPU compression
First approach: scale conventional architecture
Page 17
Aim
20 GB/s
Reached
5 GB/s

POWER / OpenCAPI / FPGA architecture
Page 18

• Real-time performance
- FPGA design is cycle-accurate, with fixed latency and throughput
• Large memory throughput
- FPGAs with HBM2 have 460 GB/s bandwidth to 8 GB large memory
• Ethernet on-board
- FPGA are made to work with network, often having dedicated “hard” cores for
ethernet
• Development of FPGAs is difficult and time consuming
- Hardware description languages
- PCI Express
• Virtex Ultrascale+ HBM (XCVU33P and XCVU35P)
- Availble as low-profile half-length 75W cards
FPGA are perfect devices for data acquisition
Page 19

• C/C++ compiler to produce
hardware design language (Verilog
or VHDL)
• All code is valid C++ code, it can be
executed on CPU and functionally is
generally equivalent
• Dedicated pragma to guide FPGA
synthesis
• It is generally understandable for
software developers, but may
contain strange/inoptimal
constructs from software point of
view
High-level synthesis
Page 20
Bitshuffle for 16-bit numbers

• For VU33/35P:
- Size: 8 GB
- Bandwidth: up to 460 GB/s
- Latency: up to 120 cycles @ 200 MHz
• Complex architecture
- 32 x 256-bit AXI3 interfaces
- Either operating as 32 separate memories
- Or as single memory with crossbar (at the cost of up to 50% throughput)
• 256-bit is a problem, as data are 512-bit (PCIe Gen3 x16) or 1024-bit (OpenCAPI,
PCIe Gen4 x16)
• Simulation only with special tools (Cadence Xcelium), impossible with Xilinx tools
High-bandwidth memory
Page 21

• PCI Express is CPU-centric bus, as it is design to
support peripherals
• This is good model, when FPGA is a coprocessor
to CPU – which sends data, and waits for reply
=> but for data acquisition, it is FPGA that is
producing the data, CPU has no prior knowledge
which packet will be processed at the time
• DMA is operating on physical addresses: virtual
addresses need to be pinned by kernel (so are
not swapped and moved)
Þ need to maintain own driver
Þ address translation cache possible on FPGA,
but requires memory
PCI Express DMA
Page 22
Xilinx QDMA is a robust
but highly complex
solution for PCI Express –
used to interface FPGAs
with x86 AMD and Intel
CPUs

• IBM POWER9 showed great numbers for
I/O and memory throughput in Summit
and Sierra supercomputers
• IBM designed own memory coherent
interface for accelerators
(CAPI/OpenCAPI), which has advantages
over PCIe
POWER architecture
Page 23
Source: Wikipedia

OpenCAPI
Page 24
FPGA
board
POWER9
CPU
OpenCAPI
cable

OpenCAPI
Page 25
FPGA
board
POWER9
CPU
OpenCAPI
cable
• Predecessor CAPI => proprietary IBM
• Communication over PCIe physical lines
(but different protocol)
• OpenCAPI => consortium model
• Dedicated cabling (8 x 25 Gbit/s lines)
• For POWER10 – this will be default memory interface,
(allowing to have any type of memory attached to CPU + to
“share” memory over network)

• Similar difference what 80286/80386 virtual
mode brought to software development
• In OpenCAPI one needs single kernel operation
=> Attach accelerator to running process
• Then, accelerator has access to virtual address
space of running process – it is FPGA that is
initiating the communication
=> Address translation is handled by TLB and OS
=> FPGA sees memory in a fully cache-coherent
way
• All security/reliability/efficiency mechanisms in
CPU and kernel are also present in OpenCAPI
Page 26
What difference brings OpenCAPI?
Source: Wikipedia

• Main function for the action contains a pointer to virutal address space
- On device the pointer will be synthesized as 1024-bit master memory-mapped
AXI interface
- On CPU this pointer has to be just set to zero (which is first address of virtual
address space)
• Any cell in virtual memory is just accessed as offset from this pointer
• Only requirement is that memory is aligned to 128-bytes
- No special memory allocator, malloc or mmap is fine
- No pinning/registering
• The same memory buffer class for both simulation and working with device
• For configuration, there is also 4 MiB memory-maped I/O space (like BAR in PCIe)
- On device implemented as slave AXI-lite (32-bit)
How to develop with OpenCAPI?
Page 27

• Open source “shell” mantained by IBM
• http://github.com/OpenCAPI/oc-accel
• Provides ready made tool to work with OpenCAPI (from transceiver setup to
AXImm bridge)
• Provides preconfigured interfaces for I/O peripherals (HBM, 100G, NVMe)
• Provides simulation environment
- One can simulate both SW and HW in a single simulation (both user FPGA
design and software are not modified from their “real” implementation)
OC-Accel
Page 28

Jungfraujoch – FPGA implementation
Page 29

Jungfraujoch server
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
FPGA board with OpenCAPI interface
- Data acquisition
- Initial data analysis
- Pre-compression
(2.5 Mpixel/board for JF)
Up to 50 GB/s acquisition and
data analysis in a single 2U
IBM POWER9 server with 1-4 FPGA
boards
Frame
summation

Jungfraujoch FGPA streaming design
Modular design
• Stream of data handled by successive cores doing work in parallel
à throughput and latency of each core is determined by the hardware design
• Extra stages can be relatively simply added, option to bypass cores
• All cores are C++ functions, connected with AXI-Stream FIFOs
• As buffering is expensive on FPGA, it is best suited for algorithm that have limited
dependencies between frames
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation

Jungfraujoch
Ethernet UDP/IP core
Processes ethernet packets from network, ignores unnecessary packets, reads
frame header to get frame number, module number, etc.
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation

Jungfraujoch
Dark current core
This cores is responsible for calculating moving average of detector frames.
Calculated value is used as dark current (pedestal) for subsequent frames.
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation

Jungfraujoch
Conversion core
This cores translates JUNGFRAU read-out into units of energy or photon counts.
It benefits from very fast HBM2 memory within the FPGA (460 GB/s). Data
leaving this core can be used for processing by data analysis software.
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation

Jungfraujoch
Frame summation core (work in progress)
As data that left gain correction core are on linear scale, they can be summed to
reduce downstream data rate, if lower frame rate is needed, as compared to
detector.
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation

Jungfraujoch
Strong pixel finder core
This is first step of spot finding algorithm (for example COLSPOT). It identifies
pixels that are stronger than given number of standard deviations of their
neighborhood.
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation

Jungfraujoch
Bitshuffle
FPGAs are bit order agnostic. Therefore exchanging bit order in popular
compression prefilter is pretty much for free on FPGA.
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation

Jungfraujoch
Host memory write
Address in host memory buffer is calculated and data forwarded to host memory
via OpenCAPI. Additional image statistics are saved as well.
Ethernet
UDP/IP
Dark
current
Conversion
Strong
pixel finder
Bitshuffle
Memory
writer
Frame
summation

Jungfraujoch implementation on VU33P FPGA
Spot finding
HBM
Gain
Pedestal
Write data
OpenCAPI
100G
UDP

Jungfraujoch FPGA power usage is 18 W/board
for the whole streaming functionality
Page 40
Xilinx Vivado Power Report
2 boards for 4 Mpixel JUNGFRAU and 4 boards for 10 Mpixel JUNGFRAU

• VU33P or VU35P with 8 GB of HBM2
• OpenCAPI link and PCIe Gen3 x16 (or two
PCIe Gen4 x8)
• Small flash (2 kb) to store MAC address,
board IR
• QSFP-DD optical socket (same as QSFP28,
but with 8 lanes for 2x100G) =>
compatible with QSFP28 transceivers
• Up to 75W
Alpha Data 9H3 board
Page 41

• Software tests – Catch2
- 8 min
- Among other software tests includes 13
FPGA action tests (whole SLS code)
- Automated tests cover 95% lines of high-
level synthesis code
- Covers most of the functionality
correctness – including address calculation
- Main limitation is debugging of FIFOs
parallel behavior (deadlocks, etc.)
• Hardware simulation – Cadence Xcelium
- 4 hours
- Collection of 8 frames from single module
- Checks if hardware description is correct,
can find problems with synchronization,
and other, very rare, issues
- Too slow to verify functionality
OpenCAPI programming - testing
Page 42

• Detector and data acquisition system was sent in
November for an experiment in Photon Factory, KEK
• More than 2,000 datasets collected for protein
targets, few real-life native-SAD structures solved
• Due to pandemic, detector support and
development (including deployment of new FPGA
design) was done fully remotely from Switzerland
Commissioning in KEK (Jan – May 2021)
Page 43
BL-1A Photon Factory
JUNGFRAU detector (up)
tested in helium chamber
for native-SAD
measurements with 3.75
keV X-rays

Structure of Nucleocapsid Phosphoprotein from
SARS-CoV-2 solved in 1 second
• Crystal was previously measured with
conventional setup at our beamline –
with measurement taking longer than
one minute
• With JUNGFRAU detector and
OpenCAPI readout, 2000 images
collected in one second allowed to
solve structure of this protein
• Experimental team: Filip Leonarski, Sylvain
Engilberge, Vincent Olieric, Meitian Wang (MX
Group), Aldo Mozzanica (PSI Detector Group)
• SARS-CoV-2 protein was produced by Zinzula, L.,
Basquin, J., Bracher, A., Baumeister, W. (MPI,
Martinsried)

Possible gain from using FPGA based system
Page 45
Courtesy: B. Mesnet (IBM)

Possible gain from using FPGA based system
Page 46
Courtesy: B. Mesnet (IBM)

MX Group (PSI)
• Vincent Olieric
• Takashi Tomizaki
• Chia-Ying Huang
• Sylvain Engilberg
• Justyna Wojdyła
• Meitian Wang
Detector Group (PSI)
• Aldo Mozzanica
• Martin Brückner
• Carlos Lopez-Cuenca
• Bernd Schmitt
Science IT (PSI)
• Leonardo Sala
Controls (PSI)
• Andrej Babic
• Leonardo Hax-Damiani
SLS management (PSI)
• Oliver Bunk
Photon Factory, KEK
• Naohiro Matsugaki
• Yusuke Yamada
• Masahide Hikita
MAX IV
• Jie Nan
• Zdenek Matej
Uni Konstanz
• Kay Diederichs
LBL
• Aaron Brewster
DLS
• Graeme Winter
• DIALS Team
ESRF
• Jerome Kieffer
IBM Systems (France)
• Alexandre Castellane
• Bruno Mesnet
InnoBoost SA
• Lionel Clavien
Acknowledgements
Page 47

OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray Camera at the Swiss Light Source synchrotron

Recommended

More Related Content

What's hot (20)

Similar to OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray Camera at the Swiss Light Source synchrotron (20)

More from Ganesan Narayanasamy (20)

Recently uploaded (20)

OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray Camera at the Swiss Light Source synchrotron