wp-01231-understanding-how-hyperflex-architecture-enables-high-performance-systems
wp-01231-understanding-how-hyperflex-architecture-enables-high-performance-systems
FPGA
Intel Stratix® 10 FPGAs and SoCs leverage the Intel HyperFlex FPGA Architecture to
deliver 2X the core clock frequency performance of previous generations.
Authors Introduction
Mike Hutton To address the ever increasing bandwidth requirements of next-generation
Architect high-performance systems, FPGA vendors are continually making incremental
Intel Programmable Solutions Group improvements in their device architectures. Even with these advanced
architectures, designers often resort to implementing their designs using very
wide on-chip buses. In fact, on-chip buses of 512, 1,024 or 2,048 bits wide are
increasingly common. Although this method improves data throughput in the
FPGA core, these wide buses consume significant fabric resources and power. Also,
as the FPGA fills up, routing resources can become congested and the core clock
frequency may be limited.
Another way to increase bandwidth is to implement the design in an FPGA
fabricated using the most advanced process node, hoping to benefit from the
higher transistor switching speeds that are available with the latest technology.
However, as geometries continue to shrink, the interconnect delays between the
logic cells dominate the total delay in the FPGA and this limits the effectiveness
of the higher transistor switching speeds. Fundamentally, conventional FPGA
architectures cannot keep up with tomorrow’s performance demands.
2
White Paper | Understanding How the New Intel HyperFlex FPGA Architecture Enables Next-Generation High Performance Systems
fabric as shown in Figure 2. The Hyper-Registers are compares a conventional routing multiplexer and a HyperFlex
represented by the small squares at the intersection of every routing multiplexer with the included Hyper-Register.
horizontal and vertical routing segment.
The Hyper-Registers allow you to take advantage of
With this architecture, there is no need to use an ALM to find traditional performance enhancement methods—retiming,
a pipeline register. Every horizontal and vertical interconnect pipelining, and optimization—implemented in a new and
line in the device contains a Hyper-Register that can be better way. When implemented using the Hyper-Registers
turned on or off by configuring the FPGA. instead of the ALM registers, these techniques are referred
to as Hyper-Retiming, Hyper-Pipelining and Hyper-
Hyper-Registers are simple, one-input one-output
Optimization. Table 1 summarizes the performance gains
bypassable registers without routing multiplexers on the
achieved when these techniques are used in sequence, giving
input. You control these registers with configuration bits.
a three-step process to maximize performance using the
They are inexpensive and do not add significant silicon area
Intel HyperFlex FPGA Architecture.
to the device. Because Hyper-Registers are ubiquitous in
the core fabric, designers are not limited by the number of
registers in their design. They can retime and pipeline as Architecture Effort Core
Step
needed without consuming additional LAB resources. In Advantage Required Performance *
many cases, the design uses fewer LAB resources because 1 Hyper- No change 1.5X
registers are implemented using the Hyper-Registers in the Retiming or minor RTL
routing instead of partially consuming an ALM simply to use changes
its register.
2 Hyper- Added 1.65X
Pipelining pipelining
The HyperFlex Advantage
3 Hyper- Design 2X or more
Because the Hyper-Registers are included in the interconnect
Optimization dependent
routing architecture, timing optimizations can be done after
place-and-route, without changing the design’s routing. Table 1. Three-Step Process to Maximize Performance Using
The Intel Quartus Prime software can easily find and use the Intel HyperFlex FPGA Architecture
the Hyper-Registers during retiming operations. Figure 3 * vs. previous-generation high-performance FPGAs
Hyper-retiming
Stratix 10 HyperFlex In conventional architectures, software performs retiming by
Conventional Routing Multiplexer finding a nearby, unused ALM register and including it in the
Routing Multiplexer (with Hyper-Register)
circuit. This retiming method is limited by the granularity of
Interconnect Interconnect
the ALM register placement:
• The unused ALM may not be located conveniently, causing
additional delay to include it in the design.
CRAM CRAM clk CRAM • There is a delay overhead to route through the ALM to the
Config Config Config register.
3
White Paper | Understanding How the New Intel FPGA HyperFlex Architecture Enables Next-Generation High-Performance Systems
• If software is trying to retime a wide bus (512 bits, 1,024 architectures, shown in Figure 4, are unnecessary with the
bits, or wider), retiming requires a large number of Intel HyperFlex FPGA Architecture. Therefore, paths that are
additional logic cells. a few nanoseconds long can be split perfectly during Hyper-
Retiming as shown in Figure 5.
• The algorithms required to determine the best location for
a retimed register are difficult. Hyper-Retiming does not affect existing LABs and ALMs,
which means that there is no incremental placement or
Figure 4 shows a routing example before and after retiming
routing required and no significant impact on compilation
with conventional architectures.
time. To retime a register, the register location is simply
In the new HyperFlex core architecture, the Hyper-Registers pushed into the routing to its naturally balanced final
are used to enable fine-grained Hyper-Retiming. The Intel location (see Figure 5) after place and route. This feature is
Quartus Prime software retimes the path by moving the a tremendous benefit for designs with wide data buses that
register out of the logic cell and into the interconnect. require hundreds or thousands of additional ALMs to achieve
Because there are Hyper-Registers available in every routing retiming in conventional architectures, and typically require
segment, there are many registers locations available, making extensive rerouting.
the optimization easy.
• For more information about using the Intel Quartus Prime
With Hyper-Registers, the retiming granularity is extremely software to perform Hyper-Retiming, refer to the Using
fine; it is the delay of an individual routing wire, which is Intel Quartus Prime Software to Maximize Performance
a few tens of picoseconds. The compromises made when in the Intel HyperFlex FPGA Architecture technical white
trying to locate the retiming registers in conventional paper.
Short
Interconnect Long Interconnect
(Many Hops)
ALM ALM ALM
Before
Retiming
Logic Logic Logic
286 MHz
1.5 ns 3.5 ns
Short
Interconnect Shorter Interconnect Shorter Interconnect
(Fewer Hops) (Fewer Hops)
ALM ALM ALM
After
Retiming
Logic Logic Logic
333 MHz ALM
3 ns 2.5 ns
Short
Interconnect Long Interconnect
(Many Hops)
ALM ALM ALM
Before
Retiming
Logic Logic Logic
286 MHz
1.5 ns 3.5 ns
4
White Paper | Understanding How the New Intel FPGA HyperFlex Architecture Enables Next-Generation High-Performance Systems
Short
Interconnect Long Interconnect
(Many Hops)
ALM ALM ALM
Before
Pipelining
Logic Logic Logic
286 MHz
3.5 ns
Short
Interconnect Shorter Interconnect Shorter Interconnect
(Fewer Hops) (Fewer Hops)
ALM ALM ALM
After
Pipelining
Logic Logic Logic
400 MHz
ALM
Short
Interconnect Long Interconnect
(Many Hops)
ALM ALM ALM
Before
Pipelining
Logic Logic Logic
286 MHz
1.5 ns 3.5 ns
5
White Paper | Understanding How the New Intel FPGA HyperFlex Architecture Enables Next-Generation High-Performance Systems
fanout clock enables, allow the software to move registers (or Boolean factorization) to shorten the loop thus increasing
easily through the FPGA core fabric. the maximum frequency. Typically, you target these
optimizations at control loops in which the performance
For more information on design optimizations that take
benefit greatly outweighs any area cost of the additional logic
advantage of the Intel HyperFlex FPGA Architecture, refer to
required to achieve the factorization.
the Tailoring RTL Designs for Optimum Performance in the
Intel HyperFlex FPGA Architecture technical white paper.
Flexible, high-speed programmable clock tree synthesis
Hyper-optimization Clocking in high-performance FPGA designs is becoming
more challenging for designers. Conventional FPGAs
After Hyper-Retiming and Hyper-Pipelining are complete,
have fixed global clock tree networks that are designed to
the performance gains may be so great in some sections of
support high-fanout, chip-wide, global clock domains. At
the design that other areas are exposed as bottlenecks that
GHz performance, however, clock networks require greater
prevent further gains. These bottlenecks may be circuits such
flexibility. Designers want to create time-shifted clocks for
as long feedback loops or complex state machines that need
performance balancing and clock crossing, and generate
to be evaluated on every clock cycle.
dynamically gated clocks for rate-matching and system
A common method for improving the design performance power management.
is to optimize specific portions of the design. For example,
To address these needs, the Intel HyperFlex FPGA
a design with a long feedback loop can limit the maximum
Architecture contains an entirely new clock structure with
frequency (fMAX). Redesigning the circuit to pre-compute the
pre-routed clock paths onto which a design’s clock region
possible feedback values, and using a short feedback loop
is synthesized (as is common for ASIC clock tree synthesis).
to select between them, increases the maximum frequency.
This structure allows unprecedented flexibility to create
With Hyper-Registers, this process can achieve higher speeds
small, localized clock domains. It also lets the software
than are possible with conventional architectures because
manage skew: taking advantage of beneficial skew when
the pre-compute paths can be optimized using Hyper-
available and minimizing skew when necessary. Additionally,
Retiming and Hyper-Pipelining. Figure 9 shows an example of
when required, this clock structure can be used to synthesize
Hyper-Optimization; performing a Shannon decomposition
B
C Time Around Feedback
Loop Limits fMAX
6
White Paper | Understanding How the New Intel FPGA HyperFlex Architecture Enables Next-Generation High-Performance Systems
traditional global and regional balanced H-tree clocks for smaller device. Alternatively, the designer has the freedom
backwards compatibility. to use part of the performance dividend to improve clock
speed, and convert the remaining performance dividend into
The Intel Quartus Prime software manages the
power savings through reduced core power supply voltage or
programmable clock tree synthesis; it synthesizes clock trees
use a slower speed grade device.
in an integrated fashion during place-and-route. Figure 10
shows an example of balanced and unbalanced clock trees
synthesized by this approach. Productivity
The increased core performance available with the Intel
Power efficiency HyperFlex FPGA Architecture provides benefits that go
beyond simply running the core at a faster clock rate. The
Intel Stratix 10 FPGAs and SoCs offer a significant power
additional performance results in easier timing closure,
improvement over previous families in large part due to the
improved design team productivity, and shorter time-to-
use of Intel’s 14 nm Tri-Gate (FinFET) process technology
market for the product.
to fabricate the devices. Additionally, the Intel HyperFlex
FPGA Architecture facilitates dramatic power savings.
The increased performance of the Intel HyperFlex FPGA Conclusion
Architecture enables a 1,024 bit datapath clocked at 350 MHz Meeting the needs of next generation, high-performance
to be implemented as a 512 bit datapath clocked at 700 MHz. designs is a challenge with conventional FPGA core
As a result, the design fits into a device half the size. This architectures. The value of techniques such as retiming,
change is neutral for dynamic power but reduces static power pipelining, and optimization are limited by the architecture
by half and also results in significant cost savings by using a itself. The Intel Stratix 10 Intel HyperFlex FPGA Architecture,
with its “registers everywhere” approach, takes these
Unbalanced optimizations to the next level, resulting in 2X the core
performance compared to previous generation high-
Balanced
performance FPGAs.
WP-01231-1.4
7