0% found this document useful (0 votes)
5 views

wp-01231-understanding-how-hyperflex-architecture-enables-high-performance-systems

The Intel HyperFlex FPGA Architecture significantly enhances performance in Stratix 10 FPGAs and SoCs, achieving 2X the core clock frequency of previous generations. It introduces Hyper-Registers throughout the core fabric, allowing for improved bandwidth, area, and power efficiency through advanced techniques like Hyper-Retiming and Hyper-Pipelining. This architecture addresses the growing bandwidth demands of high-performance systems while optimizing design efficiency and clocking capabilities.

Uploaded by

yeswanthsathya24
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

wp-01231-understanding-how-hyperflex-architecture-enables-high-performance-systems

The Intel HyperFlex FPGA Architecture significantly enhances performance in Stratix 10 FPGAs and SoCs, achieving 2X the core clock frequency of previous generations. It introduces Hyper-Registers throughout the core fabric, allowing for improved bandwidth, area, and power efficiency through advanced techniques like Hyper-Retiming and Hyper-Pipelining. This architecture addresses the growing bandwidth demands of high-performance systems while optimizing design efficiency and clocking capabilities.

Uploaded by

yeswanthsathya24
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

White Paper

FPGA

Understanding How the New Intel® HyperFlex™


FPGA Architecture Enables Next-Generation
High-Performance Systems

Intel Stratix® 10 FPGAs and SoCs leverage the Intel HyperFlex FPGA Architecture to
deliver 2X the core clock frequency performance of previous generations.

Authors Introduction
Mike Hutton To address the ever increasing bandwidth requirements of next-generation
Architect high-performance systems, FPGA vendors are continually making incremental
Intel Programmable Solutions Group improvements in their device architectures. Even with these advanced
architectures, designers often resort to implementing their designs using very
wide on-chip buses. In fact, on-chip buses of 512, 1,024 or 2,048 bits wide are
increasingly common. Although this method improves data throughput in the
FPGA core, these wide buses consume significant fabric resources and power. Also,
as the FPGA fills up, routing resources can become congested and the core clock
frequency may be limited.
Another way to increase bandwidth is to implement the design in an FPGA
fabricated using the most advanced process node, hoping to benefit from the
higher transistor switching speeds that are available with the latest technology.
However, as geometries continue to shrink, the interconnect delays between the
logic cells dominate the total delay in the FPGA and this limits the effectiveness
of the higher transistor switching speeds. Fundamentally, conventional FPGA
architectures cannot keep up with tomorrow’s performance demands.

The need for bandwidth


Optical transport network (OTN), wireline, military, and high-performance
computing applications require ever increasing bandwidth. The need to move
greater quantities of data has resulted in increasing datapath widths inside the
FPGA. The amount of data that can be moved through the routing architecture is a
function of the number of wires used and the speed (fMAX) of the wires. The number
of wires available is a function of technology; it is derived from the size of the
Table of Contents device and the minimum pitch of wires in the technology.
Introduction . . . . . . . . . . . . . . . . . . . . . 1 Routing architectures can make the wires more efficient by using hierarchy (for
example, local routing in logic array blocks (LABs) and global routing on horizontal
The Intel HyperFlex FPGA
and vertical interconnect lines) and optimization techniques. However, doubling
Architecture. . . . . . . . . . . . . . . . . . . . . 2
the number of wires simply adds die area and increases power dissipation. The
The HyperFlex Advantage. . . . . . . . . 3 speed of routing wires is technology driven (the RC delay on a wire), and is subject
to the FPGA architecture and the design implementation. For example, pipelining a
Conclusion. . . . . . . . . . . . . . . . . . . . . . . 7
design can increase the clock speed without increasing the number of wires, which
Where to Get More Information . . . 7 increases bandwidth for the same resources.
White Paper | Understanding How the New Intel HyperFlex FPGA Architecture Enables Next-Generation High-Performance Systems

The Intel HyperFlex FPGA Architecture


Before Interconnect Interconnect
Pipelining Intel Stratix 10 devices have a redesigned core architecture
that includes additional registers, called Hyper-Registers,
everywhere throughout the core fabric. These registers are
available in every interconnect routing segment and at the
inputs of all functional blocks. The Hyper-Registers provide
a fine-grained solution to the problem of how to increase
After Interconnect
Pipelining
bandwidth and improve area and power efficiency. With
many more registers that are easy to access, you can retime
ALM registers to eliminate critical paths, add pipeline registers to
remove routing delays, and optimize your design for best-
in-class performance. When Hyper-Registers are used to
implement these techniques, all FPGA logic resources are
available for logic functions instead of being sacrificed as
Figure 1. Added Delays with Conventional Pipelining feed-through cells to reach conventional LUT registers.
To keep up with the high-performance of the core fabric, the
The need for efficiency
dedicated function blocks in the FPGA core—such as M20K
When designers pipeline a design for greater performance, memory and floating-point digital signal processing (DSP)
they add registers to the design. The traditional methodology blocks—have been redesigned to support operation at clock
of building register-look-up table (LUT) pairs that is present speeds up to 1 GHz.
in all existing FPGA core architectures means that logic is
To make it easy to use the Hyper-Registers, the Intel Quartus®
sacrificed to reach the added pipeline registers. Pipelining in
Prime software includes a Hyper-Aware design flow with:
conventional architectures also incurs a delay cost because
a signal needs to be routed into and out of a logic block. The • Post place-and-route performance tuning for accelerated
result is diminishing returns for the pipelining technique, timing closure
especially when routing delays dominate the total delay.
Figure 1 shows before and after examples of conventional • Hyper-Aware synthesis and place-and-route for efficient
pipelining, and the added delays due to routing into and out pipelining
of the added register. • Fast Forward Compilation to explore performance
enhancement options
The need for improved clocking
To address the need for a flexible clock network, Intel
As clock speeds increase, clock skew becomes increasingly Stratix 10 FPGAs and SoCs include programmable clock tree
important. Conventional FPGA core architectures have synthesis. This ASIC-like clocking helps mitigate skew and
concentrated on balanced clock trees, which minimize uncertainty. It also lowers power dissipation by intelligently
deterministic skew. This method has served well for designs enabling clock network branches.
up to 500 MHz, but to break the 500 MHz barrier and reach
speeds of up to 1 GHz, a next-generation clocking solution is Intel Stratix 10 FPGAs and SoCs use Intel’s 14 nm Tri-Gate
needed. The solution must localize clocks to minimize local (FinFET) process technology. The combination of the new
variation and skew, as well as provide a flexible network that Intel HyperFlex FPGA Architecture and the industry-leading
services the numerous clocks that are common in high- FinFET process technology allows Intel Stratix 10 devices
performance designs. to achieve 2X the core performance compared to previous-
generation high-performance FPGAs.
The Intel® FPGA HyperFlex™ solution
Hyper-registers
To address these challenges, Intel Stratix® 10 FPGAs and
SoCs introduce (formerly Altera® Stratix 10 FPGAs and With 90 nm Stratix II FPGAs, Intel was the first FPGA vendor
SOCs) an entirely new core architecture, the Intel HyperFlex to shrink the critical path depth with 6-input LUTs. In 28 nm
FPGA Architecture. The innovative Intel HyperFlex FPGA Stratix V FPGAs, Intel introduced time-borrowing latches to
Architecture supports previously unimaginable levels allow automatic micro-retiming of clock and data signals.
of performance: 2X the core performance compared With 14 nm Intel Stratix 10 devices, Intel is the first FPGA
to previous-generation high-performance FPGAs. This company to introduce an entirely new “registers everywhere”
performance level is not possible with conventional core architecture filled with bypassable retiming and
architectures. To take advantage of the Intel HyperFlex FPGA pipelining registers. This method breaks the link between the
Architecture, you use familiar techniques: register retiming, functional registers in the adaptive logic module (ALM) itself,
pipelining, and design optimization. These techniques can and the Hyper-Registers used for retiming and pipelining
speed up designs on conventional architectures. However, critical paths and improving design efficiency.
when combined with the Intel HyperFlex FPGA Architecture,
the result is designs that run at blazing fast speeds with core The Intel HyperFlex FPGA Architecture is built for retiming
clock rates up to 1 GHz. and pipelining high-performance designs. All routing
segments have an optional Hyper-Register built into the
programmable routing multiplexer that allows the routing
segment to be registered or combinational. These Hyper-
Registers are available everywhere throughout the core

2
White Paper | Understanding How the New Intel HyperFlex FPGA Architecture Enables Next-Generation High Performance Systems

ALM ALM ALM

ALM ALM ALM

ALM ALM ALM

Registers are available in every routing segment


Registers are available on all block inputs (ALM, M20K blocks, DSP blocks, and I/O cells)
Figure 2. "Registers Everywhere" Intel HyperFlex FPGA Architecture

fabric as shown in Figure 2. The Hyper-Registers are compares a conventional routing multiplexer and a HyperFlex
represented by the small squares at the intersection of every routing multiplexer with the included Hyper-Register.
horizontal and vertical routing segment.
The Hyper-Registers allow you to take advantage of
With this architecture, there is no need to use an ALM to find traditional performance enhancement methods—retiming,
a pipeline register. Every horizontal and vertical interconnect pipelining, and optimization—implemented in a new and
line in the device contains a Hyper-Register that can be better way. When implemented using the Hyper-Registers
turned on or off by configuring the FPGA. instead of the ALM registers, these techniques are referred
to as Hyper-Retiming, Hyper-Pipelining and Hyper-
Hyper-Registers are simple, one-input one-output
Optimization. Table 1 summarizes the performance gains
bypassable registers without routing multiplexers on the
achieved when these techniques are used in sequence, giving
input. You control these registers with configuration bits.
a three-step process to maximize performance using the
They are inexpensive and do not add significant silicon area
Intel HyperFlex FPGA Architecture.
to the device. Because Hyper-Registers are ubiquitous in
the core fabric, designers are not limited by the number of
registers in their design. They can retime and pipeline as Architecture Effort Core
Step
needed without consuming additional LAB resources. In Advantage Required Performance *
many cases, the design uses fewer LAB resources because 1 Hyper- No change 1.5X
registers are implemented using the Hyper-Registers in the Retiming or minor RTL
routing instead of partially consuming an ALM simply to use changes
its register.
2 Hyper- Added 1.65X
Pipelining pipelining
The HyperFlex Advantage
3 Hyper- Design 2X or more
Because the Hyper-Registers are included in the interconnect
Optimization dependent
routing architecture, timing optimizations can be done after
place-and-route, without changing the design’s routing. Table 1. Three-Step Process to Maximize Performance Using
The Intel Quartus Prime software can easily find and use the Intel HyperFlex FPGA Architecture
the Hyper-Registers during retiming operations. Figure 3 * vs. previous-generation high-performance FPGAs

Hyper-retiming
Stratix 10 HyperFlex In conventional architectures, software performs retiming by
Conventional Routing Multiplexer finding a nearby, unused ALM register and including it in the
Routing Multiplexer (with Hyper-Register)
circuit. This retiming method is limited by the granularity of
Interconnect Interconnect
the ALM register placement:
• The unused ALM may not be located conveniently, causing
additional delay to include it in the design.

CRAM CRAM clk CRAM • There is a delay overhead to route through the ALM to the
Config Config Config register.

Figure 3. Comparing Conventional and Hyperflex Routing

3
White Paper | Understanding How the New Intel FPGA HyperFlex Architecture Enables Next-Generation High-Performance Systems

• If software is trying to retime a wide bus (512 bits, 1,024 architectures, shown in Figure 4, are unnecessary with the
bits, or wider), retiming requires a large number of Intel HyperFlex FPGA Architecture. Therefore, paths that are
additional logic cells. a few nanoseconds long can be split perfectly during Hyper-
Retiming as shown in Figure 5.
• The algorithms required to determine the best location for
a retimed register are difficult. Hyper-Retiming does not affect existing LABs and ALMs,
which means that there is no incremental placement or
Figure 4 shows a routing example before and after retiming
routing required and no significant impact on compilation
with conventional architectures.
time. To retime a register, the register location is simply
In the new HyperFlex core architecture, the Hyper-Registers pushed into the routing to its naturally balanced final
are used to enable fine-grained Hyper-Retiming. The Intel location (see Figure 5) after place and route. This feature is
Quartus Prime software retimes the path by moving the a tremendous benefit for designs with wide data buses that
register out of the logic cell and into the interconnect. require hundreds or thousands of additional ALMs to achieve
Because there are Hyper-Registers available in every routing retiming in conventional architectures, and typically require
segment, there are many registers locations available, making extensive rerouting.
the optimization easy.
• For more information about using the Intel Quartus Prime
With Hyper-Registers, the retiming granularity is extremely software to perform Hyper-Retiming, refer to the Using
fine; it is the delay of an individual routing wire, which is Intel Quartus Prime Software to Maximize Performance
a few tens of picoseconds. The compromises made when in the Intel HyperFlex FPGA Architecture technical white
trying to locate the retiming registers in conventional paper.

Short
Interconnect Long Interconnect
(Many Hops)
ALM ALM ALM
Before
Retiming
Logic Logic Logic
286 MHz
1.5 ns 3.5 ns

Short
Interconnect Shorter Interconnect Shorter Interconnect
(Fewer Hops) (Fewer Hops)
ALM ALM ALM
After
Retiming
Logic Logic Logic
333 MHz ALM

3 ns 2.5 ns

Figure 4. Retiming in Conventional FPGA Architecture

Short
Interconnect Long Interconnect
(Many Hops)
ALM ALM ALM
Before
Retiming
Logic Logic Logic
286 MHz
1.5 ns 3.5 ns

Short Shorter Interconnect Shorter Interconnect


Interconnect (Fewer Hops) (Fewer Hops)

ALM ALM ALM


Hyper
Retiming
Logic Logic Logic
400 MHz
2.5 ns 2.5 ns
Hyper-Register

Figure 5. Hyper-Retiming in the Intel HyperFlex FPGA Architecture

4
White Paper | Understanding How the New Intel FPGA HyperFlex Architecture Enables Next-Generation High-Performance Systems

Hyper-pipelining feed-forward logic. Figure 7 shows an example of Hyper-


Pipelining.
Conventional pipelining suffers from the same drawbacks
as conventional retiming, and the lack of register granularity Because the software can automatically retime the logic
reduces the effectiveness of this optimization. Conventional by moving registers into the interconnect, you only need
pipelining is inherently an iterative process because the to specify the required number of pipeline registers at the
number of pipeline stages required, and their optimum input to a clock domain or at a sub-design’s pin logic. The
location, is unknown at the start of the process. Therefore, Intel Quartus Prime software then moves the registers into
the design must be placed and routed several times while the routing as required, after place and route, solving the
trying to converge on a pipelined solution that meets multiple iteration problem that exists with pipelining in
performance goals. Figure 6 shows a simple example before a conventional architecture. Placing registers together in
and after conventional pipelining. the RTL also allows for easy logic parameterization when
intellectual property (IP) libraries target more than one clock
When using the Intel HyperFlex FPGA Architecture, you can
frequency (fMAX). Figure 8 shows an example of placing
pipeline at will using the Hyper-Registers without bloating
additional pipeline registers at the input of a clock domain
the size of the design. This process is known as Hyper-
and the resulting movement of these registers into the
Pipelining. In many cases, a design with heavy register usage
optimum position in the interconnect routing.
experiences a decrease in the number of ALMs required to
implement the design because no “orphaned” registers are Designers who make their design pipeline-friendly will
needed. experience the greatest benefit from the Intel HyperFlex
FPGA Architecture. For example, latency-tolerant forms of
With what amounts to cost-free pipelining, you can use
flow control, such as using data-valid signals instead of high-
the technique aggressively, particularly in datapath and

Short
Interconnect Long Interconnect
(Many Hops)
ALM ALM ALM
Before
Pipelining
Logic Logic Logic
286 MHz

3.5 ns

Short
Interconnect Shorter Interconnect Shorter Interconnect
(Fewer Hops) (Fewer Hops)
ALM ALM ALM
After
Pipelining
Logic Logic Logic
400 MHz
ALM

1.5 ns 1.5 ns 2.5 ns

Figure 6. Pipelining in Conventional FPGA Architectures

Short
Interconnect Long Interconnect
(Many Hops)
ALM ALM ALM
Before
Pipelining
Logic Logic Logic
286 MHz
1.5 ns 3.5 ns

Short Shorter Interconnect Shorter Interconnect


Interconnect (Fewer Hops) (Fewer Hops)
ALM ALM ALM
Hyper
Pipelining
Logic Logic Logic
572 MHz
1.5 ns 1.75 ns 1.75 ns
Hyper-Register

Figure 7. Hyper-Pipelining in the Intel HyperFlex FPGA Architecture

5
White Paper | Understanding How the New Intel FPGA HyperFlex Architecture Enables Next-Generation High-Performance Systems

fanout clock enables, allow the software to move registers (or Boolean factorization) to shorten the loop thus increasing
easily through the FPGA core fabric. the maximum frequency. Typically, you target these
optimizations at control loops in which the performance
For more information on design optimizations that take
benefit greatly outweighs any area cost of the additional logic
advantage of the Intel HyperFlex FPGA Architecture, refer to
required to achieve the factorization.
the Tailoring RTL Designs for Optimum Performance in the
Intel HyperFlex FPGA Architecture technical white paper.
Flexible, high-speed programmable clock tree synthesis
Hyper-optimization Clocking in high-performance FPGA designs is becoming
more challenging for designers. Conventional FPGAs
After Hyper-Retiming and Hyper-Pipelining are complete,
have fixed global clock tree networks that are designed to
the performance gains may be so great in some sections of
support high-fanout, chip-wide, global clock domains. At
the design that other areas are exposed as bottlenecks that
GHz performance, however, clock networks require greater
prevent further gains. These bottlenecks may be circuits such
flexibility. Designers want to create time-shifted clocks for
as long feedback loops or complex state machines that need
performance balancing and clock crossing, and generate
to be evaluated on every clock cycle.
dynamically gated clocks for rate-matching and system
A common method for improving the design performance power management.
is to optimize specific portions of the design. For example,
To address these needs, the Intel HyperFlex FPGA
a design with a long feedback loop can limit the maximum
Architecture contains an entirely new clock structure with
frequency (fMAX). Redesigning the circuit to pre-compute the
pre-routed clock paths onto which a design’s clock region
possible feedback values, and using a short feedback loop
is synthesized (as is common for ASIC clock tree synthesis).
to select between them, increases the maximum frequency.
This structure allows unprecedented flexibility to create
With Hyper-Registers, this process can achieve higher speeds
small, localized clock domains. It also lets the software
than are possible with conventional architectures because
manage skew: taking advantage of beneficial skew when
the pre-compute paths can be optimized using Hyper-
available and minimizing skew when necessary. Additionally,
Retiming and Hyper-Pipelining. Figure 9 shows an example of
when required, this clock structure can be used to synthesize
Hyper-Optimization; performing a Shannon decomposition

Figure 8. Placing Additional Pipeline Registers at the Input of a Clock Domain

ALM ALM ALM ALM

A Logic Logic Logic Logic

B
C Time Around Feedback
Loop Limits fMAX

ALM ALM ALM Short Loop


0
Logic Logic Logic
ALM

ALM ALM ALM Logic


1
A Logic Logic Logic Time Around Short
Loop Does NOT
B Limit fMAX
C

Figure 9. Hyper-Optimization of a Long Feedback Loop

6
White Paper | Understanding How the New Intel FPGA HyperFlex Architecture Enables Next-Generation High-Performance Systems

traditional global and regional balanced H-tree clocks for smaller device. Alternatively, the designer has the freedom
backwards compatibility. to use part of the performance dividend to improve clock
speed, and convert the remaining performance dividend into
The Intel Quartus Prime software manages the
power savings through reduced core power supply voltage or
programmable clock tree synthesis; it synthesizes clock trees
use a slower speed grade device.
in an integrated fashion during place-and-route. Figure 10
shows an example of balanced and unbalanced clock trees
synthesized by this approach. Productivity
The increased core performance available with the Intel
Power efficiency HyperFlex FPGA Architecture provides benefits that go
beyond simply running the core at a faster clock rate. The
Intel Stratix 10 FPGAs and SoCs offer a significant power
additional performance results in easier timing closure,
improvement over previous families in large part due to the
improved design team productivity, and shorter time-to-
use of Intel’s 14 nm Tri-Gate (FinFET) process technology
market for the product.
to fabricate the devices. Additionally, the Intel HyperFlex
FPGA Architecture facilitates dramatic power savings.
The increased performance of the Intel HyperFlex FPGA Conclusion
Architecture enables a 1,024 bit datapath clocked at 350 MHz Meeting the needs of next generation, high-performance
to be implemented as a 512 bit datapath clocked at 700 MHz. designs is a challenge with conventional FPGA core
As a result, the design fits into a device half the size. This architectures. The value of techniques such as retiming,
change is neutral for dynamic power but reduces static power pipelining, and optimization are limited by the architecture
by half and also results in significant cost savings by using a itself. The Intel Stratix 10 Intel HyperFlex FPGA Architecture,
with its “registers everywhere” approach, takes these
Unbalanced optimizations to the next level, resulting in 2X the core
performance compared to previous generation high-
Balanced
performance FPGAs.

Where to Get More Information


For more information about Intel and Stratix 10 FPGAs, visit
www.intel.com/content/www/us/en/products/details/
fpga/stratix/10.html
• White Paper: A New FPGA Architecture and Leading-Edge FinFET Process Technology
Promise to Meet Next-Generation System Requirements
• Technical White Paper: Using Intel Quartus Prime Software to Maximize Performance in
the Intel HyperFlex FPGA Architecture
Balanced
Full H-Tree

Software Constructs Network As Needed for Design


Start with Pre-Built H-Tree Templates

Figure 10. Balanced and Unbalanced Clock Tree Synthesis

Intel technologies may require enabled hardware, software or service activation.


No product or component can be absolutely secure.
Your costs and results may vary.
Customer is responsible for safety of the overall system, including compliance with applicable safety-related requirements or standards.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

WP-01231-1.4
7

You might also like