10 Hymel Conger Abstract
10 Hymel Conger Abstract
AbstractRecent advances in Xilinxs FPGA hardware and commercial software design tools, spurred in large part by the DODs Joint Tactical Radio System initiative, offer the possibility of incorporating dynamic partial reconfiguration (PR) into highperformance, embedded systems outside of academic research laboratories. PR can provide the flexibility and run-time reconfigurability that no pure hardware or software solution can offer. By multiplexing the hardware resources of a single programmable device with time-independent tasks, a common architecture in DOD systems, one FPGA can handle the same processing workload as a multi-device equivalent. This paper analyzes the performance impact of using PR to perform remote updating, an important capability often used in embedded applications.
I. INTRODUCTION
an SRAM-based FPGA is a multiprocessing device in that multiple, user-defined hardware modules can operate in parallel and independently within the same chip. One of the great advantages of such a device is the ability to modify its configuration memory easily and at any time. PR enhances this paradigm by reconfiguring only a portion of the chips configuration memory, allowing the user to load and unload these functional hardware modules without interrupting or resetting the rest of the device. Despite this advantage, commercial interest in PR has never materialized due mainly to a lack of supporting software tools and merciless design flows. Nevertheless, different academic approaches have been developed to incorporate PR into embedded systems using the Virtex-II FPGA [1-2]. Recently, however, the release of the Virtex-4 and Virtex-5 series of FPGAs, with their tile-based frame architectures, coupled with the lucrative softwaredefined radio market, has pushed Xilinx to engineer a workable PR design flow [3]. While still unreleased to the general public, the new design flow eliminates many of the burdensome requirements put in place by the previous flow [4] and now supports the Virtex-4 (though not yet the Virtex-5). Unfortunately, due to the relatively recent unveiling of this new design flow, as well as the still restricted nature of its release, there exists a vacuum in research and results exploring high-performance PR systems targeting these new devices. In response, we present a study of the performance impact (timing, resource utilization, and other metrics) of the new design flow when targeting Virtex-4 FPGAs, with remote updating, an important usage of PR, as a platform for analysis. II. TARGET APPLICATION
ENERICALLY,
Although commercial FPGAs have enjoyed great success as development and testing platforms, their use in embedded
2 continue uninterrupted during partial device reconfiguration, automatically maintaining state information. The remainder of this paper analyzes the performance impact of incorporating remote updating into three permutations of a generic PR architecture targeting an XC4VLX25 FPGA. III. EXPERIMENTAL ARCHITECTURES In order to facilitate PR in real hardware with a commercially-available design flow, key design issues and trade-offs must be addressed, including the number of partially reconfigurable regions (PRRs), the PRR shape, size, and placement, the PRRs access to the global clock network and I/O pads, and the communication interface amongst different PRRs and the static portion of the design. A complete description of each experimental study will appear in the full presentation, while a condensed version appears here. Each design permutation contains a static communication and configuration controller, as well as a different number of PRRs, ranging from one PRR of maximal size, to two side-byside PRRs, to four PRRs arranged in a 2x2 fashion. Each of the regions has a generic black-box, top-level interface. The advantage of such an approach is that a designer can use any high- or low-level tool to synthesize the PRR, so long as the top-level interfaces match. Then the designer need only run an existing script that automatically handles the details of the PR design flow to generate the partial bitstreams. We evaluated each design permutation using different highperformance computing cores, including Radix-4 FFT, AES, ARM7 soft-core processing, and others. We measured the minimum clock period at which each design could run twice, once when the design operated without any PR modifications and once after plugging into the experimental PR architecture. We also measured the size of the programming bitstream twice in the same fashion.
% Change from non-PR Baseline
40 35 30 25 20 15 10 5 0 Bitstream Reduction Overhead 1 PRR Max. Freq. Reduction 4 PRRs Max. Freq. Reduction (<100 MHz)
macros) but that do not contribute to processing. The clock frequency numbers are split into two categories, one for all designs and one for designs that originally operated at less than 100 MHz. The discrepancy is due to a single enable net in the static region whose purpose is to put the PRRs into a known state during reconfiguration. This net is most often the critical path for designs over 100 MHz due to its length and fanout. In absolute terms, the results averaged across all design permutations are -162 KB, +727 slices, -57.6 MHz, and -8.09 MHz, respectively. In addition, the relative percentages should remain constant across different device sizes. The full presentation will include a detailed breakdown of these results. IV. CONCLUSIONS The use of partial reconfiguration in conjunction with commercial FPGAs and software tools can provide a reliable, resource-saving, and flexible means for updating the processing load of a deployed programmable device. By timemultiplexing the device, the designer has, in effect, an FPGA that contains more resources than are actually physically present, providing multiprocessing across both time and space. This method not only reduces the reconfiguration time but also the amount of bitstream data. Furthermore, using a generic architecture simplifies the design flow at the hardware level to allow rapid system development by designers untrained in the nuances of PR. These factors are especially important in DOD systems, as the generic hardware can be qualified to the necessary environmental standards and then reused in other platforms without knowledge of the low-level details. Future directions for this work include exploring full partial reconfiguration. As Virtex-4 devices contain two separate ICAP primitives, we have the ability to reconfigure the reconfiguration engine itself by switching configuration control between different regions. Doing so would allow us to update the previously static controller, e.g., to change the encryption standard or the communication protocol it uses. V. ACKNOWLEDGEMENTS This work was supported in part by the I/UCRC Program of the National Science Foundation under Grant No. EEC0642422. The authors gratefully acknowledge tools and equipment provided by Sandia National Laboratories and Xilinx that helped make this work possible. VI. REFERENCES
[1] M. Ullmann, B. Grimm, M. Hbner, and J. Becker, An FPGA Run-Time System for Dynamical On-Demand Reconfiguration, Proc. IEEE Parallel and Distributed Processing Symposium, Santa Fe, NM, Apr. 26-30, 2004. [2] M. Hbner, J. Becker, Exploiting Dynamic and Partial Reconfiguration for FPGAs Toolflow, Architecture, and System Integration, Proc. 19th SBCCI Symp. on Integrated Circuits and Systems Design, Ouro Preot, Brazil, 2006. [3] Early Access Partial Reconfiguration User Guide, UG208 (v1.1), Xilinx Inc., Mar. 6, 2006. [4] Two Flows for Partial Reconfiguration: Module Based or Difference Based, XAPP290 (v1.2), Xilinx Inc., Sept. 9, 2004.
2 PRRs
Figure 1: Measured Effects of PR vs. non-PR Baseline Figure 1 displays a set of average measured PR performance effects, including the bitstream size reduction, the PR overhead of each design, and the decrease in maximum clock frequency due to PR. The PR overhead consists of resources that the FPGA uses to facilitate the design flow (e.g. bus