0% found this document useful (0 votes)
193 views

Session 2 Digital Processor

1) AMD's new "Zen 4" CPU core is fabricated on TSMC's 5nm process and features an 8-core complex occupying 55mm2. 2) Key improvements include a 13% increase in instructions per clock (IPC) through architectural enhancements, and a 16% increase in frequency to 5.7GHz enabled by physical design innovations for lower power. 3) These improvements result in over 30% higher performance at equal power compared to the previous generation.

Uploaded by

吴川斌
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
193 views

Session 2 Digital Processor

1) AMD's new "Zen 4" CPU core is fabricated on TSMC's 5nm process and features an 8-core complex occupying 55mm2. 2) Key improvements include a 13% increase in instructions per clock (IPC) through architectural enhancements, and a 16% increase in frequency to 5.7GHz enabled by physical design innovations for lower power. 3) These improvements result in over 30% higher performance at equal power compared to the previous generation.

Uploaded by

吴川斌
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

ISSCC 2023 / SESSION 2 / DIGITAL PROCESSORS / OVERVIEW

Session 2 Overview: Digital Processors


DIGITAL ARCHITECTURES AND SYSTEMS SUBCOMMITTEE

Session Chair: Shidhartha Das Session Co-Chair: Ji-Hoon Kim


AMD, Cambridge, UK Ewha Woman’s University, Korea

In this year’s conference, mainstream processors designed in advanced process technologies share the stage with domain-specific
processors to address a range of applications from high-performance general-purpose computing through to genomics. The
session leads off with AMD’s latest-generation “Zen 4” core, MediaTek’s flagship 5G mobile SoC and features university researchers
demonstrating simulated-annealing processors and data-flow computing SoCs for AR/VR, robotics and next-generation genomic
sequencing.

1:30 PM
2.1 “Zen 4”: The AMD 5nm 5.7GHz x86-64 Microprocessor Core
Benjamin Munger, AMD, Boxborough, MA
In Paper 2.1, AMD highlights a 5.7GHz “Zen 4” 8-core complex fabricated in a 5nm FinFET process, occupying 55mm2. 13% IPC
improvement is accomplished through architectural enhancements including increased structure sizes and conflict reduction. This
coupled with physical design innovations including Vth management, selective path tuning and intelligent power grid optimization
drives a 6% reduction in switching capacitance, 16% increase in frequency, and an FMAX of 5.7GHz. Over 30% increase in iso-
power performance vs. the prior generation in desktop products is demonstrated.

2:00 PM
2.2 A 5G Mobile Gaming-Centric SoC with High-Performance Thermal Management in 4nm FinFET
Bo-Jr Huang, MediaTek, Hsinchu, Taiwan
In Paper 2.2, MediaTek demonstrates a high-performance thermal management system for a 110mm2 5G mobile gaming SoC
featuring a tri-gear CPU with GPU, designed in 4nm. With CPUs running up to 3.35GHz, a power-predictor and calculator system
fed by multiple on-chip sensors combined with energy-thermal-aware task reallocation facilitates an average 10°C increase in
throttling threshold enabling a record AnTuTu score of 1.146M.

36 • 2023 IEEE International Solid-State Circuits Conference 978-1-6654-9016-0/23/$31.00 ©2023 IEEE


ISSCC 2023 / February 20, 2023 / 1:30 PM

2:30 PM
2.3 Amorphica: 4-Replica 512 Fully Connected Spin 336MHz Metamorphic Annealer with Programmable
Optimization Strategy and Compressed-Spin-Transfer Multi-Chip Extension 2
Kazushi Kawamura, Tokyo Institute of Technology, Yokohama, Japan
In Paper 2.3, Tokyo Institute of Technology presents a 40nm programmable multi-policy simulated-annealing processor
integrating 4-replica 512 fully connected spins, extensible across 4-chips. The 9mm2 die operates at 336MHz consuming 151-
474mW at 1.1V.

3:15 PM
2.4 A Fully Integrated End-to-End Genome Analysis Accelerator for Next-Generation Sequencing
Yen-Lung Chen, National Taiwan University, Taipei, Taiwan
In Paper 2.4, National Taiwan University researchers present a 28nm processor for next-generation genomic sequencing
supporting end-to-end workflow from short-read mapping through to genotyping. The 16mm2 die operates at 400MHz @ 0.9V
and is designed in a TSMC 28nm CMOS process. The chip delivers 59× higher throughput and 935-to-4910× higher energy
efficiency compared to state-of-the-art cloud-based solutions.

3:45 PM
2.5 A 28nm 142mW Motion-Control SoC for Autonomous Mobile Robots
I-Ting Lin, National Taiwan University, Taipei, Taiwan
In Paper 2.5, National Taiwan University researchers present a 28nm 4.39mm2 200MHz SoC for autonomous robot control that
incorporates sampling-based motor control that enables high parallelization. Optimizations such as trajectory pruning and use
of an acceleration-based model facilitate a ~5kHz maximum rate control, with less than 1.6% tracking error, and over 350×
improvement in energy efficiency as compared to prior state of the art.

4:15 PM
2.6 VISTA: A 704mW 4K-UHD CNN Processor for Video and Image Spatial/Temporal Interpolation Acceleration
Kai-Ping Lin, National Tsing Hua University, Hsinchu, Taiwan
In Paper 2.6, National Tsing Hua University presents a video CNN chip for 4K-UHD imaging/display applications, providing peak
throughput of 60/50fps for spatial/temporal-interpolation with 704mW power dissipation. The 40nm 12.6mm2 chip achieves
comparable energy efficiency to prior work ranging from 4.0-6.4TOPS/W and area efficiency of 222.2GOPS/mm2, while
supporting the advanced feature of multi-image processing.

4:45 PM
2.7 MetaVRain: A 133mW Real-Time Hyper-Realistic 3D-NeRF Processor with 1D-2D Hybrid-Neural Engines for
Metaverse on Mobile Devices
Donghyeon Han, Korea Advanced Institute of Science and Technology, Daejeon, Korea
In Paper 2.7, KAIST presents a real-time hyper-realistic-3D-NeRF processor, MetaVRain, for metaverse on mobile devices,
which can create 3D models by training a DNN to memorize 3D scene geometry from a few photos. The 28nm chip, integrating
5K FP8-FP16 configurable MACs with 2MB of SRAM, demonstrates a maximum of 118fps, and consumes at least 99.95% lower
power compared with modern GPUs and a TPU.

DIGEST OF TECHNICAL PAPERS • 37


ISSCC 2023 / SESSION 2 / DIGITAL PROCESSORS / 2.1
2.1 “Zen 4”: The AMD 5nm 5.7GHz x86-64 Microprocessor Core Several improvements are also implemented in the “Zen 4” cache hierarchy. The L2
cache capacity is doubled to 1M at a less than two-fold area cost compared to “Zen 3”.
Improved efficiency is achieved through a combination of process scaling, an ECC
Benjamin Munger1, Kathy Wilcox1, Jeshuah Sniderman1, Chuck Tung1,
scheme implemented at a 256b granularity, which reduces storage by 4% compared to
Brett Johnson2, Russell Schreiber3, Carson Henrion2, Kevin Gillespie1, the 128b “Zen 3” ECC scheme in both the L2 and L3, and a substantially more compact
Tom Burd4, Harry Fair1, David Johnson2, Jonathan White1, Scott McLelland1, LRU implementation. Two cycles of latency are added to the L2 access to maintain high
Steven Bakke1, Javin Olson1, Ryan McCracken1, Matthew Pickett2, frequency operation. The higher L2 hit rate achieved by the larger capacity reduces L3
Aaron Horiuchi2, Hien Nguyen1, Tim H Jackson2 cache active power by more than 10%. Like previous generations, separate voltage
supply rails are used by the L2 SRAMs, which use VDDM, vs. logic, which use VDD.
1
AMD, Boxborough, MA Frequency-dependent VDDM levels, similar to the P-STATE dependent VDD levels in
2
AMD, Fort Collins, CO previous generations, are added. Finer control over VDDM levels saves active and leakage
3
AMD, Austin, TX power without adding area and can extend the VDD operating range by decreasing the
4
AMD, Santa Clara, CA maximum difference between VDDM and VDD. The L3 cache density is improved with 0.68×
effective area scaling across the caches, datapaths and control logic relative to the
“Zen 4” is AMD’s next generation x86-64 microprocessor core, fabricated in a 5nm previous 7nm generation. A more area-efficient L3 tag design based on a high density
FinFET process. Close collaboration between the design team and TSMC enabled an bitcell is used and careful co-optimization of the L3 tag and surrounding logic further
optimized process and excellent process scaling relative to the 7nm process used for improves area efficiency.
“Zen 3” [1]. The 55mm2 core complex (CCX), shown in Fig. 2.1.1, contains 6.5B
transistors across eight cores, similar to the 8 core CCX in the previous generation. Each The “Zen 4” core complex die (CCD) chiplet, utilizing AMD’s next generation AM5
core includes a 1MB private L2 cache, double the previous generation, and the eight package for desktop CPUs, is comprised of the core complex, a system management
cores share a 32MB L3 cache. The design also delivers a process-neutral performance unit (SMU), test/debug logic and dual Infinity Fabric On-Package (IFOP) SerDes links. 
increase over “Zen 3”: instructions per cycle (IPC) is increased, the physical design The “Zen 4” CCD, similar to the “Zen 3” CCD, is combined with a 6nm IO die (IOD) to
improves process-neutral frequency and changes are made to drive improved power provide cost-effective performance, and the modularity of chiplets enables multiple
efficiency maximizing both single-threaded performance and performance per watt in product configurations.  “Zen 4”-based mainstream client products combine the client
multi-threaded workloads. Incremental improvements to the core micro-architecture IOD with one or two “Zen 4” CCD chiplets to cover the spectrum of mainstream products
provide a 13% IPC improvement over the previous generation on an average of single- from top-end 16-core performance desktop products to 12, 8, and 6-core products.  The
threaded desktop applications and the “Zen 4” core can operate at up to 5.7GHz delivering “Zen 4” L3 supports a second generation of AMD 3D V-cache which extends the L3 cache
a more than 29% increase generationally in single-threaded performance. from 32MB to 96MB per CCX. The overhead to support the V-cache is through-silicon
vias (TSVs) on the CCX die the area of which was reduced by 40% in the second
Figure 2.1.2 shows a block diagram of the “Zen 4” architecture. Up to six integer generation implementation.
operations can be dispatched, up to three loads and two stores are supported, and the
branch prediction accuracy is improved over “Zen 3.” The design also increases buffer “Zen 4” delivers a substantial performance increase over the previous generation. As
sizes throughout the core. Structure size increases include a larger instruction op cache, shown in Fig. 2.1.4, the design provides up to 16% frequency increase at constant
retire queue, and integer register file. The floating-point register file size is also increased voltage, with improvement across a wide range of voltages, which is achieved through
and power efficient support for 512b Advanced Vector Extension (AVX 512) floating- a combination of technology optimization and timing-focused physical design and droop
point instructions is added using a 256b data path. Level one dcache bank conflicts are mitigation techniques. IPC is also increased which, combined with the higher frequency,
reduced by adding the ability to partially write level-one data-cache entries. Layout delivers a 30% single-threaded Geekbench score improvement over the previous
optimizations made to standard cells used in the dcache storage array reduce the area generation, as shown in Fig. 2.1.5. Given the lower power and higher IPC and frequency,
cost associated with adding the partial-write capability by more than 20%. “Zen 4” also delivers performance-per-watt improvements. As shown in Fig. 2.1.6, up
to 34% performance improvement at constant CCX power is achieved using a
“Zen 4” targeted a technology process neutral frequency improvement of 3% and combination of process scaling and power focused RTL and physical design
improved performance per watt compared to “Zen 3.” Frequency was improved through optimizations.
a combination of physical design optimization and the limited use of higher leakage
devices having higher performance. Critical timing paths were identified after synthesis Acknowledgement:
and automatic place and route was completed and subsequently improved through We would like to thank our talented AMD design teams across Austin, Bangalore, Boston,
judicious use of the higher leakage devices. Near-critical paths with potential sensitivity Fort Collins and Santa Clara who contributed to “Zen 4”.
to process variation have been further optimized to improve timing yield. Effective
switched capacitance (Cac) is reduced by 6% relative to the “Zen 3” core [1] through a References:
combination of process scaling, improved clock gating efficiency, and other RTL [1] T. Burd et al., “Zen 3: The AMD 2nd-Generation 7nm x86-64 Microprocessor Core,”
optimizations to reduce unnecessary net switching. Power optimizations more than offset ISSCC, pp. 54-55, 2022.
the Cac increase associated with the higher IPC achieved by “Zen 4”.  A breakdown of [2] G. Yeap et al., “5nm CMOS Production Technology Platform Featuring Full-Fledged
“Zen 4” Cac is shown in Fig. 2.1.3.  Compared to “Zen 3,” flipflop (Flop) Cac EUV, and High Mobility Channel FinFETs with Densest 0.021μm2 SRAM Cells for Mobile
proportionally decreases due to extensive use of low-power and multiple-bit latch cells. SoC and High Performance Computing Applications”, IDEM, pp. 36.7.1-36.7.4, 2019.
Clock distribution Cac proportionally increases because higher gate density in 5nm
requires a more robust clock mesh to maintain equivalent clock skew. Increased gate
density in 5nm also drives an increase in current density, increasing voltage supply
droop. Voltage droop is mitigated through the use of multiple power grids with different
signal track counts and power grid resistances to allow localized trade-offs between
droop improvement and signal route congestion, yielding up to 27% reduction in voltage
droop with minimal timing impact.

“Zen 4,” a high-performance x86 CPU core implemented in a 5nm process, is fabricated
in an optimized version of TSMC’s 5nm FinFET process [2] with a 15 metal layer
telescoping stack, designed for density on the lower layers and for speed on the upper
layers. Design technology co-optimization techniques applied to the power grid design,
as well as the standard cell library, enable minimal disruptions in cell placement, allowing
for overall area and frequency improvements. Exposing standard cell pins on the local
interconnect layer (M0) improves placement density and timing. The design team worked
closely with TSMC to enable interconnect optimizations beyond the foundry platform
offering to reduce wire capacitance by 4% resulting in a 1.5% frequency improvement.

38 • 2023 IEEE International Solid-State Circuits Conference 978-1-6654-9016-0/23/$31.00 ©2023 IEEE


ISSCC 2023 / February 20, 2023 / 1:30 PM

Figure 2.1.1: Die photo of “Zen 4” CCX. Figure 2.1.2: “Zen 4” architecture.

Figure 2.1.3: Cac breakdown. Figure 2.1.4: “Zen 4” frequency improvement.

Figure 2.1.5: Desktop performance improvement. Figure 2.1.6: Average performance vs. power for one eight-core 32M L3 CCX.

DIGEST OF TECHNICAL PAPERS • 39


ISSCC 2023 / SESSION 2 / DIGITAL PROCESSORS / 2.2
2.2 A 5G Mobile Gaming-Centric SoC with High-Performance Based on the run-time scenario (idle, browsing, music, gaming, etc.), the target operating
Thermal Management in 4nm FinFET point (OPP) voltage/frequency is set as the initial condition (Fig. 2.2.3), with the expected
computing power demand and temperature obtained from the thermal sensor. A
frequency-locked loop (FLL) which utilizes a tunable-delay ROSC based on post-silicon
Bo-Jr Huang, Alfred Tsai, Lear Hsieh, Kathleen Chang, C.-J. Tsai, binning further optimizes voltage/frequency under various workloads [2]. Next, based
Jia-Ming Chen, Eric Jia-Wei Fang, Sung S.-Y. Hsueh, Jack Ciao, on temperature, voltage and leakage sensor readings, the microprocessor calculates the
Barry Chen, Chuck Chang, Ping Kao, Ericbill Wang, Harry H. Chen, run-time power and Tjump slew to obtain the sustainable performance time before
Hugh Mair, Shih-Arn Hwang throttling. Lastly, E/TAS optimizes the thermal gradient by task reallocation among
processor cores to further extend the sustainable performance time. Fig. 2.2.5 illustrates
MediaTek, Hsinchu, Taiwan the operation of the proposed E/TAS. For program processing, EAS is used to choose
the CPU core with maximum available capacity to perform the task. For more precise
In recent years, mobile gaming has grown rapidly to overtake both console and PC scheduling considering heat coupling, TAS uses per-core temperature readings to
gaming markets globally. Thus far, modern smartphones have been powerful enough to coordinate with EAS for optimizing task assignment, thus minimizing the thermal gradient
support gaming requirements. However, demands of high-performance and computing between CPU cores. Keeping a smaller thermal gradient brings lower leakage and Tjump
power cause thermal management challenges to sustain performance during gaming. benefits. A gaming test case shows the measured frequency distributions of HP and HE
As shown in Fig. 2.2.1, frequency upgrades did not bring commensurate benchmark cores with E/TAS. Observe that, on average, frequency is enhanced by 3% in HP cores
improvements due to the thermal wall. This work presents a high-performance fully and 35% in HE cores.
integrated 5G flagship mobile gaming-centric SoC in a 4nm FinFET process. The SoC
consists of an octa-core tri-gear 3.35/3.2/1.8GHz CPU and a deca-core 955MHz GPU to As the sensed temperature approaches Tthr, the Cooler enables throttling. As shown in
provide a high-quality gaming experience. To maintain high-performance operation under Fig. 2.2.6, the Cooler takes the workload prediction with further consideration of the PCB
heat ramp-up, a thermal management system that adopts threshold temperature (Tthr) temperature (TPCB) to form a closed-loop smart frame-per-second (fps) control for a
control along with energy/temperature-aware scheduling (E/TAS) is proposed. On-chip better gaming experience. The evaluation test case requires a minimum 75fps and an
sensors which monitor temperature, voltage and leakage current are designed to acquire average 88fps for smooth gaming. Without the smart FPS control, an average 88fps can
a run-time power budget for E/TAS to boost performance based on computing demand. be achieved, but the minimum fps is only 72.6. With the smart fps control, average fps
The proposed thermal management system raises Tthr by up to 10°C with 2000/°C reaches 89 and minimum fps is improved to 78.4, ensuring a better gaming experience.
benchmark score gain, on average, and maintains a stable temperature range with
occasional damping well controlled within 5°C while throttling. The gaming phone The proposed thermal management system is applied to both the CPU and GPU since
achieves a score of 1.146M for the AnTuTu v9.4.2 benchmark. they are the critical gaming performance IPs. Fig. 2.2.6 shows measured real-time
temperature under the Geekbench 5 multi-core HDR test. With conventional global
Figure 2.2.1 shows the SoC, featuring an ARMv9 CPU subsystem with a single Cortex- throttling, the system temperature fluctuates widely with damping up to 10°C, resulting
X2 high-performance (HP) core up to 3.35GHz as the first gear, three Cortex-A710 in unstable performance. With the proposed thermal management system, Tthr is
balanced-performance (BP) cores up to 3.2GHz as the second gear and four 1.8GHz increased by 10°C with an average 2000/°C benchmark score gain, and the system
Cortex-A510 cores for high-efficiency (HE) as the third gear. A Mali-G710 GPU for 3D temperature remains stable within a tight range with occasional damping kept within 5°C
graphics and an in-house APU for AI processing are integrated. Multimedia with 8K video to sustain performance. For the AnTuTu v9.4.2 benchmark, the SoC achieves a leading
decoding at 30fps, 4K video encoding at 60fps, a camera with up to 320MPixels and score of 1.146M.
QHD+ video with frame rates of 144Hz are supported. Furthermore, external SDRAM
connected through a LPDDR5-6400/LPDDR5X-7500 memory interface has a peak The die photograph of the 108.6mm2 SoC is shown in Fig. 2.2.7. In summary, a 5G mobile
transfer rate of 0.46Tbps. The 5G modem supports NR sub-6GHz with 2.5Gbps upload gaming-centric SoC with high-performance thermal management is demonstrated in
and 7.01Gbps download speeds. 4nm FinFET. Sensor-assisted run-time power budget calculation and E/TAS are adopted
in the proposed thermal management system to sustain performance against high power
In a flagship smartphone, the CPU and GPU operate at high speed and typically account induced heat. The achieved Tthr increase contributes benchmark performance gain. The
for more than 30% of the power of the whole SoC in high-performance benchmarks. gaming phone realizes an AnTuTu score up to 1.146M.
Consequently, the critical success factor for maintaining performance is efficient thermal
control of the CPU and GPU. Taking the CPU as an example, in Fig. 2.2.2, the power Acknowledgement:
consumed exceeds 10W because of the higher speed and power density as the process The authors thank Cheng-Yuh Wu, Chao-Yang Yeh, Joey Lu, Tran Trong Hieu, Wei-Li
shrinks to 4nm. Based on the performance/power-efficiency curve for gaming, the CPU Liao, Kevin Hung, Yuwen Tsai, and Alex Chiou, Mediatek, Hsinchu, Taiwan, for their
performance needs to be boosted by 7%, causing a 60% power increase. As the support on this work.
measured temperature vs. CPU power shows, 16W power dissipation will cause a 10°C
temperature jump (Tjump) in 1ms, and 25°C in 5ms, posing challenges for thermal References:
management. In a typical thermal-control policy, as system temperature exceeds Tthr, [1] B.-J Huang et al., “An Octa-Core 2.8/2GHz Dual-Gear Sensor-Assisted High-Speed
the thermal throttling mechanism slows down the clock for cooling, causing a reduction and Power-Efficient CPU in 7nm FinFET 5G Smartphone SoC,” ISSCC, pp. 490-491, 2021.
in performance. [2] H. Mair et al., “A 7nm FinFET 2.5GHz/2.0GHz Dual-gear Octa-Core CPU Subsystem
with Power/Performance Enhancements for a Fully Integrated 5G Smartphone SoC,”
Figure 2.2.3 shows the proposed thermal management system for sustainable gaming ISSCC, pp. 50-51, 2020.
performance. The process monitor is implemented based on a ring oscillator (ROSC) [3] V. K. Kalyanam et al., “Thread-Level Power Management for a Current- and
structure with configurable cell/wire delay. It reflects process variations to model the Temperature-Limiting System in a 7nm HexagonTM Processor,” ISSCC, pp. 494-495,
minimum system operation voltage (Vsov) [1]. The leakage sensor is composed of tied- 2021.
off logic to mimic the leakage current (ILKG). With Vsov and ILKG, the power predictor [4] A. Nayak et al., “A 5nm 3.4GHz Tri-Gear ARMv9 CPU Subsystem in a Fully Integrated
calculates the total power of the system and converts it to the corresponding expected 5G Flagship Mobile SoC,” ISSCC, pp. 50-51, 2022.
Tjump for Tthr determination. As shown in Fig. 2.2.4, Tjump considers the temperature guard
band which covers temperature overshoot, local temperature fluctuation and the
response time of the thermal management system. Tthr is defined for throttling to prevent
system failure as the temperature increases beyond the sign-off level. Fig. 2.2.4 shows
the total power calculated by the power predictor vs. Vsov for 3K samples. The total power
can be mapped to Tthr linearly. Observe that Tthr of most samples can be increased by
more than 10°C compared with conventional global throttling using the worst power.
With increased Tthr, the throttling mechanism can be postponed to lengthen the duration
of high-performance operation.

40 • 2023 IEEE International Solid-State Circuits Conference 978-1-6654-9016-0/23/$31.00 ©2023 IEEE


ISSCC 2023 / February 20, 2023 / 2:00 PM

Figure 2.2.1: Thermal challenges of upgrading specifications for gaming and the 5G
SoC block diagram. Figure 2.2.2: Thermal challenges arising from performance/power increase.

Figure 2.2.3: Proposed thermal management system of the SoC. Figure 2.2.4: Tthr determination based on power prediction.

Figure 2.2.5: Operation of E/TAS and the CPU frequency distribution wi./wo. E/TAS Figure 2.2.6: Smart FPS control, real time temperature in benchmark and AnTuTu
in a gaming test case. benchmark score.

DIGEST OF TECHNICAL PAPERS • 41


ISSCC 2023 PAPER CONTINUATIONS

Figure 2.2.7: Die photograph.

• 2023 IEEE International Solid-State Circuits Conference 978-1-6654-9016-0/23/$31.00 ©2023 IEEE


ISSCC 2023 / SESSION 2 / DIGITAL PROCESSORS / 2.3
2.3 Amorphica: 4-Replica 512 Fully Connected Spin 336MHz The pipeline diagram (Fig. 2.3.3, bottom) shows how the main datapath executes SA/DA
Metamorphic Annealer with Programmable Optimization (left) and SCA/RPA (right). SA/DA employs replica parallelism, since there is only one
flip per step. On the other hand, the parallel spin flips in SCA/RPA make the pipeline busy
Strategy and Compressed-Spin-Transfer Multi-Chip Extension even for handling just one replica. The N-way parallel datapath (Fig. 2.3.2) is clock-gated
Kazushi Kawamura*1, Jaehoon Yu*1, Daiki Okonogi1, Satoru Jimbo1, finely for reducing power consumption.
Genta Inoue1, Akira Hyodo1, Ángel López García-Arias1, Kota Ando2, Since a single annealer cannot handle arbitrarily large CO problems, extension to multi-
Bruno Hideki Fukushima-Kimura2, Ryota Yasudo3, Thiem Van Chu1, chip is inevitable. Even though a single chip handles a 1/C domain of the whole system
Masato Motomura1 (Fig. 2.3.4), full-spin connectivity requires each chip to hold: 1) the coupling weights
deep into another chip’s domains and 2) the spin states of all the domains (as is different
Tokyo Institute of Technology, Yokohama, Japan, 2Hokkaido University, Sapporo, Japan,
1
from [2]). Thus, Amorphica incorporates a spin-exchange mechanism that exchanges
Kyoto University, Kyoto, Japan, *Equally Credited Authors (ECA)
3
Δs with the outside via a zero-run-length (ZRL) encoding/decoding scheme. In a 2K-spin
test run on a 4-chip Amorphica system (Fig. 2.3.4, middle-left), we observed that the Δ
Combinatorial optimization (CO) is vital for making wiser decisions and planning in our
vectors become more heavily zero-dominant as annealing proceeds. Hence, they become
society. Annealing computation is a promising CO approach derived from an analogy to
better encoded with longer code length: point (c) in the graph achieves a 94% bandwidth
physical phenomena (Fig. 2.3.1). It represents a CO problem as an energy function, a
reduction (8b-encoding), for example. The timing chart indicates that both on-chip
quadratic form of {1, -1} vectors, where each binary element is called a (pseudo) spin.
computation and inter-chip communication decrease as the annealing proceeds (fewer
The spin vector is initialized randomly and is updated stochastically to find minimum
flips occur). Furthermore, the communication overhead is cleanly hidden behind the
energy states by gradually reducing the (pseudo) temperature. Local-connection
computation since double Δ buffers allow production of Δs externally and simultaneous
annealers (quantum [1] and non-quantum [2-4]) have been constrained to spin models
internal consumption.
having only local inter-spin couplings. This restriction, however, severely limits their CO
applications even with the help of clever graph embedding algorithms. Full-connection Figure 2.3.5 explores the behavior of max-cut problems with random {1, 0, -1} weights
annealers [5, 6], considered here, have been proposed to address this drawback, (top-left). Examining the full range of its weight distribution shows the essential and
permitting handling of arbitrary topologies and densities of inter-spin couplings, even if intriguing nature of CO. Each dot in the chart (top-middle) records the best among the
they are irregular. four policies. The p+ and p- axes correspond to the positive and negative weight ratios
among all the spin couplings (the p++p-=1 line corresponds to the fully-coupled models).
The major contributions of this paper are three-fold. 1) We present an RPA (ratio-
Observe that DA is best in the dense and anti-ferromagnetic region (from an analogy to
controlled parallel annealing) policy geared towards improving the convergence of
the physical magnetic system), where negative couplings dominate heavily. As couplings
existing SCA (stochastic cellular automata annealing) [6]. 2) Studies that show that there
become more inter-mixed, a.k.a. spin-glass state, RPA works broadly better. All the
is no one-size-fits-all policy to cover a broad range of real-world CO problems, i.e., SA
policies perform equally in the lower-right half, i.e. in the ferromagnetic area. The chart
(simulated annealing, a baseline), DA (digital annealing) [5], SCA [6], or the proposed
to the right compares only SCA and SA for clarity: in the anti-ferromagnetic region, SCA
RPA. This leads to the conclusion that annealers should feature a multi-policy
loses since it does not converge well to minimum energy states, a finding that led us to
mechanism: metamorphic annealing. 3) We present Amorphica, featuring a parallel spin-
RPA. The best-ε 3D-chart (top-right) shows ε approaches 0 to assure convergence in
flip mechanism that flexibly manages annealing policies and finely controls various
the deep anti-ferromagnetic region.
parameters according to the target CO problem. Importantly, it is also extendable to a
multi-chip full-connection system. We sampled four configurations (2K-spin) in the max-cut space. Two are known as the
K2000 and G22 benchmarks, and the others are intentionally tested from the middle to
Figure 2.3.1 explains how SA, DA, SCA and RPA work. SA examines only one spin at a
the dense anti-ferromagnetic region. We measured energy vs. annealing-time curves for
time before updating it stochastically. DA inspects all the spins at once, then selects and
the four configurations on the 4-Amorphica system (Fig. 2.3.4) by employing the four
updates one of the flippable spins. SCA is more aggressive: for speed-up, it
policies (Fig. 2.3.5, left). The plots indicate different policy-of-choice as expected: SCA
simultaneously examines and conducts stochastic updates on all the spins. This
is advantageous in K2000, RPA is better than the others in G22, and 4-replica DA is the
mechanism is rationalized by incorporating a regularization term in the SCA energy
best for A2000/B2000 (1-replica DA is inferior). To demonstrate the metamorphic
function [6]. Nonetheless, in some hard-to-optimize settings, our experiments have
features, we have experimentally reconfigured the policy from RPA to DA in the middle
revealed that SCA does not converge to the minimum energy states (explained later).
of A2000/B2000. This approach has exhibited the fastest convergence among examined
The RPA addresses this issue by applying a ratio (ε) to the number of flippable spins: ε
policies, as shown in the plots. Additionally, changing the temperature control from the
ranges from 1/N (N is a total number of spins) to 1; in the lower (higher) limit, RPA
conventional one to a heat-then-cool approach provides better annealing speed. These
becomes identical to SA (SCA). RPA extends the flipping options to a continuum, with
results support the potential of metamorphic annealing. We have also compared a GPU
SA and SCA as extremes.
(250W-class) and Amorphica (less than 500mW, Fig. 2.3.7) by testing the same problem
Amorphica processes N spins (σ vector, Fig. 2.3.2, top-right) as a unit, holding C (# of set, employing the best annealing policies on each (RPA is the best for GPU). Amorphica
chips to connect) σ vectors (i.e., σ matrix) for annealing C×N spins. It maintains R (# of has achieved 58× (A2000) to 6.7× (K2000) speed up, with around 1/500 the power
replicas) σ matrices. Annealing is performed for many heuristic trials, exploited for replica consumption.
parallelism. In the present chip, N=512, C=4, and R=4: the 3D-Matrix σ buffer keeps 8K
Figure 2.3.6 compares Amorphica with prior annealing chips. The distinguishing features
spins.
of this work are multi-chip extensibility and utilization of replica-parallel and multi-policy
Amorphica features a near-memory architecture: N spin-coupling weights (8b each) are annealing. Furthermore, it is superior to both the local-connection [2-4] and the full-
read from the weight memory (WMEM, C×N rows) and processed in a column-parallel connection [6] chips in the number of spin couplings processed and stored on-chip.
manner. The per-column local field unit (LFU) accumulates inertia to flip the spin (hi) by Fig. 2.3.7 includes a chip-microphotograph, a specification table, and frequency and
partly using FP16 representation for a wide dynamic range (Fig. 2.3.2, bottom left). Then, power measurements. The chip has been fabricated in 40nm (LP) technology with an
a spin update is examined stochastically (using an FP16 RNG) in the Δ calculation unit 8Mb SRAM on a 3×3mm2 die. At 1.1V, clock frequency ranges from 336-369MHz. The
(DCU). The flip flag and the updated spin are called Δ and τ, respectively. This process average power consumption at 0.8V is 44-95mW, depending on the policy.
repeats under the supervision of a custom RISC processor and a triple-loop counter that
In summary, Amorphica has been designed to solve real-world CO problems, featuring
can set various annealing policies and parameters on the fly (Fig. 2.3.2, middle-right).
a full-connection architecture, multi-chip extensibility with reduced inter-chip data
The zero-run-length decoder (ZRLD) and DMA are essential in multi-chip setups.
transfer, and multi-disciplinary policies with flexibly tunable annealing parameters. This
Figure 2.3.3 focuses on the multi-policy Δ selector (MPDS), the heart of the metamorphic work also offers run-time reconfigurability and replica-parallel annealing, crucial features
annealing (see Fig. 2.3.1, bottom), consisting of the cyclic shifter (CS), the policy- for deploying silicon annealers to practical applications.
dependent masker (PM), and the priority encoder (PrE). In the SA policy, the mask is all
Acknowledgement:
0 except for a single location: since CS rotationally shifts the Δ vector by a random
This work was partially supported by JST CREST Grant Number JPMJCR18K3, Japan.
number of bits, an arbitrary Δj is selected. The mask in the DA mode is all 1: PrE
determines a top flipped (Δj =1) location from the randomly shifted Δ vector. SCA is References:
different from DA only in the control of PrE: it generates the flipped indices as far as they [1] C. McGeoch et al., “Advantage Processor Overview,” D-Wave Technical Report Series
exist. For the RPA policy, the random mask contains 512×ε of 1’s. PrE works exactly the 14-1058A-A, 2022.
same as SCA. The MPDS datapath, in this way, processes multi-policy annealing in a [2] T. Takemoto et al., “A 144Kb Annealing System Composed of 9×16Kb Annealing
unified manner by customizing the PM/PrE controls for each policy. Processor Chips with Scalable Chip-to-Chip Connections for Large-Scale Combinatorial
Optimization Problems,” ISSCC, pp. 64-65, 2021.

42 • 2023 IEEE International Solid-State Circuits Conference 978-1-6654-9016-0/23/$31.00 ©2023 IEEE


ISSCC 2023 / February 20, 2023 / 2:30 PM

Figure 2.3.1: Combinatorial optimization and its annealing solution. Three major Figure 2.3.2: The top-level view of the Amorphica architecture and the LFU-DCU
contributions of this paper are also summarized. circuitry.

Figure 2.3.3: MPDS circuitry and its multi-annealing policy mechanism for flipped- Figure 2.3.4: Multi-chip configuration and the inter-chip bandwidth reduction by the
spin selection. Example pipeline diagrams for the main datapath are also depicted. on-chip ZRL decoding.

Figure 2.3.5: Numerical simulations in the max-cut problem space and measurement Figure 2.3.6: Comparison with recent annealing chips*. This work represents a multi-
results on the 4-Amorphica-chip system for four 2K-spin max-cut problems. replica, multi-chip, and multi-policy full-connection annealer.

DIGEST OF TECHNICAL PAPERS • 43


ISSCC 2023 PAPER CONTINUATIONS

Additional References:
[3] J. Mu et al., “A 20×28 Spins Hybrid In-Memory Annealing Computer Featuring
Voltage-Mode Analog Spin Operator for Solving Combinatorial Optimization Problems,”
IEEE Symp. VLSI Circuits, 2021.
[4] Y. Su et al., “FlexSpin: A Scalable CMOS Ising Machine with 256 Flexible Spin
Processing Elements for Solving Complex Combinatorial Optimization Problems,”
ISSCC, pp. 274-475, 2022.
[5] M. Aramon et al., “Physics-Inspired Optimization for Quadratic Unconstrained
Problems Using a Digital Annealer,” Frontiers in Physics, vol. 7, no. 48, 2019.
[6] K. Yamamoto et al., “STATICA: A 512-Spin 0.25M-Weight Full-Digital Annealing
Processor with a Near-Memory All-Spin-Updates-at-Once Architecture for
Combinatorial Optimization with Complete Spin-Spin Interactions,” ISSCC, pp. 138-
139, 2020.

Figure 2.3.7: Chip micrograph, specification table and measured results.

• 2023 IEEE International Solid-State Circuits Conference 978-1-6654-9016-0/23/$31.00 ©2023 IEEE


ISSCC 2023 / SESSION 2 / DIGITAL PROCESSORS / 2.4
2.4 A Fully Integrated End-to-End Genome Analysis Accelerator for the same as the target short-read, the similarity score is updated to the sequence length
Next-Generation Sequencing multiplied by the Match score in (N+N-1) cycles. The deducted score for one mismatch
(Dmis) is subtracted from the exactly matched score when there is one different bp in the
middle subsequence of the short-read compared to the fragment. Based on our
Yen-Lung Chen*1, Chung-Hsuan Yang*1, Yi-Chung Wu*1, Chao-Hsi Lee2, experiments, for similarity scores greater than or equal to the exactly matched score
Wen-Ching Chen3, Liang-Yi Lin3, Nian-Shyang Chang3, Chun-Pin Lin3, minus the deducted score for one mismatch, the final similarity score of a candidate can
Chi-Shi Chen3, Jui-Hung Hung2,4, Chia-Hsiang Yang1,2 be calculated directly by comparing each bp between two sequences to replace time-
consuming dynamic programming. Dmis is set to 5 as an example. In the comparison of
1
National Taiwan University, Taipei, Taiwan the first 5 bp (head of the sequence) and the last 5 bp (tail of the sequence) in the outward
2
GeneASIC Technologies, Hsinchu, Taiwan direction, the short-read locations of the first bp difference from the fragments (Phead and
3
Taiwan Semiconductor Research Institute, Hsinchu, Taiwan Ptail) both contribute to the total deducted score. The number of the different bps between
4
National Yang Ming Chiao Tung University, Hsinchu, Taiwan the middle subsequences of the short-read and fragments (Cmid) represents the frequency
*Equally Contributing Authors (ECA) that the deducted score for one mismatch should be subtracted from the exactly matched
score. The total deducted score can be aggregated accordingly. The hardware
Next-generation sequencing (NGS) has revolutionized biological sciences and clinical architecture in the Sequence Sender includes N bp comparators to compute the total
practices. It has become an essential technology for various emerging applications, such deducted score. For one candidate, the succeeding operations associated with dynamic
as cancer screening and inherited disease diagnosis. Fig. 2.4.1 shows an overview of an programming can be skipped without accuracy loss, reducing the latency by 96.6%.
NGS pipeline. An NGS pipeline includes sample preparation, sequencing, data analysis
and tertiary analysis. A sequencer first generates a massive amount of DNA segments Figure 2.4.5 shows the optimized dataflow and hardware mapping for two compute-
(short-reads) from samples. Short-reads are used as the inputs for data analysis. The intensive steps: haplotype calling and genotyping. In haplotype calling, a Parallel Graph
outputs (genetic variants) of the data analysis can then be sent to facilities for further Constructor that comprises 32 Graph Arrays, each containing 128 Graph Cells, is
tertiary analysis. The data analysis is very time consuming and has become the bottleneck included to maximize the throughput. The latency for building the de Bruijn graph is
in the entire NGS pipeline [1]. The high computational complexity comes from hundreds reduced by 97% when compared to [3]. In genotyping, a Viterbi Decoding Engine is used
of millions of short-reads for reconstructing a DNA sequence with three billion to compute likelihood matrices and to generate the likelihood values to the succeeding
nucleotides. A complete data analysis workflow includes four steps: short-read mapping, modules: a Max Likelihood Finder, a PL Calculator and a Write Data Controller. The Viterbi
haplotype calling, variant calling and genotyping. Data analysis accelerators have been Decoding Engine includes four Likelihood Computing Arrays, each composed of
proposed to reduce the processing time [2][3]. They support the first three steps of the Quantized Fixed-Point PEs. The proposed Quantized Fixed-Point PE is tailored based on
workflow, but genotyping, the dominant step [4], is not supported. Additionally, only the the Pair-HMM to minimize the bit-widths of both the base quality score and operator,
single-end-based short-read mapping is adopted in previous works so that the achieved while maintaining precision. The Max Likelihood Finder and PL Calculator compute the
analysis accuracy is limited. This work presents a fully integrated data analysis final genotype quality scores from the likelihood matrices. Overall, the area of the PE is
accelerator that handles the complete analysis workflow. Mapping with paired-end short- reduced by 83% compared to a direct-mapped implementation.
reads along with rescue is utilized to enhance the analysis accuracy.
Figure 2.4.6 shows the chip performance and comparison table. Fabricated in 28nm
Figure 2.4.2 shows the data analysis workflow, which includes short-read mapping, CMOS, the chip size is 16.14mm2, and the power consumption is 2.73W at 400MHz from
haplotype calling, variant calling and genotyping. In short-read mapping, short-read pairs a 0.9V supply. The chip supports paired-end short-reads mapping. An FDA dataset (the
are cut into seeds first. The seeds are used to query candidate mapping locations from PrecisionFDA 50× NA12878) is used for benchmarking. This work achieves a precision
a pre-built FM-index [2][3]. The candidate mapping locations of both paired short-reads [sensitivity] of 99.79% [99.03%] in 28.2 minutes on average. For the ASIC solution, [3]
are then found. The seeds are then extended to the short-read length to extract the only supports single-end short-read mapping and fails to support time-consuming
corresponding fragments from the reference DNA. By aligning the short-reads to the genotyping. This work still achieves a 2× improvement in runtime compared to [3]. End-
fragments, the similarity scores of candidates can be obtained. After considering both to-end genome analysis on a high-end server, studied in [5], is selected for performance
candidate locations and scores from both paired short-reads, the mapping locations can comparison. The server includes 64 AMD CPU cores, 512GB DRAM and an Illumina
be found. In haplotype calling, all the short-reads in a specific region are identified and DRAGEN3 FPGA acceleration system. The mainstream software packages (BWA/GATK4
associated sub-sequences are extracted. The extracted sub-sequence is called k-mer, from the Broad Institute and DeepVariant (DV) from Google) and the DRAGEN3 pipeline
which has a length of k base-pairs (bp). A de Bruijn graph can be generated by recording are tested. The precision [sensitivity] of the mainstream software packages and the
k-mers and transitions of each k-mer. The haplotype sequences can be obtained by DRAGEN3 ranges from 98.54% [98.14%] to 99.79% [98.75%]. The runtime for the same
tracing the de Bruijn graph. In variant calling, the variants of the subject are located by dataset varies from 90-to-1680 minutes. This work delivers a comparable
aligning haplotype sequences to the reference DNA by performing the Smith-Waterman precision/sensitivity and achieves a 3-to-59× higher throughput. The proposed end-to-
(SW) algorithm. The scenario where variants appear at the same location is marked as end genome analysis accelerator achieves a 935× [4,910×] higher energy efficiency than
one event. In genotyping, all possible genotypes are found to evaluate their quality scores. DRAGEN3 [BWA/GATK4]. With the initiation of several population-level DNA analysis
A pre-trained pair-hidden Markov model (Pair-HMM) and the Viterbi algorithm are used projects, the demand for massive DNA data analysis is becoming stronger. This work
to generate read-to-haplotype likelihood matrices. Then, the likelihood values of the provides a promising solution for such an ambitious goal in an energy-saving way. Fig.
variants (known as alleles) and the genotype quality scores can be derived. 2.4.7 shows the chip micrograph and chip summary.

Figure 2.4.3 shows the system architecture, which includes two Aligner Engines and one Acknowledgement:
Caller Engine. The Aligner Engine performs compute-intensive short-read mapping. The This work is supported by GeneASIC Technologies and National Science and Technology
Caller Engine performs haplotype calling, variant calling and genotyping. In the Aligner Council (NSTC) of Taiwan. The authors also thank Taiwan Semiconductor Research
Engine, a Read Reverser preprocesses paired-end reads. Seed Generators are designed Institute (TSRI) for technical support on chip design and fabrication.
to cut paired-end reads into seeds. Two Exact Matchers perform FM-index queries and
find the candidate mapping locations of paired-end reads. A Candidates Pairer filters the References:
candidates by their locations. A Quality Calculator evaluates the mapping quality. A [1] K. R. Franke et al., “Accelerating next generation sequencing data analysis: an
Rescuer is deployed to re-align the paired reads to the locations next to that of the target evaluation of optimized best practices for Genome Analysis Toolkit algorithms,”
candidate to reduce the number of unmapped short-reads. A Dynamic Programming Genomics Inform, vol. 18, no. 1, pp. 1-9, 2020.
Engine includes multiple sets of processing element (PE) arrays that perform parallel [2] Y.-C. Wu et al., “A 135mW Fully Integrated Data Processor for Next-Generation
sequence alignment. In the Caller Engine, a Parallel k-mer Processing Engine constructs Sequencing,” ISSCC, pp. 252-253, 2017.
the de Bruijn graph and performs queries in a massively-parallel manner. A Variant [3] Y.-C. Wu et al., “A Fully Integrated Genetic Variant Discovery SoC for Next-Generation
Discovering Engine locates the variants and executes event determination. A Genotype Sequencing,” ISSCC, pp. 322-323, 2020.
Likelihood Computing Engine generates read-to-haplotype likelihood matrices and [4] A. Xiao et al., “ADS-HCSpark: A scalable HaplotypeCaller Leveraging Adaptive Data
calculates the allele likelihood values for genotype quality scores. Segmentation to Accelerate Variant Calling on Spark,” BMC Bioinformatics, vol. 20, no.
76, pp. 1-13, 2019.
Figure 2.4.4 illustrates the techniques for rapid similarity score calculation in short-read [5] S. Zhao et al., “Accuracy and Efficiency of Germline Variant Calling Pipelines for
mapping and its corresponding hardware architecture in Aligner Engines. In order to Human Genome Data,” Scientific Reports, vol. 10, no. 20222, 2020.
evaluate the similarity between two sequences, the SW algorithm is applied. The penalties [6] Illumina, Illumina DRAGEN Server v3 Site Prep and Installation Guide, No.
in similarity score of Match, Mismatch, Gap open, and Gap extension (Sm, Smis, Gopen, 1000000097923 v02, 2020.
and Gext) are introduced in the score matrix. In this example, in which the fragment is

44 • 2023 IEEE International Solid-State Circuits Conference 978-1-6654-9016-0/23/$31.00 ©2023 IEEE


ISSCC 2023 / February 20, 2023 / 3:15 PM

Figure 2.4.1: Overview of next-generation sequencing (NGS) data analysis and


applications. Figure 2.4.2: NGS data analysis workflow.

Figure 2.4.4: Operation and circuits for rapid similarity calculation in short-read
Figure 2.4.3: System architecture of the end-to-end genome analysis accelerator. mapping.

Figure 2.4.5: Dataflow and hardware mapping for haplotype calling and genotyping. Figure 2.4.6: Performance comparison.

DIGEST OF TECHNICAL PAPERS • 45


ISSCC 2023 PAPER CONTINUATIONS

Figure 2.4.7: Chip micrograph and summary.

• 2023 IEEE International Solid-State Circuits Conference 978-1-6654-9016-0/23/$31.00 ©2023 IEEE


ISSCC 2023 / SESSION 2 / DIGITAL PROCESSORS / 2.5
2.5 A 28nm 142mW Motion-Control SoC for Autonomous Mobile trigonometric functions for the physics model. It also supports the exponential function
Robots for weight evaluation. In a baseline design, idle cycles for data transfer between
submodules in the TU are inevitable because submodules need to access data from the
same buffers. In this work, a programmable forwarding unit transfers the data directly
I-Ting Lin1, Zih-Sing Fu1, Wen-Ching Chen2, Liang-Yi Lin2, Nian-Shyang Chang2, from a submodule to the others, reducing idle cycles. For example, the latency for
Chun-Pin Lin2, Chi-Shi Chen2, Chia-Hsiang Yang1 computing the kinematics model of a robot arm can be reduced by 63%. For TU
scheduling, a typical way is to evaluate the cost of a trajectory after updating all states,
1
National Taiwan University, Taipei, Taiwan which requires storing the entire trajectory. In this work, the TU updates states and
2
Taiwan Semiconductor Research Institute, Hsinchu, Taiwan evaluates costs in an interleaved manner, in which only the latest state is stored, reducing
memory usage by up to 99%. For the GRNG, a conventional implementation is to
Autonomous mobile robots (AMRs) have proven useful for smart factories and have accumulate outputs of several random number generators (RNGs) in fixed-point
potential to revolutionize critical missions, such as disaster rescue [1]. As illustrated in arithmetic. An area-efficient GRNG is proposed by accumulating fewer floating-point
Fig. 2.5.1, AMRs can perceive the environment, plan for assigned tasks and act on the random numbers. The floating-point random numbers are obtained by converting the
plan [2]. Motion control is critical to trajectory adjustment for joint control or intelligent outputs of RNGs with a look-up table. The sample distribution is verified by the Shapiro-
navigation, especially when AMRs are operated in a fast-changing environment. This is Wilk test when using only two FP16, instead of six INT16, RNGs. The proposed GRNG
accomplished through trajectory optimization to refine the robot states using a physics has 59% less area compared to the conventional design, according to the synthesis
model [3]. A command sequence for motion control is generated by taking the next estimates.
possible states into account. The next states of the trajectory are predicted by robots’
dynamics. As one would expect, robots can respond faster and act more resiliently when Figure 2.5.5 shows the NoC for workload balancing, data dispatching and data collection.
the control rate and the number of trajectory time steps increase, respectively. However, Workloads (evaluated by the number of processed trajectories) of PEs may be
there is a fundamental tradeoff between the control rate and the number of trajectory imbalanced because of trajectory pruning, which lowers the average utilization of PEs.
time steps. A motion control accelerator is demonstrated in [4] to improve the motion To address this issue, PEs in the same row/column can transfer the workloads through
control capability. It achieves a 1kHz maximum control rate for up to 30 trajectory time the NoC that consists of an arbiter and trajectory buffer for each row/column. The
steps, but the control rate decreases to <250Hz for the supported maximum of 130 workload-balancing process is as follows: Initially, workloads of PEs are equal. After
trajectory time steps. This limits its applicability in robot applications that demand low trajectory pruning, workloads may become imbalanced and the rollout latency is bounded
response time (with a >1kHz control rate) [5], while maintaining high resiliency (for >130 by the PE with the highest workload. Row-wise balancing is performed first. For each
trajectory time steps). row, the workloads are transferred from the PEs with the high workloads to those with
low workloads. Column-wise balancing is then performed in a similar way. Compared to
The command sequence can be optimized by minimizing the trajectory cost (sum of a an implementation without workload balancing, the rollout latency is reduced by 56%.
terminal cost and a running cost), given the constraint set of the physics model (as For data moved between the PEs and the collection unit/sequence memory, data dispatch
shown in Fig. 2.5.1). The trajectory cost is evaluated with the states and commands at and collection are designed to minimize the routing complexity of the interconnect fabric.
time steps to inspect whether the trajectory is close to the target. Trajectory optimization For data dispatch, the data are transferred to the top row of the NoC and then sent to
can be categorized into two types: gradient-based and sampling-based. The gradient- PEs through the columns of the NoC. For data collection, the data are transferred from
based method iteratively updates a single command sequence with the gradient of the PEs (selected from four rows individually) to the collection unit through the rows of the
physics model, as adopted by [4]. In contrast, the sampling-based method evaluates NoC. This reduces the routing complexity of the interconnect fabric by 84%.
multiple sampled command sequences and combines them with proper weights to find
the final sequence [6]. The sampling-based method offers parallel hardware acceleration. Fabricated in 28nm CMOS, the motion control SoC integrates 5.3M logic gates in core
In this work, a sampling-based motion-control accelerator is developed to maximize both area of 3.56mm2. The SoC dissipates 142mW at a clock frequency of 200MHz from a
the control rate and the number of trajectory time steps. 1.0V supply. The chip achieves a 4.935kHz maximum control rate for 130 trajectory time
steps for a 7-DoF robot arm. With a 1kHz control rate, it supports up to 750 trajectory
Figure 2.5.2 illustrates the workflow of the sampling-based motion control. The command time steps. The chip also delivers a 35Hz/mW [1386Hz/mm2] maximum energy [area]
sequences are generated by sampling Gaussian noise with different random seeds. The efficiency. Figure 2.5.6 demonstrates the functionality of the chip and gives the
generated sequences are then used to update the states to evaluate the trajectory costs. performance comparison. The robot arm is able to track the ground-truth trajectory with
After adjusting the range and scale of all the trajectory costs, the individual trajectory a <1.6% error for joint control. Compared to the prior art [4], this work achieves a 22×
costs are transformed into probability weights of the noise sequences using the softmax [26×] improvement in the maximum control rate [number of trajectory time steps]. It
function. The weighted sum of the noise sequences is smoothed by a Savitzky-Golay also delivers 350× [66×] higher energy [area] efficiency, at the same technology node.
(SG) filter and then added to the initial command sequence to obtain the final one. In The motion control SoC provides a promising solution for future agile humanoid robots
this work, two techniques are proposed to minimize the computational complexity. First, that demand ultra-fast and robust control. Figure 2.5.7 shows the chip micrograph and
trajectory pruning is applied. Trajectories with a high cost make a negligible (<0.01%) summary.
contribution of the corresponding noise sequences to the weighted sum. To reduce the
computations required for such trajectories, a trajectory is discarded when its cost Acknowledgement:
exceeds an adaptive threshold before full evaluation. Trajectory pruning reduces up to This work is supported by National Science and Technology Council (NSTC) of Taiwan
94% of computations with only 1.6% accuracy loss. Second, a physics model and Intelligent & Sustainable Medical Electronics Research Fund in National Taiwan
transformation is applied by setting the sequence of acceleration values of joints (instead University. The authors also thank Taiwan Semiconductor Research Institute (TSRI) for
of torques) as the command sequence. Therefore, a complex dynamics model can be technical support on chip design and fabrication.
replaced with a simpler kinematics one. Post-processing is added to transform the
updated commands back to output torques. The computational complexity is reduced References:
by 82% via the model translation. Overall, a 99% complexity reduction is achieved by [1] M. Spenko et al., The DARPA Robotics Challenge Finals: Humanoid Robots To The
leveraging the two techniques. Rescue. Springer Tracts in Advanced Robotics, vol. 121, 2018.
[2] R. Siegwart et al., Introduction to Autonomous Mobile Robots, Cambridge (Mas.):
Figure 2.5.3 shows the system architecture of the motion control SoC, consisting of a MIT Press, 2011.
trajectory optimization accelerator, a post-processing unit and an MCU. In the trajectory [3] J. Koenemann et al., “Whole-body Model-Predictive Control Applied to the HRP-2
optimization accelerator, 4×4 processing elements (PEs) compute in parallel to reduce Humanoid,” IROS, pp. 3346-3351, 2015.
the latency. Each PE contains a trajectory unit (TU), a trajectory pruner (TP) and a [4] S. M. Neuman et al., “Robomorphic Computing: A Design Methodology for Domain-
Gaussian random number generator (GRNG). The TU updates the states of trajectories specific Accelerators Parameterized by Robot Morphology,” ASPLOS, pp. 674-686, 2021.
and evaluates their costs and weights. The TP prunes trajectories with high cost. The [5] A. Mueller, “Modern Robotics: Mechanics, Planning, and Control,” IEEE Control
GRNG samples noise to generate command sequences. A network-on-chip (NoC) System Magazine, vol. 39, no. 6, pp. 100-102, 2019.
transfers data across PEs for workload balancing, data dispatching and data collection. [6] G. Williams et al., “Information-theoretic Model Predictive Control: Theory and
The MCU (ARM Cortex-M3 core) is used for system configuration and scheduling. Applications to Autonomous Driving,” IEEE Trans. on Robotics, vol. 34, no. 6, pp. 1603-
1622, 2018.
Figure 2.5.4 shows design techniques and architecture optimizations for the PE. In the
TU, a rotation unit performs 3D vector rotation for coordinate transformation in the
physics model. Two vector units support vector functions in parallel to reduce the latency
of both state update and cost evaluation. A nonlinear function unit computes

46 • 2023 IEEE International Solid-State Circuits Conference 978-1-6654-9016-0/23/$31.00 ©2023 IEEE


ISSCC 2023 / February 20, 2023 / 3:45 PM

Figure 2.5.1: Motion control for autonomous mobile robots. Figure 2.5.2: Sampling-based motion control workflow and complexity minimization.

Figure 2.5.3: System architecture of the proposed motion control SoC. Figure 2.5.4: Design techniques and architecture optimizations for PE.

Figure 2.5.5: NoC for workload balancing, data dispatching and data collection. Figure 2.5.6: Experimental verification and performance comparison.

DIGEST OF TECHNICAL PAPERS • 47


ISSCC 2023 PAPER CONTINUATIONS

Figure 2.5.7: Chip micrograph and summary.

• 2023 IEEE International Solid-State Circuits Conference 978-1-6654-9016-0/23/$31.00 ©2023 IEEE


ISSCC 2023 / SESSION 2 / DIGITAL PROCESSORS / 2.6
2.6 VISTA: A 704mW 4K-UHD CNN Processor for Video and Image Figure 2.6.4 shows the hardware-model co-design of TODC. DC can provide 0.23-to-
Spatial/Temporal Interpolation Acceleration 0.35dB of PSNR gain compared to plain networks. However, its unbounded offset fields
(OF) entail irregular samples from neighboring FMs (NFMs), thereby demanding a FM
line buffer of 12400KB for 4K-UHD resolution. In addition, a large amount of BI is used
Kai-Ping Lin, Jia-Han Liu, Jyun-Yi Wu, Hong-Chuan Liao, Chao-Tsung Huang to warp NFMs with OF, e.g. a 2×2-tile result requires a 6×6-tile DFM with 36 arbitrary
OF, resulting in 1379K additional logic gates. To reduce the line buffer and BI
National Tsing Hua University, Hsinchu, Taiwan computation, CoDeNet [6] limits the OF to integer values within a 7-pixel range for object
detection; nevertheless, this constraint causes 0.49-to-0.55dB of PSNR drop for VSR
Video convolutional neural networks (CNNs) have achieved great success in high- applications. In this work, TODC resolves the memory and computation issues, while
resolution imaging applications, such as video super-resolution (VSR) and demonstrated maintaining image quality for video CNNs. It consists of offset fields confinement (OFC)
superior quality and temporal consistency by leveraging time information. In particular, for limiting the OF range and tile-based interpolation (TBI) for reducing the computation
as shown in Fig. 2.6.1, video CNNs can also support applications like video-frame overhead. The OFC bounds the OF range within only one pixel and thus limits the NFM
interpolation (VFI) which is difficult to achieve by single-image CNNs. Therefore, video into a 6×6-tile region for a 2×2-tile result, which leads to 4KB usage for the line buffer.
CNNs have enormous potential for next-generation imaging/display technology. However, The TBI computes a 4×4-tile DFM, instead of a 6×6-tile, followed by stride-1 CONV3×3
there are three design challenges while inferencing high-throughput video CNNs. Firstly, operations and, therefore, reduces the number of BI computations from 36 to 16 for a
massive external memory access (EMA) and computation complexity are induced since 2×2-tile result. To keep up with the high-throughput convolution unit, we design an
they both grow accordingly as the number of input frames (N) increases. Secondly, interpolation engine (IE) comprising 192 BI modules, each including 3 multipliers, 7
tremendous memory usage of feature maps (FMs) is required for supporting cross-frame adders and 4 multiplexers. The implementation shows that TODC substantially reduces
alignment with in-order frame scheduling. Thirdly, supporting deformable convolution the size of the line buffer by nearly 100% (from 12400KB to 4KB) and reduces gate count
(DC) for alignment costs extra line buffers for irregular samples and computation for of IE by 56% with only 0.06-to-0.18dB of PSNR degradation compared to DC.
bilinear interpolation (BI). In this work, we present a video CNN processor supporting
diverse-application video CNN inference at 4K-UHD resolution and address the challenges This chip is fabricated in 40nm CMOS and the measurement results are shown in Fig.
through three key features: 1) a cuboid-based layer-fusion (CBLF) inference flow to 2.6.5. It achieves peak energy efficiency of 6.4TOPS/W, dissipating 110mW at 0.64V and
reduce EMA and computation complexity; 2) an alignment-aware memory optimization 50MHz. It reaches peak frequency of 200MHz with 4.0TOPS/W for dissipating 704mW
technique to save the FM memory size; 3) a hardware-model co-design of tile-based at 1.06V. Regarding different N for VSR and VFI applications with the same model size
offset-confined deformable convolution (TODC) to alleviate the overheads of induced FM at 4K-UHD resolution, there is a tradeoff between energy consumption and image quality.
line buffers and computation logics for DC. For VSR×4, the chip consumes 11.6mJ/frame for N=3 with 27.9dB of PSNR, and
13.9mJ/frame for N=5 with 0.1dB of PSNR gain. For VFI, it consumes 27.9mJ/frame for
Figure 2.6.2 shows the proposed CBLF inference flow. Single-image CNN inference can N=2 with 30.5dB of PSNR, and dissipates 36.1mJ/frame for N=4 with 0.4dB of PSNR
be considered as 3D processing composed of layer depth, feature height and width. improvement. Qualitative results are shown at the bottom of Fig. 2.6.5. Video CNNs not
Previous works [1-4] show that adopting depth-first layer fusion (LF), i.e. depth-block only provide vivid image quality, but also enhance temporal consistency as N increases.
3D LF, can significantly avoid EMA of intermediate FMs. However, video CNN inference
is 4D processing with an additional time dimension. Directly extending the idea of depth- Figure 2.6.6 shows a comparison table with state-of-the-art computational imaging CNN
block LF to 4D processing, namely depth-block-time 4D LF, would sequentially generate accelerator chips [1-4]. Our chip occupies 12.6mm2 with a total 994KB of on-chip SRAM
image blocks in time t with depth-first computation and then generate an output frame and computes in 8b dynamic fixed-point precision. The chip achieves comparable energy
in time t (Ft) integrated with those blocks. This approach, however, re-accesses the efficiency ranging from 4.0-6.4TOPS/W and area efficiency of 222.2GOPS/mm2 while
overlapped input frames of each neighboring output frame. In contrast, CBLF applies supporting the advanced feature of multi-image processing. Also, the required EMA
depth-time-block 4D LF to generate cuboid outputs (Cb) and reduces overlapped-input increases modestly with N and performance is similar to prior work, even for VSR
re-access bandwidth. In other words, the depth-block-time 4D LF inferences in a frame- requiring a deep model with many layers. Fig. 2.6.7 shows the chip micrograph and
by-frame manner, while the CBLF processes in a cuboid-by-cuboid manner. A cuboid specifications. This work represents an efficient video CNN processor for VSR×2, VSR×4,
output consists of across-time blocks whose receptive fields are overlapped in time VFI and video-denoising (VDn) applications with maximum throughputs of 4K-UHD at
direction. As a consequence, CBLF allows depth-first computation to generate a first 30, 60, 50, and 22fps, respectively, which facilitates video CNNs in next-generation
block, e.g. t1b0, and caches across-time overlapped FMs in a temporal reuse buffer (TRB). imaging/display technology.
Then, a following block, e.g. t2b0, can reuse these overlapped FMs to save EMA and
computation. Regarding the cuboid length (L), there is a trade-off between input EMA in Acknowledgement:
L+1
terms of the normalized input frame number (NIN, L+1) and TRB usage. This work selects The authors would like to thank TSRI, MOST and TSMC for manufacturing and financial
L=2, which decreases the NIN by 33-57% with 63KB of TRB usage for N from 2 to 7. As support.
a result, CBLF saves 33-53% of input EMA, further reducing 19-42% of the computations
when inferencing 35-layer VSR×4 models at 4K-UHD resolution. References:
[1] J. Lee et al., “A Full HD 60 fps CNN Super Resolution Processor with Selective
Figure 2.6.3 illustrates the alignment-aware memory optimization via two techniques: Caching based Layer Fusion for Mobile Devices,” IEEE Symp. VLSI Circuits, pp. C302-
hybrid-workflow processing and reference-frame-first scheduling (RFFS). The first one C303, 2019.
considers the overlapped regions of FMs between neighboring blocks in both along- [2] K. Goetschalckx et al., “DepFiN: A 12nm, 3.8TOPs Depth-First CNN Processor for
/across-processing direction (Along-Dir and Across-Dir), which have a strong demand High Res. Image Processing,” IEEE Symp. VLSI Circuits, 2021.
of on-chip memory. In particular, the amount for the alignment layer is proportional to [3] Z. Li et al., “An 0.92 mJ/frame High-quality FHD Super-resolution Mobile Accelerator
the factor N×(L+1) due to channel concatenation that is magnified by video CNNs and SoC with Hybrid-precision and Energy-efficient Cache,” IEEE CICC, , 2022.
CBLF. If they are all cached on-chip as in [2-3] in a two-direction-reused (UU) way, it [4] Y.-C. Ding et al., “A 4.6-8.3 TOPS/W 1.2-4.9 TOPS CNN-based Computational
requires 7428KB of on-chip memory. Instead, if we only cache the Along-Dir region and Imaging Processor with Overlapped Stripe Inference Achieving 4K Ultra-HD 30fps,” IEEE
recompute the Across-Dir one as in [4] in a reused-recomputed (UC) way, the on-chip ESSCIRC, pp. 81-84, 2022.
memory can be reduced by 90% to 720KB. In this work, we further decrease the memory [5] C.-T. Huang, “RingCNN: Exploiting Algebraically-Sparse Ring Tensors for Energy-
size by a hybrid workflow: a two-direction-recomputed (CC) flow for the alignment layer Efficient CNN-Based Computational Imaging,” ACM/IEEE ISCA, pp. 1096-1109, 2021.
to eliminate the multiplicative need for on-chip memory and the UC flow for the rest of [6] Q. Huang et al., “CoDeNet: Efficient Deployment of Input-Adaptive Object Detection
layers, reducing memory usage by 14%. Another technique, RFFS, alleviates the memory on Embedded FPGAs,” ACM/SIGDA FPGA, pp. 206-216, 2021.
traffic by processing-order rearrangement. For naïve scheduling, the input image blocks
for one cuboid output would be processed in time order. In this case, deformed FMs
(DFMs, Krn) for a later aligned FM (At) need to be stored before generating a former one,
e.g. K31, K32, and K33 for A3 before generating A2. In contrast, RFFS processes reference
frames first and calculate DFMs in order of aligned FMs, such that the storage size for
DFMs can be minimized. Combined with the hybrid workflow, the FM memory size is
finally reduced by 25% compared to the pure UC flow. Fig. 2.6.3 also shows the overall
system architecture implementing the proposed techniques, and a system control unit
performs memory arbitration and system scheduling accordingly. In addition, a ring-
tensor convolutional unit is adopted to leverage algebraic sparsity [5], and an
interpolation unit implements the BI in TODC for feature alignment.

48 • 2023 IEEE International Solid-State Circuits Conference 978-1-6654-9016-0/23/$31.00 ©2023 IEEE


ISSCC 2023 / February 20, 2023 / 4:15 PM

Figure 2.6.1: Video convolutional neural networks, design challenges and key
features. Figure 2.6.2: Cuboid-based layer-fusion (CBLF) inference flow.

Figure 2.6.4: Hardware-model co-design of tile-based offset-confined deformable


Figure 2.6.3: Alignment-aware memory optimization and overall system architecture. convolution (TODC).

Figure 2.6.5: Chip measurement and video CNNs results of super-resolution and
frame interpolation. Figure 2.6.6: Comparison table with state-of-the-art designs.

DIGEST OF TECHNICAL PAPERS • 49


ISSCC 2023 PAPER CONTINUATIONS

Figure 2.6.7: Chip micrograph and specifications.

• 2023 IEEE International Solid-State Circuits Conference 978-1-6654-9016-0/23/$31.00 ©2023 IEEE


ISSCC 2023 / SESSION 2 / DIGITAL PROCESSORS / 2.7
2.7 MetaVRain: A 133mW Real-Time Hyper-Realistic 3D-NeRF the VPC to reuse RGB-D pixels of the previous frame, resulting in 48.6× speed-up, on
Processor with 1D-2D Hybrid-Neural Engines for Metaverse average. The SAU, which consists of the Sample Coordinate Generator (SCG), Voxel
Cache, and Task Controller, generates sample coordinates using the binary map (Bmap)
on Mobile Devices received from the TFU and accumulated density of the TDA logic as an input. The Voxel
Cache uses the coordinates to fetch new voxels only in the case of cache misses. The
Donghyeon Han, Junha Ryu, Sangyeob Kim, Sangjin Kim, Hoi-Jun Yoo voxel is used to evaluate whether the sample is inside the attention map or not, and then
the tile is transferred to the either the dense or sparse buffer of the Task Controller,
Korea Advanced Institute of Science and Technology, Daejeon, Korea depending on the number of valid samples per tile. The Task Controller manages the I/O
data for the HNE with out-of-order execution (OoOE). The input controller transfers dense
A neural radiance field (NeRF) [1] uses a deep neural network (DNN) to create 3D models tiles directly to the HNE, but in the case of sparse tiles, it waits until input coordinates
by training the DNN to memorize 3D scene geometry from a few photos. Prior work uses are fully accumulated in a buffer. The output controller decodes the HNE output with
conventional computer graphic algorithms, such as ray-tracing or SLAM, for the same either the Masking Unit or Reorder Buffer (ROB) to generate final RGB-D results. The
purpose. With NeRF, the generated model can display hyper-realistic 3D content on the SAU offers 23.1× higher throughput on average, and as a result, the VPC realizes the
metaverse, with quality better than or the same as 3D images rendered by complicated BuFF architecture with 1120× faster rendering than vanilla NeRF acceleration [1].
ray-tracing. The 3D model can also be shared with other metaverse users via low-
bandwidth communication because the transaction requires <1MB of parameters. NeRF Figure 2.7.4 shows the architecture and operation of the HNE and CS-DNNA Core. The
is promising for 3D reconstruction, but also for a wide range of applications from depth HNE utilizes DNNA to divide DNN channels into two groups according to the sparsity
estimation to 3D style transfer [2], however, its heavy computational demands stand in ratio. The 1D NE receives channels where sparsity is higher than a threshold, and the
the way of its applicability for mobile and wearable applications. remaining channels are allocated to the 2D NE. The CS-DNNA Core predicts the sparsity
of the next layer with CS, which searches the 4 outermost empty pixels of every tile in
Figure 2.7.1 shows an overview of NeRF and the proposed Bundle-Frame-Familiarity zigzag order from center to the periphery of 4 sub-tiles. The selected 4 pixels are used
(BuFF) architecture of this paper. In general, NeRF consists of 3 main processes: 1) to perform 8-sample INFs by the HNE and generate the CS-Bmap indicating NZ outputs.
positional encoding, 2) DNN inference (INF), and 3) volumetric rendering, among which They are aggregated by ‘OR’ operations and generate a CS bit-mask for every input
DNN INF accounts for 96.7% of the execution time. In this application, the DNN channel. The CS-DNNA Core prefetches weights from global memory into either the 1D
parameters require only 0.6MB of memory, but compared with ResNet-50, the DNN or 2D NE according to the number of zeros in the bit-mask. The efficiency of the HNE
requires 18000× more operations, resulting in slow rendering speed (0.03fps) even with can be further improved by both task-offloading and clock-gating (CG). Since the CS-
high-end GPU servers [1]. Moreover, low-power NeRF acceleration is needed for use in DNNA Core can detect underutilization of the 2D NE in advance, it realizes both 1D-to-2D
mobile devices with limited battery capacity. and 2D-to-1D task-offloading to improve throughput by 14.5%. Furthermore, bit-rotator-
In order to realize real-time and energy-efficient 3D rendering on mobile devices, the based CG circuits in the 1D and 2D NEs use the CS bit-mask to generate CG signals and
BuFF architecture is proposed. It consists of 3 visual perception stages: 1) Spatial reduce the dynamic power of PE operation by up to 24.6%. The HNE achieves 3.7× higher
Attention (SA), 2) Temporal Familiarity (TF), and 3) Top-Down Attention (TDA). The SA throughput than acceleration without exploiting sparsity and at least 2.4× higher energy
stage utilizes an attention map with low-resolution voxels to collect only meaningful efficiency compared with using solely a 1D or 2D NE alone.
samples inside the attention map. The TF stage first transforms RGB-D pixels of the Figure 2.7.5 shows detailed circuits for the Mod-PEU adopting periodic polynomial
previous frame into RGB-D pixels of the current frame, and then, with partial INF, approximation of a sinusoidal function. It consists of a Modulo Circuit, Sign Calculation
evaluates the familiarity of the two frames by measuring color differences before and Circuit (SCC), and a Magnitude Calculation Circuit (MCC). The Modulo Circuit computes
after the transformation. A small color difference implies high familiarity, allowing RGB- two modulo residues using an arithmetic shifter and a 2’s complement unit. Both the
D pixels of the previous frame to be reused in the current frame rendering. Conversely, SCC and MCC receive the modulo residues to make 30 positional encoding values in
if familiarity is low, new RGB-D is computed through a new INF. In this way, more than parallel. The SCC determines their sign values by using ‘XOR’ logical operators and 2b
95% of DNN INFs can be skipped with <1dB PSNR loss, compared to 9.7dB loss shown adders. The MCC determines exponents and mantissas by multiplying the two modulo
in naïve RGB-D reuse without familiarity consideration. The TDA stage skips the last few residues. Since all 30 values use the same LSB multiplication results, the LSB portion of
layers of the DNN when the density obtained from the prior layers is low (short-term the multiplier is shared to reduce power by 38.2%. Overall, this approach consumes
feedback). It also skips computing the remaining samples along a ray when the 95.9% lower power and requires 90.0% smaller area compared with a conventional
cumulative density of prior samples is high enough (long-term feedback). This reduces sinusoidal circuit.
the total number of samples by 95.7% without additional PSNR loss.
Figure 2.7.6 shows measurement results for MetaVRain. When evaluated with both
The proposed MetaVRain has 3 key hardware blocks: 1) a Visual Perception Core (VPC) synthetic NeRF/NSVF [3] and a forward-facing dataset [1], MetaVRain shows 4174×
to realize the BuFF-based computational cost reduction, 2) 1D-2D Hybrid Neural Engines higher rendering throughput with <1dB PSNR loss. CS-based CG and Mod-PEU
(HNE) using Dynamic Neural Network Allocation (DNNA) for fast and efficient DNN additionally reduce overall power consumption by 23.7%. As a result, MetaVRain
processing, and 3) a Modulo-based Positional Encoding Unit (Mod-PEU) to minimize achieves 32.8fps (@50MHz) and 61.9fps (@100MHz) real-time 3D rendering, dissipating
the HW cost of the sinusoidal function necessary for NeRF operation. 135mW (@50MHz) and 310mW (@100 MHz). This is at least 99.95% lower power
Figure 2.7.2 shows the overall chip architecture of MetaVRain. It is composed of 4 HNEs, compared with modern GPUs or a TPU.
VPC, a Centrifugal-Search-based DNNA (CS-DNNA) Core, a global memory, and a top MetaVRain is fabricated in 28nm CMOS technology, integrating 5K FP8-FP16
controller. Firstly, the VPC performs 3 visual perception stages of BuFF to determine the configurable MACs with 2MB of SRAM. Its power and throughput can be varied
workload of the HNEs after removing useless DNN INFs. The remaining DNN INFs are dynamically by selecting operating modes with SW programming, and there are two
accelerated by HNEs and their tasks are dynamically allocated to the two heterogeneous typical operating modes: power-efficient and high-speed. In power-efficient mode, the
engines, 1D and 2D NEs, by the CS-DNNA Core. The 1D NE consists of 8 PE units and chip consumes 133mW while maintaining >30fps. In high-speed mode, it can achieve a
each PE unit receives a single nonzero (NZ) input activation (IA) to compute 32 outputs maximum of 118fps with 899mW power consumption. In summary, we present an
simultaneously. It maximizes zero skipping but sacrifices data reuse, resulting in efficient NeRF architecture, BuFF, and we design and fabricate MetaVRain with the
significant performance degradation if the sparsity ratio is <50%. On the other hand, the proposed BuFF architecture for low-power and real-time 3D rendering. Measurement
2D NE maximizes data reuse to be more favorable for dense inputs. It receives 32 IAs results of MetaVRain successfully demonstrate 911× faster rendering and 26400× lower
from I/O memory (IOMEM) or the PEU regardless of sparsity and generates 32 outputs energy consumption vs. a GPU. MetaVRain realizes not only real-time 3D rendering [1]
using adder trees. All intermediate partial-sums are stored in partial-sum memory but also hyper-realistic 3D model editing [2, 4], at the same quality as ray-tracing-based
(PSMEM) and transferred to IOMEM after post-processing, such as ReLU. rendering, enabling metaverse on mobile.
Figure 2.7.3 shows the details of the TF Unit (TFU) and SA Unit (SAU) in the VPC. The References:
TFU consists of a 3D transformation unit, I/O buffers and a tile-wise TF handler (TTH). [1] B. Mildenhall et al., “NeRF: Representing scenes as neural radiance fields for view
The TFU projects the previous RGB-D pixels into the current frame via 3D transformation synthesis,” European Conf. on Computer Vision, pp. 405–421, 2020.
and receives results of partial INF (0.72% of full INF) from the HNE to evaluate whether [2] K. Zhang et al., “ARF: Artistic Radiance Fields,” European Conf. on Computer Vision,
pixel data can be reused. Since the pixels within a tile show data locality even after 3D pp. 717-733, 2022.
transformation, the TFU adopts relative addressing-based buffering to reduce 97.9% of [3] L. Liu et al., “Neural Sparse Voxel Fields,” Conf. on Neural Information Processing
the external memory accesses (EMA). The TTH selects only one RGB-D pixel closest to Systems, no. 1313, pp. 15651-15663, 2020.
the camera origin using a depth comparator during intra-tile handling. In the inter-tile [4] R. Martin-Brualla et al., “NeRF in the Wild: Neural Radiance Fields for Unconstrained
handling, it skips the depth comparison but, by a bitwise ‘AND’ operation, determines Photo Collections,” Computer Vision and Pattern Recognition, pp. 7210-7219, 2021.
overlapping pixels that need a new INF. The TFU reduces EMA by 99.5% and enables

50 • 2023 IEEE International Solid-State Circuits Conference 978-1-6654-9016-0/23/$31.00 ©2023 IEEE


ISSCC 2023 / February 20, 2023 / 4:45 PM

Figure 2.7.1: Overview of neural radiance fields (NeRF) and the proposed Bundle-
Frame-Familiarity (BuFF) architecture. Figure 2.7.2: Overall chip architecture.

Figure 2.7.3: Details of visual perception core (VPC) with temporal familiarity unit Figure 2.7.4: 1D and 2D hybrid neural engine (HNE) with centrifugal search (CS)-
(TFU) and spatial attention unit (SAU). based dynamic neural network allocation (DNNA) core.

Figure 2.7.5: Modulo-based positional encoding unit (Mod-PEU) adopting periodic


polynomial approximation of sinusoidal function. Figure 2.7.6: Measurement results and performance comparison table.

DIGEST OF TECHNICAL PAPERS • 51


ISSCC 2023 PAPER CONTINUATIONS

Figure 2.7.7: Chip micrograph and performance summary.

• 2023 IEEE International Solid-State Circuits Conference 978-1-6654-9016-0/23/$31.00 ©2023 IEEE

You might also like