Session 2 Digital Processor
Session 2 Digital Processor
In this year’s conference, mainstream processors designed in advanced process technologies share the stage with domain-specific
processors to address a range of applications from high-performance general-purpose computing through to genomics. The
session leads off with AMD’s latest-generation “Zen 4” core, MediaTek’s flagship 5G mobile SoC and features university researchers
demonstrating simulated-annealing processors and data-flow computing SoCs for AR/VR, robotics and next-generation genomic
sequencing.
1:30 PM
2.1 “Zen 4”: The AMD 5nm 5.7GHz x86-64 Microprocessor Core
Benjamin Munger, AMD, Boxborough, MA
In Paper 2.1, AMD highlights a 5.7GHz “Zen 4” 8-core complex fabricated in a 5nm FinFET process, occupying 55mm2. 13% IPC
improvement is accomplished through architectural enhancements including increased structure sizes and conflict reduction. This
coupled with physical design innovations including Vth management, selective path tuning and intelligent power grid optimization
drives a 6% reduction in switching capacitance, 16% increase in frequency, and an FMAX of 5.7GHz. Over 30% increase in iso-
power performance vs. the prior generation in desktop products is demonstrated.
2:00 PM
2.2 A 5G Mobile Gaming-Centric SoC with High-Performance Thermal Management in 4nm FinFET
Bo-Jr Huang, MediaTek, Hsinchu, Taiwan
In Paper 2.2, MediaTek demonstrates a high-performance thermal management system for a 110mm2 5G mobile gaming SoC
featuring a tri-gear CPU with GPU, designed in 4nm. With CPUs running up to 3.35GHz, a power-predictor and calculator system
fed by multiple on-chip sensors combined with energy-thermal-aware task reallocation facilitates an average 10°C increase in
throttling threshold enabling a record AnTuTu score of 1.146M.
2:30 PM
2.3 Amorphica: 4-Replica 512 Fully Connected Spin 336MHz Metamorphic Annealer with Programmable
Optimization Strategy and Compressed-Spin-Transfer Multi-Chip Extension 2
Kazushi Kawamura, Tokyo Institute of Technology, Yokohama, Japan
In Paper 2.3, Tokyo Institute of Technology presents a 40nm programmable multi-policy simulated-annealing processor
integrating 4-replica 512 fully connected spins, extensible across 4-chips. The 9mm2 die operates at 336MHz consuming 151-
474mW at 1.1V.
3:15 PM
2.4 A Fully Integrated End-to-End Genome Analysis Accelerator for Next-Generation Sequencing
Yen-Lung Chen, National Taiwan University, Taipei, Taiwan
In Paper 2.4, National Taiwan University researchers present a 28nm processor for next-generation genomic sequencing
supporting end-to-end workflow from short-read mapping through to genotyping. The 16mm2 die operates at 400MHz @ 0.9V
and is designed in a TSMC 28nm CMOS process. The chip delivers 59× higher throughput and 935-to-4910× higher energy
efficiency compared to state-of-the-art cloud-based solutions.
3:45 PM
2.5 A 28nm 142mW Motion-Control SoC for Autonomous Mobile Robots
I-Ting Lin, National Taiwan University, Taipei, Taiwan
In Paper 2.5, National Taiwan University researchers present a 28nm 4.39mm2 200MHz SoC for autonomous robot control that
incorporates sampling-based motor control that enables high parallelization. Optimizations such as trajectory pruning and use
of an acceleration-based model facilitate a ~5kHz maximum rate control, with less than 1.6% tracking error, and over 350×
improvement in energy efficiency as compared to prior state of the art.
4:15 PM
2.6 VISTA: A 704mW 4K-UHD CNN Processor for Video and Image Spatial/Temporal Interpolation Acceleration
Kai-Ping Lin, National Tsing Hua University, Hsinchu, Taiwan
In Paper 2.6, National Tsing Hua University presents a video CNN chip for 4K-UHD imaging/display applications, providing peak
throughput of 60/50fps for spatial/temporal-interpolation with 704mW power dissipation. The 40nm 12.6mm2 chip achieves
comparable energy efficiency to prior work ranging from 4.0-6.4TOPS/W and area efficiency of 222.2GOPS/mm2, while
supporting the advanced feature of multi-image processing.
4:45 PM
2.7 MetaVRain: A 133mW Real-Time Hyper-Realistic 3D-NeRF Processor with 1D-2D Hybrid-Neural Engines for
Metaverse on Mobile Devices
Donghyeon Han, Korea Advanced Institute of Science and Technology, Daejeon, Korea
In Paper 2.7, KAIST presents a real-time hyper-realistic-3D-NeRF processor, MetaVRain, for metaverse on mobile devices,
which can create 3D models by training a DNN to memorize 3D scene geometry from a few photos. The 28nm chip, integrating
5K FP8-FP16 configurable MACs with 2MB of SRAM, demonstrates a maximum of 118fps, and consumes at least 99.95% lower
power compared with modern GPUs and a TPU.
“Zen 4,” a high-performance x86 CPU core implemented in a 5nm process, is fabricated
in an optimized version of TSMC’s 5nm FinFET process [2] with a 15 metal layer
telescoping stack, designed for density on the lower layers and for speed on the upper
layers. Design technology co-optimization techniques applied to the power grid design,
as well as the standard cell library, enable minimal disruptions in cell placement, allowing
for overall area and frequency improvements. Exposing standard cell pins on the local
interconnect layer (M0) improves placement density and timing. The design team worked
closely with TSMC to enable interconnect optimizations beyond the foundry platform
offering to reduce wire capacitance by 4% resulting in a 1.5% frequency improvement.
Figure 2.1.1: Die photo of “Zen 4” CCX. Figure 2.1.2: “Zen 4” architecture.
Figure 2.1.5: Desktop performance improvement. Figure 2.1.6: Average performance vs. power for one eight-core 32M L3 CCX.
Figure 2.2.1: Thermal challenges of upgrading specifications for gaming and the 5G
SoC block diagram. Figure 2.2.2: Thermal challenges arising from performance/power increase.
Figure 2.2.3: Proposed thermal management system of the SoC. Figure 2.2.4: Tthr determination based on power prediction.
Figure 2.2.5: Operation of E/TAS and the CPU frequency distribution wi./wo. E/TAS Figure 2.2.6: Smart FPS control, real time temperature in benchmark and AnTuTu
in a gaming test case. benchmark score.
Figure 2.3.1: Combinatorial optimization and its annealing solution. Three major Figure 2.3.2: The top-level view of the Amorphica architecture and the LFU-DCU
contributions of this paper are also summarized. circuitry.
Figure 2.3.3: MPDS circuitry and its multi-annealing policy mechanism for flipped- Figure 2.3.4: Multi-chip configuration and the inter-chip bandwidth reduction by the
spin selection. Example pipeline diagrams for the main datapath are also depicted. on-chip ZRL decoding.
Figure 2.3.5: Numerical simulations in the max-cut problem space and measurement Figure 2.3.6: Comparison with recent annealing chips*. This work represents a multi-
results on the 4-Amorphica-chip system for four 2K-spin max-cut problems. replica, multi-chip, and multi-policy full-connection annealer.
Additional References:
[3] J. Mu et al., “A 20×28 Spins Hybrid In-Memory Annealing Computer Featuring
Voltage-Mode Analog Spin Operator for Solving Combinatorial Optimization Problems,”
IEEE Symp. VLSI Circuits, 2021.
[4] Y. Su et al., “FlexSpin: A Scalable CMOS Ising Machine with 256 Flexible Spin
Processing Elements for Solving Complex Combinatorial Optimization Problems,”
ISSCC, pp. 274-475, 2022.
[5] M. Aramon et al., “Physics-Inspired Optimization for Quadratic Unconstrained
Problems Using a Digital Annealer,” Frontiers in Physics, vol. 7, no. 48, 2019.
[6] K. Yamamoto et al., “STATICA: A 512-Spin 0.25M-Weight Full-Digital Annealing
Processor with a Near-Memory All-Spin-Updates-at-Once Architecture for
Combinatorial Optimization with Complete Spin-Spin Interactions,” ISSCC, pp. 138-
139, 2020.
Figure 2.4.3 shows the system architecture, which includes two Aligner Engines and one Acknowledgement:
Caller Engine. The Aligner Engine performs compute-intensive short-read mapping. The This work is supported by GeneASIC Technologies and National Science and Technology
Caller Engine performs haplotype calling, variant calling and genotyping. In the Aligner Council (NSTC) of Taiwan. The authors also thank Taiwan Semiconductor Research
Engine, a Read Reverser preprocesses paired-end reads. Seed Generators are designed Institute (TSRI) for technical support on chip design and fabrication.
to cut paired-end reads into seeds. Two Exact Matchers perform FM-index queries and
find the candidate mapping locations of paired-end reads. A Candidates Pairer filters the References:
candidates by their locations. A Quality Calculator evaluates the mapping quality. A [1] K. R. Franke et al., “Accelerating next generation sequencing data analysis: an
Rescuer is deployed to re-align the paired reads to the locations next to that of the target evaluation of optimized best practices for Genome Analysis Toolkit algorithms,”
candidate to reduce the number of unmapped short-reads. A Dynamic Programming Genomics Inform, vol. 18, no. 1, pp. 1-9, 2020.
Engine includes multiple sets of processing element (PE) arrays that perform parallel [2] Y.-C. Wu et al., “A 135mW Fully Integrated Data Processor for Next-Generation
sequence alignment. In the Caller Engine, a Parallel k-mer Processing Engine constructs Sequencing,” ISSCC, pp. 252-253, 2017.
the de Bruijn graph and performs queries in a massively-parallel manner. A Variant [3] Y.-C. Wu et al., “A Fully Integrated Genetic Variant Discovery SoC for Next-Generation
Discovering Engine locates the variants and executes event determination. A Genotype Sequencing,” ISSCC, pp. 322-323, 2020.
Likelihood Computing Engine generates read-to-haplotype likelihood matrices and [4] A. Xiao et al., “ADS-HCSpark: A scalable HaplotypeCaller Leveraging Adaptive Data
calculates the allele likelihood values for genotype quality scores. Segmentation to Accelerate Variant Calling on Spark,” BMC Bioinformatics, vol. 20, no.
76, pp. 1-13, 2019.
Figure 2.4.4 illustrates the techniques for rapid similarity score calculation in short-read [5] S. Zhao et al., “Accuracy and Efficiency of Germline Variant Calling Pipelines for
mapping and its corresponding hardware architecture in Aligner Engines. In order to Human Genome Data,” Scientific Reports, vol. 10, no. 20222, 2020.
evaluate the similarity between two sequences, the SW algorithm is applied. The penalties [6] Illumina, Illumina DRAGEN Server v3 Site Prep and Installation Guide, No.
in similarity score of Match, Mismatch, Gap open, and Gap extension (Sm, Smis, Gopen, 1000000097923 v02, 2020.
and Gext) are introduced in the score matrix. In this example, in which the fragment is
Figure 2.4.4: Operation and circuits for rapid similarity calculation in short-read
Figure 2.4.3: System architecture of the end-to-end genome analysis accelerator. mapping.
Figure 2.4.5: Dataflow and hardware mapping for haplotype calling and genotyping. Figure 2.4.6: Performance comparison.
Figure 2.5.1: Motion control for autonomous mobile robots. Figure 2.5.2: Sampling-based motion control workflow and complexity minimization.
Figure 2.5.3: System architecture of the proposed motion control SoC. Figure 2.5.4: Design techniques and architecture optimizations for PE.
Figure 2.5.5: NoC for workload balancing, data dispatching and data collection. Figure 2.5.6: Experimental verification and performance comparison.
Figure 2.6.1: Video convolutional neural networks, design challenges and key
features. Figure 2.6.2: Cuboid-based layer-fusion (CBLF) inference flow.
Figure 2.6.5: Chip measurement and video CNNs results of super-resolution and
frame interpolation. Figure 2.6.6: Comparison table with state-of-the-art designs.
Figure 2.7.1: Overview of neural radiance fields (NeRF) and the proposed Bundle-
Frame-Familiarity (BuFF) architecture. Figure 2.7.2: Overall chip architecture.
Figure 2.7.3: Details of visual perception core (VPC) with temporal familiarity unit Figure 2.7.4: 1D and 2D hybrid neural engine (HNE) with centrifugal search (CS)-
(TFU) and spatial attention unit (SAU). based dynamic neural network allocation (DNNA) core.