0% found this document useful (0 votes)

30 views

C. HPC Based Optimized NEXT 2-D LFSR The NEXT 2-D LFSR Synthesis Algorithm (10), Written

The document discusses optimizing a BIST hardware synthesis algorithm to run efficiently on a GPU. Key points: 1) The algorithm was rewritten in CUDA to take advantage of the GPU's parallel resources. Shared memory was used to cache test patterns for reuse by multiple threads. 2) Memory transfers were minimized by deallocating device memory after use. Occupancy was increased to hide memory latency. 3) The algorithm was iteratively optimized by varying thread count and resource allocation to find the fastest configuration. Using a GPU significantly reduced synthesis times for benchmark circuits compared to CPU.

Uploaded by

Manish Bansal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

C. HPC Based Optimized NEXT 2-D LFSR The NEXT 2-D LFSR Synthesis Algorithm (10), Written

Uploaded by

Manish Bansal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Primary Input (PI) of the circuit

memory and device memory. The CUDA execution is based on threads and a collection of threads is called a block. A group of blocks can be assigned to a single multi-processor and timeshare their execution. The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. Tesla C2050 GPU has a warp size of 32. A grid constitutes of a collection of blocks in a single execution. Each thread and block can be accessed within the thread using a unique identifier. The kernel is the code executed on each thread. Using the thread and block identifier, the thread performs the kernel task on its part of the data. Multiple kernels can be called by an algorithm and the kernels share data through the global memory. C. HPC based Optimized NEXT 2-D LFSR The NEXT 2-D LFSR synthesis algorithm [10], written and simulated using Matlab software, was computationally intensive and the computational time for an optimal solution is prohibitively large. An automated synthesis algorithm (Fig. 2) was written in CUDA for optimal usage of the GPU resources. The BIST algorithm was optimized for GPU operation to maximize independent parallelism and arithmetic intensity. As the shared memory is much faster than the global memory, the shared memory was exploited to the fullest by caching the test patterns so that multiple threads can be reused. Furthermore, as device to host memory bandwidth much lower than device to device bandwidth, host-device data transfers were minimized by de-allocating the device memory for data right after it was operated on. In general, the latency associated with memory-bound operations can be hidden by increasing the utilization (threads/multiprocessor) of the GPU. As the BIST optimization algorithm includes several memory-bound operations that involves comparison of bits in the test patterns, its critical to have maintain high occupancy in GPU. However, GPU occupancy is limited by the availability of the registers and shared memory for the BIST synthesis operations. To ensure that the GPU resources were utilized efficiently and thereby obtaining the best timing performance, the algorithm was iteratively run by varying the partitioned number of registers and shared memory amount concurrent threadblocks. The first iteration of the BIST hardware synthesis utilizes 64 threads per block along with its associated registers and shared memory. After the first iteration, the compute time was recorded. Next, the number of threads and its associated registers and shared memory was increased, the synthesis procedure was repeated and compute time recorded. If the compute time was less than the previous iteration, the number of threads utilized and, its associated registers and shared memory was reduced instead. This optimization procedure was continued until the all iterations are completed. In order to compare the computational time involved in the BIST synthesis procedure using Matlab in Sun Ultra 40 M2 Workstations (2 x AMD Opteron 3.0 GHz dual-core processors and 4GB DDR memory) and CUDA in GPU, the Tesla C2050 GPU was installed in the Sun Ultra 40 M2 workstation. BIST hardware was then synthesized for benchmark circuits shown in Table II [10] and compared. The

number of test patterns per configuration and the number of configurations were kept same as the numbers in the Table II in [10]. The synthesis procedure is performed 1000 times for each circuit and the average of computation times is reported. The comparison of the BIST hardware synthesis computation time is given Table 1. It can be observed that: a) utilizing GPUs drastically reduces the BIST hardware synthesis time. This is significant as the BIST synthesis time for large SoC cores is prohibitively large; b) as the number of stages in the FFA is increased from 1 to 3, the computation time for BIST synthesis decreases. This is primarily because as the FFA stages are
1: 2: 3: 4: 5: 6: 1 1 0 1 0 0 1 2 0 1 1 1 0 0 3 1 1 1 1 1 1 4 0 0 1 1 1 1 5 0 1 0 1 1 1 6 0 0 0 0 0 1 7 0 1 0 0 1 0 Test Patterns 8 9 10 11 12 13 14 15 16 17 18 1 1 1 0 0 1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 1 1 0 1 0 1

SoC core 1 patterns to be SoC core 2 patterns to be embedded in config. 1 embedded in config. 2

START

Load binary patterns

Take first PPC patterns for configuration no. 1 (use m = 1, s = 1); Iteration = 1

USER INPUT PPC - no: of patt. / config. m - no: of inputs to gates s - no: of stages in FFA (maximum = 3) I - total no: of iterations No: of GPU threads to be used per block: 64 (plus its associated registers and shared memory) Max. no: of GPU threads per block: [PI * (PPC - s)]

CUDA timer start

Logic synthesis for PI = 1

Logic synthesis for PI = 2

Logic synthesis for PI = 3

Logic synthesis for PI = 4

Logic synthesis for PI = 5

Logic synthesis for PI = 6

BIST hardware synthesis successful?

Yes

Increment m&s

Take the second PPC patterns for configuration no. 2 (use m = 1, s = 1) and repeat the above procedure
CUDA timer end

Record the compute time

Revise (reduce/ increase) the number of GPU threads to be used per block and its associated registers and shared memory based on recorded compute time from current and previous simulations
No

Is number of iteration = I ?
Yes

END

Fig. 2. HPC based BIST test vector synthesis flow

SNES Architecture: Architecture of Consoles: A Practical Analysis, #4
From Everand
SNES Architecture: Architecture of Consoles: A Practical Analysis, #4
Rodrigo Copetti
No ratings yet
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
From Everand
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
Rodrigo Copetti
No ratings yet
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
From Everand
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
Rodrigo Copetti
No ratings yet
Summary Master Thesis
No ratings yet
Summary Master Thesis
3 pages
ECE408 2012 Practice Exam1
No ratings yet
ECE408 2012 Practice Exam1
10 pages
Assinmet&Case Study
No ratings yet
Assinmet&Case Study
19 pages
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
From Everand
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
Rodrigo Copetti
No ratings yet
Assignment Questions
No ratings yet
Assignment Questions
3 pages
A Demonstration of Exact String Matching Algorithms With CUDA
No ratings yet
A Demonstration of Exact String Matching Algorithms With CUDA
10 pages
The Parallel Finite Difference Time Domain (FDTD) Project
No ratings yet
The Parallel Finite Difference Time Domain (FDTD) Project
4 pages
Core-level_DVFS_for_Spatial_Multitasking_GPUs
No ratings yet
Core-level_DVFS_for_Spatial_Multitasking_GPUs
4 pages
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
No ratings yet
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
4 pages
CUDA_Memory
No ratings yet
CUDA_Memory
56 pages
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
Ntroduction: (I 1 N) Represents An N-Bit Vector and G A. GPU Hardware Model
No ratings yet
Ntroduction: (I 1 N) Represents An N-Bit Vector and G A. GPU Hardware Model
1 page
Final Fall20.Docx.pdf
No ratings yet
Final Fall20.Docx.pdf
6 pages
Q1
No ratings yet
Q1
9 pages
PDS_Ising_Model
No ratings yet
PDS_Ising_Model
6 pages
HSCD_FewSmall_CaseStudy
No ratings yet
HSCD_FewSmall_CaseStudy
19 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Manage-Design of Logic-Sabir Hussain
No ratings yet
Manage-Design of Logic-Sabir Hussain
6 pages
Benchmarking the cost of thread divergence in CUDA
No ratings yet
Benchmarking the cost of thread divergence in CUDA
8 pages
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
EE 4702 Take-Home Pre-Final Questions: Solution
No ratings yet
EE 4702 Take-Home Pre-Final Questions: Solution
11 pages
Parallelization of BFS Graph Algorithm Using CUDA
No ratings yet
Parallelization of BFS Graph Algorithm Using CUDA
6 pages
Realtimesignal Cuda
No ratings yet
Realtimesignal Cuda
26 pages
COSS Makeup Question
No ratings yet
COSS Makeup Question
3 pages
hw03 Sol
No ratings yet
hw03 Sol
8 pages
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
From Everand
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
Rodrigo Copetti
No ratings yet
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
From Everand
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
Rodrigo Copetti
No ratings yet
Quiz For Chapter 7 With Solutions
No ratings yet
Quiz For Chapter 7 With Solutions
8 pages
Efficient Acceleration of Asymmetric Cryptography On Graphics Hardware
No ratings yet
Efficient Acceleration of Asymmetric Cryptography On Graphics Hardware
17 pages
Implementation of Low Power Test Pattern Generator Using LFSR
No ratings yet
Implementation of Low Power Test Pattern Generator Using LFSR
6 pages
Cover Page: Paper Title: On Obtaining Maximum Length Sequences For Accumulator-Based
No ratings yet
Cover Page: Paper Title: On Obtaining Maximum Length Sequences For Accumulator-Based
22 pages
L 0421068072
No ratings yet
L 0421068072
5 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
IJTech EECE 798 VLSI Circuit Optimization For 8051 MCU
No ratings yet
IJTech EECE 798 VLSI Circuit Optimization For 8051 MCU
8 pages
Example XSEDE Code and Performance Scaling
No ratings yet
Example XSEDE Code and Performance Scaling
2 pages
exam 3
No ratings yet
exam 3
8 pages
Built in Selftest For Embedded Systems
No ratings yet
Built in Selftest For Embedded Systems
12 pages
MTGP Slide Mcqmc2
No ratings yet
MTGP Slide Mcqmc2
35 pages
FPGA Test Time Reduction Through A Novel Interconnect Testing Scheme
No ratings yet
FPGA Test Time Reduction Through A Novel Interconnect Testing Scheme
9 pages
BIST
No ratings yet
BIST
53 pages
Chapter 04
No ratings yet
Chapter 04
17 pages
main-midterm
No ratings yet
main-midterm
25 pages
Fast Sort On CPUs, GPUs and Intel MIC Architectures - Technical Report - Intel Labs (Intel-Labs-Radix-Sort-Mic-Report)
No ratings yet
Fast Sort On CPUs, GPUs and Intel MIC Architectures - Technical Report - Intel Labs (Intel-Labs-Radix-Sort-Mic-Report)
11 pages
Vlsi Abstracts An Accumulator-Based Test-Per-Clock Scheme
No ratings yet
Vlsi Abstracts An Accumulator-Based Test-Per-Clock Scheme
7 pages
Game Boy Advance Architecture: Architecture of Consoles: A Practical Analysis, #7
From Everand
Game Boy Advance Architecture: Architecture of Consoles: A Practical Analysis, #7
Rodrigo Copetti
No ratings yet
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
Design and Implementation of Single Precision Pipelined Floating Point Co-Processor
No ratings yet
Design and Implementation of Single Precision Pipelined Floating Point Co-Processor
4 pages
Screening_Tasks_OpenFOAM_GUI
No ratings yet
Screening_Tasks_OpenFOAM_GUI
6 pages
Parallel BFS On Graphs Using GPGPU
No ratings yet
Parallel BFS On Graphs Using GPGPU
10 pages
1b S Divya Bathmavathi Low
No ratings yet
1b S Divya Bathmavathi Low
6 pages
Exercise Only
No ratings yet
Exercise Only
40 pages
GPU Quicksort
No ratings yet
GPU Quicksort
22 pages
CUDA Compression Final Report
No ratings yet
CUDA Compression Final Report
11 pages
Supercomputer Selection, The Sequel: Input
No ratings yet
Supercomputer Selection, The Sequel: Input
2 pages
CAPP1
No ratings yet
CAPP1
4 pages
2018 SI Call For Papers
No ratings yet
2018 SI Call For Papers
3 pages
Research Article: Pipeline FFT Architectures Optimized For Fpgas
No ratings yet
Research Article: Pipeline FFT Architectures Optimized For Fpgas
10 pages
Guide To Software Testing
100% (3)
Guide To Software Testing
31 pages
Bluestein-S FFT Algorithm
No ratings yet
Bluestein-S FFT Algorithm
3 pages
0270 PDF Bib
No ratings yet
0270 PDF Bib
8 pages
Fast Radix-2 Algorithm For The Discrete Hartley Transform of Type II
No ratings yet
Fast Radix-2 Algorithm For The Discrete Hartley Transform of Type II
3 pages
A High Speed Low Power Encoder For A 5 Bit Flash ADC - 2012
No ratings yet
A High Speed Low Power Encoder For A 5 Bit Flash ADC - 2012
5 pages
A High-Throughput Low-Latency Arithmetic Encoder Design For HDTV
No ratings yet
A High-Throughput Low-Latency Arithmetic Encoder Design For HDTV
4 pages
A High Speed Low Power Encoder For A 5 Bit Flash ADC - 2012
No ratings yet
A High Speed Low Power Encoder For A 5 Bit Flash ADC - 2012
5 pages
VLSI Implementation of A Low-Cost High-Quality Image Scaling Processor
No ratings yet
VLSI Implementation of A Low-Cost High-Quality Image Scaling Processor
5 pages
Implementation of A Communications Channelizer Using Fpgas and Rns Arithmetic
No ratings yet
Implementation of A Communications Channelizer Using Fpgas and Rns Arithmetic
0 pages

C. HPC Based Optimized NEXT 2-D LFSR The NEXT 2-D LFSR Synthesis Algorithm (10), Written

Uploaded by

C. HPC Based Optimized NEXT 2-D LFSR The NEXT 2-D LFSR Synthesis Algorithm (10), Written

Uploaded by

Primary Input (PI) of the circuit

Load binary patterns

CUDA timer start

Logic synthesis for PI = 1

Logic synthesis for PI = 2

Logic synthesis for PI = 3

Logic synthesis for PI = 4

Logic synthesis for PI = 5

Logic synthesis for PI = 6

BIST hardware synthesis successful?

Record the compute time

Fig. 2. HPC based BIST test vector synthesis flow

You might also like