C. HPC Based Optimized NEXT 2-D LFSR The NEXT 2-D LFSR Synthesis Algorithm (10), Written
C. HPC Based Optimized NEXT 2-D LFSR The NEXT 2-D LFSR Synthesis Algorithm (10), Written
memory and device memory. The CUDA execution is based on threads and a collection of threads is called a block. A group of blocks can be assigned to a single multi-processor and timeshare their execution. The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. Tesla C2050 GPU has a warp size of 32. A grid constitutes of a collection of blocks in a single execution. Each thread and block can be accessed within the thread using a unique identifier. The kernel is the code executed on each thread. Using the thread and block identifier, the thread performs the kernel task on its part of the data. Multiple kernels can be called by an algorithm and the kernels share data through the global memory. C. HPC based Optimized NEXT 2-D LFSR The NEXT 2-D LFSR synthesis algorithm [10], written and simulated using Matlab software, was computationally intensive and the computational time for an optimal solution is prohibitively large. An automated synthesis algorithm (Fig. 2) was written in CUDA for optimal usage of the GPU resources. The BIST algorithm was optimized for GPU operation to maximize independent parallelism and arithmetic intensity. As the shared memory is much faster than the global memory, the shared memory was exploited to the fullest by caching the test patterns so that multiple threads can be reused. Furthermore, as device to host memory bandwidth much lower than device to device bandwidth, host-device data transfers were minimized by de-allocating the device memory for data right after it was operated on. In general, the latency associated with memory-bound operations can be hidden by increasing the utilization (threads/multiprocessor) of the GPU. As the BIST optimization algorithm includes several memory-bound operations that involves comparison of bits in the test patterns, its critical to have maintain high occupancy in GPU. However, GPU occupancy is limited by the availability of the registers and shared memory for the BIST synthesis operations. To ensure that the GPU resources were utilized efficiently and thereby obtaining the best timing performance, the algorithm was iteratively run by varying the partitioned number of registers and shared memory amount concurrent threadblocks. The first iteration of the BIST hardware synthesis utilizes 64 threads per block along with its associated registers and shared memory. After the first iteration, the compute time was recorded. Next, the number of threads and its associated registers and shared memory was increased, the synthesis procedure was repeated and compute time recorded. If the compute time was less than the previous iteration, the number of threads utilized and, its associated registers and shared memory was reduced instead. This optimization procedure was continued until the all iterations are completed. In order to compare the computational time involved in the BIST synthesis procedure using Matlab in Sun Ultra 40 M2 Workstations (2 x AMD Opteron 3.0 GHz dual-core processors and 4GB DDR memory) and CUDA in GPU, the Tesla C2050 GPU was installed in the Sun Ultra 40 M2 workstation. BIST hardware was then synthesized for benchmark circuits shown in Table II [10] and compared. The
number of test patterns per configuration and the number of configurations were kept same as the numbers in the Table II in [10]. The synthesis procedure is performed 1000 times for each circuit and the average of computation times is reported. The comparison of the BIST hardware synthesis computation time is given Table 1. It can be observed that: a) utilizing GPUs drastically reduces the BIST hardware synthesis time. This is significant as the BIST synthesis time for large SoC cores is prohibitively large; b) as the number of stages in the FFA is increased from 1 to 3, the computation time for BIST synthesis decreases. This is primarily because as the FFA stages are
1: 2: 3: 4: 5: 6: 1 1 0 1 0 0 1 2 0 1 1 1 0 0 3 1 1 1 1 1 1 4 0 0 1 1 1 1 5 0 1 0 1 1 1 6 0 0 0 0 0 1 7 0 1 0 0 1 0 Test Patterns 8 9 10 11 12 13 14 15 16 17 18 1 1 1 0 0 1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 1 1 0 1 0 1
SoC core 1 patterns to be SoC core 2 patterns to be embedded in config. 1 embedded in config. 2
START
Take first PPC patterns for configuration no. 1 (use m = 1, s = 1); Iteration = 1
USER INPUT PPC - no: of patt. / config. m - no: of inputs to gates s - no: of stages in FFA (maximum = 3) I - total no: of iterations No: of GPU threads to be used per block: 64 (plus its associated registers and shared memory) Max. no: of GPU threads per block: [PI * (PPC - s)]
No
Increment m&s
Take the second PPC patterns for configuration no. 2 (use m = 1, s = 1) and repeat the above procedure
CUDA timer end
Revise (reduce/ increase) the number of GPU threads to be used per block and its associated registers and shared memory based on recorded compute time from current and previous simulations
No
Is number of iteration = I ?
Yes
END