Performance Efficient Integration and Programming Approach of DCT Accelerator For HEVC in MANGO Platform
Performance Efficient Integration and Programming Approach of DCT Accelerator For HEVC in MANGO Platform
To cite this article: Igor Piljić, Leon Dragić & Mario Kovač (2019) Performance-efficient integration
and programming approach of DCT accelerator for HEVC in MANGO platform, Automatika, 60:2,
245-252, DOI: 10.1080/00051144.2019.1618526
REGULAR PAPER
CONTACT Igor Piljić [email protected] Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia
© 2019 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted
use, distribution, and reproduction in any medium, provided the original work is properly cited.
246 I. PILJIĆ ET AL.
validation of HEVC encoded bitstreams were con- In addition to the different accelerator types, the
ducted using StreamEye software [17]. MANGO platform enables the use of a group of units
Results are obtained using Intel Amplifier tool and working in parallel, either isolated or in collaboration,
are shown in Figure 2. on a given task. Therefore, we can also identify and
As can be seen from the figure above, most time- exercise the following working modes in MANGO:
consuming tasks are: forward2DTransform and its
counterpart forward2DInverseTransform. The main • Standalone mode – units work in standalone mode
kernel used in both functions is 2-dimensional Dis- running parts of the tasks. No communication or
crete Cosine Transform, which is why DCT was chosen collaboration is exercised between the host and the
as a kernel that can be used to demonstrate benefits unit and between concurrent units.
of offloading to a custom hardware accelerator. Sev- • Host iterative mode – the unit works in an iterative
eral modifications to the applications were made to mode with the host computing parts of the task. The
meet the requirements of the DCT accelerator. Resid- kernel running on the unit is completely synchro-
ual samples had to be stored in a 16-bit format which nized with the host application. All the units work in
halved the amount of data being transferred between the same fashion but there are no direct interactions
host and accelerator. However, this modification leads between units.
to additional changes such as packing and unpacking
of samples for AVX DCT implementation which had a Due to the nature of the DCT algorithm and the
minor impact on its performance. design of DCT accelerator, the unit collaborative mode,
also available in MANGO, was not exercised, and focus
was put on standalone and host iterative mode.
Accelerator – DCT
Based on HEVC encoder analysis, a custom hardware Integration and programming approach
accelerator for DCT was designed and implemented Standalone approach
in FPGA. Symmetry properties of 2D DCT trans-
form were exploited to design area-optimized, 1D DCT The standalone mode was the first working mode in
architecture that can be reused to implement full 2D which HW DCT accelerator was integrated into the
core transform. The accelerator architecture is fully MANGO platform. In standalone mode, a single task
pipelined and applicable for all transform sizes used represents a single matrix with sizes ranging from
in HEVC (from 4 × 4 to 32 × 32 blocks). After evalua- 32 × 32 to 4 × 4 which is then offloaded to the accelera-
tion as a stand-alone module, DCT accelerator core was tor that processes the data and returns the transformed
integrated as a MANGO HN processing core. matrix to the host. For each task, initialization proce-
dures such as registration of kernels, buffers, and events,
allocation of resources or transfer of arguments are
repeated. In the first phase, initialization of mango con-
Benchmarks
text and kernel function takes place. Depending on the
In the test scenario, HW DCT accelerator was used matrix size, input, config and output buffers are regis-
for processing DCT tasks. The baseline for benchmarks tered and added to the task graph shown in Figure 3.
were two implementations of DCT kernel on Intel GN: Input data is then written to the buffers and the kernel
is started. After this, host waits (or does some use-
• Single thread ful work) for the end event which indicates that the
• Single thread + AVX accelerator has processed input data. On the accelerator
248 I. PILJIĆ ET AL.
Optimizations
Table 1 shows average time distribution per phase
when processing 10,000 instances of 32 × 32 blocks.
Write and read time corresponds to the time needed
to transfer data from host to the MANGO memory
or from MANGO memory to the host respectively.
Figure 6. Time distribution for iterative integration. Processing time is the duration from the moment when
the interrupt_start has been sent to the accelerator to
the moment when interrupt_end has been received by
The kernel is now started before the data is writ- the host. Processing time thus includes the time needed
ten to the buffer. After host transfers the input data to for the accelerator to load data from the memory, pro-
MANGO memory, it issues the interrupt_start which cess it, and then store it back. Here we do not address
triggers the accelerators to start loading and process- how the total processing time of the accelerator is dis-
ing the data. After all the data is processed and stored, tributed between memory accesses and data passing
accelerator issues the interrupt back to the host, signi- through the pipeline. Usually, small buffer sizes suf-
fying that the output data is available for reading. The fer from the problem of significant memory transfer
host then reads the data and the cycle ends. This cycle is overhead but in this case, almost 90% of the time is
repeated if the host has more data to process and when contributed to the processing time of the accelerator,
there is no more data, the host issues the end event as shown in Table 1. However, a detailed analysis of
which indicates that the accelerator is no longer used the different buffer sizes showed that processing time is
by the application. Figure 6 shows time distribution significantly affected by the time accelerator spends on
between different stages when processing three 32 × 32 loading and storing the data. The results of this analysis
blocks in iterative working mode. As it can be seen, are shown in Table 2.
mangolibs overhead, in this case, is lowered and will Total time spent on transferring and processing data
continue to drop as the number of blocks for processing linearly grows with the increase of buffer size from 2 kB
increases. up to 64 kB. However, when the buffer becomes 64 kB or
To benchmark the performance of the iterative larger, the average time needed to transfer and process
working mode, 10,000 tasks of 32 × 32, 16 × 16, 8 × 8, data significantly shortens and thus we witness signifi-
4 × 4 blocks were offloaded in series to the HW DCT cantly higher efficiency. The explanation for this lies in
accelerator. The results are shown in the following table: the fact that for buffer sizes smaller than 64 kB, data is
It is interesting to note that block size does not affect transferred from the host to MANGO using item net-
the time needed to process the data even though the work while for larger buffers, shared buffers are used.
pipeline for processing 4 × 4 matrices is much shorter Figure 7 shows the time cost per single block for buffer
than the pipeline for processing 32 × 32 matrices. The sizes ranging from 2 to 20,000 kB. The total time cost
reason for this is that the bottleneck for processing time for transferring and processing single block in a 60 kB
is the time spent on loading data by the accelerator and buffer is 14,67 milliseconds while for 64 kB buffer it is
then storing it once the processing is finished. This is 0,61 milliseconds. The time cost continues to decline to
a limitation introduced by the fact that the MANGO 0,44 for a buffer size of 2000 kB after which it flatlines.
platform is indeed architecture exploration platform for This analysis shows that to efficiently exploit acceler-
HPC and not processing efficient HPC platform itself. ator, buffer sizes should be greater than 2 MB. In the
The memory bandwidth impacts the performance of accelerator, the max buffer size is limited to 20 MB
250 I. PILJIĆ ET AL.
Table 2. Benchmarked performance of DCT kernel for different I/O buffer sizes, all time are shown in milliseconds.
Buffer size # of runs Processing time Read data time Write data time Total time 32 × 32 blocks processed Total per block
2 kB 100 15.5345 1.8409 0.4142 18.8748 1 18.8748
4 kB 100 27.7040 2.1224 0.7477 32.1012 2 16.0506
20 kB 100 138.3400 4.1918 3.3358 150.1410 10 15.0141
60 kB 100 414.2460 7.6073 9.5798 440.0900 30 14.6697
64 kB 100 2.6461 8.9927 7.1893 19.4663 32 0.6083
200 kB 100 4.5282 25.1933 19.0508 49.4060 100 0.4941
2 MB 100 29.9313 233.6350 176.7760 440.9960 1000 0.4410
10 MB 100 143.0100 1233.7600 897.8580 2275.2900 5000 0.4551
20 MB 100 284.5090 2461.4800 1790.8300 4537.5000 10000 0.4538
Table 3. Comparison of DCT kernel on different processing fastest, followed by CPU and HW DCT accelerator.
cores. Total time consists of processing time, read time and
Core Block size Process Read Write Total per block write time. Read and write time represent time spent
DCT 32 × 32 284.50 2461.48 1790.83 0.4538 for transferring data from the host to MANGO and
DCT 16 × 16 283.10 2465.58 1790.20 0.1135 vice versa and are equal to 0 for CPU and AVX since
DCT 8×8 279.32 2451.88 1791.44 0.0283
DCT 4×4 286.05 2459.17 1789.57 0.0071 in this case, the data never leaves the host. However,
CPU 32 × 32 1981.51 – – 0.1985 read and write time participate with a share of over 90%
CPU 16 × 16 1016.66 – – 0.0254
CPU 8×8 541.27 – – 0.0033 when it comes to HW DCT accelerator. This can be
CPU 4×4 331.89 – – 0.0005 explained with MANGO platform being an exploration
AVX 32 × 32 580.54 – – 0.0580 platform for HPC. Implemented data transfer buses do
AVX 16 × 16 363.17 – – 0.0090
AVX 8×8 181.70 – – 0.0011 not exercise real-world scenarios in which high band-
AVX 4×4 220.33 – – 0.0003 width buses are fully exploited. If we take only process-
ing time into consideration, for 32 × 32 and 16 × 16
blocks, HW accelerator, even running at 40 MHz, out-
which is enough to support use cases of transcoding full performs Intel (running at 3.3 GHz) AVX optimized
HD video sequences. implementation. However, for smaller blocks, 8 × 8 and
4 × 4, HW DCT accelerator doesn’t provide the fastest
results. The main issue that explains this behaviour is
Performance evaluation
slow memory access from the accelerator to MANGO
Performance evaluation was performed for three types memory which is the bottleneck of accelerator process-
of accelerator kernels: HW DCT accelerator, CPU and ing time. Because of this bottleneck, processing times
CPU + AVX. In all tests, a buffer size of 20 MB was used of HW DCT accelerator for all block sizes are approxi-
and blocks sizes ranging from 32 × 32 to 4 × 4 were mately the same even though the pipeline for processing
transferred. Results are shown in Table 3. 4 × 4 blocks is much shorter than the one for processing
If we take into consideration the time spent on aver- 8 × 8, 16 × 16 or 32 × 32 blocks. The second fact that
age for processing a single block of data, AVX is the needs to be considered is that the accelerator runs on
AUTOMATIKA 251
[16] Flich J, et al. The MANGO FET-HPC project: an [18] Massari G, Libutti S, Fornaciari W, et al. Resource-aware
overview. 2015 IEEE 18th International Conference on application execution exploiting the BarbequeRTRM.
Computational Science and Engineering; Porto; 2015. p. Proceedings of 1st Workshop on Resource Aware-
351–354. ness and Application Autotuning in Adaptive and
[17] Elecard: StreamEye software. Available from: https:// Heterogeneous Computing (RES4ANT); CEUR; 2016.
www.elecard.com/products/video-analysis/streameye. p. 3–7.