0% found this document useful (0 votes)

11 views9 pages

Performance Efficient Integration and Programming Approach of DCT Accelerator For HEVC in MANGO Platform

Uploaded by

milic77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views9 pages

Performance Efficient Integration and Programming Approach of DCT Accelerator For HEVC in MANGO Platform

Uploaded by

milic77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Automatika

Journal for Control, Measurement, Electronics, Computing and

Communications

ISSN: 0005-1144 (Print) 1848-3380 (Online) Journal homepage: https://www.tandfonline.com/loi/taut20

Performance-efficient integration and

programming approach of DCT accelerator for
HEVC in MANGO platform

Igor Piljić, Leon Dragić & Mario Kovač

To cite this article: Igor Piljić, Leon Dragić & Mario Kovač (2019) Performance-efficient integration
and programming approach of DCT accelerator for HEVC in MANGO platform, Automatika, 60:2,
245-252, DOI: 10.1080/00051144.2019.1618526

To link to this article: https://doi.org/10.1080/00051144.2019.1618526

© 2019 The Author(s). Published by Informa

UK Limited, trading as Taylor & Francis
Group

Published online: 20 May 2019.

Submit your article to this journal

Article views: 245

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=taut20
AUTOMATIKA
2019, VOL. 60, NO. 2, 245–252
https://doi.org/10.1080/00051144.2019.1618526

REGULAR PAPER

Performance-efficient integration and programming approach of DCT

accelerator for HEVC in MANGO platform
Igor Piljić , Leon Dragić and Mario Kovač

Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia

ABSTRACT ARTICLE HISTORY

Video encoding based on novel HEVC standard is an extremely computationally expensive pro- Received 20 February 2019
cess and achieving efficient encoding requires intelligent utilization of all available resources, Accepted 4 April 2019
from both software and hardware perspective. Profiling and analysis of the encoding pro- KEYWORDS
cess identified Discrete cosine transform (DCT) as one of the key kernels that consume most Video encoding; HEVC; DCT;
of the time in the application’s runtime. Therefore, high-throughput, fully-pipelined hardware hardware accelerators;
accelerator was designed in FPGA and integrated into MANGO platform. MANGO platform is het- heterogeneous computing;
erogeneous HPC system that consists of different types of nodes, from general purpose nodes MANGO
(GN) to heterogeneous nodes (HN). While executing specific kernels on GN nodes is a straight-
forward process, executing kernels on accelerator-based HNs is a more complex procedure
and requires specific integration to successfully exploit heterogeneous architecture. This paper
presents performance-efficient integration of DCT hardware accelerator in MANGO platform,
focusing on the performance of the encoder while maintaining coding efficiency and video qual-
ity of the encoded bitstream. Several approaches were considered, tested and compared; from
the standalone integration where series of single tasks were offloaded to the DCT accelerator, to
more complex solutions based on smart buffer utilization.

Introduction In this paper, we present and compare several

Latest analysis and statistics show that 82% of global approaches for performance-efficient integration of
IP traffic will be video traffic by 2022, which is an hardware accelerator for one of the key kernels in
increase from 75% in 2017 [1]. Handling and transfer- the HEVC encoder/decoder: discrete cosine transform
ring this huge amount of data requires efficient systems, (DCT). In this paper, we describe HEVC DCT imple-
in terms of performance, power and predictability, that mentation in heterogeneous, accelerator-based archi-
are able to deliver video content with desired Qual- tecture developed as a part of MANGO project.
ity of Experience (QoE). High-efficiency video coding The rest of the paper is organized as follows: Sec-
(HEVC) is one of the latest video coding standards that ond section describes the motivation for this research,
can achieve up to 50% bitrate reduction when com- Section three describes system on which all integrations
pared to the previous standard Advanced Video Coding and tests were concluded, while Section four presents
(AVC) [2]. However, the computational complexity and integration and programming approaches. Finally, per-
resource requirements of HEVC are increased by up formance evaluation is presented in Section five.
to 10 times [3]. To deal with the increased computa-
tional complexity it is necessary to intelligently utilize
Motivation and related work
all software and hardware components of the system.
Although software algorithms can lead to significant Heterogeneous, accelerator-based architectures pro-
improvements, heterogeneous accelerator-based archi- vide great potential for increasing efficiency of compute
tectures on high performance computers can drastically and data-intensive applications, such as video encod-
improve power-performance ratio of the system, so ing from both performance and power perspective.
their analysis and exploitation, especially for large-scale However, exploiting such architectures requires a deep
video content providers are necessary. understanding of both application requirements and
Custom hardware accelerators can improve perfor- system design, as well as intelligent utilization of all
mance and lower the power consumption, however available resources.
efficient utilization of such architectures is a challeng- Previous research in this field rarely cover integra-
ing task. Overhead of data transfer often suppresses tion of the accelerators in heterogeneous systems con-
performance benefits gained by the accelerator. taining different types of processing cores, from general

CONTACT Igor Piljić [email protected] Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia
© 2019 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted
use, distribution, and reproduction in any medium, provided the original work is properly cited.
246 I. PILJIĆ ET AL.

Figure 1. MANGO platform scheme.

processing nodes to heterogeneous nodes. In [4–6] Platform – MANGO platform

CPU + GPU heterogeneous platform is used to accel-
MANGO platform [15,16] is a heterogeneous high-
erate HEVC decoder, either by deploying a subset of
performance computing system consisting of general-
functions (IDCT and de-quantization) or by imple-
purpose compute nodes (GN) that are intertwined
menting entire decoder on GPU. In comparison with
with heterogeneous acceleration nodes (HN), linked
the HEVC encoder, the decoder is computationally
by an across-boundary homogeneous interconnect.
much less demanding than HEVC encoder, which is
GNs are based on CPUs (i.e. Intel Xeon), while HNs
why our focus is set on improving the encoding process.
are FPGA-based next-generation manycore chips cou-
GPU-based accelerators can offer large performance
pled with deeply customized heterogeneous comput-
improvements compared to the CPU, especially for
ing resources. HNs are open and can be tailored for a
highly parallelizable algorithms, however, FPGA-based
specific purpose, which was used to incorporate DCT
accelerators could provide even more performance and
accelerator in MANGO architecture. Figure 1 shows
energy-efficient solutions.
the scheme of one cluster in MANGO platform with
There is a lot of research on specialized hardware
integrated DCT accelerator.
accelerators for several different kernels that are used
CPU host is a general node (GN) connected with
in HEVC. FPGA-based solutions for Intra coding are
the heterogeneous node (HN) deployed on four FPGAs.
presented in [7,8]. Architectures for HEVC standard
HN consists of different processing nodes, such as
DCT acceleration introduced in [9–12] show several
RISC multicore, GPU-like core, different accelerator
different approaches for optimizing integer based DCT.
cores and a hardware accelerator for DCT. Topology
Interpolation [13] and deblocking filter [14] as another
and number of the accelerators can be adapted to the
compute demanding algorithms are also subject of the
requirements of the system. For each FPGA, there is
research in this area. However, all these papers focus
one memory controller (MC) that connects the process-
their research on a specific accelerator as a stand-alone
ing cores with local DDR memory. Resource manager
module, without measuring impacts of its integration
located on a CPU host side manages the allocation of
within the existing heterogeneous system.
computing resources (GNs and HNs) to multiple con-
current application that can run on MANGO platform.
System
The MANGO heterogeneous high-performance com- Host – HEVC encoder application
puting system was used as an architecture exploration In-house “clean-room” implementation of HEVC
platform for this research such that high-level encod- encoder was used as a base for integration on the host
ing algorithm was instantiated on general-purpose pro- side. The encoder was profiled and analysed to deter-
cessor cores while the custom designed HEVC DCT mine most time-consuming tasks of the application’s
accelerator cores were used for DCT computations. runtime. The analysis was done on CPU-only single-
Below we give a more detailed overview of the complete core implementation configured for fast encoding, aim-
system and the approach. ing at Just-in-Time video encoding. Verification and
AUTOMATIKA 247

Figure 2. HEVC encoder proﬁling results.

validation of HEVC encoded bitstreams were con- In addition to the different accelerator types, the
ducted using StreamEye software [17]. MANGO platform enables the use of a group of units
Results are obtained using Intel Amplifier tool and working in parallel, either isolated or in collaboration,
are shown in Figure 2. on a given task. Therefore, we can also identify and
As can be seen from the figure above, most time- exercise the following working modes in MANGO:
consuming tasks are: forward2DTransform and its
counterpart forward2DInverseTransform. The main • Standalone mode – units work in standalone mode
kernel used in both functions is 2-dimensional Dis- running parts of the tasks. No communication or
crete Cosine Transform, which is why DCT was chosen collaboration is exercised between the host and the
as a kernel that can be used to demonstrate benefits unit and between concurrent units.
of offloading to a custom hardware accelerator. Sev- • Host iterative mode – the unit works in an iterative
eral modifications to the applications were made to mode with the host computing parts of the task. The
meet the requirements of the DCT accelerator. Resid- kernel running on the unit is completely synchro-
ual samples had to be stored in a 16-bit format which nized with the host application. All the units work in
halved the amount of data being transferred between the same fashion but there are no direct interactions
host and accelerator. However, this modification leads between units.
to additional changes such as packing and unpacking
of samples for AVX DCT implementation which had a Due to the nature of the DCT algorithm and the
minor impact on its performance. design of DCT accelerator, the unit collaborative mode,
also available in MANGO, was not exercised, and focus
was put on standalone and host iterative mode.
Accelerator – DCT
Based on HEVC encoder analysis, a custom hardware Integration and programming approach
accelerator for DCT was designed and implemented Standalone approach
in FPGA. Symmetry properties of 2D DCT trans-
form were exploited to design area-optimized, 1D DCT The standalone mode was the first working mode in
architecture that can be reused to implement full 2D which HW DCT accelerator was integrated into the
core transform. The accelerator architecture is fully MANGO platform. In standalone mode, a single task
pipelined and applicable for all transform sizes used represents a single matrix with sizes ranging from
in HEVC (from 4 × 4 to 32 × 32 blocks). After evalua- 32 × 32 to 4 × 4 which is then offloaded to the accelera-
tion as a stand-alone module, DCT accelerator core was tor that processes the data and returns the transformed
integrated as a MANGO HN processing core. matrix to the host. For each task, initialization proce-
dures such as registration of kernels, buffers, and events,
allocation of resources or transfer of arguments are
repeated. In the first phase, initialization of mango con-
Benchmarks
text and kernel function takes place. Depending on the
In the test scenario, HW DCT accelerator was used matrix size, input, config and output buffers are regis-
for processing DCT tasks. The baseline for benchmarks tered and added to the task graph shown in Figure 3.
were two implementations of DCT kernel on Intel GN: Input data is then written to the buffers and the kernel
is started. After this, host waits (or does some use-
• Single thread ful work) for the end event which indicates that the
• Single thread + AVX accelerator has processed input data. On the accelerator
248 I. PILJIĆ ET AL.

Figure 3. Task graph for standalone integration.

Figure 4. Time distribution for standalone integration.

side, starting kernel initiates the loop in which accel-
erator fetches the data through 512-bit data bus from
MANGO memory. When data is fetched, it is put into
the pipeline of the accelerator. With each clock cycle,
data propagates through the pipeline until it is written
in the local buffer from which it is stored to MANGO
memory. When all data is stored, the task is finalized
by triggering end event which notifies the host applica-
tion that the data is ready to be retrieved from MANGO
memory. The application then initiates the memory
transfer and if needed restarts the above described cycle
with a new set of input data.
To benchmark the performance, the application was
run with DCT kernel offloaded to HW DCT acceler-
ator operating in standalone working mode. A single
32 × 32 matrix of 16-bit residuals was transferred to
the MANGO memory, processed by the accelerator and
then returned to the host application. Figure 4 shows
the time distribution for processing a single 32 × 32
residual block. Accelerator part corresponds to the time
Figure 5. Task graph for iterative integration.
required by the accelerator to load, process and store
data to MANGO memory. Host part corresponds to the
time spent on executing the application on Intel GN Iterative approach
which consists of filling the buffers with data intended
To tackle the problem of mangolibs overhead, itera-
for processing on the accelerator, transferring the data
tive working mode of the accelerator was proposed. In
from host to MANGO memory and vice versa. As it can
iterative working mode, resources are registered and
be seen from the figure, almost 50% of the runtime is
allocated only once for any number of tasks offloaded to
spent on the mangolibs initialization which makes stan-
the accelerator. Several modifications were necessary to
dalone working mode performance inefficient. Man-
adapt the accelerator to iterative working mode. Since
golibs is a software stack that consists of:
buffers now get registered only once, buffer sizes have
• API provided by the MANGO application library been increased to max_buffer_size which is defined as
• HN library that implements the communication 20 MB for the current version of the accelerator. The
functions between applications and resource man- iterative working mode also required additional syn-
ager with the FPGA subsystem chronization mechanisms which were implemented in
• BOSP – Barbeque Run-Time Resource Manager [18] the form of interrupts. Introduced changes reflected the
and several MANGO unit specific modules. task graph structure shown in Figure 5.
AUTOMATIKA 249

Table 1. Time distribution – 10,000 × 2 kB.

Processing Read
Buﬀer Block Size Total time time Write time
2 kB 32 × 32 17,68 14,45 1,800 0,406
2 kB 16 × 16 17,67 14,45 1,798 0,406
2 kB 8×8 17,69 14,46 1,801 0,406
2 kB 4×4 17,69 14,46 1,804 0,407

the accelerator, which is an obvious and expected con-

clusion. However, additional optimizations which can
lead to better performance of the accelerator are possi-
ble. These optimizations will be explained in the next
section. Since the matrix size doesn’t impact the per-
formance, the rest of the document will consider only
32 × 32 matrices.

Optimizations
Table 1 shows average time distribution per phase
when processing 10,000 instances of 32 × 32 blocks.
Write and read time corresponds to the time needed
to transfer data from host to the MANGO memory
or from MANGO memory to the host respectively.
Figure 6. Time distribution for iterative integration. Processing time is the duration from the moment when
the interrupt_start has been sent to the accelerator to
the moment when interrupt_end has been received by
The kernel is now started before the data is writ- the host. Processing time thus includes the time needed
ten to the buffer. After host transfers the input data to for the accelerator to load data from the memory, pro-
MANGO memory, it issues the interrupt_start which cess it, and then store it back. Here we do not address
triggers the accelerators to start loading and process- how the total processing time of the accelerator is dis-
ing the data. After all the data is processed and stored, tributed between memory accesses and data passing
accelerator issues the interrupt back to the host, signi- through the pipeline. Usually, small buffer sizes suf-
fying that the output data is available for reading. The fer from the problem of significant memory transfer
host then reads the data and the cycle ends. This cycle is overhead but in this case, almost 90% of the time is
repeated if the host has more data to process and when contributed to the processing time of the accelerator,
there is no more data, the host issues the end event as shown in Table 1. However, a detailed analysis of
which indicates that the accelerator is no longer used the different buffer sizes showed that processing time is
by the application. Figure 6 shows time distribution significantly affected by the time accelerator spends on
between different stages when processing three 32 × 32 loading and storing the data. The results of this analysis
blocks in iterative working mode. As it can be seen, are shown in Table 2.
mangolibs overhead, in this case, is lowered and will Total time spent on transferring and processing data
continue to drop as the number of blocks for processing linearly grows with the increase of buffer size from 2 kB
increases. up to 64 kB. However, when the buffer becomes 64 kB or
To benchmark the performance of the iterative larger, the average time needed to transfer and process
working mode, 10,000 tasks of 32 × 32, 16 × 16, 8 × 8, data significantly shortens and thus we witness signifi-
4 × 4 blocks were offloaded in series to the HW DCT cantly higher efficiency. The explanation for this lies in
accelerator. The results are shown in the following table: the fact that for buffer sizes smaller than 64 kB, data is
It is interesting to note that block size does not affect transferred from the host to MANGO using item net-
the time needed to process the data even though the work while for larger buffers, shared buffers are used.
pipeline for processing 4 × 4 matrices is much shorter Figure 7 shows the time cost per single block for buffer
than the pipeline for processing 32 × 32 matrices. The sizes ranging from 2 to 20,000 kB. The total time cost
reason for this is that the bottleneck for processing time for transferring and processing single block in a 60 kB
is the time spent on loading data by the accelerator and buffer is 14,67 milliseconds while for 64 kB buffer it is
then storing it once the processing is finished. This is 0,61 milliseconds. The time cost continues to decline to
a limitation introduced by the fact that the MANGO 0,44 for a buffer size of 2000 kB after which it flatlines.
platform is indeed architecture exploration platform for This analysis shows that to efficiently exploit acceler-
HPC and not processing efficient HPC platform itself. ator, buffer sizes should be greater than 2 MB. In the
The memory bandwidth impacts the performance of accelerator, the max buffer size is limited to 20 MB
250 I. PILJIĆ ET AL.

Table 2. Benchmarked performance of DCT kernel for different I/O buffer sizes, all time are shown in milliseconds.
Buffer size # of runs Processing time Read data time Write data time Total time 32 × 32 blocks processed Total per block
2 kB 100 15.5345 1.8409 0.4142 18.8748 1 18.8748
4 kB 100 27.7040 2.1224 0.7477 32.1012 2 16.0506
20 kB 100 138.3400 4.1918 3.3358 150.1410 10 15.0141
60 kB 100 414.2460 7.6073 9.5798 440.0900 30 14.6697
64 kB 100 2.6461 8.9927 7.1893 19.4663 32 0.6083
200 kB 100 4.5282 25.1933 19.0508 49.4060 100 0.4941
2 MB 100 29.9313 233.6350 176.7760 440.9960 1000 0.4410
10 MB 100 143.0100 1233.7600 897.8580 2275.2900 5000 0.4551
20 MB 100 284.5090 2461.4800 1790.8300 4537.5000 10000 0.4538

Figure 7. Total time for transferring and processing per block.

Table 3. Comparison of DCT kernel on diﬀerent processing fastest, followed by CPU and HW DCT accelerator.
cores. Total time consists of processing time, read time and
Core Block size Process Read Write Total per block write time. Read and write time represent time spent
DCT 32 × 32 284.50 2461.48 1790.83 0.4538 for transferring data from the host to MANGO and
DCT 16 × 16 283.10 2465.58 1790.20 0.1135 vice versa and are equal to 0 for CPU and AVX since
DCT 8×8 279.32 2451.88 1791.44 0.0283
DCT 4×4 286.05 2459.17 1789.57 0.0071 in this case, the data never leaves the host. However,
CPU 32 × 32 1981.51 – – 0.1985 read and write time participate with a share of over 90%
CPU 16 × 16 1016.66 – – 0.0254
CPU 8×8 541.27 – – 0.0033 when it comes to HW DCT accelerator. This can be
CPU 4×4 331.89 – – 0.0005 explained with MANGO platform being an exploration
AVX 32 × 32 580.54 – – 0.0580 platform for HPC. Implemented data transfer buses do
AVX 16 × 16 363.17 – – 0.0090
AVX 8×8 181.70 – – 0.0011 not exercise real-world scenarios in which high band-
AVX 4×4 220.33 – – 0.0003 width buses are fully exploited. If we take only process-
ing time into consideration, for 32 × 32 and 16 × 16
blocks, HW accelerator, even running at 40 MHz, out-
which is enough to support use cases of transcoding full performs Intel (running at 3.3 GHz) AVX optimized
HD video sequences. implementation. However, for smaller blocks, 8 × 8 and
4 × 4, HW DCT accelerator doesn’t provide the fastest
results. The main issue that explains this behaviour is
Performance evaluation
slow memory access from the accelerator to MANGO
Performance evaluation was performed for three types memory which is the bottleneck of accelerator process-
of accelerator kernels: HW DCT accelerator, CPU and ing time. Because of this bottleneck, processing times
CPU + AVX. In all tests, a buffer size of 20 MB was used of HW DCT accelerator for all block sizes are approxi-
and blocks sizes ranging from 32 × 32 to 4 × 4 were mately the same even though the pipeline for processing
transferred. Results are shown in Table 3. 4 × 4 blocks is much shorter than the one for processing
If we take into consideration the time spent on aver- 8 × 8, 16 × 16 or 32 × 32 blocks. The second fact that
age for processing a single block of data, AVX is the needs to be considered is that the accelerator runs on
AUTOMATIKA 251

MANGO architecture with the clock of 40 MHz while References

in real world scenario it can run with a frequency that [1] Cisco. Cisco visual networking index: forecast and
is one order of magnitude larger for FPGA implemen- methodology, 2017–2022. [cited 2018 Nov 26]. Avail-
tation and even higher for ASIC implementation. able from: https://www.cisco.com/c/en/us/solutions/
collateral/service-provider/visual-networking-index-
vni/white-paper-c11-741490.html
[2] Sullivan GJ, Ohm J-R, Han W-J, et al. Overview of
Conclusion
the high efficiency video coding (HEVC) standard.
In this paper, we have investigated several approaches IEEE Trans Circuits Syst Video Technol. 2012;22(12):
for integration of custom designed hardware acceler- 1649–1668.
[3] Bossen F, Bross B, Sühring Ket al. HEVC complexity
ator for discrete cosine transform in novel heteroge- and implementation analysis. IEEE Trans Circuits Syst
neous architecture developed as a part of Horizon 2020 Video Technol. 2012;22(12):1685–1696.
project MANGO: exploring Manycore Architectures [4] d Souza DF, Roma N, Sousa L. Opencl parallelization
for Next-generation HPC systems. Fully pipelined, area of the HEVC de-quantization and inverse transform
optimized DCT accelerator is used to improve the perfor heterogeneous platforms. 2014 22nd European Sig-
nal Processing Conference (EUSIPCO); Lisbon; 2014. p.
formance of compute and data-intensive video encod-
755–759.
ing process based on a HEVC standard. Different [5] Wang B, et al. Efficient HEVC decoder for heteroge-
approaches of accelerator utilization were identified as neous CPU with GPU systems. 2016 IEEE 18th Inter-
a standalone and iterative approach. Each accelerator national Workshop on Multimedia Signal Processing
and approach were analysed and benchmarked. Itera- (MMSP); Montreal, QC; 2016. p. 1–6.
tive mode proved to be more efficient than standalone [6] Ma A, Guo C. Parallel acceleration of HEVC decoder
based on CPU + GPU heterogeneous platform. 2017
mode due to high mangolibs initialization overhead. Seventh International Conference on Information Sci-
Analysis of the data transfer from MANGO to host and ence and Technology (ICIST); Da Nang; 2017. p.
vice-versa showed that for efficient data transfer buffer 323–330.
sizes should be 200 kB or larger. [7] Amish F,Bourennane E. A novel hardware accelerator
Performance comparison of different processing for the HEVC intra prediction. 2015 IEEE 13th Inter-
national New Circuits and Systems Conference (NEW-
cores showed that DCT HW accelerator, even run-
CAS); Grenoble; 2015. p. 1–4.
ning at 40 MHz mode, outperforms Intel (running at [8] Sjövall P, Viitamäki V, Vanne J, et al. FPGA-Powered
3.3 GHz) AVX optimized implementation for block 4K120p HEVC Intra encoder. 2018 IEEE International
sizes of 16 × 16 or larger when it comes to process- Symposium on Circuits and Systems (ISCAS); Florence;
ing time. However, due to the MANGO platform being 2018. p. 1–5.
an exploration platform for HPC, data transfers and [9] Meher PK, Park SY, Mohanty BK, et al. Efficient integer
DCT architectures for HEVC. IEEE Trans Circuits Syst
memory access provide a bottleneck for maximum uti- Video Technol. 2014 Jan;24(1):168–178.
lization of HW DCT accelerator. [10] Chatterjee S, Sarawadekar KP. A low cost, constant
Our future work in this area will include the devel- throughput and reusable 8X8 DCT architecture for
opment and integration of other types of accelerators HEVC. Proceedings of IEEE 59th International Mid-
designed for HEVC video encoding and transcoding. west Symposium on Circuits and Systems (MWSCAS);
Abu Dhabi, United Arab Emirates; 2016 Oct 16–19.
Different types of processing units, besides custom
p. 1–4.
accelerator-based cores, will also be investigated, such [11] Bolaños-Jojoa JD,Velasco-Medina J. Efficient hardware
as RISC-V and GPU-like cores. Resource manager will design of N-point 1D-DCT for HEVC, Proceedings of
be adapted and improved to facilitate management of 20th Symposium on Signal Process, Images and Com-
all integrated modules. puter Vision (STSIVA), Bogota, Colombia; 2015 Sept
2–4. p. 1–6.
[12] Abdelrasoul M, Sayed MS,Goulart V. Scalable integer
DCT architecture for HEVC encoder. Proceedings of
Disclosure statement
IEEE Computer Society Annual Symposium on VLSI
No potential conflict of interest was reported by the authors. (ISVLSI); Pittsburgh, Pennsylvania, 2016 Jul 11–13. p.
314–318.
[13] Diniz CM, Shafique M, Bampi S, et al. A reconfigurable
Funding hardware architecture for fractional pixel interpolation
in high efficiency video coding. IEEE Trans Comput-
This project has received funding from the European Union’s Aided Des Integr Circuits Syst. 2015 Feb;34(2):238–251.
H2020 Future and Emerging Technologies under [grant [14] Diniz CM, Shafique M, Dalcin FV, et al. A deblock-
agreement No. 671668]. ing filter hardware architecture for the high efficiency
video coding standard. 2015 design, Automation & Test
in Europe Conference & Exhibition (DATE); Grenoble;
ORCID 2015. p. 1509–1514.
Igor Piljić http://orcid.org/0000-0003-2345-0322 [15] Flich J, Agosta G, Ampletzer P, et al. MANGO: explor-
ing manycore architectures for next-Generation HPC
Leon Dragić http://orcid.org/0000-0002-4558-7269
systems. 2017 Euromicro Conference on Digital System
Mario Kovač http://orcid.org/0000-0002-8365-7002 Design (DSD); Vienna; 2017. p. 478–485.
252 I. PILJIĆ ET AL.

[16] Flich J, et al. The MANGO FET-HPC project: an [18] Massari G, Libutti S, Fornaciari W, et al. Resource-aware
overview. 2015 IEEE 18th International Conference on application execution exploiting the BarbequeRTRM.
Computational Science and Engineering; Porto; 2015. p. Proceedings of 1st Workshop on Resource Aware-
351–354. ness and Application Autotuning in Adaptive and
[17] Elecard: StreamEye software. Available from: https:// Heterogeneous Computing (RES4ANT); CEUR; 2016.
www.elecard.com/products/video-analysis/streameye. p. 3–7.

Geography Grade 8 June Exam 2019
50% (2)
Geography Grade 8 June Exam 2019
7 pages
Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
Software-Defined Networks: A Systems Approach
From Everand
Software-Defined Networks: A Systems Approach
Larry Peterson
5/5 (1)
Rudolf Steiner - Warmth Course The Theory of Heat
No ratings yet
Rudolf Steiner - Warmth Course The Theory of Heat
122 pages
Jetson Platform Development Guide: Definitive Reference for Developers and Engineers
From Everand
Jetson Platform Development Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CircuitPython in Practice: Definitive Reference for Developers and Engineers
From Everand
CircuitPython in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Programming NodeMCU for IoT Applications: Definitive Reference for Developers and Engineers
From Everand
Programming NodeMCU for IoT Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
BeagleBone Systems and Applications: Definitive Reference for Developers and Engineers
From Everand
BeagleBone Systems and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Contiki Operating System for Embedded IoT: Definitive Reference for Developers and Engineers
From Everand
Contiki Operating System for Embedded IoT: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Microsoft AZ-400: Designing and Implementing Microsoft DevOps Solutions - Certification Exam Prep
From Everand
Microsoft AZ-400: Designing and Implementing Microsoft DevOps Solutions - Certification Exam Prep
Steve Brown
No ratings yet
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
Design and Implementation with i.MX Processors: Definitive Reference for Developers and Engineers
From Everand
Design and Implementation with i.MX Processors: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
EtherNet/IP Engineering Guide: Definitive Reference for Developers and Engineers
From Everand
EtherNet/IP Engineering Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
From Everand
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
Robert Johnson
No ratings yet
Programming and Prototyping with Teensy Microcontrollers: Definitive Reference for Developers and Engineers
From Everand
Programming and Prototyping with Teensy Microcontrollers: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Hardware-Efficient_2D-DCT_IDCT_Architecture_for_Portable_HEVC-Compliant_Devices
No ratings yet
Hardware-Efficient_2D-DCT_IDCT_Architecture_for_Portable_HEVC-Compliant_Devices
10 pages
Mastering Kubernetes
From Everand
Mastering Kubernetes
Manish Soni
No ratings yet
Embedded Systems Programming with C: Writing Code for Microcontrollers
From Everand
Embedded Systems Programming with C: Writing Code for Microcontrollers
Larry Jones
No ratings yet
Comprehensive Guide to Mbed Development: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Mbed Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Versatile Routing and Services with BGP: Understanding and Implementing BGP in SR-OS
From Everand
Versatile Routing and Services with BGP: Understanding and Implementing BGP in SR-OS
Alcatel-Lucent
No ratings yet
Zerynth Solutions for Embedded Python Systems: Definitive Reference for Developers and Engineers
From Everand
Zerynth Solutions for Embedded Python Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering the Art of Network Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Network Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
HPE Compute Certification Guide: 444 Practice Questions for the Advanced HPE1-H02 Exam
From Everand
HPE Compute Certification Guide: 444 Practice Questions for the Advanced HPE1-H02 Exam
Steve Brown
No ratings yet
Google Kubernetes Engine Essentials: Definitive Reference for Developers and Engineers
From Everand
Google Kubernetes Engine Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
NetBeans Development Guide: Definitive Reference for Developers and Engineers
From Everand
NetBeans Development Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DeepSeek vs. ChatGPT – Why DeepSeek is the Superior AI.
From Everand
DeepSeek vs. ChatGPT – Why DeepSeek is the Superior AI.
Gary Thatcher
No ratings yet
PPPoE Protocol Engineering: Definitive Reference for Developers and Engineers
From Everand
PPPoE Protocol Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Essays on Infrastructure-as-code
From Everand
Essays on Infrastructure-as-code
Ravi Rajamani
No ratings yet
NB-IoT Systems and Protocols: Definitive Reference for Developers and Engineers
From Everand
NB-IoT Systems and Protocols: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
PlatformIO Development Essentials: Definitive Reference for Developers and Engineers
From Everand
PlatformIO Development Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
1-s2.0-S1434841116309037-main
No ratings yet
1-s2.0-S1434841116309037-main
8 pages
ESP32 Development and Applications: Definitive Reference for Developers and Engineers
From Everand
ESP32 Development and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Engineering Anthos Solutions: Definitive Reference for Developers and Engineers
From Everand
Engineering Anthos Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Ambassador for Cloud Native Ingress Solutions: Definitive Reference for Developers and Engineers
From Everand
Ambassador for Cloud Native Ingress Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kubernetes and Cloud Native Associate (KCNA) Exam Preparation
From Everand
Kubernetes and Cloud Native Associate (KCNA) Exam Preparation
Georgio Daccache
No ratings yet
Backend Development
From Everand
Backend Development
Kai Turing
No ratings yet
Truffle for Blockchain Development: Definitive Reference for Developers and Engineers
From Everand
Truffle for Blockchain Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
From Everand
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
PIC Microcontroller Development Essentials: Definitive Reference for Developers and Engineers
From Everand
PIC Microcontroller Development Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Knative in Cloud-Native Infrastructure: Definitive Reference for Developers and Engineers
From Everand
Knative in Cloud-Native Infrastructure: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to VNC Technology: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to VNC Technology: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Embedded C: The Ultimate Guide to Building Efficient Systems
From Everand
Mastering Embedded C: The Ultimate Guide to Building Efficient Systems
Robert Johnson
No ratings yet
AWS CDK Essentials: A Beginner's Guide to Infrastructure as Code
From Everand
AWS CDK Essentials: A Beginner's Guide to Infrastructure as Code
Robert Johnson
No ratings yet
Software Defined Networking (SDN) - a definitive guide
From Everand
Software Defined Networking (SDN) - a definitive guide
Rajesh Kumar Sundararajan
2/5 (2)
Octopus Deploy in Modern CI/CD Workflows: Definitive Reference for Developers and Engineers
From Everand
Octopus Deploy in Modern CI/CD Workflows: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Network Coding and Signcryption for Cloud Data Integrity
From Everand
Network Coding and Signcryption for Cloud Data Integrity
Noah Joan
No ratings yet
ESP8266 Programming and Applications: Definitive Reference for Developers and Engineers
From Everand
ESP8266 Programming and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Developing Applications on EOSIO: Definitive Reference for Developers and Engineers
From Everand
Developing Applications on EOSIO: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comptia Cloud+ CV0 - 004: 715 Questions and Explanation
From Everand
Comptia Cloud+ CV0 - 004: 715 Questions and Explanation
Arabella Kushner
No ratings yet
Efficient Editing with OniVim: Definitive Reference for Developers and Engineers
From Everand
Efficient Editing with OniVim: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Administering ArcGIS for Server
From Everand
Administering ArcGIS for Server
Hussein Nasser
No ratings yet
Comprehensive Guide to Arduino Systems: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Arduino Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
JUCE Audio Application Development: Definitive Reference for Developers and Engineers
From Everand
JUCE Audio Application Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Code Beneath the Surface: Mastering Assembly Programming
From Everand
Code Beneath the Surface: Mastering Assembly Programming
Kameron Hussain
No ratings yet
Cilium: Architecture, Networking, and Security in Kubernetes: Definitive Reference for Developers and Engineers
From Everand
Cilium: Architecture, Networking, and Security in Kubernetes: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Chromium Embedded Framework: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Chromium Embedded Framework: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Networking Programming with C++: Build Efficient Communication Systems
From Everand
Networking Programming with C++: Build Efficient Communication Systems
Robert Johnson
No ratings yet
Efficient Web Deployment with Zeit: Definitive Reference for Developers and Engineers
From Everand
Efficient Web Deployment with Zeit: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DevOps for Networking
From Everand
DevOps for Networking
Steven Armstrong
4/5 (2)
Professional Heroku Programming
From Everand
Professional Heroku Programming
Chris Kemp
4/5 (2)
Efficient Development with JetBrains Tools: Definitive Reference for Developers and Engineers
From Everand
Efficient Development with JetBrains Tools: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CodePipeline in Depth: Definitive Reference for Developers and Engineers
From Everand
CodePipeline in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Bridge Chapt. 5
No ratings yet
Bridge Chapt. 5
33 pages
Gr -06 Probability Work Sheet
No ratings yet
Gr -06 Probability Work Sheet
4 pages
Lab 1
No ratings yet
Lab 1
10 pages
Rexroth Star Linear Guides
No ratings yet
Rexroth Star Linear Guides
172 pages
Standards - Based Instruction and Assessment Rubric For Chemistry Content Standards
100% (1)
Standards - Based Instruction and Assessment Rubric For Chemistry Content Standards
20 pages
Lpwan Ieee 802.11ah and Lorawan Capacity Simulation Analysis Comparison Using Ns-3
No ratings yet
Lpwan Ieee 802.11ah and Lorawan Capacity Simulation Analysis Comparison Using Ns-3
4 pages
Deep Cryo Treatment
No ratings yet
Deep Cryo Treatment
6 pages
UNIT 3-Ost
No ratings yet
UNIT 3-Ost
16 pages
Genetic Algorithm and Its Application: M.Sivagami
No ratings yet
Genetic Algorithm and Its Application: M.Sivagami
26 pages
Easy UPS On-Line - SRV2KL-IN
No ratings yet
Easy UPS On-Line - SRV2KL-IN
3 pages
Ysk MMCC
No ratings yet
Ysk MMCC
20 pages
Chem Module1 Ch1&2
No ratings yet
Chem Module1 Ch1&2
29 pages
Elisa DRG
No ratings yet
Elisa DRG
40 pages
Dcam PT 66 Training Module 15.1 Fundamentals
No ratings yet
Dcam PT 66 Training Module 15.1 Fundamentals
57 pages
Getting Startedwith Proteus 8
No ratings yet
Getting Startedwith Proteus 8
12 pages
Operating System
No ratings yet
Operating System
132 pages
AA20105_MUHAMMAD LOKMAN NUR HAKIM BIN AZMI_WBP_LOCATIONE - AA20 105 lokman azmi
No ratings yet
AA20105_MUHAMMAD LOKMAN NUR HAKIM BIN AZMI_WBP_LOCATIONE - AA20 105 lokman azmi
4 pages
Flammable Liquids: CRC Handbook of Laboratory Safety, 5th Edition, Page 265
No ratings yet
Flammable Liquids: CRC Handbook of Laboratory Safety, 5th Edition, Page 265
7 pages
ISODRAFT Reference Manual
No ratings yet
ISODRAFT Reference Manual
311 pages
MiG-29
No ratings yet
MiG-29
157 pages
Erwin® Data Modeler Navigator Edition: User Guide
No ratings yet
Erwin® Data Modeler Navigator Edition: User Guide
66 pages
ISA MINOR1 2023-24 AE
No ratings yet
ISA MINOR1 2023-24 AE
3 pages
Vasudeva, Harkrishan L. - Shirali, Satish - Multivariable Analysis-Springer (2011)
No ratings yet
Vasudeva, Harkrishan L. - Shirali, Satish - Multivariable Analysis-Springer (2011)
405 pages
Assignment (Kinetics of Particles), Version 1 Student Name
0% (1)
Assignment (Kinetics of Particles), Version 1 Student Name
2 pages
Java Unit - 2 (Final)
No ratings yet
Java Unit - 2 (Final)
114 pages
ICO Inst BK Clinical Optics Ref 2013 PDF
100% (1)
ICO Inst BK Clinical Optics Ref 2013 PDF
16 pages
F100A Loop Responder
No ratings yet
F100A Loop Responder
2 pages
The Age Problem Solving Series
100% (2)
The Age Problem Solving Series
13 pages

Performance Efficient Integration and Programming Approach of DCT Accelerator For HEVC in MANGO Platform

Uploaded by

Performance Efficient Integration and Programming Approach of DCT Accelerator For HEVC in MANGO Platform

Uploaded by

Automatika

Journal for Control, Measurement, Electronics, Computing and

ISSN: 0005-1144 (Print) 1848-3380 (Online) Journal homepage: https://www.tandfonline.com/loi/taut20

Performance-efficient integration and

Igor Piljić, Leon Dragić & Mario Kovač

To link to this article: https://doi.org/10.1080/00051144.2019.1618526

© 2019 The Author(s). Published by Informa

Published online: 20 May 2019.

Submit your article to this journal

Article views: 245

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at

Performance-efficient integration and programming approach of DCT

Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia

ABSTRACT ARTICLE HISTORY

Introduction In this paper, we present and compare several

Figure 1. MANGO platform scheme.

processing nodes to heterogeneous nodes. In [4–6] Platform – MANGO platform

Figure 2. HEVC encoder proﬁling results.

Figure 3. Task graph for standalone integration.

Figure 4. Time distribution for standalone integration.

Table 1. Time distribution – 10,000 × 2 kB.

the accelerator, which is an obvious and expected con-

Figure 7. Total time for transferring and processing per block.

MANGO architecture with the clock of 40 MHz while References

You might also like