0% found this document useful (0 votes)
11 views9 pages

Performance Efficient Integration and Programming Approach of DCT Accelerator For HEVC in MANGO Platform

Uploaded by

milic77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views9 pages

Performance Efficient Integration and Programming Approach of DCT Accelerator For HEVC in MANGO Platform

Uploaded by

milic77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Automatika

Journal for Control, Measurement, Electronics, Computing and


Communications

ISSN: 0005-1144 (Print) 1848-3380 (Online) Journal homepage: https://www.tandfonline.com/loi/taut20

Performance-efficient integration and


programming approach of DCT accelerator for
HEVC in MANGO platform

Igor Piljić, Leon Dragić & Mario Kovač

To cite this article: Igor Piljić, Leon Dragić & Mario Kovač (2019) Performance-efficient integration
and programming approach of DCT accelerator for HEVC in MANGO platform, Automatika, 60:2,
245-252, DOI: 10.1080/00051144.2019.1618526

To link to this article: https://doi.org/10.1080/00051144.2019.1618526

© 2019 The Author(s). Published by Informa


UK Limited, trading as Taylor & Francis
Group

Published online: 20 May 2019.

Submit your article to this journal

Article views: 245

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=taut20
AUTOMATIKA
2019, VOL. 60, NO. 2, 245–252
https://doi.org/10.1080/00051144.2019.1618526

REGULAR PAPER

Performance-efficient integration and programming approach of DCT


accelerator for HEVC in MANGO platform
Igor Piljić , Leon Dragić and Mario Kovač

Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia

ABSTRACT ARTICLE HISTORY


Video encoding based on novel HEVC standard is an extremely computationally expensive pro- Received 20 February 2019
cess and achieving efficient encoding requires intelligent utilization of all available resources, Accepted 4 April 2019
from both software and hardware perspective. Profiling and analysis of the encoding pro- KEYWORDS
cess identified Discrete cosine transform (DCT) as one of the key kernels that consume most Video encoding; HEVC; DCT;
of the time in the application’s runtime. Therefore, high-throughput, fully-pipelined hardware hardware accelerators;
accelerator was designed in FPGA and integrated into MANGO platform. MANGO platform is het- heterogeneous computing;
erogeneous HPC system that consists of different types of nodes, from general purpose nodes MANGO
(GN) to heterogeneous nodes (HN). While executing specific kernels on GN nodes is a straight-
forward process, executing kernels on accelerator-based HNs is a more complex procedure
and requires specific integration to successfully exploit heterogeneous architecture. This paper
presents performance-efficient integration of DCT hardware accelerator in MANGO platform,
focusing on the performance of the encoder while maintaining coding efficiency and video qual-
ity of the encoded bitstream. Several approaches were considered, tested and compared; from
the standalone integration where series of single tasks were offloaded to the DCT accelerator, to
more complex solutions based on smart buffer utilization.

Introduction In this paper, we present and compare several


Latest analysis and statistics show that 82% of global approaches for performance-efficient integration of
IP traffic will be video traffic by 2022, which is an hardware accelerator for one of the key kernels in
increase from 75% in 2017 [1]. Handling and transfer- the HEVC encoder/decoder: discrete cosine transform
ring this huge amount of data requires efficient systems, (DCT). In this paper, we describe HEVC DCT imple-
in terms of performance, power and predictability, that mentation in heterogeneous, accelerator-based archi-
are able to deliver video content with desired Qual- tecture developed as a part of MANGO project.
ity of Experience (QoE). High-efficiency video coding The rest of the paper is organized as follows: Sec-
(HEVC) is one of the latest video coding standards that ond section describes the motivation for this research,
can achieve up to 50% bitrate reduction when com- Section three describes system on which all integrations
pared to the previous standard Advanced Video Coding and tests were concluded, while Section four presents
(AVC) [2]. However, the computational complexity and integration and programming approaches. Finally, per-
resource requirements of HEVC are increased by up formance evaluation is presented in Section five.
to 10 times [3]. To deal with the increased computa-
tional complexity it is necessary to intelligently utilize
Motivation and related work
all software and hardware components of the system.
Although software algorithms can lead to significant Heterogeneous, accelerator-based architectures pro-
improvements, heterogeneous accelerator-based archi- vide great potential for increasing efficiency of compute
tectures on high performance computers can drastically and data-intensive applications, such as video encod-
improve power-performance ratio of the system, so ing from both performance and power perspective.
their analysis and exploitation, especially for large-scale However, exploiting such architectures requires a deep
video content providers are necessary. understanding of both application requirements and
Custom hardware accelerators can improve perfor- system design, as well as intelligent utilization of all
mance and lower the power consumption, however available resources.
efficient utilization of such architectures is a challeng- Previous research in this field rarely cover integra-
ing task. Overhead of data transfer often suppresses tion of the accelerators in heterogeneous systems con-
performance benefits gained by the accelerator. taining different types of processing cores, from general

CONTACT Igor Piljić [email protected] Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia
© 2019 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted
use, distribution, and reproduction in any medium, provided the original work is properly cited.
246 I. PILJIĆ ET AL.

Figure 1. MANGO platform scheme.

processing nodes to heterogeneous nodes. In [4–6] Platform – MANGO platform


CPU + GPU heterogeneous platform is used to accel-
MANGO platform [15,16] is a heterogeneous high-
erate HEVC decoder, either by deploying a subset of
performance computing system consisting of general-
functions (IDCT and de-quantization) or by imple-
purpose compute nodes (GN) that are intertwined
menting entire decoder on GPU. In comparison with
with heterogeneous acceleration nodes (HN), linked
the HEVC encoder, the decoder is computationally
by an across-boundary homogeneous interconnect.
much less demanding than HEVC encoder, which is
GNs are based on CPUs (i.e. Intel Xeon), while HNs
why our focus is set on improving the encoding process.
are FPGA-based next-generation manycore chips cou-
GPU-based accelerators can offer large performance
pled with deeply customized heterogeneous comput-
improvements compared to the CPU, especially for
ing resources. HNs are open and can be tailored for a
highly parallelizable algorithms, however, FPGA-based
specific purpose, which was used to incorporate DCT
accelerators could provide even more performance and
accelerator in MANGO architecture. Figure 1 shows
energy-efficient solutions.
the scheme of one cluster in MANGO platform with
There is a lot of research on specialized hardware
integrated DCT accelerator.
accelerators for several different kernels that are used
CPU host is a general node (GN) connected with
in HEVC. FPGA-based solutions for Intra coding are
the heterogeneous node (HN) deployed on four FPGAs.
presented in [7,8]. Architectures for HEVC standard
HN consists of different processing nodes, such as
DCT acceleration introduced in [9–12] show several
RISC multicore, GPU-like core, different accelerator
different approaches for optimizing integer based DCT.
cores and a hardware accelerator for DCT. Topology
Interpolation [13] and deblocking filter [14] as another
and number of the accelerators can be adapted to the
compute demanding algorithms are also subject of the
requirements of the system. For each FPGA, there is
research in this area. However, all these papers focus
one memory controller (MC) that connects the process-
their research on a specific accelerator as a stand-alone
ing cores with local DDR memory. Resource manager
module, without measuring impacts of its integration
located on a CPU host side manages the allocation of
within the existing heterogeneous system.
computing resources (GNs and HNs) to multiple con-
current application that can run on MANGO platform.
System
The MANGO heterogeneous high-performance com- Host – HEVC encoder application
puting system was used as an architecture exploration In-house “clean-room” implementation of HEVC
platform for this research such that high-level encod- encoder was used as a base for integration on the host
ing algorithm was instantiated on general-purpose pro- side. The encoder was profiled and analysed to deter-
cessor cores while the custom designed HEVC DCT mine most time-consuming tasks of the application’s
accelerator cores were used for DCT computations. runtime. The analysis was done on CPU-only single-
Below we give a more detailed overview of the complete core implementation configured for fast encoding, aim-
system and the approach. ing at Just-in-Time video encoding. Verification and
AUTOMATIKA 247

Figure 2. HEVC encoder profiling results.

validation of HEVC encoded bitstreams were con- In addition to the different accelerator types, the
ducted using StreamEye software [17]. MANGO platform enables the use of a group of units
Results are obtained using Intel Amplifier tool and working in parallel, either isolated or in collaboration,
are shown in Figure 2. on a given task. Therefore, we can also identify and
As can be seen from the figure above, most time- exercise the following working modes in MANGO:
consuming tasks are: forward2DTransform and its
counterpart forward2DInverseTransform. The main • Standalone mode – units work in standalone mode
kernel used in both functions is 2-dimensional Dis- running parts of the tasks. No communication or
crete Cosine Transform, which is why DCT was chosen collaboration is exercised between the host and the
as a kernel that can be used to demonstrate benefits unit and between concurrent units.
of offloading to a custom hardware accelerator. Sev- • Host iterative mode – the unit works in an iterative
eral modifications to the applications were made to mode with the host computing parts of the task. The
meet the requirements of the DCT accelerator. Resid- kernel running on the unit is completely synchro-
ual samples had to be stored in a 16-bit format which nized with the host application. All the units work in
halved the amount of data being transferred between the same fashion but there are no direct interactions
host and accelerator. However, this modification leads between units.
to additional changes such as packing and unpacking
of samples for AVX DCT implementation which had a Due to the nature of the DCT algorithm and the
minor impact on its performance. design of DCT accelerator, the unit collaborative mode,
also available in MANGO, was not exercised, and focus
was put on standalone and host iterative mode.
Accelerator – DCT
Based on HEVC encoder analysis, a custom hardware Integration and programming approach
accelerator for DCT was designed and implemented Standalone approach
in FPGA. Symmetry properties of 2D DCT trans-
form were exploited to design area-optimized, 1D DCT The standalone mode was the first working mode in
architecture that can be reused to implement full 2D which HW DCT accelerator was integrated into the
core transform. The accelerator architecture is fully MANGO platform. In standalone mode, a single task
pipelined and applicable for all transform sizes used represents a single matrix with sizes ranging from
in HEVC (from 4 × 4 to 32 × 32 blocks). After evalua- 32 × 32 to 4 × 4 which is then offloaded to the accelera-
tion as a stand-alone module, DCT accelerator core was tor that processes the data and returns the transformed
integrated as a MANGO HN processing core. matrix to the host. For each task, initialization proce-
dures such as registration of kernels, buffers, and events,
allocation of resources or transfer of arguments are
repeated. In the first phase, initialization of mango con-
Benchmarks
text and kernel function takes place. Depending on the
In the test scenario, HW DCT accelerator was used matrix size, input, config and output buffers are regis-
for processing DCT tasks. The baseline for benchmarks tered and added to the task graph shown in Figure 3.
were two implementations of DCT kernel on Intel GN: Input data is then written to the buffers and the kernel
is started. After this, host waits (or does some use-
• Single thread ful work) for the end event which indicates that the
• Single thread + AVX accelerator has processed input data. On the accelerator
248 I. PILJIĆ ET AL.

Figure 3. Task graph for standalone integration.

Figure 4. Time distribution for standalone integration.


side, starting kernel initiates the loop in which accel-
erator fetches the data through 512-bit data bus from
MANGO memory. When data is fetched, it is put into
the pipeline of the accelerator. With each clock cycle,
data propagates through the pipeline until it is written
in the local buffer from which it is stored to MANGO
memory. When all data is stored, the task is finalized
by triggering end event which notifies the host applica-
tion that the data is ready to be retrieved from MANGO
memory. The application then initiates the memory
transfer and if needed restarts the above described cycle
with a new set of input data.
To benchmark the performance, the application was
run with DCT kernel offloaded to HW DCT acceler-
ator operating in standalone working mode. A single
32 × 32 matrix of 16-bit residuals was transferred to
the MANGO memory, processed by the accelerator and
then returned to the host application. Figure 4 shows
the time distribution for processing a single 32 × 32
residual block. Accelerator part corresponds to the time
Figure 5. Task graph for iterative integration.
required by the accelerator to load, process and store
data to MANGO memory. Host part corresponds to the
time spent on executing the application on Intel GN Iterative approach
which consists of filling the buffers with data intended
To tackle the problem of mangolibs overhead, itera-
for processing on the accelerator, transferring the data
tive working mode of the accelerator was proposed. In
from host to MANGO memory and vice versa. As it can
iterative working mode, resources are registered and
be seen from the figure, almost 50% of the runtime is
allocated only once for any number of tasks offloaded to
spent on the mangolibs initialization which makes stan-
the accelerator. Several modifications were necessary to
dalone working mode performance inefficient. Man-
adapt the accelerator to iterative working mode. Since
golibs is a software stack that consists of:
buffers now get registered only once, buffer sizes have
• API provided by the MANGO application library been increased to max_buffer_size which is defined as
• HN library that implements the communication 20 MB for the current version of the accelerator. The
functions between applications and resource man- iterative working mode also required additional syn-
ager with the FPGA subsystem chronization mechanisms which were implemented in
• BOSP – Barbeque Run-Time Resource Manager [18] the form of interrupts. Introduced changes reflected the
and several MANGO unit specific modules. task graph structure shown in Figure 5.
AUTOMATIKA 249

Table 1. Time distribution – 10,000 × 2 kB.


Processing Read
Buffer Block Size Total time time Write time
2 kB 32 × 32 17,68 14,45 1,800 0,406
2 kB 16 × 16 17,67 14,45 1,798 0,406
2 kB 8×8 17,69 14,46 1,801 0,406
2 kB 4×4 17,69 14,46 1,804 0,407

the accelerator, which is an obvious and expected con-


clusion. However, additional optimizations which can
lead to better performance of the accelerator are possi-
ble. These optimizations will be explained in the next
section. Since the matrix size doesn’t impact the per-
formance, the rest of the document will consider only
32 × 32 matrices.

Optimizations
Table 1 shows average time distribution per phase
when processing 10,000 instances of 32 × 32 blocks.
Write and read time corresponds to the time needed
to transfer data from host to the MANGO memory
or from MANGO memory to the host respectively.
Figure 6. Time distribution for iterative integration. Processing time is the duration from the moment when
the interrupt_start has been sent to the accelerator to
the moment when interrupt_end has been received by
The kernel is now started before the data is writ- the host. Processing time thus includes the time needed
ten to the buffer. After host transfers the input data to for the accelerator to load data from the memory, pro-
MANGO memory, it issues the interrupt_start which cess it, and then store it back. Here we do not address
triggers the accelerators to start loading and process- how the total processing time of the accelerator is dis-
ing the data. After all the data is processed and stored, tributed between memory accesses and data passing
accelerator issues the interrupt back to the host, signi- through the pipeline. Usually, small buffer sizes suf-
fying that the output data is available for reading. The fer from the problem of significant memory transfer
host then reads the data and the cycle ends. This cycle is overhead but in this case, almost 90% of the time is
repeated if the host has more data to process and when contributed to the processing time of the accelerator,
there is no more data, the host issues the end event as shown in Table 1. However, a detailed analysis of
which indicates that the accelerator is no longer used the different buffer sizes showed that processing time is
by the application. Figure 6 shows time distribution significantly affected by the time accelerator spends on
between different stages when processing three 32 × 32 loading and storing the data. The results of this analysis
blocks in iterative working mode. As it can be seen, are shown in Table 2.
mangolibs overhead, in this case, is lowered and will Total time spent on transferring and processing data
continue to drop as the number of blocks for processing linearly grows with the increase of buffer size from 2 kB
increases. up to 64 kB. However, when the buffer becomes 64 kB or
To benchmark the performance of the iterative larger, the average time needed to transfer and process
working mode, 10,000 tasks of 32 × 32, 16 × 16, 8 × 8, data significantly shortens and thus we witness signifi-
4 × 4 blocks were offloaded in series to the HW DCT cantly higher efficiency. The explanation for this lies in
accelerator. The results are shown in the following table: the fact that for buffer sizes smaller than 64 kB, data is
It is interesting to note that block size does not affect transferred from the host to MANGO using item net-
the time needed to process the data even though the work while for larger buffers, shared buffers are used.
pipeline for processing 4 × 4 matrices is much shorter Figure 7 shows the time cost per single block for buffer
than the pipeline for processing 32 × 32 matrices. The sizes ranging from 2 to 20,000 kB. The total time cost
reason for this is that the bottleneck for processing time for transferring and processing single block in a 60 kB
is the time spent on loading data by the accelerator and buffer is 14,67 milliseconds while for 64 kB buffer it is
then storing it once the processing is finished. This is 0,61 milliseconds. The time cost continues to decline to
a limitation introduced by the fact that the MANGO 0,44 for a buffer size of 2000 kB after which it flatlines.
platform is indeed architecture exploration platform for This analysis shows that to efficiently exploit acceler-
HPC and not processing efficient HPC platform itself. ator, buffer sizes should be greater than 2 MB. In the
The memory bandwidth impacts the performance of accelerator, the max buffer size is limited to 20 MB
250 I. PILJIĆ ET AL.

Table 2. Benchmarked performance of DCT kernel for different I/O buffer sizes, all time are shown in milliseconds.
Buffer size # of runs Processing time Read data time Write data time Total time 32 × 32 blocks processed Total per block
2 kB 100 15.5345 1.8409 0.4142 18.8748 1 18.8748
4 kB 100 27.7040 2.1224 0.7477 32.1012 2 16.0506
20 kB 100 138.3400 4.1918 3.3358 150.1410 10 15.0141
60 kB 100 414.2460 7.6073 9.5798 440.0900 30 14.6697
64 kB 100 2.6461 8.9927 7.1893 19.4663 32 0.6083
200 kB 100 4.5282 25.1933 19.0508 49.4060 100 0.4941
2 MB 100 29.9313 233.6350 176.7760 440.9960 1000 0.4410
10 MB 100 143.0100 1233.7600 897.8580 2275.2900 5000 0.4551
20 MB 100 284.5090 2461.4800 1790.8300 4537.5000 10000 0.4538

Figure 7. Total time for transferring and processing per block.

Table 3. Comparison of DCT kernel on different processing fastest, followed by CPU and HW DCT accelerator.
cores. Total time consists of processing time, read time and
Core Block size Process Read Write Total per block write time. Read and write time represent time spent
DCT 32 × 32 284.50 2461.48 1790.83 0.4538 for transferring data from the host to MANGO and
DCT 16 × 16 283.10 2465.58 1790.20 0.1135 vice versa and are equal to 0 for CPU and AVX since
DCT 8×8 279.32 2451.88 1791.44 0.0283
DCT 4×4 286.05 2459.17 1789.57 0.0071 in this case, the data never leaves the host. However,
CPU 32 × 32 1981.51 – – 0.1985 read and write time participate with a share of over 90%
CPU 16 × 16 1016.66 – – 0.0254
CPU 8×8 541.27 – – 0.0033 when it comes to HW DCT accelerator. This can be
CPU 4×4 331.89 – – 0.0005 explained with MANGO platform being an exploration
AVX 32 × 32 580.54 – – 0.0580 platform for HPC. Implemented data transfer buses do
AVX 16 × 16 363.17 – – 0.0090
AVX 8×8 181.70 – – 0.0011 not exercise real-world scenarios in which high band-
AVX 4×4 220.33 – – 0.0003 width buses are fully exploited. If we take only process-
ing time into consideration, for 32 × 32 and 16 × 16
blocks, HW accelerator, even running at 40 MHz, out-
which is enough to support use cases of transcoding full performs Intel (running at 3.3 GHz) AVX optimized
HD video sequences. implementation. However, for smaller blocks, 8 × 8 and
4 × 4, HW DCT accelerator doesn’t provide the fastest
results. The main issue that explains this behaviour is
Performance evaluation
slow memory access from the accelerator to MANGO
Performance evaluation was performed for three types memory which is the bottleneck of accelerator process-
of accelerator kernels: HW DCT accelerator, CPU and ing time. Because of this bottleneck, processing times
CPU + AVX. In all tests, a buffer size of 20 MB was used of HW DCT accelerator for all block sizes are approxi-
and blocks sizes ranging from 32 × 32 to 4 × 4 were mately the same even though the pipeline for processing
transferred. Results are shown in Table 3. 4 × 4 blocks is much shorter than the one for processing
If we take into consideration the time spent on aver- 8 × 8, 16 × 16 or 32 × 32 blocks. The second fact that
age for processing a single block of data, AVX is the needs to be considered is that the accelerator runs on
AUTOMATIKA 251

MANGO architecture with the clock of 40 MHz while References


in real world scenario it can run with a frequency that [1] Cisco. Cisco visual networking index: forecast and
is one order of magnitude larger for FPGA implemen- methodology, 2017–2022. [cited 2018 Nov 26]. Avail-
tation and even higher for ASIC implementation. able from: https://www.cisco.com/c/en/us/solutions/
collateral/service-provider/visual-networking-index-
vni/white-paper-c11-741490.html
[2] Sullivan GJ, Ohm J-R, Han W-J, et al. Overview of
Conclusion
the high efficiency video coding (HEVC) standard.
In this paper, we have investigated several approaches IEEE Trans Circuits Syst Video Technol. 2012;22(12):
for integration of custom designed hardware acceler- 1649–1668.
[3] Bossen F, Bross B, Sühring Ket al. HEVC complexity
ator for discrete cosine transform in novel heteroge- and implementation analysis. IEEE Trans Circuits Syst
neous architecture developed as a part of Horizon 2020 Video Technol. 2012;22(12):1685–1696.
project MANGO: exploring Manycore Architectures [4] d Souza DF, Roma N, Sousa L. Opencl parallelization
for Next-generation HPC systems. Fully pipelined, area of the HEVC de-quantization and inverse transform
optimized DCT accelerator is used to improve the per- for heterogeneous platforms. 2014 22nd European Sig-
nal Processing Conference (EUSIPCO); Lisbon; 2014. p.
formance of compute and data-intensive video encod-
755–759.
ing process based on a HEVC standard. Different [5] Wang B, et al. Efficient HEVC decoder for heteroge-
approaches of accelerator utilization were identified as neous CPU with GPU systems. 2016 IEEE 18th Inter-
a standalone and iterative approach. Each accelerator national Workshop on Multimedia Signal Processing
and approach were analysed and benchmarked. Itera- (MMSP); Montreal, QC; 2016. p. 1–6.
tive mode proved to be more efficient than standalone [6] Ma A, Guo C. Parallel acceleration of HEVC decoder
based on CPU + GPU heterogeneous platform. 2017
mode due to high mangolibs initialization overhead. Seventh International Conference on Information Sci-
Analysis of the data transfer from MANGO to host and ence and Technology (ICIST); Da Nang; 2017. p.
vice-versa showed that for efficient data transfer buffer 323–330.
sizes should be 200 kB or larger. [7] Amish F,Bourennane E. A novel hardware accelerator
Performance comparison of different processing for the HEVC intra prediction. 2015 IEEE 13th Inter-
national New Circuits and Systems Conference (NEW-
cores showed that DCT HW accelerator, even run-
CAS); Grenoble; 2015. p. 1–4.
ning at 40 MHz mode, outperforms Intel (running at [8] Sjövall P, Viitamäki V, Vanne J, et al. FPGA-Powered
3.3 GHz) AVX optimized implementation for block 4K120p HEVC Intra encoder. 2018 IEEE International
sizes of 16 × 16 or larger when it comes to process- Symposium on Circuits and Systems (ISCAS); Florence;
ing time. However, due to the MANGO platform being 2018. p. 1–5.
an exploration platform for HPC, data transfers and [9] Meher PK, Park SY, Mohanty BK, et al. Efficient integer
DCT architectures for HEVC. IEEE Trans Circuits Syst
memory access provide a bottleneck for maximum uti- Video Technol. 2014 Jan;24(1):168–178.
lization of HW DCT accelerator. [10] Chatterjee S, Sarawadekar KP. A low cost, constant
Our future work in this area will include the devel- throughput and reusable 8X8 DCT architecture for
opment and integration of other types of accelerators HEVC. Proceedings of IEEE 59th International Mid-
designed for HEVC video encoding and transcoding. west Symposium on Circuits and Systems (MWSCAS);
Abu Dhabi, United Arab Emirates; 2016 Oct 16–19.
Different types of processing units, besides custom
p. 1–4.
accelerator-based cores, will also be investigated, such [11] Bolaños-Jojoa JD,Velasco-Medina J. Efficient hardware
as RISC-V and GPU-like cores. Resource manager will design of N-point 1D-DCT for HEVC, Proceedings of
be adapted and improved to facilitate management of 20th Symposium on Signal Process, Images and Com-
all integrated modules. puter Vision (STSIVA), Bogota, Colombia; 2015 Sept
2–4. p. 1–6.
[12] Abdelrasoul M, Sayed MS,Goulart V. Scalable integer
DCT architecture for HEVC encoder. Proceedings of
Disclosure statement
IEEE Computer Society Annual Symposium on VLSI
No potential conflict of interest was reported by the authors. (ISVLSI); Pittsburgh, Pennsylvania, 2016 Jul 11–13. p.
314–318.
[13] Diniz CM, Shafique M, Bampi S, et al. A reconfigurable
Funding hardware architecture for fractional pixel interpolation
in high efficiency video coding. IEEE Trans Comput-
This project has received funding from the European Union’s Aided Des Integr Circuits Syst. 2015 Feb;34(2):238–251.
H2020 Future and Emerging Technologies under [grant [14] Diniz CM, Shafique M, Dalcin FV, et al. A deblock-
agreement No. 671668]. ing filter hardware architecture for the high efficiency
video coding standard. 2015 design, Automation & Test
in Europe Conference & Exhibition (DATE); Grenoble;
ORCID 2015. p. 1509–1514.
Igor Piljić http://orcid.org/0000-0003-2345-0322 [15] Flich J, Agosta G, Ampletzer P, et al. MANGO: explor-
ing manycore architectures for next-Generation HPC
Leon Dragić http://orcid.org/0000-0002-4558-7269
systems. 2017 Euromicro Conference on Digital System
Mario Kovač http://orcid.org/0000-0002-8365-7002 Design (DSD); Vienna; 2017. p. 478–485.
252 I. PILJIĆ ET AL.

[16] Flich J, et al. The MANGO FET-HPC project: an [18] Massari G, Libutti S, Fornaciari W, et al. Resource-aware
overview. 2015 IEEE 18th International Conference on application execution exploiting the BarbequeRTRM.
Computational Science and Engineering; Porto; 2015. p. Proceedings of 1st Workshop on Resource Aware-
351–354. ness and Application Autotuning in Adaptive and
[17] Elecard: StreamEye software. Available from: https:// Heterogeneous Computing (RES4ANT); CEUR; 2016.
www.elecard.com/products/video-analysis/streameye. p. 3–7.

You might also like