没有合适的资源?快使用搜索试试~ 我知道了~
Performance Estimation Techniques With MPSoC Transaction-Accurat...
0 下载量 156 浏览量
2021-02-20
23:59:14
上传
评论
收藏 763KB PDF 举报
温馨提示
Efficient design of Multi-Processor System-On-Chip (MPSoC) requires early, fast and accurate performance estimation techniques. In this paper, we present new techniques based on fine-grained code analysis to estimate accurate performance during simulation of MPSoC Transaction Accurate Models. First, a GCC profiling tool is applied in the native simulation process. Based on the profiling result, an instruction analyzer of the target CPU architecture is proposed to analyze the cycle cost of C code
资源推荐
资源详情
资源评论



















1920 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 12, DECEMBER 2013
Performance Estimation Techniques With MPSoC
Transaction-Accurate Models
De Ma, Rongjie Yan, Kai Huang, Min Yu, Siwen Xiu, Haitong Ge,
Xiaolang Yan, and Ahmed Amine Jerraya
Abstract—Efficient design of multiprocessor system-on-chip
(MPSoC) requires early, fast, and accurate performance estima-
tion techniques. In this paper, we present new techniques based
on fine-grained code analysis to estimate accurate performance
during simulation of MPSoC transaction accurate models. First,
a GCC profiling tool is applied in the native simulation process.
Based on the profiling result, an instruction analyzer of the
target CPU architecture is proposed to analyze the cycle cost
of C code under estimation. In addition, a memory analyzer is
used to further estimate memory access latency including both
instruction/data cache time cost and global memory access cycles.
Both data and instruction cache models are proposed to estimate
cache miss penalty, and a segment-based strategy is adopted
to update the cache models more efficiently. Furthermore, an
equalized access model is presented to imitate the memory access
behavior of processors for estimating global memory access
latency caused by bus contention and memory bandwidth. We
have applied these techniques on an H.264 decoder application
with different hardware architectures. The experimental results
show that applying these techniques can obviously improve
estimation accuracy of transaction accurate models close to that
of the virtual prototype models, with a tolerable overhead on
simulation speed.
Index Terms—Instruction, memory, multiprocessor system-on-
chip (MPSoC), performance estimation, profiling, transaction-
accurate model.
I. Introduction
T
HE FAST increase of embedded applications makes het-
erogeneous multithread multiprocessor system-on-chip
(MPSoC) more attractive to embedded system designers. The
integration of more processor components brings high per-
Manuscript received October 23, 2012; revised July 2, 2013; accepted July
5, 2013. Date of current version November 18, 2013. This work was supported
in part by National Science Foundation of China under Grant 61100074
and Fundamental Research Funds for the Central Universities. This paper
was recommended by Associate Editor Y. Xie. (Corresponding author: Kai
Huang).
D. Ma is with the Key Laboratory of RF Circuits and Systems, Ministry of
Education, Hangzhou Dianzi University, Institute of VLSI Design, Zhejiang
R. J. Yan is with the State Key Laboratory of Computer Science, Institute
K. Huang is with the Institute of VLSI Design, Zhejiang University,
X. L. Yan, M. Yu, and S. W. Xiu are with the Institute of VLSI Design,
H. T. Ge is with Hangzhou C-Sky Micro-system Company, Hangzhou
310012, China (e-mail: haitong
−
ge@c-sky.com).
A. A. Jerraya is with CEA-LETI, MINATEC, Grenoble Cedex F38054,
Digital Object Identifier 10.1109/TCAD.2013.2275252
formance with concurrency capability and long-market period
with flexible programmability [1], [2]. Because MPSoC design
is naturally processor-centric, and thus software-centric, the
most difficult design challenge in the programming model is
to map application software into efficient hardware implemen-
tations [3]. The work in [4] introduces a feasible solution
of a programming model with multiple levels of abstraction
ranging from very abstract, specification-oriented models to
very concrete, cycle-accurate models. As an important abstrac-
tion model, the transaction accurate (TA) level of modeling is
thought to be a solution to achieve a good tradeoff between
result accuracy and time cost, which also helps to find out the
best matches between hardware and software to improve the
whole system performance [5], [6].
A TA model details the local architecture of each subsystem
in MPSoC and makes the communication protocol explicit
[7]. It allows us to estimate the performance of the whole
system through hardware and software cosimulation. As shown
in Fig. 1, a software stack executable binary is built on a
host machine by linking the thread codes and main code with
an hardware dependent software (HdS) library. For hardware,
except for CPUs, all other components are implemented with
cycle-accurate models in SystemC, making use of a bus func-
tional model (BFM) and Linux shared memory (IPC Linux
shm) for the interaction, data and synchronization exchange
between hardware and software elements. Execution time
between two read/write operations is back annotated to the
BFM, and finally calculated into the total clock cycle costs
with communication overhead.
For a timed simulation of a TA model, the execution time
(e.g., time1, time2 shown in Fig. 1) is obtained in advance
from low-level simulation on a cycle-accurate simulator and
statically inserted into the corresponding read/write function
during software code generation. However, this static time
annotation technique for performance estimation does not
consider any variation of execution time when a stimulus is
changed [7], [8]. Moreover, the accuracy of the performance
result depends on the given architectures, i.e., memory archi-
tecture, bus protocol, processor architecture, thread mapping
strategy, and so on. Thus, lacking the flexibility to efficiently
estimate different MPSoC architectures is also a more serious
disadvantage in this static technique, which extremely limits
the design space. For example, the memory architecture is
a key factor to decide the data access latency to calculate
0278-0070
c
2013 IEEE

MA et al.: PERFORMANCE ESTIMATION TECHNIQUES WITH MPSoC TRANSACTION-ACCURATE MODELS 1921
Fig. 1. Hardware and software cosimulation with transaction accurate models.
the execution time. Without considering memory architecture
details, the execution time of a given application cannot be
evaluated well only with the static time annotation technique.
When the wrong execution time is annotated, it further leads
to inaccurate communication time even if the communication
model is cycle-accurate. Therefore, it is still a challenge to
estimate more accurate performance on a TA model while
keeping fast simulation speed for more efficient design space
exploration.
In this paper, we focus on how to improve the accuracy
of TA models with less speed loss. The performance of
an application depends on both static and dynamic aspects
[9]. Static timing sources, which can be analyzed without
simulation, are mainly decided according to the instruction
types of the program and memory type of the system. Dynamic
aspect relies on various factors, e.g., loops, branch, and cache
hit/miss, which can only be measured with simulation. This
paper presents new techniques considering both static sources
and dynamic factors, to estimate accurate performance based
on fine-granularity code analysis during MPSoC TA model
simulation. For the static aspect, we use gcov [10], which is a
standard utility with GCC, to test code coverage in application
software and find out some basic performance statistics. We
also take advantage of native simulation to handle dynamic
factors. Finally, the analyzing process combines the dynamic
factors with static statistics to generate exact performance
estimation from TA model simulation, allowing fast and exact
hardware and software architecture exploration.
The main contribution of this paper is the introduction of
a dynamic simulation and statistic analysis combined method
to evaluate the performance of the target MPSoC platform
in a TA model. It is used to generate the transaction model
with profiling API functions from a Simulink system-level
model and measure the performance of the whole system
with accurate execution time and communication overhead.
The second contribution is to use GCC profiling tool with an
instruction analyzer of the target CPU architecture to calculate
accurate cycle cost of the given C code during dynamic simu-
lation. The third contribution is to apply a memory analyzer to
further estimate memory access latency, including instruction
and data cache access time cost. Furthermore, we propose
an equalized access model (EAM) to imitate memory access
behavior of processors to estimate the global memory access
latency caused by bus contention and memory bandwidth.
The experimental results with an H.264 decoder application
on target MPSoC platforms are adopted to demonstrate the
efficiency of the proposed methods.
II. Related Work
The trend of MPSoC architecture is to integrate more het-
erogeneous processors, which extends the design space greatly.
A key step of architecture exploration is to efficiently estimate
the performance of an application running on those heteroge-
neous processors architectures. Current literature offers a large
set of references dealing with fast and accurate performance
estimation techniques. Most of these techniques can be divided
into two categories: static analysis and dynamic simulation.
Static analysis is able to provide fast estimation with low-
execution effort. There are many analytical techniques based
on static analysis of software codes or models, which consider
all possible paths in the control flow graph (CFG) and use
formal analytical models to represent a system as a network
of nodes exchanging streams. They are usually employed to
calculate the worst-case execution time (WCET) [11] for real-
time systems. The model of Li and Malik [12] computes a
tight bound of WCET for the instruction cache for embedded
software performance estimation. Even though pure analysis
method guarantees system performance, the estimation results
obtained by analytical techniques are usually too pessimistic,
thus leading to over-provisioning or under-utilization of re-
source. Simulation-based techniques are widely used for both
functional verification and performance estimation. A common
approach for execution-driven simulation is to employ a cycle-
accurate architecture model with instruction set simulator
(ISS) (e.g., ConvergenSC [13], Realview [14], MPARM [15]).
The operation of an ISS consists of reading the code compiled
for a target platform and executing the instructions by using
the target processor model. The ISS model can have several
levels of accuracy according to different levels of models,
e.g., instruction level model, transaction level model, or reg-
ister transfer model. Even though execution-driven simulation
method is reasonably accurate, it is often too slow to be used
for MPSoC design space exploration.

1922 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 12, DECEMBER 2013
Fig. 2. Overview of the estimation method.
To speed up simulation, some hardware-assisted simulation
approaches take the advantage of field-programmable gate
array (FPGA) techniques by splitting the tasks for simulation
into FPGA and collaborating software modules [16]–[18].
With fast simulation speed, these approaches are able to
run an unmodified OS to obtain more accurate performance
results of the whole system. However, compared with software
simulation approach, a hardware-assisted simulator suffers
from hardware limitation (FPGA resources and frequency),
and is hard to debug and validate.
Recently, native simulation approaches have been proposed
to achieve a good tradeoff among simulation speed, accu-
racy, and retargeting ability. In native simulation, software
runs on the host machine natively and its execution time is
calculated through annotation or analysis techniques. Some
static annotation-based methods [7] first collect execution
samples that contain architectural events (e.g., execution cycle,
memory accesses) of each processor under a given architecture
configuration. Then, the samples are used as inputs and
annotated to native simulation. However, estimation results
of these approaches are not accurate in the case of different
stimulus inputs. Some improved approaches [19]–[21] insert
delay functions into software codes and calculate the total
delay during native simulation. These approaches use the
assembly code from a target cross-compiler and its processor
datasheet to generate delay information for each statement or
basic block. Then, the delay information is annotated to the
raw software model to generate a delay-annotated software
model. They could provide more accurate performance than
the static annotation and run faster than ISS. However, the
cache and pipeline model is coarse and hardware behaviors
are not handled well. And software codes have to be modified
largely with fine-grained delay annotation functions.
Our work takes an advantage of dynamic simulation and
statistic analysis combined method for efficient performance
estimation with high accuracy at low cost of simulation
speed. The proposed method is similar to native simulation
approaches mentioned above. Compared with previous work,
there are only few extra codes added into the original software
model for dynamically collecting delay information during
native simulation. Meanwhile, two techniques for instruction
and memory analysis are used to correct delay result from
dynamic information, which improves the accuracy of final
estimation on software execution cycles. Furthermore, we
feed the code execution cycles to the TA model to make its
performance closer to its cycle-accurate virtual prototype (VP).
III. Proposed Estimation Method
To achieve faster estimation speed, we adopt a new native
simulation strategy for a TA model to calculate its accurate
performance result dynamically based on fine-granularity code
analysis. The basic idea is to use gcov [10] to obtain the
profiling result of C statements during simulation, and then
analyze the generated execution times of each statement with
its targeted platform model to calculate the execution time. The
counted total cycle of a given code is fed to the TA model in
SystemC to estimate the performance of the whole system.
Throughout this paper, we use the TA model generated in [5]
and [8] to explain how the proposed estimation method works.
The process of execution time estimation consists of two
stages (shown in Fig. 2): 1) profiling stage to generate C
code with profiling API and collect profiled results of the
code during simulation on a host machine, as shown in steps
P1–P4 of Fig. 2, and 2) analyzing stage to analyze profiled
results according to the architecture of the target platform and
calculate the execution time of the code under estimation,
as shown in steps A1–A7 of Fig. 2. In the first stage, the
code generator of the Simulink-based MPSoC platform [5]
generates multithread codes from Simulink model for TA
model (Step P1). During code generation, four profiling API
functions are inserted into the generated multithread codes to
support run-time performance estimation. Function ppapi
−
init
is used to fork a child process for performance profiling.
剩余13页未读,继续阅读
资源评论


weixin_38693311
- 粉丝: 4
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 库存商品的数量金额核算法【2017-2018最新会计实务】.doc
- 2023年机关事业单位技术工人计算机操作技师考试题库.doc
- 东软智慧城市之智慧健康.pptx
- 高校IT电子商务购物节-活动策划.pptx
- 2023年9月计算机四级数据库工程师笔试试题.doc
- 自动化专业职业生涯规划.doc
- 通信工程监理基础知识培训资料模板.doc
- 项目管理中的PMC和IPMT.doc
- 我国医药电子商务活动中信用信息管理研究论文.doc
- 这是b站上那个车道线识别的源码,写在ros框架里面的,用的python,没有用到深度,涉及到鱼眼摄像头的去畸变,鸟瞰图转换,感兴趣区域选择等等
- esp32 DIY自写烧录工具,限制烧录次数
- 高等学校信息化建设情况统计表.xls
- 建设工程项目管理条例.pdf
- “建站之星:黄色系风格宾馆网站源码”
- 配电网故障图形显示软件程序设计.docx
- 综合布线技术课后习题参考答案.doc
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
