PerformanceEstimationTechniquesWithMPSoCTransaction-AccurateModels资源-CSDN下载

156 浏览量 2021-02-20 23:59:14 上传评论收藏 763KB PDF 举报

资源推荐

资源详情

资源评论

1920 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 12, DECEMBER 2013

Performance Estimation Techniques With MPSoC

Transaction-Accurate Models

De Ma, Rongjie Yan, Kai Huang, Min Yu, Siwen Xiu, Haitong Ge,

Xiaolang Yan, and Ahmed Amine Jerraya

Abstract—Efﬁcient design of multiprocessor system-on-chip

(MPSoC) requires early, fast, and accurate performance estima-

tion techniques. In this paper, we present new techniques based

on ﬁne-grained code analysis to estimate accurate performance

during simulation of MPSoC transaction accurate models. First,

a GCC proﬁling tool is applied in the native simulation process.

Based on the proﬁling result, an instruction analyzer of the

target CPU architecture is proposed to analyze the cycle cost

of C code under estimation. In addition, a memory analyzer is

used to further estimate memory access latency including both

instruction/data cache time cost and global memory access cycles.

Both data and instruction cache models are proposed to estimate

cache miss penalty, and a segment-based strategy is adopted

to update the cache models more efﬁciently. Furthermore, an

equalized access model is presented to imitate the memory access

behavior of processors for estimating global memory access

latency caused by bus contention and memory bandwidth. We

have applied these techniques on an H.264 decoder application

with different hardware architectures. The experimental results

show that applying these techniques can obviously improve

estimation accuracy of transaction accurate models close to that

of the virtual prototype models, with a tolerable overhead on

simulation speed.

Index Terms—Instruction, memory, multiprocessor system-on-

chip (MPSoC), performance estimation, proﬁling, transaction-

accurate model.

I. Introduction

HE FAST increase of embedded applications makes het-

erogeneous multithread multiprocessor system-on-chip

(MPSoC) more attractive to embedded system designers. The

integration of more processor components brings high per-

Manuscript received October 23, 2012; revised July 2, 2013; accepted July

5, 2013. Date of current version November 18, 2013. This work was supported

in part by National Science Foundation of China under Grant 61100074

and Fundamental Research Funds for the Central Universities. This paper

was recommended by Associate Editor Y. Xie. (Corresponding author: Kai

Huang).

D. Ma is with the Key Laboratory of RF Circuits and Systems, Ministry of

Education, Hangzhou Dianzi University, Institute of VLSI Design, Zhejiang

University, Hangzhou 310037, China (e-mail: [email protected]).

R. J. Yan is with the State Key Laboratory of Computer Science, Institute

of Software, Beijing 100090, China (e-mail: [email protected]).

K. Huang is with the Institute of VLSI Design, Zhejiang University,

Hangzhou 310027, China (e-mail: [email protected]).

X. L. Yan, M. Yu, and S. W. Xiu are with the Institute of VLSI Design,

Zhejiang University, Hangzhou 310027, China (e-mail: [email protected];

[email protected]; [email protected]).

H. T. Ge is with Hangzhou C-Sky Micro-system Company, Hangzhou

310012, China (e-mail: haitong

−

ge@c-sky.com).

A. A. Jerraya is with CEA-LETI, MINATEC, Grenoble Cedex F38054,

France (e-mail: [email protected]).

Digital Object Identiﬁer 10.1109/TCAD.2013.2275252

formance with concurrency capability and long-market period

with ﬂexible programmability [1], [2]. Because MPSoC design

is naturally processor-centric, and thus software-centric, the

most difﬁcult design challenge in the programming model is

to map application software into efﬁcient hardware implemen-

tations [3]. The work in [4] introduces a feasible solution

of a programming model with multiple levels of abstraction

ranging from very abstract, speciﬁcation-oriented models to

very concrete, cycle-accurate models. As an important abstrac-

tion model, the transaction accurate (TA) level of modeling is

thought to be a solution to achieve a good tradeoff between

result accuracy and time cost, which also helps to ﬁnd out the

best matches between hardware and software to improve the

whole system performance [5], [6].

A TA model details the local architecture of each subsystem

in MPSoC and makes the communication protocol explicit

[7]. It allows us to estimate the performance of the whole

system through hardware and software cosimulation. As shown

in Fig. 1, a software stack executable binary is built on a

host machine by linking the thread codes and main code with

an hardware dependent software (HdS) library. For hardware,

except for CPUs, all other components are implemented with

cycle-accurate models in SystemC, making use of a bus func-

tional model (BFM) and Linux shared memory (IPC Linux

shm) for the interaction, data and synchronization exchange

between hardware and software elements. Execution time

between two read/write operations is back annotated to the

BFM, and ﬁnally calculated into the total clock cycle costs

with communication overhead.

For a timed simulation of a TA model, the execution time

(e.g., time1, time2 shown in Fig. 1) is obtained in advance

from low-level simulation on a cycle-accurate simulator and

statically inserted into the corresponding read/write function

during software code generation. However, this static time

annotation technique for performance estimation does not

consider any variation of execution time when a stimulus is

changed [7], [8]. Moreover, the accuracy of the performance

result depends on the given architectures, i.e., memory archi-

tecture, bus protocol, processor architecture, thread mapping

strategy, and so on. Thus, lacking the ﬂexibility to efﬁciently

estimate different MPSoC architectures is also a more serious

disadvantage in this static technique, which extremely limits

the design space. For example, the memory architecture is

a key factor to decide the data access latency to calculate

0278-0070

 2013 IEEE

MA et al.: PERFORMANCE ESTIMATION TECHNIQUES WITH MPSoC TRANSACTION-ACCURATE MODELS 1921

Fig. 1. Hardware and software cosimulation with transaction accurate models.

the execution time. Without considering memory architecture

details, the execution time of a given application cannot be

evaluated well only with the static time annotation technique.

When the wrong execution time is annotated, it further leads

to inaccurate communication time even if the communication

model is cycle-accurate. Therefore, it is still a challenge to

estimate more accurate performance on a TA model while

keeping fast simulation speed for more efﬁcient design space

exploration.

In this paper, we focus on how to improve the accuracy

of TA models with less speed loss. The performance of

an application depends on both static and dynamic aspects

[9]. Static timing sources, which can be analyzed without

simulation, are mainly decided according to the instruction

types of the program and memory type of the system. Dynamic

aspect relies on various factors, e.g., loops, branch, and cache

hit/miss, which can only be measured with simulation. This

paper presents new techniques considering both static sources

and dynamic factors, to estimate accurate performance based

on ﬁne-granularity code analysis during MPSoC TA model

simulation. For the static aspect, we use gcov [10], which is a

standard utility with GCC, to test code coverage in application

software and ﬁnd out some basic performance statistics. We

also take advantage of native simulation to handle dynamic

factors. Finally, the analyzing process combines the dynamic

factors with static statistics to generate exact performance

estimation from TA model simulation, allowing fast and exact

hardware and software architecture exploration.

The main contribution of this paper is the introduction of

a dynamic simulation and statistic analysis combined method

to evaluate the performance of the target MPSoC platform

in a TA model. It is used to generate the transaction model

with proﬁling API functions from a Simulink system-level

model and measure the performance of the whole system

with accurate execution time and communication overhead.

The second contribution is to use GCC proﬁling tool with an

instruction analyzer of the target CPU architecture to calculate

accurate cycle cost of the given C code during dynamic simu-

lation. The third contribution is to apply a memory analyzer to

further estimate memory access latency, including instruction

and data cache access time cost. Furthermore, we propose

an equalized access model (EAM) to imitate memory access

behavior of processors to estimate the global memory access

latency caused by bus contention and memory bandwidth.

The experimental results with an H.264 decoder application

on target MPSoC platforms are adopted to demonstrate the

efﬁciency of the proposed methods.

II. Related Work

The trend of MPSoC architecture is to integrate more het-

erogeneous processors, which extends the design space greatly.

A key step of architecture exploration is to efﬁciently estimate

the performance of an application running on those heteroge-

neous processors architectures. Current literature offers a large

set of references dealing with fast and accurate performance

estimation techniques. Most of these techniques can be divided

into two categories: static analysis and dynamic simulation.

Static analysis is able to provide fast estimation with low-

execution effort. There are many analytical techniques based

on static analysis of software codes or models, which consider

all possible paths in the control ﬂow graph (CFG) and use

formal analytical models to represent a system as a network

of nodes exchanging streams. They are usually employed to

calculate the worst-case execution time (WCET) [11] for real-

time systems. The model of Li and Malik [12] computes a

tight bound of WCET for the instruction cache for embedded

software performance estimation. Even though pure analysis

method guarantees system performance, the estimation results

obtained by analytical techniques are usually too pessimistic,

thus leading to over-provisioning or under-utilization of re-

source. Simulation-based techniques are widely used for both

functional veriﬁcation and performance estimation. A common

approach for execution-driven simulation is to employ a cycle-

accurate architecture model with instruction set simulator

(ISS) (e.g., ConvergenSC [13], Realview [14], MPARM [15]).

The operation of an ISS consists of reading the code compiled

for a target platform and executing the instructions by using

the target processor model. The ISS model can have several

levels of accuracy according to different levels of models,

e.g., instruction level model, transaction level model, or reg-

ister transfer model. Even though execution-driven simulation

method is reasonably accurate, it is often too slow to be used

for MPSoC design space exploration.

1922 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 12, DECEMBER 2013

Fig. 2. Overview of the estimation method.

To speed up simulation, some hardware-assisted simulation

approaches take the advantage of ﬁeld-programmable gate

array (FPGA) techniques by splitting the tasks for simulation

into FPGA and collaborating software modules [16]–[18].

With fast simulation speed, these approaches are able to

run an unmodiﬁed OS to obtain more accurate performance

results of the whole system. However, compared with software

simulation approach, a hardware-assisted simulator suffers

from hardware limitation (FPGA resources and frequency),

and is hard to debug and validate.

Recently, native simulation approaches have been proposed

to achieve a good tradeoff among simulation speed, accu-

racy, and retargeting ability. In native simulation, software

runs on the host machine natively and its execution time is

calculated through annotation or analysis techniques. Some

static annotation-based methods [7] ﬁrst collect execution

samples that contain architectural events (e.g., execution cycle,

memory accesses) of each processor under a given architecture

conﬁguration. Then, the samples are used as inputs and

annotated to native simulation. However, estimation results

of these approaches are not accurate in the case of different

stimulus inputs. Some improved approaches [19]–[21] insert

delay functions into software codes and calculate the total

delay during native simulation. These approaches use the

assembly code from a target cross-compiler and its processor

datasheet to generate delay information for each statement or

basic block. Then, the delay information is annotated to the

raw software model to generate a delay-annotated software

model. They could provide more accurate performance than

the static annotation and run faster than ISS. However, the

cache and pipeline model is coarse and hardware behaviors

are not handled well. And software codes have to be modiﬁed

largely with ﬁne-grained delay annotation functions.

Our work takes an advantage of dynamic simulation and

statistic analysis combined method for efﬁcient performance

estimation with high accuracy at low cost of simulation

speed. The proposed method is similar to native simulation

approaches mentioned above. Compared with previous work,

there are only few extra codes added into the original software

model for dynamically collecting delay information during

native simulation. Meanwhile, two techniques for instruction

and memory analysis are used to correct delay result from

dynamic information, which improves the accuracy of ﬁnal

estimation on software execution cycles. Furthermore, we

feed the code execution cycles to the TA model to make its

performance closer to its cycle-accurate virtual prototype (VP).

III. Proposed Estimation Method

To achieve faster estimation speed, we adopt a new native

simulation strategy for a TA model to calculate its accurate

performance result dynamically based on ﬁne-granularity code

analysis. The basic idea is to use gcov [10] to obtain the

proﬁling result of C statements during simulation, and then

analyze the generated execution times of each statement with

its targeted platform model to calculate the execution time. The

counted total cycle of a given code is fed to the TA model in

SystemC to estimate the performance of the whole system.

Throughout this paper, we use the TA model generated in [5]

and [8] to explain how the proposed estimation method works.

The process of execution time estimation consists of two

stages (shown in Fig. 2): 1) proﬁling stage to generate C

code with proﬁling API and collect proﬁled results of the

code during simulation on a host machine, as shown in steps

P1–P4 of Fig. 2, and 2) analyzing stage to analyze proﬁled

results according to the architecture of the target platform and

calculate the execution time of the code under estimation,

as shown in steps A1–A7 of Fig. 2. In the ﬁrst stage, the

code generator of the Simulink-based MPSoC platform [5]

generates multithread codes from Simulink model for TA

model (Step P1). During code generation, four proﬁling API

functions are inserted into the generated multithread codes to

support run-time performance estimation. Function ppapi

−

init

is used to fork a child process for performance proﬁling.

剩余13页未读，继续阅读

评论收藏

内容反馈

weixin_38693311

粉丝: 4

Performance Estimation Techniques With MPSoC Transaction-Accurat...

最新资源

Performance Estimation Techniques With MPSoC Transaction-Accurat...

Qt 5实现串口调试助手 （源工程文件、0积分下载）

AutoSAR标准协议4.2.2

光伏-储能并网系统仿真.rar

XCP协议的规范文档

GD32替换STM32注意事项.pdf

蓝牙BLE协议中文版.pdf

NPPJSONViewer.zip

电路分析基础第二版PDF电子书免费下载

qt样式表一键生成（花狗Fdog）

CANoe通过CAPL脚本实现自动测试

Tangent免费.rar

CMSIS-DAP使用说明及驱动.rar

VS2015安装证书，JavaScript_ProjectSystem.msi，JavaScript_LanguageService.msi

Elsevier期刊word模板.zip

BaiduOCR.zip

电气类的visio模版元件库

软件需求规格说明书模板(超详细).doc

rpa拆包工具（小白适用）

数字设计和计算机体系结构第二版奇数答案.pdf.zip

EPLAN部件库（正泰）

ISO14229汽车诊断协议文档

单相Boost功率因数校正（PFC）仿真（Simulink & Saber）

A1点阵喷码圆点.ttf

ISO26262汽车功能安全协议文档

matpower5.0b1.zip

Labview数据实时采集和存储.vi

深度图和3D点云相互转化.rar

Ubuntu18.04下解决Qt出现qt.qpa.plugin:Could not load the Qt platform plugin “xcb“问题

Marlin2.0固件全解（所有机型）持续在线更新2020

数据结构与算法－求最短路径之迪杰斯特拉（Dijkstra）算法

梵天6.1完美版.zip

最新资源

Qt 5实现串口调试助手（源工程文件、0积分下载）