Embedded System
Embedded System
Outline
Thumb Design Philosophy
Code Density
Switching from ARM to Thumb
ARM Organization and Implementation
Pipeline
Pipeline
Performance using pipelining
Assume that the times required for the five functional units, which operate in each of the five cycles,
are as follows: 10 ns, 8 ns, 10 ns, 10 ns, and 7 ns. Assume that pipelining adds 1 ns of overhead. How
much speedup in the instruction execution rate will be gained from a pipeline?
(Assume pipeline is full)
Performance using pipelining
Assume that the times required for the five functional units, which operate in each of the five cycles,
are as follows: 10 ns, 8 ns, 10 ns, 10 ns, and 7 ns. Assume that pipelining adds 1 ns of overhead. How
much speedup in the instruction execution rate will be gained from a pipeline?
(Assume pipeline is full)
Performance using pipelining
Assume a unpipelined machine takes 100 ns to execute an instruction. Assume that the time required
for each of the five functional units, which operate in each of the five cycles, are 20 ns. Determine
speedup ratio of the pipeline for executing 1000 instructions.
Performance using pipelining
• Consider 2 designs D1 and D2 for a synchronous pipeline processor. D1 has 5 stage pipeline with execution
time of 3 ns, 2 ns, 4 ns, 2 ns and 3 ns. While the design D2 has 8 pipeline stages each with 2 ns execution time.
How much time can be saved using design D2 over design D1 for executing 100 instructions?
• The stage delays in a 4-stage pipeline are 800, 500, 400 and 300 picoseconds. The first stage (with delay 800
picoseconds) is replaced with a functionally equivalent design involving two stages with respective delays 600
and 350 picoseconds. Calculate the throughput increase of the pipeline in percent.
• Consider a pipeline having 4 phases with duration 60, 50, 90 and 80 ns. Given latch delay is 10 ns. Calculate the
following
a) Pipeline cycle time
b) Non-pipeline execution time
c) Speed up ratio
d) Pipeline time for 1000 tasks
e) Sequential time for 1000 tasks
f) Throughput
Performance using pipelining
• Consider 2 designs D1 and D2 for a synchronous pipeline processor. D1 has 5 stage pipeline with execution
time of 3 ns, 2 ns, 4 ns, 2 ns and 3 ns. While the design D2 has 8 pipeline stages each with 2 ns execution time.
How much time can be saved using design D2 over design D1 for executing 100 instructions?
For D1 :
k = 5 and n = 100
Maximum clock cycle = 4ns
Total execution time = (5 + 100 – 1) * 4 = 416
For D2 :
k = 8 and n = 100
Each clock cycle = 2ns
Total execution time = (8 + 100 – 1) * 2 = 214
f) Throughput
Structural Hazard
Structural Hazard
Structural Hazard
Data Hazard
The use of results from ADD instruction causes hazard since the register
is not written until after those instructions read it.
Data Hazards
• Eliminate the stalls for the hazard involving SUB and AND instructions
using a technique called forwarding
Data Hazards: Forwarding
• Principle of locality: Data most recently used is very likely to be accessed again in near future
• Keep recently-used data in fastest memory (smaller memory close to CPU)
• Keep data not-used-recently in slower memory (larger memory farther away from CPU)
Memory Hierarchy
Working of Cache Memory
• Cache is first level of memory hierarchy
• Cache memory: Small, fast memory
close to CPU (holding most recently
used data/code)
• Cache hit: If CPU finds requested data
(referenced by a program) in cache
memory
• Cache miss: If CPU doesn’t find
requested data in cache memory
• When cache miss happens, block of
data (called block) containing the
requested data is retrieved from main
memory and placed in cache
Cache hit, cache miss, page fault
• Temporal locality: retrieved data is likely to be used
again in near future
• Spatial locality: high probability that other data within
the block will be used soon
• Cache miss handled by hardware. CPU is stalled until
requested data available
• Page fault: If CPU doesn’t find requested data in
cache and main memory
• Virtual address space broken into multiple pages
• When page fault happens, page containing the
requested data is retrieved from disc memory and
placed in main memory
• Page fault handled by software. CPU is not stalled but
switched to other task until requested data available
Cache design issues
Properties of Cache Memory
Cache mapping techniques with main
memory
Cache is volatile and empty when power is switched on .
Based upon locality of reference instructions and data are picked up and placed
in cache depending upon three main placement/mapping techniques.
Direct Mapped: When a block from MM is always mapped on a one place in the
cache
Fully Associative: When any block of MM is mapped to any place (block) in
cache
Set Associative: In this the combined effect of direct mapping and fully
associative mapping is to be used
Where can a block be placed in a cache?
Definitions
dirty bit: This status bit indicates whether the block is dirty (modified while in the cache) or clean (not modified). If it is
clean, the block is not written on a miss, since the lower level has identical information to the cache.
valid bit: When power is first turned on, the cache contains no valid data. A control bit, usually called the valid bit, must be
provided for each cache block to indicate whether the data in that block are valid.
The valid bits of all cache blocks are set to 0 when power is initially applied to the system. The processor fetches data from a
cache block only if its valid bit is equal to 1. The use of the valid bit in this manner ensures that the processor will not fetch
stale data from the cache.
Reference or use bit: To help the operating system estimate LRU, many machines provide a use bit or reference bit, which is
set whenever a page is accessed.
Locality of reference, also known as the principle of locality, is a term for the phenomenon in which the same values, or
related storage locations, are frequently accessed, depending on the memory access pattern.
There are two basic types of reference locality – temporal and spatial locality.
Temporal locality refers to the reuse of specific data, and/or resources, within a relatively small time duration.
Spatial locality refers to the use of data elements within relatively close storage locations.
Which block should be replaced on a cache miss?
When a miss occurs, the cache controller must select a block to be replaced with the desired data. A benefit of
direct-mapped placement is that hardware decisions are simplified—in fact, so simple that there is no choice:
Only one block frame is checked for a hit, and only that block can be replaced.
Least-recently used (LRU)—The block replaced is the one that has been unused for the longest time.
LRU makes use of a corollary of locality: If recently used blocks are likely to be used again, then the best
candidate for disposal is the least-recently used block.
Write through (or store through)—The information is written to both the block in the cache and to the block in
the lower-level memory.
Write back (also called copy back or store in)—The information is written only to the block in the cache.
The modified cache block is written to main memory only when it is replaced.
Mapping of cache
• Transformation of information (data/code) from main memory to cache memory
• Fully associative mapping, direct mapping, set associative mapping
Example:
• Assume a memory with 16 address lines : 216=65536 possible addresses capable of
holding 65536 words
• Assume there are 4096 blocks in the memory, each having 16 words (4096 x 16 =
65536)
• Assume there are 128 cache blocks (block frames/ cache lines)
• How to map 4096 memory blocks to 128 cache blocks (or cache lines) ?
Associative or Fully Associative Mapping
64K words
(4K blocks of 16 words
2048 or 2K words each)
16 words in a block
Direct Mapping or One-Way Set Associative Mapping
2048 or 2K words
64K words
16 words in a block (4K blocks of 16 words
each)
o = 34 bits
o Tag = 10 bits
o Index = 34-(12+10)
o = 12 bits
Cache Size = block Size x no of blocks = = 16MB
212 x 212
Ques : Consider a 16-way set-associative cache with data words are 64 bits long and words are
addressed to the half-word. The cache holds 2 Mbytes of data and each block holds 16 data words.
Physical addresses are 64 bits long, How many bits of tag, index, and offset are needed to support
references to this cache?
Ques : Consider a 16-way set-associative cache with data words are 64 bits long and words are
addressed to the half-word. The cache holds 2 Mbytes of data and each block holds 16 data words.
Physical addresses are 64 bits long, How many bits of tag, index, and offset are needed to support
references
to this cache?
Set associativity = 16
Cache size = 2MB Tag (49bits) Index(10bits Offset(5 bits)
Block Size = 16 data words )
Physical Address = 64 bits
Data word size = 32 bits = 4 bytes
There are 16 data words per block which implies that at least 4 bits are needed-
Because data is addressable to the ½ word, an additional bit of offset is needed,
therefore the offset is 5 bits.
To calculate the index, we need to use the information given regarding the total
capacity of the cache:-
2 MB is equal to 2^21 total bytes.-
We can use this information to determine the total number of blocks in the cache…
2^21 bytes x(1 block / 16 words)x(1 word / 64 bits)x(8 bits / 1 byte) = 2^14 blocks=
16k
- Now, there are 16k blocks in cache. Therefore there are 2^14 blocks x(1 set / 2^4
blocks) = 2^10 or 1024 sets- Thus,10 bits of index are needed
Finally, the remaining bits form the tag:- 64–5–10 = 49- Thus, there are49 bits of tag
To summarize :Tag: 49 bits; Index: 10 bits ; Offset: 5 bits
Ques: Consider a 2MB 4-way set-associative write back cache with 16 byte line (block) size
and a 32 bit byte-addressable address. Assume a random replacement policy and a single
core system. How many bits of the address are used for the cache index, tag and offset?
Ques: Consider a 2MB 4-way set-associative write back cache with 16 byte line (block) size
and a 32 bit byte-addressable address. Assume a random replacement policy and a single
core system. How many bits of the address are used for the cache index, tag and offset?
Tag: 27 bits
Word/offset=5bits
Direct Mapping or One-Way Set Associative Mapping
Direct Mapping or One-Way Set Associative Mapping
Tag: 19 bits
block/index=8 bits
Word/offset=5 bits
Set-Associative Mapping or Two-Way Set Associative Mapping
Set-Associative Mapping or Two-Way Set Associative Mapping
Tag: 20 bits
Set/index=7 bits
Word/offset=5 bits
Peripherals
Embedded Systems
(UEC513)
Accessing of I/O Devices
More than one I/O devices may be connected through set
of three bus.
Need to assign an unique address
Two mapping techniques
Memory mapped I/O
I/O mapped I/O
I/O Mapping Techniques
Two techniques are used to assign addressing to I/O
Memory mapped I/O
I/O mapped I/O
Accessing of I/O through polling
Normally, the data transfer of rate of I/O devices is slower than the
speed of the processor. This creates the need for mechanisms to
synchronize data transfers between them.
Program-controlled I/O: Processor continuously check the status flag
to achieve the necessary synchronization. It is called polling
5) I/O device puts a word in the data bus (for memory write)
9) Word count register = 0 DMAC checks the DMA request from the I/O device
Operation of DMA with CPU
DMA Modes:
a. Burst Mode: In this mode DMA handover the buses to CPU only after completion of
whole data transfer. Meanwhile, if the CPU requires the bus it has to stay ideal and wait for
data transfer.
b. Cycle Stealing Mode: In this mode, DMA gives control of buses to CPU after transfer of
every byte. It continuously issues a request for bus control, makes the transfer of one byte
and returns the bus. By this CPU doesn’t have to wait for a long time if it needs a bus for
higher priority task.
c. Transparent Mode: Here, DMA transfers data only when CPU is executing the instruction
which does not require the use of buses.
AMBA Architecture (AHB, ASB and APB)
Outline
References
1. ARM System on Chip Architecture, Second Edition,
Steve Furber.