0% found this document useful (0 votes)
19 views

Embedded System

Uploaded by

abansal10be22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Embedded System

Uploaded by

abansal10be22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 233

Thumb State of ARM Processor

Outline
Thumb Design Philosophy
Code Density
Switching from ARM to Thumb
ARM Organization and Implementation
Pipeline
Pipeline
Performance using pipelining
Assume that the times required for the five functional units, which operate in each of the five cycles,
are as follows: 10 ns, 8 ns, 10 ns, 10 ns, and 7 ns. Assume that pipelining adds 1 ns of overhead. How
much speedup in the instruction execution rate will be gained from a pipeline?
(Assume pipeline is full)
Performance using pipelining
Assume that the times required for the five functional units, which operate in each of the five cycles,
are as follows: 10 ns, 8 ns, 10 ns, 10 ns, and 7 ns. Assume that pipelining adds 1 ns of overhead. How
much speedup in the instruction execution rate will be gained from a pipeline?
(Assume pipeline is full)
Performance using pipelining
Assume a unpipelined machine takes 100 ns to execute an instruction. Assume that the time required
for each of the five functional units, which operate in each of the five cycles, are 20 ns. Determine
speedup ratio of the pipeline for executing 1000 instructions.
Performance using pipelining
• Consider 2 designs D1 and D2 for a synchronous pipeline processor. D1 has 5 stage pipeline with execution
time of 3 ns, 2 ns, 4 ns, 2 ns and 3 ns. While the design D2 has 8 pipeline stages each with 2 ns execution time.
How much time can be saved using design D2 over design D1 for executing 100 instructions?

• The stage delays in a 4-stage pipeline are 800, 500, 400 and 300 picoseconds. The first stage (with delay 800
picoseconds) is replaced with a functionally equivalent design involving two stages with respective delays 600
and 350 picoseconds. Calculate the throughput increase of the pipeline in percent.

• Consider a pipeline having 4 phases with duration 60, 50, 90 and 80 ns. Given latch delay is 10 ns. Calculate the
following
a) Pipeline cycle time
b) Non-pipeline execution time
c) Speed up ratio
d) Pipeline time for 1000 tasks
e) Sequential time for 1000 tasks
f) Throughput
Performance using pipelining
• Consider 2 designs D1 and D2 for a synchronous pipeline processor. D1 has 5 stage pipeline with execution
time of 3 ns, 2 ns, 4 ns, 2 ns and 3 ns. While the design D2 has 8 pipeline stages each with 2 ns execution time.
How much time can be saved using design D2 over design D1 for executing 100 instructions?

For D1 :
k = 5 and n = 100
Maximum clock cycle = 4ns
Total execution time = (5 + 100 – 1) * 4 = 416

For D2 :
k = 8 and n = 100
Each clock cycle = 2ns
Total execution time = (8 + 100 – 1) * 2 = 214

Thus, time saved using D2 over D1 = 416 – 214


=202
Performance using pipelining
• The stage delays in a 4-stage pipeline are 800, 500, 400 and 300 picoseconds. The first stage (with delay 800
picoseconds) is replaced with a functionally equivalent design involving two stages with respective delays 600
and 350 picoseconds. Calculate the throughput increase of the pipeline in percent.
Performance using pipelining
• Consider a pipeline having 4 phases with duration 60, 50, 90 and 80 ns. Given latch delay is 10 ns. Calculate the following

a) Pipeline cycle time


b) Non-pipeline execution time
c) Speed up ratio
d) Pipeline time for 1000 tasks
e) Sequential time for 1000 tasks

f) Throughput
Structural Hazard
Structural Hazard
Structural Hazard
Data Hazard

 The use of results from ADD instruction causes hazard since the register
is not written until after those instructions read it.
Data Hazards

• How many stalls needed to avoid data hazards?

• Pipeline stall clock cycles = 2 to avoid data hazards


• Forwarding/ Bypassing to eliminate stalls: hardware technique
Minimizing Data Hazard Stalls By Forwarding

 Hardware technique called forwarding


 Sometimes also called Bypassing or short-circuiting
The result can be moved from where the ADD produces it, the EX/MEM register, to where the
SUB needs it, the ALU input latches, then the need for a stall can be avoided. Using this
observation, forwarding works as follows:
 The ALU result from the EX/MEM register is always fed back to the ALU input latches.
If the forwarding hardware detects that the previous ALU operation has written the register
corresponding to a source for the current ALU operation, control logic selects the forwarded
result as the ALU input rather than the value read from the register file.
Data Hazard: Forwarding

• Eliminate the stalls for the hazard involving SUB and AND instructions
using a technique called forwarding
Data Hazards: Forwarding

Forward from EX/MEM to ALU input

Forward from MEM/WB to ALU input


Forward through register file

How Forwarding works?


• If the forwarding hardware detects the previous
ALU instruction has written the register
Current instruction corresponding to the source for the current
instruction, control logic selects forwarded
result as the ALU input rather than the value
read from register file
• Forward result not only from immediately
previous instruction, but possibly from an
instruction started 3 clock cycles earlier
LW R1,0(R2)
SUB R4,R1,R5
LDR R1, [R2,#0] Data Hazards Requiring Stalls
AND R6,R1,R7
OR R8,R1,R9
Data Hazards Requiring Stalls

LDR R1, [R2,#0]


LDR R4, [R1,#0]
STR R4, [R1,#12] Data Hazard
Operation
Operation
Operations
Operations
Operation
These instructions load optionally sign-extended bytes or halfwords, and store halfwords. The
THUMB assembler syntax is shown on next slide.
Operation
Operation
Operation
Operation
Operation
Instruction Set Examples
Embedded Systems
Memory Hierarchy
Design
Memory hierarchy
• Memory is very important part of Computer
• Simple axiom in hardware design: smaller is faster (smaller h/w is usually faster than larger
h/w): applicable to memory design
• Faster memories are available in smaller number of bits per chip

• Principle of locality: Data most recently used is very likely to be accessed again in near future
• Keep recently-used data in fastest memory (smaller memory close to CPU)
• Keep data not-used-recently in slower memory (larger memory farther away from CPU)
Memory Hierarchy
Working of Cache Memory
• Cache is first level of memory hierarchy
• Cache memory: Small, fast memory
close to CPU (holding most recently
used data/code)
• Cache hit: If CPU finds requested data
(referenced by a program) in cache
memory
• Cache miss: If CPU doesn’t find
requested data in cache memory
• When cache miss happens, block of
data (called block) containing the
requested data is retrieved from main
memory and placed in cache
Cache hit, cache miss, page fault
• Temporal locality: retrieved data is likely to be used
again in near future
• Spatial locality: high probability that other data within
the block will be used soon
• Cache miss handled by hardware. CPU is stalled until
requested data available
• Page fault: If CPU doesn’t find requested data in
cache and main memory
• Virtual address space broken into multiple pages
• When page fault happens, page containing the
requested data is retrieved from disc memory and
placed in main memory
• Page fault handled by software. CPU is not stalled but
switched to other task until requested data available
Cache design issues
Properties of Cache Memory
Cache mapping techniques with main
memory
Cache is volatile and empty when power is switched on .

Cache write operation : Where to place a block in cache

Based upon locality of reference instructions and data are picked up and placed
in cache depending upon three main placement/mapping techniques.

Direct Mapped: When a block from MM is always mapped on a one place in the
cache
Fully Associative: When any block of MM is mapped to any place (block) in
cache
Set Associative: In this the combined effect of direct mapping and fully
associative mapping is to be used
Where can a block be placed in a cache?
Definitions

dirty bit: This status bit indicates whether the block is dirty (modified while in the cache) or clean (not modified). If it is
clean, the block is not written on a miss, since the lower level has identical information to the cache.

valid bit: When power is first turned on, the cache contains no valid data. A control bit, usually called the valid bit, must be
provided for each cache block to indicate whether the data in that block are valid.

The valid bits of all cache blocks are set to 0 when power is initially applied to the system. The processor fetches data from a
cache block only if its valid bit is equal to 1. The use of the valid bit in this manner ensures that the processor will not fetch
stale data from the cache.

Reference or use bit: To help the operating system estimate LRU, many machines provide a use bit or reference bit, which is
set whenever a page is accessed.

Locality of reference, also known as the principle of locality, is a term for the phenomenon in which the same values, or
related storage locations, are frequently accessed, depending on the memory access pattern.

There are two basic types of reference locality – temporal and spatial locality.

Temporal locality refers to the reuse of specific data, and/or resources, within a relatively small time duration.

Spatial locality refers to the use of data elements within relatively close storage locations.
Which block should be replaced on a cache miss?

When a miss occurs, the cache controller must select a block to be replaced with the desired data. A benefit of
direct-mapped placement is that hardware decisions are simplified—in fact, so simple that there is no choice:
Only one block frame is checked for a hit, and only that block can be replaced.

Random—To spread allocation uniformly, candidate blocks are randomly selected.

Least-recently used (LRU)—The block replaced is the one that has been unused for the longest time.
LRU makes use of a corollary of locality: If recently used blocks are likely to be used again, then the best
candidate for disposal is the least-recently used block.

FIFO- First in first out


What happens on a write?

Write through (or store through)—The information is written to both the block in the cache and to the block in
the lower-level memory.

Write back (also called copy back or store in)—The information is written only to the block in the cache.
The modified cache block is written to main memory only when it is replaced.
Mapping of cache
• Transformation of information (data/code) from main memory to cache memory
• Fully associative mapping, direct mapping, set associative mapping
Example:
• Assume a memory with 16 address lines : 216=65536 possible addresses capable of
holding 65536 words
• Assume there are 4096 blocks in the memory, each having 16 words (4096 x 16 =
65536)
• Assume there are 128 cache blocks (block frames/ cache lines)
• How to map 4096 memory blocks to 128 cache blocks (or cache lines) ?
Associative or Fully Associative Mapping

64K words
(4K blocks of 16 words
 2048 or 2K words each)

 16 words in a block
Direct Mapping or One-Way Set Associative Mapping

 2048 or 2K words
64K words
 16 words in a block (4K blocks of 16 words
each)

(Block address) MOD (Number of blocks in cache)


16 bit Address:
0000000000000000
Direct Memory mapping
0000000000010000
0000100000000000
A 16 bit incoming address is divided
into three fields: 5 bits for tag, 7 bits
for CM line/block identification, and
4 bits for block/line offset

A tag of 5 bits attached to each of the


128 cache lines to identify which of the
Block address mod
number of block in
25 = 32 MM blocks in currently mapped
cache to the particular cache line
0 mod 128
From the incoming 16 bit address,
1 mod 128 7 middle bits identifies the cache
line/block. 5 most significant bits
are then compared to the
particular line’s tag
If match found, 4 least significant
bits identify the word (out of 24 =
16 possible words) in that
particular line
MM blocks 0, 128, 256,……, 3968 maps to CM line 0
MM blocks 1, 129, 257,……., 3969 maps to CM line 1 So on
Problem
Assume a processor has a direct mapped cache with data words of 8 bits long. The number of addresses lines
required to generate physical address for each word is 20 bits long. The tag of cache is 11 bits and each block
holds 16 bytes of data. How many blocks are in this cache?
Problem
Assume a processor has a direct mapped cache with data words of 8 bits long. The number of addresses lines
required to generate physical address for each word is 20 bits long. The tag of cache is 11 bits and each block
holds 16 bytes of data. How many blocks are in this cache?
NUMBER OF SETS IN THE CACHE = CACHE
SIZE / SET SIZE = CACHE SIZE / SET
ASSOCIATIVITY X BLOCK SIZE
No. of sets= 2k/16x2= 64
No. of sets= No. of lines/blocks in cache/ set
associativity
= 128/2=64
Example Problem 1 :Consider a direct mapped cache with block size 4 KB. The size of
main memory is 16 GB and there are 10 bits in the tag. Find the Index and block offset
fields . Also the Size of cache memory .
Example Problem 1 :Consider a direct mapped cache with block size 4 KB. The size of
main memory is 16 GB and there are 10 bits in the tag. Find the Index and block offset
fields . Also the Size of cache memory .

 Sol : Direct Mapped Cache


 Block Size = 4 KB
 Main memory Size = 16GB
 Block offset = log (block size)
2

 = 12 bits The address lines can be calculated as


2m =n
 Physical Address = log (Main memory Size)
2

o = 34 bits
o Tag = 10 bits
o Index = 34-(12+10)
o = 12 bits
 Cache Size = block Size x no of blocks = = 16MB
212 x 212
Ques : Consider a 16-way set-associative cache with data words are 64 bits long and words are
addressed to the half-word. The cache holds 2 Mbytes of data and each block holds 16 data words.
Physical addresses are 64 bits long, How many bits of tag, index, and offset are needed to support
references to this cache?
Ques : Consider a 16-way set-associative cache with data words are 64 bits long and words are
addressed to the half-word. The cache holds 2 Mbytes of data and each block holds 16 data words.
Physical addresses are 64 bits long, How many bits of tag, index, and offset are needed to support
references

to this cache?
Set associativity = 16
 Cache size = 2MB Tag (49bits) Index(10bits Offset(5 bits)
 Block Size = 16 data words )
 Physical Address = 64 bits
 Data word size = 32 bits = 4 bytes
 There are 16 data words per block which implies that at least 4 bits are needed-
Because data is addressable to the ½ word, an additional bit of offset is needed,
therefore the offset is 5 bits.
 To calculate the index, we need to use the information given regarding the total
capacity of the cache:-
 2 MB is equal to 2^21 total bytes.-
 We can use this information to determine the total number of blocks in the cache…
2^21 bytes x(1 block / 16 words)x(1 word / 64 bits)x(8 bits / 1 byte) = 2^14 blocks=
16k
 - Now, there are 16k blocks in cache. Therefore there are 2^14 blocks x(1 set / 2^4
blocks) = 2^10 or 1024 sets- Thus,10 bits of index are needed
 Finally, the remaining bits form the tag:- 64–5–10 = 49- Thus, there are49 bits of tag
 To summarize :Tag: 49 bits; Index: 10 bits ; Offset: 5 bits
Ques: Consider a 2MB 4-way set-associative write back cache with 16 byte line (block) size
and a 32 bit byte-addressable address. Assume a random replacement policy and a single
core system. How many bits of the address are used for the cache index, tag and offset?
Ques: Consider a 2MB 4-way set-associative write back cache with 16 byte line (block) size
and a 32 bit byte-addressable address. Assume a random replacement policy and a single
core system. How many bits of the address are used for the cache index, tag and offset?

 Cache Size = 2MB


 Set Associativity = 4
 Physical Address = 32 bits
 Block size = 16 bytes
 Block offset = 4 bits
 Index = log2( no of sets in cache )
 = 2^21/ 2^2 x 2^4
 = 15 bits Tag (13bits) Index(15bits) Offset(4 bits)
 Tag = 32-(15+4)
 =13 bits
Associative or Fully Associative Mapping
Associative or Fully Associative Mapping

Tag: 27 bits
Word/offset=5bits
Direct Mapping or One-Way Set Associative Mapping
Direct Mapping or One-Way Set Associative Mapping

Tag: 19 bits
block/index=8 bits
Word/offset=5 bits
Set-Associative Mapping or Two-Way Set Associative Mapping
Set-Associative Mapping or Two-Way Set Associative Mapping

Tag: 20 bits
Set/index=7 bits
Word/offset=5 bits
Peripherals

Embedded Systems
(UEC513)
Accessing of I/O Devices
 More than one I/O devices may be connected through set
of three bus.
 Need to assign an unique address
 Two mapping techniques
 Memory mapped I/O
 I/O mapped I/O
I/O Mapping Techniques
 Two techniques are used to assign addressing to I/O
 Memory mapped I/O
 I/O mapped I/O
Accessing of I/O through polling
 Normally, the data transfer of rate of I/O devices is slower than the
speed of the processor. This creates the need for mechanisms to
synchronize data transfers between them.
 Program-controlled I/O: Processor continuously check the status flag
to achieve the necessary synchronization. It is called polling

Two other mechanisms used for synchronizing data transfers between


the processor and memory:
Interrupts driven
Direct Memory Access (DMA).
Interrupt driven I/O
 I/O devices send the request
about the readiness of the data
 Processor complete the current
instruction and send the
acknowledgement to respective
I/O
 Example: Let processor is
executing a program and at
instruction located at address i
when an interrupt occurs.
 Routine executed in response to
an interrupt request is called the
interrupt-service routine (ISR).
 When an interrupt occurs, control
must be transferred to the
interrupt service routine.
 After completion of ISR, the
control back to main program
Interrupt service routine (ISR)
 CPU suspends execution of the current program
 Saves the address of the next instruction to be executed
(current contents of PC) and any other data
• CPU sets the PC to the starting address of an ISR
• CPU proceeds to the fetch cycle and fetches the first
instruction in ISR which is generally a part of the OS
 ISR typically determines the nature of the interrupt and performs
whatever actions are needed.
For example, ISR determines which I/O module
generated the interrupt and may branch to a program
that will write more data out to that I/O module.
Once ISR is completed, CPU will resume the execution
of the user program at the point of interruption.
Daisy Chain in Interrupt
 Connections of all interrupt is in serial
 First device has highest priority
 Same IR is used
 INTA is used to respond to the devices
 Start scanning from device 1 and so on
Bus Contention & Arbitration Priority Resolving
Schemes
Multiple Interrupts
Two methods can be used to handle multiple interrupt:
 Sequential execution
 Execute as per priority of interrupt
Direct Memory Access
 A special control unit to provide transfer a block of data directly
between IO and memory by bypassing the processor
 It uses the busses of processor
 It is not a processor, so not having any instruction set
Direct Memory access
 The hardware device used for direct memory access is called the DMA controller.
 DMA controller is a control unit, part of I/O device’s interface circuit, which can
transfer blocks of data between I/O devices and main memory with minimal
intervention from the processor.
 DMA can transfer block of data from IO to processor, memory to IO, memory to
memory without any intervention from processor
 To initiate the DMA transfer, the processor load the information about the DMA
controller:
 Starting address
 Number of words to be transfer
 Direction of transfer
 Modes of transfer
 After the completion of DMA transfer, it inform the processor by raising interrupt
signal
DMA controller contains an address unit, for generating addresses and
selecting I/O device for transfer.
It also contains the control unit and data count for keeping counts of
the number of blocks transferred and indicating the direction of
transfer of data.
When the transfer is completed, DMA informs the processor by raising
an interrupt.

Typical Block Diagram of DMA Controller


Different Steps
1) I/O Device sends a DMA request

2) DMAC activates the BR line

3) CPU responds with BG line

4) DMAC sends a DMA acknowledgment to the I/O device

5) I/O device puts a word in the data bus (for memory write)

6) DMAC writes data to the address specified by an Address Register

7) Decrement Word count register

8) Word count register = 0 EOT interrupt CPU

9) Word count register = 0 DMAC checks the DMA request from the I/O device
Operation of DMA with CPU
DMA Modes:

a. Burst Mode: In this mode DMA handover the buses to CPU only after completion of
whole data transfer. Meanwhile, if the CPU requires the bus it has to stay ideal and wait for
data transfer.

b. Cycle Stealing Mode: In this mode, DMA gives control of buses to CPU after transfer of
every byte. It continuously issues a request for bus control, makes the transfer of one byte
and returns the bus. By this CPU doesn’t have to wait for a long time if it needs a bus for
higher priority task.

c. Transparent Mode: Here, DMA transfers data only when CPU is executing the instruction
which does not require the use of buses.
AMBA Architecture (AHB, ASB and APB)
Outline
References
1. ARM System on Chip Architecture, Second Edition,
Steve Furber.

2. Video Lectures of Prof. Mouli Sankaran on ARM


Based Development.
https://nptel.ac.in/courses/117106111/
ARM Development Environment

You might also like