Module-2: Computer System Architecture
Module-2: Computer System Architecture
Tripathy
LECTURE NOTES
ON
COMPUTER SYSTEM
ARCHITECTURE
COURSE CODE: 24MC1TCA105
MODULE-2
PREPARED BY
1
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
MODULE-II
Hierarchical Memory Technology:
The computer memory can be divided into 5 major hierarchies that are based on use as well as
speed. A processor can easily move from any one level to some other on the basis of its
requirements. These five hierarchies in a system’s memory are register, cache memory, main
memory, magnetic disc, and magnetic tape.
Memory Hierarchy, in Computer System Design, is an enhancement that helps in organizing the
memory so that it can actually minimize the access time. The development of the Memory
Hierarchy occurred on a behavior of a program known as locality of references. Here is a figure
that demonstrates the various levels of memory hierarchy clearly:
2
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
This Hierarchy Design of Memory is divided into two main types. They are:
1. Capacity:
It refers to the total volume of data that a system’s memory can store. The capacity
increases moving from the top to the bottom in the Memory Hierarchy.
2. Access Time:
It refers to the time interval present between the request for read/write and the data
availability. The access time increases as we move from the top to the bottom in the
Memory Hierarchy.
1. Performance:
When a computer system was designed earlier without the Memory Hierarchy Design,
the gap in speed increased between the given CPU registers and the Main Memory due
to a large difference in the system’s access time. It ultimately resulted in the system’s
lower performance, and thus, enhancement was required. Such a kind of enhancement
was introduced in the form of Memory Hierarchy Design, and because of this, the
system’s performance increased. One of the primary ways to increase the performance
of a system is minimising how much a memory hierarchy has to be done to manipulate
data.
1. Cost per bit:
The cost per bit increases as one moves from the bottom to the top in the Memory
Hierarchy, i.e. External Memory is cheaper than Internal Memory.
A memory unit is an essential component in any digital computer since it is needed for storing
programs and data.
1. The memory unit that establishes direct communication with the CPU is called Main
Memory. The main memory is often referred to as RAM (Random Access Memory).
3
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
2. The memory units that provide backup storage are called Auxiliary Memory. For
instance, magnetic disks and magnetic tapes are the most commonly used auxiliary
memories.
Apart from the basic classifications of a memory unit, the memory hierarchy consists all of the
storage devices available in a computer system ranging from the slow but high-capacity
auxiliary memory to relatively faster main memory.
Auxiliary Memory:
A magnetic disk is a digital computer memory that uses a magnetization process to write,
rewrite and access data. For example, hard drives, zip disks, and floppy disks.
Magnetic tape is a storage medium that allows for data archiving, collection, and backup for
different kinds of data.
Main Memory:
The main memory in a computer system is often referred to as Random Access Memory
(RAM). This memory unit communicates directly with the CPU and with auxiliary memory
devices through an I/O processor.
4
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
The programs that are not currently required in the main memory are transferred into auxiliary
memory to provide space for currently used programs and data.
I/O Processor:
The primary function of an I/O Processor is to manage the data transfers between auxiliary
memories and the main memory.
Cache Memory:
The data or contents of the main memory that are used frequently by CPU are stored in the
cache memory so that the processor can easily access that data in a shorter time. Whenever
the CPU requires accessing memory, it first checks the required data into the cache memory. If
the data is found in the cache memory, it is read from the fast memory. Otherwise, the CPU
moves onto the main memory for the required data.
Main Memory:
The main memory acts as the central storage unit in a computer system. It is a relatively large
and fast memory which is used to store programs and data during the run time operations.
The primary technology used for the main memory is based on semiconductor integrated
circuits. The integrated circuits for the main memory are classified into two major units.
The RAM integrated circuit chips are further classified into two possible operating
modes, static and dynamic.
The primary compositions of a static RAM are flip-flops that store the binary information. The
nature of the stored information is volatile, i.e. it remains valid as long as power is applied to
the system. The static RAM is easy to use and takes less time performing read and write
operations as compared to dynamic RAM.
The dynamic RAM exhibits the binary information in the form of electric charges that are
applied to capacitors. The capacitors are integrated inside the chip by MOS transistors. The
dynamic RAM consumes less power and provides large storage capacity in a single memory
chip.
RAM chips are available in a variety of sizes and are used as per the system requirement. The
following block diagram demonstrates the chip interconnection in a 128 * 8 RAM chip.
5
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
A 128 * 8 RAM chip has a memory capacity of 128 words of eight bits (one byte) per
word. This requires a 7-bit address and an 8-bit bidirectional data bus.
The 8-bit bidirectional data bus allows the transfer of data either from memory to
CPU during a read operation or from CPU to memory during a write operation.
The read and write inputs specify the memory operation, and the two chip select
(CS) control inputs are for enabling the chip only when the microprocessor selects it.
The bidirectional data bus is constructed using three-state buffers.
The output generated by three-state buffers can be placed in one of the three
possible states which include a signal equivalent to logic 1, a signal equal to logic 0,
or a high-impedance state.
Note: The logic 1 and 0 are standard digital signals whereas the high-impedance state behaves like an open
circuit, which means that the output does not carry a signal and has no logic significance.
The following function table specifies the operations of a 128 * 8 RAM chip.
6
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
From the functional table, we can conclude that the unit is in operation only when CS1 = 1
and CS2 = 0. The bar on top of the second select variable indicates that this input is enabled
when it is equal to 0.
The primary component of the main memory is RAM integrated circuit chips, but a portion of
memory may be constructed with ROM chips.
A ROM memory is used for keeping programs and data that are permanently resident in the
computer.
Apart from the permanent storage of data, the ROM portion of main memory is needed for
storing an initial program called a bootstrap loader. The primary function of the bootstrap
loader program is to start the computer software operating when power is turned on.
ROM chips are also available in a variety of sizes and are also used as per the system
requirement. The following block diagram demonstrates the chip interconnection in a 512 * 8
ROM chip.
A ROM chip has a similar organization as a RAM chip. However, a ROM can only
perform read operation; the data bus can only operate in an output mode.
The 9-bit address lines in the ROM chip specify any one of the 512 bytes stored in it.
The value for chip select 1 and chip select 2 must be 1 and 0 for the unit to operate.
Otherwise, the data bus is said to be in a high-impedance state.
Auxiliary Memory:
7
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Magnetic Disks:
A magnetic disk is a type of memory constructed using a circular plate of metal or plastic coated
with magnetized materials. Usually, both sides of the disks are used to carry out read/write
operations. However, several disks may be stacked on one spindle with read/write head
available on each surface.
The following image shows the structural representation for a magnetic disk.
The memory bits are stored in the magnetized surface in spots along the concentric
circles called tracks.
The concentric circles (tracks) are commonly divided into sections called sectors.
Magnetic Tape:
Magnetic tape is a storage medium that allows data archiving, collection, and backup for
different kinds of data. The magnetic tape is constructed using a plastic strip coated with a
magnetic recording medium.
The bits are recorded as magnetic spots on the tape along several tracks. Usually, seven or nine
bits are recorded simultaneously to form a character together with a parity bit.
Magnetic tape units can be halted, started to move forward or in reverse, or can be rewound.
However, they cannot be started or stopped fast enough between individual characters. For
this reason, information is recorded in blocks referred to as records.
Associative Memory:
An associative memory can be considered as a memory unit whose stored data can be
identified for access by the content of the data itself rather than by an address or memory
location.
8
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
On the other hand, when the word is to be read from an associative memory, the content of
the word, or part of the word, is specified. The words which match the specified content are
located by the memory and are marked for reading.
From the block diagram, we can say that an associative memory consists of a memory array and
logic for 'm' words with 'n' bits per word.
The functional registers like the argument register A and key register K each have n bits, one for
each bit of a word. The match register M consists of m bits, one for each memory word.
The words which are kept in the memory are compared in parallel with the content of the
argument register.
The key register (K) provides a mask for choosing a particular field or key in the argument word.
If the key register contains a binary value of all 1's, then the entire argument is compared with
each memory word. Otherwise, only those bits in the argument that have 1's in their
corresponding position of the key register are compared. Thus, the key provides a mask for
identifying a piece of information which specifies how the reference to memory is made.
The following diagram can represent the relation between the memory array and the external
registers in an associative memory.
9
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
The cells present inside the memory array are marked by the letter C with two subscripts. The
first subscript gives the word number and the second specifies the bit position in the word. For
instance, the cell Cij is the cell for bit j in word i.
A bit Aj in the argument register is compared with all the bits in column j of the array provided
that Kj = 1. This process is done for all columns j = 1, 2, 3......, n.
If a match occurs between all the unmasked bits of the argument and the bits in word i, the
corresponding bit Mi in the match register is set to 1. If one or more unmasked bits of the
argument and the word do not match, Mi is cleared to 0.
1. Registers
The register is usually an SRAM or static RAM in the computer processor that is used to hold the
data word that is typically 64 bits or 128 bits. A majority of the processors make use of a status
word register and an accumulator. The accumulator is primarily used to store the data in the
form of mathematical operations, and the status word register is primarily used for decision
making.
2. Cache Memory
The cache basically holds a chunk of information that is used frequently from the main
memory. We can also find cache memory in the processor. In case the processor has a single-
10
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
core, it will rarely have multiple cache levels. The present multi-core processors would have
three 2-levels for every individual core, and one of the levels is shared.
3. Main Memory
In a computer, the main memory is nothing but the CPU’s memory unit that communicates
directly. It’s the primary storage unit of a computer system. The main memory is very fast and a
very large memory that is used for storing the information throughout the computer’s
operations. This type of memory is made up of ROM as well as RAM.
4. Magnetic Disks
In a computer, the magnetic disks are circular plates that’s fabricated with plastic or metal with
a magnetized material. Two faces of a disk are frequently used, and many disks can be stacked
on a single spindle by read/write heads that are obtainable on every plane. The disks in a
computer jointly turn at high speed.
5. Magnetic Tape
Magnetic tape refers to a normal magnetic recording designed with a slender magnetizable
overlay that covers an extended, thin strip of plastic film. It is used mainly to back up huge
chunks of data. When a computer needs to access a strip, it will first mount it to access the
information. Once the information is allowed, it will then be unmounted. The actual access time
of a computer memory would be slower within a magnetic strip, and it will take a few minutes
for us to access a strip.
In various fields of computer science and information systems, properties like Inclusion,
Coherence, and Locality are foundational principles that ensure data consistency, efficiency,
and logical organization. Understanding these properties is crucial for designing effective
algorithms, databases, distributed systems, and more.
Inclusion Property:
Definition:
The inclusion property ensures that elements of a subset or substructure are included within a
larger set or structure.
Applications:
Set Theory:
o Example: For sets A= {1,2} and B={1,2,3,4}, A⊆B because all elements of A are in
B.
11
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
File Systems:
o Example: In a Unix-like OS, if a directory grants read/write permissions to a user
group, all files within should inherit these permissions unless overridden.
Hierarchical Data Models:
o Example: In an organizational structure, if an employee is part of the
"Engineering Department," they are also part of the "Technology Division" by
inclusion.
Key Points:
The inclusion property typically refers to the idea that all elements of a subset or substructure
should be included within a larger set or structure. In the context of databases or data models,
this could mean ensuring that a record or an entity is properly included in its respective group
or category.
If you have a set AAA that is a subset of set BBB, the inclusion property ensures that
every element of AAA is also an element of BBB. Mathematically, A⊆BA \subseteq
BA⊆B.
Inclusion might mean that if a user has access to a higher-level category, they should
also have access to all subcategories within it.
Coherence Property:
Definition:
The coherence property ensures consistency and logical alignment of data or processes
within a system, so that all elements work together harmoniously.
Applications:
12
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Key Points:
The coherence property is concerned with the consistency and logical alignment of data or
processes within a system. It ensures that the elements of a system work together in a
consistent and logically coherent manner.
Coherence might refer to cache coherence, where multiple copies of a cache must
remain consistent with each other.
In data modeling, it ensures that the relationships between entities are logically sound
and that the data is consistent across the model.
Coherence can refer to the logical consistency of a set of propositions or proofs. For instance,
all components of a proof or argument should support and not contradict each other.
Locality Property:
Definition: The locality property involves confining operations, computations, or data access to
local resources to reduce the need for extensive communication or data transfer across a
system.
Applications:
13
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Key Points:
Locality improves system performance by minimizing latency and reducing the load on
network resources.
It is essential in memory management, distributed systems, and networking.
Properly leveraging locality can significantly enhance user experience and system
efficiency.
The locality property relates to the idea that operations, computations, or data should be
confined to or make use of local resources or data whenever possible. This reduces the need for
extensive communication or data transfer across a system, thereby improving efficiency.
In computer science:
Locality of reference refers to the tendency of a processor to access the same set of
memory locations repetitively over a short period. This is key to cache memory
performance.
The locality property ensures that processes or data are kept close to where they are
needed, minimizing the latency and overhead associated with remote access or
communication.
In networking:
Locality can refer to the principle that local communications (within the same network
or geographic area) should be optimized, reducing the load on broader network
infrastructure.
These properties are foundational in ensuring that systems are designed efficiently, logically,
and in a manner that maximizes performance and consistency.
Summary:
Inclusion ensures that substructures are correctly represented within a larger structure,
maintaining hierarchy and integrity.
Coherence ensures that different components of a system work together logically and
consistently, which is crucial for reliability.
Locality optimizes system performance by keeping operations local, thereby reducing
unnecessary overhead and improving efficiency.
14
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Inclusion is widely used in hierarchical databases, access control systems, and nested
structures in programming.
Coherence is critical in distributed computing, caching systems, and software design for
ensuring consistent operation across all components.
Locality is leveraged in memory management, distributed computing, and content
delivery systems to enhance performance and reduce delays.
Cache Memory:
The data or contents of the main memory that are used frequently by CPU are stored in the cache
memory so that the processor can easily access that data in a shorter time. Whenever the CPU
needs to access memory, it first checks the cache memory. If the data is not found in cache
memory, then the CPU moves into the main memory.
Cache memory is placed between the CPU and the main memory. The block diagram for a cache
memory can be represented as:
The cache is the fastest component in the memory hierarchy and approaches the speed of CPU
components.
Cache memory is organised as distinct set of blocks where each set contains a small fixed
number of blocks.
15
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
As shown in the above sets are represented by the rows. The example contains N sets and each
set contains four blocks. Whenever an access is made to cache, the cache controller does not
search the entire cache in order to look for a match. Rather, the controller maps the address to
a particular set of the cache and therefore searches only the set for a match.
If a required block is not found in that set, the block is not present in the cache and cache
controller does not search it further. This kind of cache organization is called set associative
because the cache is divided into distinct sets of blocks. As each set contains four blocks the
cache is said to be four way set associative.
When the CPU needs to access memory, the cache is examined. If the word is found in
the cache, it is read from the fast memory.
If the word addressed by the CPU is not found in the cache, the main memory is
accessed to read the word.
A block of words one just accessed is then transferred from main memory to cache
memory. The block size may vary from one word (the one just accessed) to about 16
words adjacent to the one just accessed.
The performance of the cache memory is frequently measured in terms of a quantity
called hit ratio.
When the CPU refers to memory and finds the word in cache, it is said to produce a hit.
If the word is not found in the cache, it is in main memory and it counts as a miss.
The ratio of the number of hits divided by the total CPU references to memory (hits plus
misses) is the hit ratio.
Levels of memory:
Level 1
It is a type of memory in which data is stored and accepted that are immediately stored
in CPU. Most commonly used register is accumulator, Program counter, address register
etc.
Level 2
It is the fastest memory which has faster access time where data is temporarily stored
for faster access.
Level 3
It is memory on which computer works currently. It is small in size and once power is off
data no longer stays in this memory.
Level 4
It is external memory which is not as fast as main memory but data stays permanently in
this memory.
16
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Cache Mapping:
As we know that the cache memory bridges the mismatch of speed between the main memory
and the processor. Whenever a cache hit occurs,
The word that is required isn’t present in the memory of the cache.
The page consists of the required word that we need to map from the main memory.
We can perform such a type of mapping using various different techniques of cache
mapping.
Let us discuss different techniques of cache mapping in this article.
In simpler words, cache mapping refers to a technique using which we bring the main memory
into the cache memory. Here is a diagram that illustrates the actual process of mapping:
Important Note:
The main memory gets divided into multiple partitions of equal size, known as the
frames or blocks.
The cache memory is actually divided into various partitions of the same sizes as that of
the blocks, known as lines.
The main memory block is copied simply to the cache during the process of cache
mapping, and this block isn’t brought at all from the main memory.
17
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
1. Direct Mapping
1. Direct Mapping:
In the case of direct mapping, a certain block of the main memory would be able to map a
cache only up to a certain line of the cache. The total line numbers of cache to which any
distinct block can map are given by the following:
Cache line number = (Address of the Main Memory Block ) Modulo (Total number of lines in
Cache)
For example,
Let us consider that particular cache memory is divided into a total of ‘n’ number of
lines.
Then, the block ‘j’ of the main memory would be able to map to line number only of the
cache (j mod n).
18
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Direct Mapping:
In direct mapping, the cache consists of normal high-speed random-access memory. Each
location in the cache holds the data, at a specific address in the cache. This address is given by
the lower significant bits of the main memory address. This enables the block to be selected
directly from the lower significant bit of the memory address. The remaining higher significant
bits of the address are stored in the cache with the data to complete the identification of the
cached data.
19
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
As shown in the above figure, the address from processor is divided into two field a tag and an
index.
The tag consists of the higher significant bits of the address and these bits are stored with the
data in cache. The index consists of the lower significant b of the address. Whenever the
memory is referenced, the following sequence of events occurs
For a memory read operation, the word is then transferred into the cache. It is possible to pass
the information to the cache and the process simultaneously.
In such a case, the main memory address consists of a tag, an index and a word within a line. All
the words within a line in the cache have the same stored tag
The index part in the address is used to access the cache and the stored tag is compared with
required tag address.
20
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
For a read operation, if the tags are same, the word within the block is selected for transfer to
the processor. If tags are not same, the block containing the required word is first transferred
to the cache. In direct mapping, the corresponding blocks with the same index in the main
memory will map into the same block in the cache, and hence only blocks with different indices
can be in the cache at the same time. It is important that all words in the cache must have
different indices. The tags may be the same or different.
The main memory block is capable of mapping to any given line of the cache that’s
available freely at that particular moment.
It helps us make a fully associative mapping comparatively more flexible than direct
mapping.
For Example
Let us consider the scenario given as follows:
21
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
The replacement algorithm suggests a block that is to be replaced whenever all the
cache lines happen to be occupied.
So, replacement algorithms such as LRU Algorithm, FCFS Algorithm, etc., are employed.
Division of Physical Address:
In the case of fully associative mapping, the division of the physical address occurs as follows:
The grouping of the cache lines occurs into various sets where all the sets consist of k
number of lines.
Any given main memory block can map only to a particular cache set.
However, within that very set, the block of memory can map any cache line that is freely
available.
The cache set to which a certain main memory block can map is basically given as
follows:
Cache set number = (Block Address of the Main Memory) Modulo (Total Number of sets
present in the Cache)
For Example
Let us consider the example given as follows of a two-way set-associative mapping:
22
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
In this case,
The k-way set associative mapping refers to a combination of the direct mapping as well
as the fully associative mapping.
It makes use of the fully associative mapping that exists within each set.
Therefore, the k-way set associative mapping needs a certain type of replacement
algorithm.
Division of Physical Address:
In the case of fully k-way set mapping, the division of the physical address occurs as follows:
Special Cases:
In case k = 1, the k-way set associative mapping would become direct mapping. Thus,
Direct Mapping = one-way set associative mapping
In the case of k = The total number of lines present in the cache, then the k-way set
associative mapping would become fully associative mapping.
In set associative mapping a cache is divided into a set of blocks. The number of blocks in a set
is known as associatively or set size. Each block in each set has a stored tag. This tag together
with index completely identifies the block.
23
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Thus, set associative mapping allows a limited number of blocks, with the same index and
different tags. An example of four way set associative cache having four blocks in each set is
shown in the following figure
In this type of cache, the following steps are used to access the data from a cache:
1. The index of the address from the processor is used to access the set.
2. Then the comparators are used to compare all tags of the selected set with the incoming
tag.
3. If a match is found, the corresponding location is accessed.
4. If no match is found, an access is made to the main memory.
The tag address bits are always chosen to be the most significant bits of the full address, the
block address bits are the next significant bits and the word/byte address bits are the least
significant bits. The number of comparators required in the set associative cache is given by the
number of blocks in a set. The set can be selected quickly and all the blocks of the set can be
read out simultaneously with the tags before waiting for the tag comparisons to be made. After
a tag has been identified, the corresponding block can be selected.
In fully associative type of cache memory, each location in cache stores both memory address
as well as data.
24
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
If a match is found, the corresponding is read out. Otherwise, the main memory is accessed if
address is not found in cache.
This method is known as fully associative mapping approach because cached data is related to
the main memory by storing both memory address and data in the cache. In all organisations,
data can be more than one word as shown in the following figure.
A line constitutes four words, each word being 4 bytes. In such case, the least significant part of
the address selects the particular byte, the next part selects the word, and the remaining bits
from the address. These address bits are compared to the address in the cache. The whole line
25
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
can be transferred to and from the cache in one transaction if there are sufficient data paths
between the main memory and the cache. With only one data word path, the words of the line
have to be transferred in separate transactions.
The main advantage of fully associative mapped cache is that it provides greatest flexibility of
holding combinations of blocks in the cache and conflict for a given cache.
The fully associative mechanism is usually employed by microprocessors with small internal
cache.
Reducing cache misses is crucial for improving the performance of programs, especially in
systems where memory access time is a significant bottleneck. Cache misses occur when the
data needed by the CPU is not found in the cache, requiring the data to be fetched from a
slower level of memory. Here are some techniques for reducing cache misses:
Temporal Locality: This principle suggests that if a data item is accessed once, it is likely
to be accessed again soon. To exploit temporal locality:
o Reusing Data: Write programs that reuse recently accessed data, so it stays in
the cache longer.
o Loop Fusion: Combine loops that access the same data to enhance temporal
locality.
Spatial Locality: This principle indicates that if a data item is accessed, nearby data
items are likely to be accessed soon. To exploit spatial locality:
o Data Structuring: Organize data structures (like arrays) so that related data is
stored contiguously in memory.
o Loop Interchange: Change the nesting order of loops to access data in memory
sequentially, improving spatial locality.
2. Blocking (Tiling)
26
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Technique: Break down large problems (e.g., matrix multiplication) into smaller blocks
or tiles that fit into the cache. Process these smaller blocks fully before moving to the
next, ensuring that data loaded into the cache is reused multiple times.
Example: In matrix multiplication, rather than multiplying entire rows by columns, the
matrices are divided into smaller submatrices (blocks), and these blocks are multiplied,
reducing cache misses by keeping a small part of the matrices in cache.
3. Loop Optimizations
Loop Interchange: Change the order of nested loops to access data in a cache-friendly
manner, i.e., row-major order if the array is stored in row-major format.
o Example: Instead of accessing matrix elements column-wise (which can cause
cache misses due to the way data is stored in memory), access them row-wise.
Loop Fusion (Jamming): Combine two adjacent loops that operate on the same data set
into a single loop to reduce the cache misses.
o Example: If two separate loops both iterate over the same array, fusing them
into one loop can ensure that data remains in cache between operations.
Loop Unrolling: Increase the number of operations performed within a single iteration
of the loop, reducing the overhead of loop control and improving cache usage.
o Example: Instead of processing one array element per loop iteration, process
multiple elements, reducing the number of iterations and increasing data reuse.
4. Prefetching
5. Data Alignment
Align Data to Cache Line Boundaries: Aligning data structures to cache line boundaries
can reduce the number of cache lines required to store the data, thus reducing cache
misses.
o Example: Ensure that arrays or structures start at memory addresses that are
multiples of the cache line size, avoiding unnecessary cache line splits.
27
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Array Padding: Add padding to data structures to avoid conflict misses due to multiple
frequently accessed data items mapping to the same cache set.
o Example: In some cases, adding a small amount of padding between array
elements or structure members can prevent multiple elements from mapping to
the same cache line.
Use of Smaller Data Types: Smaller data types occupy less space in the cache,
potentially allowing more data to fit within the cache and reducing misses.
o Example: Use float instead of double if precision is not a concern, or use packed
data structures.
Compact Data Structures: Use data structures that minimize memory overhead (like
tightly packed arrays or custom data structures) to maximize cache efficiency.
o Example: Replace a linked list (which has poor spatial locality due to pointer-
based storage) with a contiguous array or vector for better cache performance.
Structure of Arrays (SoA) vs. Array of Structures (AoS): Depending on access patterns,
using SoA instead of AoS can be more cache-friendly, particularly when you need to
process one field across many structures.
o Example: If you frequently access the x field of a list of 3D points (where each
point has x, y, z coordinates), storing all x values contiguously (SoA) might reduce
cache misses compared to interleaving x, y, z (AoS).
Optimize Algorithms: Choose or design algorithms that reduce the amount of data that
needs to be simultaneously in memory (working set), fitting more of the working set
into the cache.
o Example: Replace a brute-force algorithm that processes large datasets with a
more efficient algorithm that reduces the memory footprint.
Initialize Data: Properly initialize data structures so that when they are first accessed,
they do not cause unnecessary cache misses.
Use Warm-Up Phases: In some applications, a warm-up phase can be used to pre-load
important data into the cache before the actual computation starts.
28
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Partitioning Cache: If the workload is known to access specific data patterns, cache
partitioning can be used to allocate certain parts of the cache to different data sets,
minimizing conflict misses.
o Example: In real-time systems, dedicating specific cache regions to critical tasks
can ensure consistent cache performance.
In a VM implementation, a process looks at the resources with a logical view and the CPU looks
at it from a Physical or real view of resources. Every program or process begins with its starting
address as ‘0’ (Logical view). However, there is only one real '0' address in Main Memory.
Further, at any instant, many processes reside in Main Memory (Physical view). A Memory
Management Hardware provides the mapping between logical and physical view.
VM is hardware implementation and assisted by OS’s Memory Management Task. The basic
facts of VM are:
All memory references by a process are all logical and dynamically translated by
hardware into physical.
29
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
There is no need for the whole program code or data to be present in Physical
memory and neither the data or program need to be present in contiguous locations
of Physical Main Memory. Similarly, every process may also be broken up into pieces
and loaded as necessitated.
The storage in secondary memory need not be contiguous. (Remember your single
file may be stored in different sectors of the disk, which you may observe while
doing defrag).
However, the Logical view is contiguous. Rests of the views are transparent to the
user.
Any VM design has to address the following factors choosing the options available.
30
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
segment is addressed by specifying the base address of the segment and the offset within the
segment as in figure.
A segment table is required to be maintained with the details of those segments in MM and
their status. Figure shows typical entries in a segment table. A segment table resides in the OS
area in MM. The sharable part of a segment, i.e. with other programs/processes is created as a
separate segment and the access rights for the segment are set accordingly. Presence bit
indicates that the segment is available in MM. The Change bit indicates that the content of the
segment has been changed after it was loaded in MM and is not a copy of the Disk version.
Please recall in multilevel hierarchical memory, the lower level has to be in coherence with the
immediately higher level. The address translation in segmentation implementation is as shown
in figure. The virtual address generated by the program is required to be converted into a
physical address in MM. The segment table help achieve this translation.
31
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Generally, a Segment size coincides with the natural size of the program/data. Although this is
an advantage on many occasions, there are two problems to be addressed in this regard.
1. Identifying a contiguous area in MM for the required segment size is a complex process.
2. As we see chunks are identified and allotted as per requirement. There is a possibility
that there may be some gaps of memory in small chunks which are too small to be
allotted for a new segment. At the same time, the sum of such gaps may become huge
enough to be considered as undesirable. These gaps are called external fragmentation.
External fragments are cleared by a special process like Compaction by OS.
Paging:
Paging is another implementation of Virtual Memory. The logical storage is marked as Pages of
some size, say 4KB. The MM is viewed and numbered as page frames. Each page frame equals
the size of Pages. The Pages from the logical view are fitted into the empty Page Frames in MM.
This is synonymous to placing a book in a bookshelf. Also, the concept is similar to cache blocks
and their placement. Figure explains how two program’s pages are fitted in Page Frames in
MM. As you see, any page can get placed into any available Page Frame. Unallotted Page
Frames are shown in white.
This mapping is necessary to be maintained in a Page Table. The mapping is used during
address translation. Typically a page table contains virtual page address, corresponding physical
frame number where the page is stored, Presence bit, Change bit and Access rights (Refer
figure). This Page table is referred to check whether the desired Page is available in the MM.
The Page Table resides in a part of MM. Thus every Memory access requested by CPU will refer
memory twice – once to the page table and second time to get the data from accessed location.
This is called the Address Translation Process and is detailed in figure.
32
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Page size determination is an important factor to obtain Maximum Page Hits and Minimum
Thrashing. Thrashing is very costly in VM as it means getting data from Disk, which is 1000
times likely to be slower than MM.
In the Paging Mechanism, Page Frames of fixed size are allotted. There is a possibility that some
of the pages may have contents less than the page size, as we have in our printed books. This
causes unutilized space (fragment) in a page frame. By no means, this unutilized space is usable
for any other purpose. Since these fragments are inside the allotted Page Frame, it is
called Internal Fragmentation.
33
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
o The Change bit indicates that the segment/page in main memory is not a true
copy of that in Disk; if this segment/page is a candidate for replacement, it is to
be written onto the disk before replacement. This logic is part of the Address
Translation mechanism.
o Segment/Page access rights are checked to verify any access violation. Ex: one
with Read-only attribute cannot be allowed access for WRITE, or so.
The requested Segment/Page not in the respective Table, it means, it is not available in
MM and a Segment/Page Fault is generated. Subsequently what happens is,
o The OS takes over to READ the segment/page from DISK.
o A Segment needs to be allotted from the available free space in MM. If Paging,
an empty Page frame need to be identified.
o In case, the free space/Page frame is unavailable, Page Replacement algorithm
plays the role to identify the candidate Segment/Page Frame.
o The Data from Disk is written on to the MM
o The Segment /Page Table is updated with the necessary information that a new
block is available in MM
TLB is a hardware functionality designed to speedup Page Table lookup by reducing one extra
access to MM. A TLB is a fully associative cache of the Page Table. The entries in TLB correspond
to the recently used translations. TLB is sometimes referred to as address cache. TLB is part of
the Memory Management Unit (MMU) and MMU is present in the CPU block.
TLB entries are similar to that of Page Table. With the inclusion of TLB, every virtual address is
initially checked in TLB for address translation. If it is a TLB Miss, then the page table in MM is
looked into. Thus, a TLB Miss does not cause Page fault. Page fault will be generated only if it is
a miss in the Page Table too but not otherwise. Since TLB is an associative address cache in CPU,
TLB hit provides the fastest possible address translation; Next best is the page hit in Page Table;
worst is the page fault.
Having discussed the various individual Address translation options, it is to be understood that
in a Multilevel Hierarchical Memory all the functional structures coexist. I. e. TLB, Page Tables,
Segment Tables, Cache (Multiple Levels), Main Memory and Disk. Page Tables can be many and
many levels too, in which case, few Page tables may reside in Disk. In this scenario, what is the
hierarchy of verification of tables for address translation and data service to the CPU?
34
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Address Translation verification sequence starts from the lowest level i.e.
TLB -> Segment / Page Table Level 1 -> Segment / Page Table Level n
Once the address is translated into a physical address, then the data is serviced to CPU. Three
possibilities exist depending on where the data is.
Case 1 - TLB or PT hit and also Cache Hit - Data returned from CPU to Cache
Case 2 - TLB or PT hit and Cache Miss - Data returned from MM to CPU and Cache
Case 3 - Page Fault - Data from disk loaded into a segment / page frame in MM; MM returns
data to CPU and Cache
It is simple, in case of Page hit either Cache or MM provides the Data to CPU readily. The
protocol between Cache and MM exists intact. If it is a Segment/Page fault, then the routine is
handled by OS to load the required data into Main Memory. In this case, data is not in the
cache too. Therefore, while returning data to CPU, the cache is updated treating it as a case of
Cache Miss.
Generality - ability to run programs that are larger than the size of physical memory.
Storage management - allocation/deallocation either by Segmentation or Paging
mechanisms.
Protection - regions of the address space in MM can selectively be marked as Read Only,
Execute
Flexibility - portions of a program can be placed anywhere in Main Memory without
relocation
Storage efficiency -retain only the most important portions of the program in memory
35
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Concurrent I/O -execute other processes while loading/dumping page. This increases
the overall performance
Expandability - Programs/processes can grow in virtual address space.
Seamless and better Performance for users.
Mapping Functions:
The mapping functions are used to map a particular block of main memory to a particular block
of cache. This mapping function is used to transfer the block from main memory to cache
memory. Three different mapping functions are available:
Direct mapping:
A particular block of main memory can be brought to a particular block of cache memory. So, it
is not flexible.
Associative mapping:
In this mapping function, any block of Main memory can potentially reside in any cache block
position. This is much more flexible mapping method.
Block-set-associative mapping:
In this method, blocks of cache are grouped into sets, and the mapping allows a block of main
memory to reside in any block of a specific set. From the flexibility point of view, it is in
between to the other two methods.
All these three mapping methods are explained with the help of an example.
Consider a cache of 4096 (4K) words with a block size of 32 words. Therefore, the cache is
organized as 128 blocks. For 4K words, required address lines are 12 bits. To select one of the
block out of 128 blocks, we need 7 bits of address lines and to select one word out of 32 words,
we need 5 bits of address lines. So the total 12 bits of address is divided for two groups, lower 5
bits are used to select a word within a block, and higher 7 bits of address are used to select any
block of cache memory.
36
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Let us consider a main memory system consisting 64K words. The size of address bus is 16 bits.
Since the block size of cache is 32 words, so the main memory is also organized as block size of
32 words. Therefore, the total number of blocks in main memory is 2048 (2K x 32 words = 64K
words). To identify any one block of 2K blocks, we need 11 address lines. Out of 16 address
lines of main memory, lower 5 bits are used to select a word within a block and higher 11 bits
are used to select a block out of 2048 blocks.
Number of blocks in cache memory is 128 and number of blocks in main memory is 2048, so at
any instant of time only 128 blocks out of 2048 blocks can reside in cache memory. Therefore,
we need mapping function to put a particular block of main memory into appropriate block of
cache memory.
Since more than one main memory block is mapped onto a given cache block position,
contention may arise for that position. This situation may occur even when the cache is not full.
Contention is resolved by allowing the new block to overwrite the currently resident block. So
the replacement algorithm is trivial.
37
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
When a new block is first brought into the cache, the high order 4 bits of the main memory
address are stored in four TAG bits associated with its location in the cache. When the CPU
generates a memory request, the 7-bit block address determines the corresponding cache
block. The TAG field of that block is compared to the TAG field of the address. If they match, the
desired word specified by the low-order 5 bits of the address is in that block of the cache.
If there is no match, the required word must be accessed from the main memory, that is, the
contents of that block of the cache is replaced by the new block that is specified by the new
address generated by the CPU and correspondingly the TAG bit will also be changed by the high
order 4 bits of the address. The whole arrangement for direct mapping technique is shown in
the figure below.
38
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
within a block. The TAG bits of an address received from the CPU must be compared to the TAG
bits of each block of the cache to see if the desired block is present.
In the associative mapping, any block of main memory can go to any block of cache, so it has
got the complete flexibility and we have to use proper replacement policy to replace a block
from cache if the currently accessed block of main memory is not present in cache. It might not
be practical to use this complete flexibility of associative mapping technique due to searching
overhead, because the TAG field of main memory address has to be compared with the TAG
field of the entire cache block. In this example, there are 128 blocks in cache and the size of
TAG is 11 bits. The whole arrangement of Associative Mapping Technique is shown in the figure
below.
39
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
Consider the same cache memory and main memory organization of the previous example.
Organize the cache with 4 blocks in each set. The TAG field of associative mapping technique is
divided into two groups, one is termed as SET bit and the second one is termed as TAG bit. Each
set contains 4 blocks, total number of set is 32. The main memory address is grouped into three
parts: low-order 5 bits are used to identifies a word within a block. Since there are total 32 sets
present, next 5 bits are used to identify the set. High-order 6 bits are used as TAG bits.
The 5-bit set field of the address determines which set of the cache might contain the desired
block. This is similar to direct mapping technique, in case of direct mapping, it looks for block,
but in case of block-set-associative mapping, it looks for set. The TAG field of the address must
then be compared with the TAGs of the four blocks of that set. If a match occurs, then the block
is present in the cache; otherwise the block containing the addressed word must be brought to
the cache. This block will potentially come to the corresponding set only.
Since, there are four blocks in the set, we have to choose appropriately which block to be
replaced if all the blocks are occupied. Since the search is restricted to four block only, so the
searching complexity is reduced. The whole arrangement of block-set-associative mapping
technique is shown in the figure below.
40
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
It is clear that if we increase the number of blocks per set, then the number of bits in SET field is
reduced. Due to the increase of blocks per set, complexity of search is also increased. The
extreme condition of 128 blocks per set requires no set bits and corresponds to the fully
associative mapping technique with 11 TAG bits. The other extreme of one block per set is the
direct mapping method.
Replacement Algorithms:
When a new block must be brought into the cache and all the positions that it may occupy are
full, a decision must be made as to which of the old blocks is to be overwritten. In general, a
policy is required to keep the block in cache when they are likely to be referenced in near
future. However, it is not easy to determine directly which of the block in the cache are about
to be referenced. The property of locality of reference gives some clue to design good
replacement policy.
Consider a specific example of a four-block set. It is required to track the LRU block of this four-
block set. A 2-bit counter may be used for each block.
When a hit occurs, that is, when a read request is received for a word that is in the cache, the
counter of the block that is referenced is set to 0. All counters which values originally lower
than the referenced one are incremented by 1 and all other counters remain unchanged.
When a miss occurs, that is, when a read request is received for a word and the word is not
present in the cache, we have to bring the block to cache.
41
Lecture Notes on Computer System Architecture by Dr. P. R. Tripathy
1. If the set is not full, the counter associated with the new block loaded from the main
memory is set to 0, and the values of all other counters are incremented by 1.
2. If the set is full and a miss occurs, the block with the counter value 3 is removed , and the
new block is put in its place. The counter value is set to zero. The other three block
When a miss occurs and the set is not full, the new block is put into an empty block and the
counter values of the occupied block will be increment by one.
When a miss occurs and the set is full, the block with highest counter value is replaced by new
block and counter is set to 0, counter value of all other blocks of that set is incremented by 1.
The overhead of the policy is less, since no updation is required during hit.
42