Computer Organization UNIT5
Computer Organization UNIT5
Timeshared
Register Multistage
Processor
Cache Multiport
Memory
Coherence Inter-Process
Structure
Parallel Interconnection
Synchronization
Vector Switch
Communication
Processing Instruction
Computer Organization | RISC and CISC
The main idea behind this is to make hardware simpler by using an instruction set composed of a few basic
steps for loading, evaluating, and storing operations just like a load command will load data, a store
command will store the data.
Complex instruction set architecture (cisc)
The main idea is that a single instruction will do all loading, evaluating, and storing operations just
like a multiplication command will do stuff like loading data, evaluating, and storing it, hence it’s
complex.
Cisc: the cisc approach attempts to minimize the number of instructions per program but at the cost
of an increase in the number of cycles per instruction.
Simpler instruction, hence simple instruction decoding.
Characteristic Instruction may take more than a single clock cycle to get
executed.
of CISC
Less number of general-purpose registers as operations get
performed in memory itself.
Simpler instructions: RISC processors use a smaller set of simple instructions, which
makes them easier to decode and execute quickly. This results in faster processing times.
Faster execution: Because RISC processors have a simpler instruction set, they can
execute instructions faster than CISC processors.
Lower power consumption: RISC processors consume less power than CISC processors,
making them ideal for portable devices.
Disadvantages of RISC:
More instructions required: RISC processors require more instructions to perform complex
tasks than CISC processors.
Increased memory usage: RISC processors require more memory to store the additional
instructions needed to perform complex tasks.
Higher cost: Developing and manufacturing RISC processors can be more expensive than
CISC processors.
Advantages and Disadvantages
Advantages of CISC:
Reduced code size: CISC processors use complex instructions that can perform multiple
operations, reducing the amount of code needed to perform a task.
More memory efficient: Because CISC instructions are more complex, they require fewer
instructions to perform complex tasks, which can result in more memory-efficient code.
Widely used: CISC processors have been in use for a longer time than RISC processors, so
they have a larger user base and more available software.
Disadvantages of CISC:
Slower execution: CISC processors take longer to execute instructions because they have
more complex instructions and need more time to decode them.
More complex design: CISC processors have more complex instruction sets, which makes
them more difficult to design and manufacture.
Higher power consumption: CISC processors consume more power than RISC processors
because of their more complex instruction sets.
What is Parallel Processing ?
For the purpose of increasing the computational speed of computer system, the term ‘parallel
processing‘ employed to give simultaneous data-processing operations is used to represent a large
class. In addition, a parallel processing system is capable of concurrent data processing to achieve
faster execution times.
As an example, the next instruction can be read from memory, while an instruction is being
executed in ALU. The system can have two or more ALUs and be able to execute two or more
instructions at the same time. In addition, two or more processing is also used to speed up computer
processing capacity and increases with parallel processing, and with it, the cost of the system
increases. But, technological development has reduced hardware costs to the point where parallel
processing methods are economically possible.
Parallel processing derives from multiple levels of complexity.
It is distinguished between parallel and serial operations by the type of registers used at the lowest level.
Shift registers work one bit at a time in a serial fashion, while parallel registers work simultaneously with all bits of
simultaneously with all bits of the word.
At high levels of complexity, parallel processing derives from having a plurality of functional units that perform
separate or similar operations simultaneously.
By distributing data among several functional units, parallel processing is installed.
As an example, arithmetic, shift and logic operations can be divided into three units and operations are transformed into
a teach unit under the supervision of a control unit.
One possible method of dividing the execution unit into eight functional units operating in parallel is shown in figure.
Depending on the operation specified by the instruction, operands in the registers are transferred to one of the units,
associated with the operands. In each functional unit, the operation performed is denoted in each block of the diagram.
The arithmetic operations with integer numbers are performed by the adder and integer multiplier.
Floating-point operations can be divided into three circuits operating in parallel. Logic, shift, and increment
operations are performed concurrently on different data. All units are independent of each other, therefore one
number is shifted while another number is being incremented. Generally, a multi-functional organization is
associated with a complex control unit to coordinate all the activities between the several components.
The main advantage of parallel processing is that it provides better utilization of system resources by increasing
resource multiplicity which overall system throughput.
Pipelining : Type 1
To improve the performance of a CPU we have two options: 1) Improve the hardware by introducing faster circuits. 2) Arrange the
hardware such that more than one operation can be performed at the same time. Since there is a limit on the speed of hardware and
the cost of faster circuits is quite high, we have to adopt the 2 nd option.
Pipelining is a process of arrangement of hardware elements of the CPU such that its overall performance is increased.
Simultaneous execution of more than one instruction takes place in a pipelined processor.
S3 I1 I2
S3 I1 I2
S4 I1 I2
S4 I1 I2
ETpipeline = k + n – 1 cycles = (k + n – 1) Tp
In the same case, for a non-pipelined processor, the execution time of ‘n’ instructions will be:
ETnon-pipeline = n * k * Tp
Arithmetic Pipeline and Instruction Pipeline
1. ArithmeticPipeline :
An arithmetic pipeline divides an arithmetic problem into various sub problems for execution in various pipeline segments. It is used
for floating point operations, multiplication and various other computations. The process or flowchart arithmetic pipeline for floating
point addition is shown in the diagram.
Floating point addition using arithmetic pipeline :
In this a stream of instructions can be executed by overlapping fetch, decode and execute phases of an instruction cycle. This type
of technique is used to increase the throughput of the computer system. An instruction pipeline reads instruction from the memory
while previous instructions are being executed in other segments of the pipeline. Thus we can execute multiple instructions
simultaneously. The pipeline will be more efficient if the instruction cycle is divided into segments of equal duration.
In the most general case computer needs to process each instruction in following sequence of steps:
Fetch the instruction from memory (FI)
Decode the instruction (DA)
Calculate the effective address
Fetch the operands from memory (FO)
Execute the instruction (EX)
Store the result in the proper place
Vector processing
According to from where the operands are retrieved in a vector processor, pipe lined vector computers are classified into two
architectural configurations:
1.Memory to memory architecture –
In memory to memory architecture, source operands, intermediate and final results are retrieved (read) directly from the main
memory. For memory to memory vector instructions, the information of the base address, the offset, the increment, and the vector
length must be specified in order to enable streams of data transfers between the main memory and pipelines. The processors like TI-
ASC, CDC STAR-100, and Cyber-205 have vector instructions in memory to memory formats. The main points about memory to
memory architecture are:
1. There is no limitation of size
2. Speed is comparatively slow in this architecture
2.Register to register architecture –
In register to register architecture, operands and results are retrieved indirectly from the main memory through the use of large
number of vector registers or scalar registers. The processors like Cray-1 and the Fujitsu VP-200 use vector instructions in register
to register formats. The main points about register to register architecture are:
1. Register to register architecture has limited size.
2. Speed is very high as compared to the memory to memory architecture.
3. The hardware cost is high in this architecture.
Flowchart
Example:
These are two types of Array Processors: Attached Array Processor, and SIMD Array Processor.
1. Attached Array Processor :
To improve the performance of the host computer in numerical computational tasks auxiliary processor is attached to it.
Attached array processor has two interfaces:
Here local memory interconnects main memory. Host computer is general purpose computer. Attached
processor is back end machine driven by the host computer.
The array processor is connected through an I/O controller to the computer & the computer treats it as an
external interface.
2. SIMD array processor :This is computer with multiple processing units operating in parallel. Both types of array processors,
manipulate vectors but their internal organization is different.
The processing units are synchronized to perform the same operation under the control of a common control unit. Thus providing a single instruction stream, multiple
data stream (SIMD) organization. As shown in figure, SIMD contains a set of identical processing elements (PES) each having a local memory M.
ProcessingElement
Each PE includes
ALU
Floating point arithmetic unit
Working Register
• Master control unit controls the operation in the PEs. The function of master control unit is to decode the instruction and determine how the instruction
to be executed. If the instruction is scalar or program control instruction then it is directly executed within the master control unit.
• Main memory is used for storage of the program while each PE uses operands stored in its local memory .
Multiprocessor:
In shared memory multiprocessors, all the CPUs shares the common memory but in a
distributed memory multiprocessor, every CPU has its own private memory.
Applications of Multiprocessor
Enhanced performance.
Multiple applications.
Multi-tasking inside an application.
High throughput and responsiveness.
Hardware sharing among CPUs.
Interconnection Structures
The processors must be able to share a set of main memory modules & I/O devices in a
multiprocessor system. This sharing capability can be provided through interconnection
structures.
The interconnection structure that are commonly used can be given as follows:
In a multiprocessor system, the time shared bus interconnection provides a common communication path connecting all the functional units like processor, I/O processor,
memory unit etc. The figure below shows the multiple processors with common communication path (single bus).
To communicate with any functional unit, processor needs the bus to transfer the
data. To do so, the processor first need to see that whether the bus is available / not by
checking the status of the bus. If the bus is used by some other functional unit, the
status is busy, else free.
A processor can use bus only when the bus is free. The sender processor puts the
address of the destination on the bus & the destination unit identifies it. In order to
communicate with any functional unit, a command is issued to tell that unit, what
work is to be done. The other processors at that time will be either busy in internal
operations or will sit free, waiting to get bus.
We can use a bus controller to resolve conflicts, if any.
Advantages
Inexpensive as no extra hardware is required such as switch.
Simple & easy to configure as the functional units are directly connected to the bus .
Disadvantages
Major fight with this kind of configuration is that if malfunctioning occurs in any of the bus interface circuits, complete
system will fail.
Decreased throughput
At a time, only one processor can communicate with any other functional unit.
The crossbar switch organization includes various crosspoints that are located at intersections between processor buses and memory module
directions. The diagram shows a crossbar switch interconnection between four CPUs and four memory modules.
The tiny square in each crosspoint is a switch that decides the direction from a processor to a memory module. Every switch point has to
control logic to set up the transmission direction between a processor and memory.
It determines the address that is located in the bus to decide whether its specific module is being addressed. It can also resolve several
requests for an approach to the equal memory module on a fixed priority basis.
s
Multiport Memory System employs separate buses between each memory module and each CPU. A processor bus comprises the address, data and
control lines necessary to communicate with memory. Each memory module connects each processor bus. At any given time, the memory module
should have internal control logic to obtain which port can have access to memory.
Memory module can be said to have four ports and each port accommodates one of the buses. Assigning fixed priorities to each memory port resolve
the memory access conflicts. the priority is established for memory access associated with each processor by the physical port position that its bus
occupies in each module. Therefore CPU 1 can have priority over CPU 2, CPU 2 can have priority over CPU 3 and CPU 4 can have the lowest
priority.
Advantage:-
High transfer rate can be achieved because of multiple paths
Disadvantage:-
It requires expensive memory control logic and a large number of cables and
connectors.
It is only good for systems with small number of processors.
4.Multistage Switching Network
The 2×2 crossbar switch is used in the multistage network. It has 2 inputs (A & B) and 2 outputs (0 & 1). To establish the connection between the input
& output terminals, the control inputs C A & CB are associated.
Contd…..
2 * 2 Crossbar Switch
We can construct a multistage network using 2×2 switches, in order to control the communication between a number of sources & destinations.
Creating a binary tree of cross-bar switches accomplishes the connections to connect the input to one of the 8 possible destinations.
In the above diagram, PA & PB are 2 processors, and they are connected to 8 memory modules in a binary way from 000(0) to 111(7) through switches. Three levels are there from
a source to a destination. To choose output in a level, one bit is assigned to each of the 3 levels. There are 3 bits in the destination number: 1st bit determines the output of the
switch in 1st level, 2nd bit in 2nd level & 3rd bit in the 3rd level.
Example: If the source is: PB & the destination is memory module 011 (as in the figure): A path is formed from P B to 0
output in 1st level, output 1 in 2nd level & output 1 in 3rd level.
Usually, the processor acts as the source and the memory unit acts as a destination in a tightly coupled system. The
destination is a memory module. But, processing units act as both, the source and the destination in a loosely coupled
system.
Many patterns can be made using 2×2 switches such as Omega networks, Butterfly Network, etc.
Conclusion :
Interconnection structure can decide the overall system’s performance in a multi-processor environment. To overcome the
disadvantage of the common bus system, i.e., availability of only 1 path & reducing the complexity (crossbar have the
complexity of O(n2))of other interconnection structure, Multi-Stage Switching network came. They used smaller switches,
i.e., 2×2 switches to reduce the complexity. To set the switches, routing algorithms can be used. Its complexity and cost are
less than the cross-bar interconnection network
5.Hypercube System
Hypercube (or Binary n-cube multiprocessor) structure represents a loosely coupled system made up of N=2n processors interconnected in an n-dimensional binary cube. Each
processor makes a made of the cube. Each processor makes a node of the cube. Therefore, it is customary to refer to each node as containing a processor, in effect it has not only a CPU
but also local memory and I/O interface. Each processor has direct communication paths to n other neighbor processors. These paths correspond to the cube edges.
There are 2 distinct n-bit binary addresses which can be assigned to the processors. Each processor address differs from that of each of its n neighbors by exactly one bit position.
Hypercube structure for n= 1, 2 and 3.
A one cube structure contains n = 1 and 2n = 2.
It has two processors interconnected by a single path.
A two-cube structure contains n=2 and 2n=4.
It has four nodes interconnected as a cube.
An n-cube structure contains 2n nodes with a processor residing in each node.
Each node is assigned a binary address in such a manner, that the addresses of two neighbors differ in exactly one bit position. For example, the three neighbors of the node with
address 100 are 000, 110, and 101 in a three-cube structure. Each of these binary numbers differs from address 100 by one bit value.
Interprocess communication and synchronization.
Interprocess communication is the mechanism provided by the operating system that allows processes to communicate with each other. This
communication could involve a process letting another process know that some event has occurred or the transferring of data from one process
to another.
Synchronization is a necessary part of Interprocess communication. It is either provided by the interprocess control mechanism or
handled by the communicating processes.
Semaphore A semaphore is a variable that controls the access to a common resource by multiple processes. The two types of
semaphores are binary semaphores and counting semaphores.
Mutual Exclusion Mutual exclusion requires that only one process thread can enter the critical section at a time. This is useful for
synchronization and also prevents race conditions.
Barrier A barrier does not allow individual processes to proceed until all the processes reach it. Many parallel languages and
collective routines impose barriers.
Spinlock This is a type of lock. The processes trying to acquire this lock wait in a loop while checking if the lock is available or not.
This is known as busy waiting because the process is not doing any useful operation even though it is active.
Approaches to Interprocess Communication
Cache coherence : In a multiprocessor system, data inconsistency may occur among adjacent levels or within the same level of the memory hierarchy.
In a shared memory multiprocessor with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand:
one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be
changed also.
Example : Cache and the main memory may have inconsistent copies of the same object.
Suppose there are three processors, each having cache. Suppose the following scenario:-
Processor 1 read X : obtains 24 from the memory and caches it.
Processor 2 read X : obtains 24 from memory and caches it.
Again, processor 1 writes as X : 64, Its locally cached copy is updated.
As multiple processors operate in parallel, and independently multiple caches may possess different copies of the same memory block,
this creates a cache coherence problem. Cache coherence is the discipline that ensures that changes in the values of shared operands are
propagated throughout the system in a timely fashion.
These are :-
Modified – It means that the value in the cache is dirty, that is the value in current cache is different from the main memory.
Exclusive – It means that the value present in the cache is same as that present in the main memory, that is the value is clean.
Shared – It means that the cache value holds the most recent data copy and that is what shared among all the cache and main memory
as well.
Owned – It means that the current cache holds the block and is now the owner of that block, that is having all rights on that particular
blocks.
Invalid – This states that the current cache block itself is invalid and is required to be fetched from other cache or main memory.
Coherency mechanisms
1. Directory-based – In a directory-based system, the data being shared is placed in a common directory that maintains the coherence
between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory
to its cache. When an entry is changed, the directory either updates or invalidates the other caches with that entry.
2. Snooping – First introduced in 1983, snooping is a process where the individual caches monitor address lines for accesses to memory
locations that they have cached. It is called a write invalidate protocol. When a write operation is observed to a location that a cache has a
copy of and the cache controller invalidates its own copy of the snooped memory location.
3. Snarfing – It is a mechanism where a cache controller watches both address and data in an attempt to update its own copy of a memory
location when a second master modifies a location in main memory. When a write operation is observed to a location that a cache has a
copy of the cache controller updates its own copy of the snarfed memory location with the new data.
Thank You