Chapter 08 - Pipeline and Vector Processing
Chapter 08 - Pipeline and Vector Processing
Parallel Processing
Parallel Processing is a term used to denote a large class of techniques that are used to provide simultaneous data-processing tasks for the purpose of increasing the computational speed of a computer system. Instead of processing each instruction sequentially as in a conventional computer, a parallel processing system is able to perform concurrent data processing to achieve faster execution time. For example, while an instruction is being executed in the ALU, the next instruction can be read from memory. The system may have two or more ALUs and be able to execute two or more instructions at the same time. Furthermore, the system may have two or more processors operating concurrently. The purpose of parallel processing is to speed up the computer processing capability and increase its throughput, that is, the amount of processing that can be accomplished during a given interval of time. The amount of increase with parallel processing, and with it, the cost of the system increases. However, technological developments have reduced hardware costs to the point where parallel processing techniques are economically feasible. Parallel processing can be viewed from various levels of complexity. At the lowest level, we distinguish between parallel and serial operations by the type of registers used. Shift registers operate in serial fashion one bit at a time, while registers with parallel load operate with all the bits of the word simultaneously, parallel processing at a higher level of complexity cab be achieved by have a multiplicity of functional units that perform identical or different operations simultaneously. Parallel processing is established by distributing the data among the multiple functional units. For example, the arithmetic logic and shift operations can be separated into three units and the operands diverted to each unit under the supervision of a control unit. A parallel processing system is able to perform concurrent data processing to achieve faster execution time. The system may have two or more ALUs and be able to execute two or more instructions at the same time. Also, the system may have two or more processors operating concurrently. Goal is to increase the throughput the amount of processing that can be accomplished during a given interval of time. Parallel processing increases the amount of hardware required. Example: The ALU can be separated into three units and the operands diverted to each unit under the supervision of a control unit. All units are independent of each other. A multifunctional organization is usually associated with a complex control unit to coordinate all the activities among the various components. There are a variety of ways that parallel processing can be classified. It can be considered from the internal organization of the processors, from the interconnection structure between processors, or from the flow of information through the system. One classification introduced by M.J. Flynn considers the organization of a computer system by the number of instructions and data items that are manipulated simultaneously. The normal operation of a computer is to fetch instructions from memory and execute them in the processor. The sequence of instructions read from memory constitutes an instruction stream. The operations performed on the data in the processor constitute a data stream. Parallel processing may occur in the instruction stream, in the data stream, or in both.
1|P a g e
Chapter 8
2|P a g e
Chapter 8
Flynns classification divides computers into four major groups as follows: 1. Single instruction stream, single data stream (SISD) 2. Single instruction stream, multiple data stream (SIMD) 3. Multiple instruction streams, single data stream (MISD) 4. Multiple instruction streams, multiple data stream (MIMD)
Single Instruction Stream, Single Data Stream (SISD) In computing, SISD (single instruction, single data) is a term referring to a computer architecture in which a single processor, a uniprocessor, executes a single instruction stream, to operate on data stored in a single memory.
SISD represents the organizations of a single computer containing a control unit, a processor unit, and a memory unit. Instructions are executed sequentially and the system may or may not have internal parallel processing capabilities. Parallel processing in this case may be achieved by means of multiple functional units or by pipeline processing. Instruction fetching and pipelined execution of instructions are common examples found in most modern SISD computers Single Instruction Stream, Multiple Data Stream (SIMD) Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. Thus, such machines exploit data level parallelism. SIMD is particularly applicable to common tasks like adjusting the contrast in a digital image or adjusting the volume of digital audio. Most modern CPU designs include SIMD instructions in order to improve the performance of multimedia use. SIMD represents an organization that includes many processing units under the supervision of a common control unit. All processors receive the same instruction from the control unit but operate on different items of data. The shared memory unit must contain multiple modules so that it can communicate with all the processors simultaneously.
3|P a g e
Chapter 8
Multiple Instruction Streams, Single Data Stream (MISD) In computing, MISD (multiple instruction, single data) is a type of parallel computing architecture where many functional units perform different operations on the same data. Pipeline architectures belong to this type.
MISD structure is only of theoretical interest since no practical system has been constructed using this organization. One prominent example of MISD in computing are the Space Shuttle flight control computers. A systolic array is an example of a MISD structure. Multiple Instruction Streams, Multiple Data Stream (MIMD) In computing, MIMD (multiple instruction, multiple data) is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data. MIMD organization refers to a computer system capable of processing several programs at the same time. Most multiprocessor and multi-computer systems can be classified in this category. MIMD architectures may be used in a number of application areas such as computer-aided design/computeraided manufacturing, simulation, modeling, and as communication switches. MIMD machines can be of either shared memory or distributed memory categories. These classifications are based on how MIMD processors access memory. Shared memory machines may be of the bus-based, extended, or hierarchical type. Distributed memory machines may have hypercube or mesh interconnection schemes.
4|P a g e
Chapter 8
Flynns classification depends on the distinction between the performance of the control unit and the data processing unit. It emphasizes the behavioral characteristics of the computer system rather than its operational and structural interconnections. One type of parallel processing that does not fit Flynns classification is pipelining. Here we are considering parallel processing under the following main topics: 1. Pipeline processing 2. Vector processing 3. Array processors Pipeline Processing Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in special dedicated segments that operates concurrently with all other segments. A pipeline can be visualized as a collection of processing segments through which binary information flows. Each segment performs partial processing dictated by the way the task partitioned. The result obtained from the computation in each segment is transferred to the next segment in the pipeline. The final result is obtained after the data have passed through all segments. The name pipeline implies the flow of information analogous to an industrial assembly line. It is character of pipelines that several computations can be in progress in distinct segments at the same time. The overlapping of computations is made possible by associating a register with each segment in the pipeline. The registers provide isolation between each segment so that each can operate on distinct data simultaneously. The pipeline organization will be demonstrated by means of a simple example. Suppose that we want to perform the combined multiply and add operation with a stream of numbers. Ai * Bi + Ci for I = 1, 2, 3, ..,7 Each suboperation is to be implemented in a segment within a pipeline. Each segment has one or two registers and a combinational circuit as shown in figure below. R1 through R5 are registers that receive new data with every clock pulse. The multiplier and adder are combinational circuits.
5|P a g e
Chapter 8
The sub operations performed in each segment of the pipeline are as follows: R1 Ai, R2 Bi Input Ai and Bi R3 R1 * R2, R4 Ci Multiply and input Ci R5 R3 + R4 Add Ci to product The five registers are loaded with new data every clock pulse. The effect of each clock is shown in Table below.
6|P a g e
Chapter 8
The first clock pulse transfers A1 and B1 into R1 and R2. The second clock pulse transfers the product of R1 and R2 into R3 and C1 into R4. The same clock pulse transfers A2 and B2 into R1 and R2.The third clock pulse operates on all three segments simultaneously. It places A3 and B3 into R1 and R2, transfers the product of R1 and R2 into R3, transfers C2 into R4, and places the sum of R3 and R4 into R5. It takes three clock pulses to fill up the pipe and retrieve the first output from R5. From there on, each clock produces a new output and moves the data one step down the pipeline. This happens as long as new input data flow into the system. When no more input data are available, the clock must continue until the last output emerges out of the pipeline.
Arithmetic Pipeline
Arithmetic units are usually found in very high-speed computers. They are used to implement floatingpoint operations, multiplication of fixed-point numbers, and similar computations encountered in scientific problems. A pipeline multiplier is essentially an array multiplier as described in figure below, with special adders designed to minimize the carry propagation time through the partial products. We will now show an example of a pipeline unit for floating point addition and subtraction. The inputs to the floating-point adder pipeline are two normalized floating-point binary numbers. X = A X 2a Y = B X 2b A and B are two fractions that represents the mantissas and a and b are the exponents. The floatingpoint addition and subtraction can be performed in four segments. The registers labeled R are placed between the segments to store intermediate results. The sub operations that are performed in the four segments are: 1. Compare the exponents. 2. Align the mantissas. 3. Add or subtract the mantissas 4. Normalize the result.
7|P a g e
Chapter 8
8|P a g e
Chapter 8
Instruction Pipeline
Pipeline processing can occur not only in the data stream but in the instruction stream as well. An instruction pipeline reads consecutive instructions from memory while previous instructions are being executed in other segments. This cause the instructions fetch and execute phases to overlap and perform simultaneous operations. One possible digression associated with such a scheme is that an instruction may cause a branch out of sequence. In that case the pipeline must be emptied and all the instructions that have been read from memory after the branch instruction must be discarded. Consider a computer with an instruction fetch unit and an instruction execution unit designed to provide a two-segment pipeline. The instruction fetch segment can be implemented by means of a first-in, firstout (FIFO) buffer. This is a type of unit that forms a queue rather than a stack. Whenever the execution unit is not using memory, the control increments the program counter and uses its address value to read consecutive instructions from memory. The instructions are inserted into the FIFO buffer so that they can be executed on a first-in, first-out basis. Thus an instruction stream can be placed in a queue, waiting for decoding and processing by the execution segment. The instruction stream queuing mechanism provides an efficient way for reducing the average access time to memory for reading instructions. Whenever there is space in the FIFO buffer, the control unit initiates the next instruction fetch phase. The buffer acts as a queue form which control then extracts the instructions for the execution unit.
9|P a g e
Chapter 8
10 | P a g e
Chapter 8
Computers with complex instructions require other phases in addition to the fetch and execute to process an instruction with the following sequence of steps. 1. Fetch the instruction from memory. 2. Decode the instruction. 3. Calculate the effective address. 4. Fetch the operands from memory. 5. Execute the instruction. 6. Store the result in the proper place.
RISC Pipeline
Instruction pipelining is often used to enhance performance. Let us consider this in the context of RISC architecture. In RISC machines most of the operations are register-to-register. Therefore, the instructions can be executed in two phases: F: Instruction Fetch to get the instruction. E: Instruction Execute on register operands and store the results in register. In general, the memory access in RISC is performed through LOAD and STORE operations. For such instructions the following steps may be needed: F: Instruction Fetch to get the instruction E: Effective address calculation for the desired memory operand D: Memory to register or register to memory data transfer through bus. Let us explain pipelining in RISC with an example program execution sample. Take the following program (R indicates register). LOAD RA LOAD RB ADD RC, RA, RB SUB RD, RA, RB MUL RE, RC, RD STOR RE Return to main. (Load from memory location A) (Load from memory location B) (RC = RA + RB) (RD = RA - RB) (RE = RC RD) (Store in memory location C)
Chapter 8
Vector Processing
Vector processing is the arithmetic or logical computation applied on vectors whereas in scalar processing only one or pair of data is processed. Therefore, vector processing is faster compared to scalar processing. When the scalar code is converted to vector form then it is called vectorization. A vector processor is a special coprocessor, which is designed to handle the vector computations. A vector is an ordered set of the same type of scalar data items. The scalar item can be a floating pint number, an integer or a logical value. Vector instructions can be classified as below: (1) Vector-Vector Instructions In this type, vector operands are fetched from the vector register and stored in another vector register. These instructions are denoted with the following function mappings: F1 : V V F2 : V V V (2) Vector-Scalar Instructions In this type, when the combination of scalar and vector are fetched and stored in vector register. These instructions are denoted with the following function mappings: F3 : S V V where S is the scalar item (3) Vector reduction Instructions When operations on vector are being reduced to scalar items as the result, then these are vector reduction instructions. These instructions are denoted with the following function mappings: F4 : V S F5 : V V S (4) Vector-Memory Instructions When vector operations with memory M are performed then these are vector-memory instructions. These instructions are denoted with the following function mappings: F6 : MV F7 : V V
Array Processing
There is another method for vector operations. If we have an array of n processing elements (PEs) i.e., multiple ALUs for storing multiple operands of the vector, then an n instruction, for example, vector addition, is broadcast to all PEs such that they add all operands of the vector at the same time. That means all PEs will perform computation in parallel. All PEs are synchronized under one control unit. This organization of synchronous array of PEs for vector operations is called array processor. The organization is same as in SIMD. An array processor can handle one instruction and multiple data streams as we have seen in case of SIMD organization. Therefore, array processors are also called SIMD array computers. The organization of an array processor is shown in Figure below. The following components are organized in an array processor:
12 | P a g e
Chapter 8
Figure: Organization of SIMD Array Processor Control Unit (CU): All PEs are under the control of one control unit. CU controls the inter communication between the PEs. There is a local memory of CU also called CY memory. The user programs are loaded into the CU memory. The vector instructions in the program are decoded by CU and broadcast to the array of PEs. Instruction fetch and decoding is done by the CU only. Processing elements (PEs): Each processing element consists of ALU, its registers and a local memory for storage of distributed data. These PEs have been interconnected via an interconnection network. All PEs receive the instructions from the control unit and the different component operands are fetched from their local memory. Thus, all PEs perform the same function synchronously in a lock-step fashion under the control of the CU. It may be possible that all PEs need not participate in the execution of a vector instruction. Therefore, it is required to adopt a masking scheme to control the status of each PE. A masking vector is used to control the status of all PEs such that only enabled PEs are allowed to participate in the execution and others are disabled.
13 | P a g e
Chapter 8
Interconnection Network (IN): IN performs data exchange among the PEs, data routing and manipulation functions. This IN is under the control of CU. Host Computer: An array processor may be attached to a host computer through the control unit. The purpose of the host computer is to broadcast a sequence of vector instructions through CU to the PEs. Thus, the host computer is a general-purpose machine that acts as a manager of the entire system. Array processors are special purpose computers which have been adopted for the following: various scientific applications, matrix algebra, matrix eigen value calculations, real-time scene analysis SIMD array processor on the large scale has been developed by NASA for earth resources satellite image processing. This computer has been named Massively parallel processor (MPP) because it contains 16,384 processors that work in parallel. MPP provides real-time time varying scene analysis. However, array processors are not commercially popular and are not commonly used. The reasons are that array processors are difficult to program compared to pipelining and there is problem in vectorization. Array processing is used in radar, sonar, and seismic exploration, anti-jamming and wireless communications.
References:
J. P. Hayes, Computer Architecture and Organization, McGraw Hill, 3rd Ed, 1998. M. Morris Mano, Computer System Architecture, Pearson, 3rd Ed, 2004. www.google.com en.wikipedia.com
Assignments:
(1) Differentiate between instruction pipeline and arithmetic pipeline? (2) Identify the factors due to which speed of the pipelining is limited? (3) What is the difference between scalar and vector processing?
A Gentle Advice:
Please go through your text books and reference books for detail study!!! Thank you all.
14 | P a g e