unit 4
unit 4
Large problems can often be divided into smaller ones, which can then be solved at
the same time.
1
Serial Computation
Traditionally, software has been written for serial computation:
Tobe run on a single computer having a single Central Processing Unit (CPU)
What is Parallel Computing?
In the simplest sense, parallel computing is the simultaneous use of
multiple compute resources to solve a computational problem.
Advantages of Parallel Computing
4
Types of Parallelism
▪ Parallelism in Hardware
▪ Parallelism in a Uniprocessor
– Pipelining
– Superscalar
▪ SIMD instructions, Vector processors, GPUs
▪ Multiprocessor
– Symmetric shared-memory multiprocessors
– Distributed-memory multiprocessors
– Chip-multiprocessors a.k.a. Multi-cores
▪ Multicomputers a.k.a. clusters
▪ Parallelism in Software
▪ Instruction level parallelism
▪ Thread or Task-level parallelism
▪ Data parallelism
▪ Bit level parallelism
5
Taxonomy of Parallel Computers
▪ According to instruction and data streams (Flynn):
– Single instruction single data (SISD): this is the standard uniprocessor
7
2
Taxonomy of Parallel Computers
▪ According to memory communication model
– Shared address or shared memory: It emphasizes on control parallelism
than on data parallelism. In the shared memory model, multiple processes
execute on different processors independently, but they share a common
memory space. Due to any processor activity, if there is any change in any
memory location, it is visible to the rest of the processors.
▪ Processes in different processors can use the same virtual address space
▪ Any processor can directly access memory in another processor node
▪ Communication is done through shared memory variables
▪ Explicit synchronization with locks and critical sections
▪ Arguably easier to program??
5
Taxonomy of Parallel Computers
– Physically centralized memory, uniform memory access (UMA)
▪ All memory is allocated at same distance from all processors
▪ Also called symmetric multiprocessors (SMP)
▪ Memory bandwidth is fixed and must accommodate all processors does not
scale to large number of processors
▪ Used in CMPs today (single-socket ones)
Interconnection
Main memory
3
Taxonomy of Parallel Computers
– Physically distributed memory, non-uniform memory access (NUMA)
▪ A portion of memory is allocated with each processor (node)
▪ Accessing local memory is much faster than remote memory
▪ If most accesses are to local memory than overall memory bandwidth increases
linearly with the number of processors
▪ Used in multi-socket CMPs E.g Intel Nehalem
DDR3 C
DDR3 A
DDR3 D
DDR3 B
DDR3 E
DDR3 F
I/O I/O
Interconnection
4
Taxonomy of Parallel Computers
▪ According to memory communication model
5
Taxonomy of Parallel Computers
▪ According to memory communication model
5
Types of Parallelism in Applications
▪ Instruction-level parallelism (ILP)
– Multiple instructions from the same instruction stream can be executed
concurrently
– Generated and managed by hardware (superscalar) or by compiler (VLIW)
– Limited in practice by data and control dependences
– Ex. Pipelining, Superscalar Architecture, Very large instruction word
(VLIW) etc.
Example
1: A = B + C // Instruction I1
2: D = E - F // Instruction I2
3: G = A * D // Instruction I3
7
Types of Parallelism in Applications
▪ Data-level parallelism (DLP)
– Instructions from a single stream operate concurrently on several data
– Limited by non-regular data manipulation patterns and by memory
bandwidth.
– Ex. SIMD: Vector processing, array processing
– Let’s take an example, summing the contents of an array of size N.
For a single-core system, one thread would simply sum the elements
[0] . . . [N − 1]. For a dual-core system, however, thread A, running
on core 0, could sum the elements [0] . . . [N/2 − 1] and while thread
B, running on core 1, could sum the elements [N/2] . . . [N − 1]. So
the Two threads would be running in parallel on separate computing
cores.
8
Types of Parallelism in Applications
▪ Thread-level or task-level parallelism (TLP)
– Multiple threads or instruction sequences from the same application
can be executed concurrently
– Generated by compiler/user and managed by compiler and hardware
– Limited in practice by communication/synchronization overheads
and by algorithm characteristics.
– Consider an example of task parallelism might involve two threads,
each performing a unique statistical operation on the array of
elements. Again, The threads are operating in parallel on separate
computing cores, but each is performing a unique operation.
8
Types of Parallelism in Applications
▪ Bit-level parallelism
– Bit-level parallelism is a form of parallel computing based on increasing
processor word size, depending on very-large-scale integration (VLSI)
technology. Enhancements in computers designs were done by
increasing bit-level parallelisms.
– For e.g., consider a case where an 8-bit processor must add two 16-bit
integers. The processor must first add the 8 lower-order bits from each
integer, then add the 8 higher-order bits, requiring two instructions to
complete a single operation. A 16-bit processor would be able to
complete the operation with single instruction.