DS1822 -Parallel Computing - Unit 1
DS1822 -Parallel Computing - Unit 1
1. Importance of Parallelism:
Serial Computing:
Traditionally, software has been written for serial computation:
A problem is broken into a discrete series of instructions.
Instructions are executed sequentially one after another.
Executed on a single processor.
Only one instruction may execute at any moment in time.
Parallel Computing
In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve
a computational problem:
A problem is broken into discrete parts that can be solved concurrently.
Each part is further broken down to a series of instructions.
Instructions from each part execute simultaneously on different processors.
An overall control/coordination mechanism is employed.
Parallel Computers:
Virtually all stand-alone computers today are parallel from a hardware perspective:
Multiple functional units (L1 cache, L2 cache, branch, prefetch, decode, floating-point,
graphics processing (GPU), integer, etc.)
Multiple execution units/cores
Multiple hardware threads
Networks connect multiple stand-alone computers (nodes) to make larger parallel computer
clusters.
In the natural world, many complex, interrelated events are happening at the same time,
yet within a temporal sequence.
Compared to serial computing, parallel computing is much better suited for modeling,
simulating and understanding complex, real-world phenomena.
Parallelism:
To build ever more powerful computers – parallelism is used.
Rather than building ever-faster, more complex, monolithic processors, the
industry has decided to put multiple, relatively simple, complete processors
on a single chip.
Such integrated circuits are called multicore processors, and core has
become synonymous with central processing unit, or CPU.
In this setting a conventional processor with one CPU is often called a
single-core system.
Basic idea of parallelism is partitioning the work to be done among
the cores.
Types of Parallelism:
There are 2 types of Parallelism :
task-parallelism
data-parallelism.
In task-parallelism, partition the various tasks carried out in solving the
problem among the cores.
In data-parallelism, we partition the data used in solving the problem among
the cores, and each core carries out more or less similar operations on its
part of the data.
The aim is to learn the basics of programming parallel computers using the C
language and four different APIs or application program interfaces: the
Message-Passing Interface or MPI, POSIX threads or Pthreads, OpenMP, and
CUDA.
MPI and Pthreads are libraries of type definitions, functions, and macros that can be used in C
programs.
OpenMP consists of a library and some modifications to the C compiler.
CUDA consists of a library and modifications to the C++ compiler.
Concurrent, parallel, distributed computing:
1 In concurrent computing, a program is one in which multiple tasks can be in progress at any
instant.
2 In parallel computing, a program is one in which multiple tasks cooperate closely to solve a
problem.
3 In distributed computing, a program may need to cooperate with other programs to solve a
problem.
OS, is a major piece of software, whose purpose is to manage hardware and software resources
on a computer.
It determines which programs can run and when they can run. It also controls the allocation of
memory to running programs and access to peripheral devices, such as hard disks and network
interface cards.
When a user runs a program, the operating system creates a process—an instance of a computer
program that is being executed.
A process consists of several entities:
• The executable machine language program.
• A block of memory, which will include the executable code, a call stack that keeps track
of functions, a heap that can be used for memory explicitly allocated by the user
program, and some other memory locations.
• Descriptors of resources that the operating system has allocated to the process, for
example, file descriptors.
• Security information—for example, information specifying which hardware and software
resources the process can access.
• Information about the state of the process, such as whether the process is ready to run or
is waiting on some resource, the content of the registers, and information about the
process’s memory.
Multitasking:
Threading:
Threading provides a mechanism for programmers to divide their programs into more or less
independent tasks, with the property that when one thread is blocked, another thread can be run.
Furthermore, in most systems it is possible to switch between threads much faster than it’s
possible to switch between processes.
This is because threads are “lighter weight” than processes.
When a thread is started, it forks off the process; when a thread terminates, it joins the process.
Caching is one of the most widely used methods of addressing the von Neumann bottleneck.
Cache is a collection of memory locations that can be accessed in less time than some other
memory locations.
CPU cache is a collection of memory locations that the CPU can access more quickly than it
can access main memory.
A CPU cache can either be located on the same chip as the CPU or it can be located on a
separate chip that can be accessed much faster than an ordinary memory chip.
Once we have a cache, an obvious problem is deciding which data and instructions should
be stored in the cache.
The universally used principle is based on the idea that programs tend to use data and
instructions that are physically close to recently used data and instructions.
The principle that an access of one location is followed by an access of a nearby location is
often called locality.
After accessing one memory location (instruction or data), a program will typically access a
nearby location (spatial locality) in the near future (temporal locality).
A memory access will effectively operate on blocks of data and instructions instead of
individual instructions and individual data items. These blocks are called cache
blocks or cache lines.
The cache is usually divided into levels: the first level (L1) is the smallest and the fastest, and
higher levels (L2, L3, . . . ) are larger and slower.
When the CPU needs to access an instruction or data, it works its way down the cache
hierarchy: First it checks the level 1 cache, then the level 2, and so on. Finally, if the
information needed isn’t in any of the caches, it accesses main memory.
When a cache is checked for information and the information is available, it’s called
a cache hit or just a hit. If the information isn’t available, it’s called a cache miss or a miss.
In write-through caches, the line is written to main memory when it is written to the cache.
In write-back caches, the data isn’t written immediately. Rather, the updated data in the
cache is marked dirty, and when the cache line is replaced by a new cache line from memory,
the dirty line is written to memory.
Virtual memory:
Caches make it possible for the CPU to quickly access instructions and data that are in main
memory.
However, if we run a very large program or a program that accesses very large data sets, all of
the instructions and data may not fit into main memory.
Virtual memory was developed so that main memory can function as a cache for secondary
storage.
It exploits the principle of spatial and temporal locality by keeping in main memory only the
active parts of the many running programs; those parts that are idle are kept in a block of
secondary storage called swap space.
Like CPU caches, virtual memory operates on blocks of data and instructions. These blocks are
commonly called pages, and since secondary storage access can be hundreds of thousands of
times slower than main memory access, pages are relatively large most systems have a fixed
page size that currently ranges from 4 to 16 kilobytes.
When the program is run, a table is created that maps the virtual page numbers to physical
addresses.
When the program is run and it refers to a virtual address, this page table is used to translate
the virtual address into a physical address.
A drawback to the use of a page table is that it can double the time needed to access a location in
main memory.
Translation-lookaside buffer:
Although multiple programs can use main memory at more or less the same time, using a
page table has the potential to significantly increase each program’s overall run-time.
In order to address this issue, processors have a special address translation cache called
a translation-lookaside buffer, or TLB. It caches a small number of entries (typically 16–
512) from the page table in very fast memory.
When we look for an address and the virtual page number is in the TLB, it’s called a
TLB hit. If it’s not in the TLB, it’s called a miss.
If we attempt to access a page that’s not in memory, that is, the page table doesn’t have a
valid physical address for the page and the page is only stored on disk, then the attempted
access is called a page fault.
4, Instruction-level parallelism:
Instruction-level parallelism, or ILP, attempts to improve processor performance by having
multiple processor components or functional units simultaneously exe-cuting instructions.
There are two main approaches to ILP:
o Pipelining, in which functional units are arranged in stages,
o Multiple issue, in which multiple instructions can be simultaneously initiated. Both
approaches are used in virtually all modern CPUs.
Pipelining
The principle of pipelining is similar to a factory assembly line: while one team is bolting a car’s
engine to the chassis, another team can connect the transmission to the engine and the driveshaft
of a car that’s already been processed by the first team, and a third team can bolt the body to the
chassis in a car that’s been processed by the first two teams.
As an example involving computation, suppose we want to add the floating point numbers 9.87
104 and 6.54 103. Then we can use the following steps:
Multiple issue:
Pipelines improve performance by taking individual pieces of hardware or functional units and
connecting them in sequence.
Multiple issue processors replicate functional units and try to simultaneously execute different
instructions in a program.
For example, if we have two complete floating point adders, we can approximately halve the
time it takes to execute the loop
While the first adder is computing z[0], the second can compute z[1]; while the first is
computing z[2], the second can compute z[3]; and so on.
If the functional units are scheduled at compile time, the multiple issue system is said to
use static multiple issue.
If they’re scheduled at run-time, the system is said to use dynamic multiple issue. A processor
that supports dynamic multiple issue is sometimes said to be superscalar.
In order to make use of multiple issue, the system must find instructions that can be executed
simultaneously.
One of the most important techniques is speculation.
In speculation, the compiler or the processor makes a guess about an instruction, and then
executes the instruction on the basis of the guess.
As a simple example, in the following code, the system might predict that the outcome of z = x +
y will give z a positive value, and, as a consequence, it will assign w = x.
5. Thread-level parallelism:
Thread-level parallelism, or TLP, attempts to provide parallelism through the simultaneous
execution of different threads, so it provides a coarser-grained parallelism than ILP, that is, the
program units that are being simultaneously executed threads are larger or coarser than the finer-
grained units individual instructions.
Hardware multithreading provides a means for systems to continue doing use-ful work when
the task being currently executed has stalled—for example, if the current task has to wait for data
to be loaded from memory. Instead of looking for parallelism in the currently executing thread, it
may make sense to simply run another thread.
In fine-grained multithreading, the processor switches between threads after each instruction,
skipping threads that are stalled. While this approach has the potential to avoid wasted machine
time due to stalls, it has the drawback that a thread that’s ready to execute a long sequence of
instructions may have to wait to execute every instruction.
Coarse-grained multithreading attempts to avoid this problem by only switching threads that are
stalled waiting for a time-consuming operation to complete (e.g., a load from main memory).
This has the virtue that switching threads doesn’t need to be nearly instantaneous. However, the
processor can be idled on shorter stalls, and thread switching will also cause delays.
Simultaneous multithreading, or SMT, is a variation on fine-grained multi-threading. It
attempts to exploit superscalar processors by allowing multiple threads to make use of the
multiple functional units. If we designate “preferred” threads— threads that have many
instructions ready to execute—we can somewhat reduce the problem of thread slowdown.
6.Flynns Classification:
There are a number of different ways to classify parallel computers. Examples are available in
the references.
One of the more widely used classifications, in use since 1966, is called Flynn's Taxonomy.
8. Cache Coherence:
CPU caches are managed by system hardware: programmers don’t have direct control over them.
This has several important consequences for shared-memory systems.
To understand these issues, suppose we have a shared-memory system with two cores, each of
which has its own private data cache.
As long as the two cores only read shared data, there is no problem.
For example, suppose that x is a shared variable that has been initialized to 2, y0 is private and
owned by core 0, and y1 and z1 are private and owned by core 1.
Now suppose the following statements are executed at the indicated times:
Then the memory location for y0 will eventually get the value 2, and the memory location
for y1 will eventually get the value 6.
However, it’s not so clear what value z1 will get.
It might at first appear that since core 0 updates x to 7 before the assign-ment to z1, z1 will get
the value 4 7 = 28.
However, at time 0, x is in the cache of core 1.
So unless for some reason x is evicted from core 0’s cache and then reloaded into core 1’s cache,
it actually appears that the original value x = 2 may be used, and z1 will get the value 4 2 = 8.
Now suppose our shared-memory system has two cores, m = 8, doubles are eight bytes, cache
lines are 64 bytes, and y[0] is stored at the beginning of a cache line.
A cache line can store eight doubles, and y takes one full cache line.
What happens when core 0 and core 1 simultaneously execute their codes?
Since all of y is stored in a single cache line, each time one of the cores executes the
statement y[i] += f(i,j), the line will be invalidated, and the next time the other core tries
to execute this statement it will have to fetch the updated line from memory!
So if n is large, we would expect that a large percentage of the assignments y[i] += f(i,j) will
access main memory—in spite of the fact that core 0 and core 1 never access each others’
elements of y.
This is called false sharing, because the system is behaving as if the elements of y were being
shared by the cores.
9.Parallel Software:
Performance
The main purpose in writing parallel programs is usually increased performance. And how can
we evaluate our programs?
1. Speedup and efficiency:
Usually the best we can hope to do is to equally divide the work among the cores, while at the
same time introducing no additional work for the cores.
If we succeed in doing this, and we run our program with p cores, one thread or process on each
core, then our parallel program will run p times faster than the serial program.
If we call the serial run-time Tserial and our parallel run-time Tparallel, then the best we can hope for
is Tparallel = Tserial/p.
When this happens, we say that our parallel program has linear speedup.
If we define the speedup of a parallel program to be then linear speedup has S = p, which is
unusual.
Furthermore, as p increases, we expect S to become a smaller and smaller fraction of the ideal,
linear speedup p.
Another way of saying this is that S=p will probably get smaller and smaller as p increases.
if Toverhead denotes this parallel overhead, it’s often the case that
As the problem size is increased, Toverhead often grows more slowly than Tserial.
When this is the case the speedup and the efficiency will increase.
The more work for the processes/threads to do, so the relative amount of time spent
coordinating the work of the processes/threads should be less.
2. Amdahl’s law
Back in the 1960s, Gene Amdahl made an observation [2] that’s become known as Amdahl’s
law.
It says, roughly, that unless virtually all of a serial program is parallelized, the possible speedup
is going to be very limited—regardless of the number of cores available.
If the serial run-time is Tserial= 20 seconds, then the run-time of the parallelized part will be 0.9
xTserial/p = 18=p and the run-time of the “unparallelized” part will be 0.1 x Tserial = 2. The overall
parallel run-time will be
3. Scalability:
Discrete GPUs, or dGPUs, are graphics processors that are separate from a device's CPU, where
information is taken in and processed, allowing a computer to function.
Discrete GPUs are typically used in advanced applications with special requirements, such as
editing, content creation or high-end gaming.
They are distinct chips with connectors to separate circuit boards and attached to the CPU by
using an express slot.
One of the most widely used discrete GPUs is the Intel Arc brand, built for the PC gaming
industry.
Integrated GPUs
An integrated GPU, or iGPU, is built into a computer or device's infrastructure and typically
slotted in next to the CPU.
Designed in the 2010s by Intel, integrated GPUs became more popular as manufacturers such as
MSI, ASUS and Nvidia noticed the power of combining GPUs with CPUs rather than requiring
users to add GPUs by a PCI express slot themselves.
They remain a popular choice for laptop users, gamers and others who are running compute-
intensive programs on their PCs.
Virtual GPUs
Virtual GPUs, or vGPUs, have the same capabilities as discrete or integrated GPUs but without
the hardware.
They are software-based versions of GPUs built for cloud instances and can be used to run the
same workloads.
Also, because they have no hardware, they are simpler and cheaper to maintain than their
physical counterparts.
12. GPGPUs:
General-Purpose computing on Graphics Processing Units (GPGPUs) is a fascinating area of
parallel computing.
Essentially, GPGPUs leverage the massive parallel processing power of GPUs, which were
originally designed for rendering graphics, to perform a wide range of computational tasks
traditionally handled by CPUs.
Here are some key points about GPGPUs in parallel computing:
Architecture and Components
Multiple Cores: GPUs have hundreds or thousands of smaller, efficient cores compared to
CPUs, which are designed for sequential processing.
Memory Hierarchy: Includes registers, shared memory, and global memory, allowing efficient
data handling and access.
Programming Models
CUDA (Compute Unified Device Architecture): Developed by NVIDIA, it provides a C-like
language and runtime system for writing parallel code.
OpenCL (Open Computing Language): An open standard that supports heterogeneous
computing across different platforms and devices.
Applications
Scientific Research: Accelerates complex simulations in fields like physics, biology, and
environmental science.
Machine Learning and AI: Essential for training and running AI models due to their ability to
handle large-scale parallel processing.
Medical Imaging: Processes medical images like MRI and CT scans for faster diagnostics.
Financial Modeling: Used for risk analysis, algorithmic trading, and real-time data processing.
Benefits
Enhanced Computational Speed: Parallel processing capabilities offer significant speed
advantages over traditional CPUs.
Energy Efficiency: More energy-efficient for parallel tasks, providing a greener alternative for
high-performance computing.