0% found this document useful (0 votes)
15 views23 pages

DS1822 -Parallel Computing - Unit 1

Uploaded by

Nisha Rajini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views23 pages

DS1822 -Parallel Computing - Unit 1

Uploaded by

Nisha Rajini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

DS1822 - PARALLEL COMPUTING

UNIT I PARALLEL COMPUTING BASICS

Importance of Parallelism – Processes, Tasks and Threads – Modifications to von-Neumann model –


ILP – TLP – Parallel Hardware – Flynns Classification – Shared Memory and Distributed Memory
Architectures – Cache Coherence – Parallel Software – Performance – Speedup and Scalability –
Massive Parallelism – GPUs – GPGPUs.

1. Importance of Parallelism:
Serial Computing:
Traditionally, software has been written for serial computation:
 A problem is broken into a discrete series of instructions.
 Instructions are executed sequentially one after another.
 Executed on a single processor.
 Only one instruction may execute at any moment in time.

Parallel Computing
In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve
a computational problem:
 A problem is broken into discrete parts that can be solved concurrently.
 Each part is further broken down to a series of instructions.
 Instructions from each part execute simultaneously on different processors.
 An overall control/coordination mechanism is employed.
Parallel Computers:

 Virtually all stand-alone computers today are parallel from a hardware perspective:
 Multiple functional units (L1 cache, L2 cache, branch, prefetch, decode, floating-point,
graphics processing (GPU), integer, etc.)
 Multiple execution units/cores
 Multiple hardware threads
 Networks connect multiple stand-alone computers (nodes) to make larger parallel computer
clusters.

Why Use Parallel Computing?

 In the natural world, many complex, interrelated events are happening at the same time,
yet within a temporal sequence.
 Compared to serial computing, parallel computing is much better suited for modeling,
simulating and understanding complex, real-world phenomena.

Parallelism:
 To build ever more powerful computers – parallelism is used.
 Rather than building ever-faster, more complex, monolithic processors, the
industry has decided to put multiple, relatively simple, complete processors
on a single chip.
 Such integrated circuits are called multicore processors, and core has
become synonymous with central processing unit, or CPU.
 In this setting a conventional processor with one CPU is often called a
single-core system.
 Basic idea of parallelism is partitioning the work to be done among
the cores.
Types of Parallelism:
 There are 2 types of Parallelism :
 task-parallelism
 data-parallelism.
 In task-parallelism, partition the various tasks carried out in solving the
problem among the cores.
 In data-parallelism, we partition the data used in solving the problem among
the cores, and each core carries out more or less similar operations on its
part of the data.
 The aim is to learn the basics of programming parallel computers using the C
language and four different APIs or application program interfaces: the
Message-Passing Interface or MPI, POSIX threads or Pthreads, OpenMP, and
CUDA.
 MPI and Pthreads are libraries of type definitions, functions, and macros that can be used in C
programs.
 OpenMP consists of a library and some modifications to the C compiler.
 CUDA consists of a library and modifications to the C++ compiler.
Concurrent, parallel, distributed computing:
1 In concurrent computing, a program is one in which multiple tasks can be in progress at any
instant.
2 In parallel computing, a program is one in which multiple tasks cooperate closely to solve a
problem.
3 In distributed computing, a program may need to cooperate with other programs to solve a
problem.

2. Processes, Tasks and Threads:

 OS, is a major piece of software, whose purpose is to manage hardware and software resources
on a computer.
 It determines which programs can run and when they can run. It also controls the allocation of
memory to running programs and access to peripheral devices, such as hard disks and network
interface cards.
 When a user runs a program, the operating system creates a process—an instance of a computer
program that is being executed.
 A process consists of several entities:
• The executable machine language program.
• A block of memory, which will include the executable code, a call stack that keeps track
of functions, a heap that can be used for memory explicitly allocated by the user
program, and some other memory locations.
• Descriptors of resources that the operating system has allocated to the process, for
example, file descriptors.
• Security information—for example, information specifying which hardware and software
resources the process can access.
• Information about the state of the process, such as whether the process is ready to run or
is waiting on some resource, the content of the registers, and information about the
process’s memory.

Multitasking:

 Modern operating systems are multitasking.


 This means that the operating system provides support for the apparent simultaneous execution
of multiple programs.
 This is possible even on a system with a single core, since each process runs for a small interval
of time (typically a few milliseconds), often called a time slice.

Threading:
 Threading provides a mechanism for programmers to divide their programs into more or less
independent tasks, with the property that when one thread is blocked, another thread can be run.
 Furthermore, in most systems it is possible to switch between threads much faster than it’s
possible to switch between processes.
 This is because threads are “lighter weight” than processes.
 When a thread is started, it forks off the process; when a thread terminates, it joins the process.

3. Parallel Hardware & Parallel Software

Modifications to Von-Neumann model:


Von Neumann Architecture:
 Describes a computer system as a CPU (or core) connected to the main memory through an
interconnection network
 Executes only one instruction at a time, with each instruction operating on only a few pieces of
data.
 Main memory has a set of addresses where you can store both instructions and data.
 CPU is divided into a control unit and an ALU.
 Control unit decides which instructions in a program need to be executed.
 ALU executes the instructions selected by control unit.
 CPU stores temporary data and some other information in registers.
 Special register PC is present in the control unit which has address of the next instructions that is
to be executed.
 Interconnect/bus used to transfer instructions and data between CPU and memory.
 Data/instructions fetched/read from memory to CPU.
 Data/results stored/written from CPU to memory.

 Separation of memory and CPU known as von Neumann bottleneck.


 Problem because CPUs can execute instructions more than a hundred time faster than they can
fetch items from main memory.

Modifications to the von Neumann Model

• Achieved by caching, virtual memory, and low-level parallelism.

The basics of caching:

 Caching is one of the most widely used methods of addressing the von Neumann bottleneck.
 Cache is a collection of memory locations that can be accessed in less time than some other
memory locations.
 CPU cache is a collection of memory locations that the CPU can access more quickly than it
can access main memory.
 A CPU cache can either be located on the same chip as the CPU or it can be located on a
separate chip that can be accessed much faster than an ordinary memory chip.
 Once we have a cache, an obvious problem is deciding which data and instructions should
be stored in the cache.
 The universally used principle is based on the idea that programs tend to use data and
instructions that are physically close to recently used data and instructions.
 The principle that an access of one location is followed by an access of a nearby location is
often called locality.
 After accessing one memory location (instruction or data), a program will typically access a
nearby location (spatial locality) in the near future (temporal locality).
 A memory access will effectively operate on blocks of data and instructions instead of
individual instructions and individual data items. These blocks are called cache
blocks or cache lines.
The cache is usually divided into levels: the first level (L1) is the smallest and the fastest, and
higher levels (L2, L3, . . . ) are larger and slower.
 When the CPU needs to access an instruction or data, it works its way down the cache
hierarchy: First it checks the level 1 cache, then the level 2, and so on. Finally, if the
information needed isn’t in any of the caches, it accesses main memory.
 When a cache is checked for information and the information is available, it’s called
a cache hit or just a hit. If the information isn’t available, it’s called a cache miss or a miss.
 In write-through caches, the line is written to main memory when it is written to the cache.
 In write-back caches, the data isn’t written immediately. Rather, the updated data in the
cache is marked dirty, and when the cache line is replaced by a new cache line from memory,
the dirty line is written to memory.

Virtual memory:

 Caches make it possible for the CPU to quickly access instructions and data that are in main
memory.
 However, if we run a very large program or a program that accesses very large data sets, all of
the instructions and data may not fit into main memory.
 Virtual memory was developed so that main memory can function as a cache for secondary
storage.
 It exploits the principle of spatial and temporal locality by keeping in main memory only the
active parts of the many running programs; those parts that are idle are kept in a block of
secondary storage called swap space.
 Like CPU caches, virtual memory operates on blocks of data and instructions. These blocks are
commonly called pages, and since secondary storage access can be hundreds of thousands of
times slower than main memory access, pages are relatively large most systems have a fixed
page size that currently ranges from 4 to 16 kilobytes.
 When the program is run, a table is created that maps the virtual page numbers to physical
addresses.
 When the program is run and it refers to a virtual address, this page table is used to translate
the virtual address into a physical address.
 A drawback to the use of a page table is that it can double the time needed to access a location in
main memory.

Translation-lookaside buffer:

 Although multiple programs can use main memory at more or less the same time, using a
page table has the potential to significantly increase each program’s overall run-time.
 In order to address this issue, processors have a special address translation cache called
a translation-lookaside buffer, or TLB. It caches a small number of entries (typically 16–
512) from the page table in very fast memory.
 When we look for an address and the virtual page number is in the TLB, it’s called a
TLB hit. If it’s not in the TLB, it’s called a miss.
 If we attempt to access a page that’s not in memory, that is, the page table doesn’t have a
valid physical address for the page and the page is only stored on disk, then the attempted
access is called a page fault.

4, Instruction-level parallelism:
 Instruction-level parallelism, or ILP, attempts to improve processor performance by having
multiple processor components or functional units simultaneously exe-cuting instructions.
 There are two main approaches to ILP:
o Pipelining, in which functional units are arranged in stages,
o Multiple issue, in which multiple instructions can be simultaneously initiated. Both
approaches are used in virtually all modern CPUs.
Pipelining
 The principle of pipelining is similar to a factory assembly line: while one team is bolting a car’s
engine to the chassis, another team can connect the transmission to the engine and the driveshaft
of a car that’s already been processed by the first team, and a third team can bolt the body to the
chassis in a car that’s been processed by the first two teams.
 As an example involving computation, suppose we want to add the floating point numbers 9.87
104 and 6.54 103. Then we can use the following steps:

Multiple issue:

 Pipelines improve performance by taking individual pieces of hardware or functional units and
connecting them in sequence.
 Multiple issue processors replicate functional units and try to simultaneously execute different
instructions in a program.
 For example, if we have two complete floating point adders, we can approximately halve the
time it takes to execute the loop

 While the first adder is computing z[0], the second can compute z[1]; while the first is
computing z[2], the second can compute z[3]; and so on.
 If the functional units are scheduled at compile time, the multiple issue system is said to
use static multiple issue.
 If they’re scheduled at run-time, the system is said to use dynamic multiple issue. A processor
that supports dynamic multiple issue is sometimes said to be superscalar.
 In order to make use of multiple issue, the system must find instructions that can be executed
simultaneously.
 One of the most important techniques is speculation.
 In speculation, the compiler or the processor makes a guess about an instruction, and then
executes the instruction on the basis of the guess.
 As a simple example, in the following code, the system might predict that the outcome of z = x +
y will give z a positive value, and, as a consequence, it will assign w = x.

5. Thread-level parallelism:
 Thread-level parallelism, or TLP, attempts to provide parallelism through the simultaneous
execution of different threads, so it provides a coarser-grained parallelism than ILP, that is, the
program units that are being simultaneously executed threads are larger or coarser than the finer-
grained units individual instructions.
 Hardware multithreading provides a means for systems to continue doing use-ful work when
the task being currently executed has stalled—for example, if the current task has to wait for data
to be loaded from memory. Instead of looking for parallelism in the currently executing thread, it
may make sense to simply run another thread.
 In fine-grained multithreading, the processor switches between threads after each instruction,
skipping threads that are stalled. While this approach has the potential to avoid wasted machine
time due to stalls, it has the drawback that a thread that’s ready to execute a long sequence of
instructions may have to wait to execute every instruction.
 Coarse-grained multithreading attempts to avoid this problem by only switching threads that are
stalled waiting for a time-consuming operation to complete (e.g., a load from main memory).
This has the virtue that switching threads doesn’t need to be nearly instantaneous. However, the
processor can be idled on shorter stalls, and thread switching will also cause delays.
 Simultaneous multithreading, or SMT, is a variation on fine-grained multi-threading. It
attempts to exploit superscalar processors by allowing multiple threads to make use of the
multiple functional units. If we designate “preferred” threads— threads that have many
instructions ready to execute—we can somewhat reduce the problem of thread slowdown.

6.Flynns Classification:
 There are a number of different ways to classify parallel computers. Examples are available in
the references.

 One of the more widely used classifications, in use since 1966, is called Flynn's Taxonomy.

 Flynn's taxonomy distinguishes multi-processor computer architectures according to how they


can be classified along the two independent dimensions of Instruction Stream and Data
Stream. Each of these dimensions can have only one of two possible states: Single or Multiple.

 The matrix below defines the 4 possible classifications according to Flynn:

Single Instruction, Single Data (SISD)


 A serial (non-parallel) computer
 Single Instruction: Only one instruction stream is being acted on by the CPU during any one
clock cycle
 Single Data: Only one data stream is being used as input during any one clock cycle
 Deterministic execution
 This is the oldest type of computer
 Examples: older generation mainframes, minicomputers, workstations and single processor/core
PCs.
Single Instruction, Multiple Data (SIMD)
 A type of parallel computer
 Single Instruction: All processing units execute the same instruction at any given clock cycle
 Multiple Data: Each processing unit can operate on a different data element
 Best suited for specialized problems characterized by a high degree of regularity, such as
graphics/image processing.
 Synchronous (lockstep) and deterministic execution
 Two varieties: Processor Arrays and Vector Pipelines
 Examples:
 Processor Arrays: Thinking Machines CM-2, MasPar MP-1 & MP-2, ILLIAC IV
 Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi
S820, ETA10
 Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD
instructions and execution units.
Multiple Instruction, Single Data (MISD)
 A type of parallel computer
 Multiple Instruction: Each processing unit operates on the data independently via separate
instruction streams.
 Single Data: A single data stream is fed into multiple processing units.
 Few (if any) actual examples of this class of parallel computer have ever existed.
 Some conceivable uses might be:
 multiple frequency filters operating on a single signal stream
 multiple cryptography algorithms attempting to crack a single coded message.

Multiple Instruction, Multiple Data (MIMD)


 A type of parallel computer
 Multiple Instruction: Every processor may be executing a different instruction stream
 Multiple Data: Every processor may be working with a different data stream
 Execution can be synchronous or asynchronous, deterministic or non-deterministic
 Currently, the most common type of parallel computer - most modern supercomputers fall into
this category.
 Examples: most current supercomputers, networked parallel computer clusters and "grids",
multi-processor SMP computers, multi-core PCs.
 Note Many MIMD architectures also include SIMD execution sub-components.

7. Shared Memory and Distributed Memory Architectures:

 There are two principal types of MIMD systems:


o Shared-memory systems
o Distributed-memory systems.

Shared Memory system:


 In a shared-memory system a collection of autonomous processors is connected to a memory
system via an interconnection network, and each processor can access each memory location.
 In a shared-memory system, the processors usually communicate implicitly by accessing shared
data structures.
Distributed Memory system:
 In a distributed-memory system, each processor is paired with its own private memory, and the
processor-memory pairs communicate over an interconnection network.
 So in distributed-memory systems the processors usually communicate explicitly by sending
messages or by using special functions that provide access to the memory of another processor.

8. Cache Coherence:
 CPU caches are managed by system hardware: programmers don’t have direct control over them.
 This has several important consequences for shared-memory systems.
 To understand these issues, suppose we have a shared-memory system with two cores, each of
which has its own private data cache.
 As long as the two cores only read shared data, there is no problem.
 For example, suppose that x is a shared variable that has been initialized to 2, y0 is private and
owned by core 0, and y1 and z1 are private and owned by core 1.
 Now suppose the following statements are executed at the indicated times:
 Then the memory location for y0 will eventually get the value 2, and the memory location
for y1 will eventually get the value 6.
 However, it’s not so clear what value z1 will get.
 It might at first appear that since core 0 updates x to 7 before the assign-ment to z1, z1 will get
the value 4 7 = 28.
 However, at time 0, x is in the cache of core 1.
 So unless for some reason x is evicted from core 0’s cache and then reloaded into core 1’s cache,
it actually appears that the original value x = 2 may be used, and z1 will get the value 4 2 = 8.

Fig : Shared memory system with 2 cores and 2 caches


 Note that this unpredictable behavior will occur regardless of whether the system is using a
write-through or a write-back policy.
 If it’s using a write-through policy, the main memory will be updated by the assignment x = 7.
However, this will have no effect on the value in the cache of core 1.
 If the system is using a write-back policy, the new value of x in the cache of core 0 probably
won’t even be available to core 1 when it updates z1.
Cache Coherence Problem:
 Clearly, this is a problem.
 The programmer doesn’t have direct control over when the caches are updated, so her program
cannot execute these apparently innocuous statements and know what will be stored in z1.
 There are several problems here, but the one we want to look at right now is that the caches we
described for single processor systems provide no mechanism for insuring that when the caches
of multiple processors store the same variable, an update by one processor to the cached variable
is “seen” by the other processors.
 That is, that the cached value stored by the other processors is also updated. This is called
the cache coherence problem.
Two main approaches:
 There are two main approaches to insuring cache coherence:
o snooping cache coherence
o directory-based cache coherence.
Snooping Cache Coherence:
 The idea behind snooping comes from bus-based systems:
 When the cores share a bus, any signal transmitted on the bus can be “seen” by all the cores
connected to the bus.
 Thus, when core 0 updates the copy of x stored in its cache, if it also broadcasts this information
across the bus, and if core 1 is “snooping” the bus, it will see that x has been updated and it can
mark its copy of x as invalid.
 This is more or less how snooping cache coherence works.
Directory-based cache coherence:
 Snooping cache coherence requires a broadcast every time a variable is updated.
 So snooping cache coherence isn’t scalable, because for larger systems it will cause
performance to degrade.
 Directory-based cache coherence protocols attempt to solve this problem through the use of a
data structure called a directory.
 The directory stores the status of each cache line. Typically, this data structure is distributed.
 Thus, when a line is read into, say, core 0’s cache, the directory entry corresponding to that line
would be updated indicating that core 0 has a copy of the line.
 When a variable is updated, the directory is consulted, and the cache controllers of the cores that
have that variable’s cache line in their caches are invalidated.
False sharing:
 As an example, suppose we want to repeatedly call a function f(i,j) and add the computed values
into a vector:
 We can parallelize this by dividing the iterations in the outer loop among the cores.
 If we have core count cores, we might assign the first m/core count iterations to the first core, the
next m/core count iterations to the second core, and so on.

 Now suppose our shared-memory system has two cores, m = 8, doubles are eight bytes, cache
lines are 64 bytes, and y[0] is stored at the beginning of a cache line.
 A cache line can store eight doubles, and y takes one full cache line.
 What happens when core 0 and core 1 simultaneously execute their codes?
 Since all of y is stored in a single cache line, each time one of the cores executes the
statement y[i] += f(i,j), the line will be invalidated, and the next time the other core tries
to execute this statement it will have to fetch the updated line from memory!
 So if n is large, we would expect that a large percentage of the assignments y[i] += f(i,j) will
access main memory—in spite of the fact that core 0 and core 1 never access each others’
elements of y.
 This is called false sharing, because the system is behaving as if the elements of y were being
shared by the cores.

9.Parallel Software:
Performance
 The main purpose in writing parallel programs is usually increased performance. And how can
we evaluate our programs?
1. Speedup and efficiency:
 Usually the best we can hope to do is to equally divide the work among the cores, while at the
same time introducing no additional work for the cores.
 If we succeed in doing this, and we run our program with p cores, one thread or process on each
core, then our parallel program will run p times faster than the serial program.
 If we call the serial run-time Tserial and our parallel run-time Tparallel, then the best we can hope for
is Tparallel = Tserial/p.
 When this happens, we say that our parallel program has linear speedup.
 If we define the speedup of a parallel program to be then linear speedup has S = p, which is
unusual.

 Furthermore, as p increases, we expect S to become a smaller and smaller fraction of the ideal,
linear speedup p.
 Another way of saying this is that S=p will probably get smaller and smaller as p increases.

 if Toverhead denotes this parallel overhead, it’s often the case that
 As the problem size is increased, Toverhead often grows more slowly than Tserial.
 When this is the case the speedup and the efficiency will increase.
 The more work for the processes/threads to do, so the relative amount of time spent
coordinating the work of the processes/threads should be less.

2. Amdahl’s law

 Back in the 1960s, Gene Amdahl made an observation [2] that’s become known as Amdahl’s
law.
 It says, roughly, that unless virtually all of a serial program is parallelized, the possible speedup
is going to be very limited—regardless of the number of cores available.
 If the serial run-time is Tserial= 20 seconds, then the run-time of the parallelized part will be 0.9
xTserial/p = 18=p and the run-time of the “unparallelized” part will be 0.1 x Tserial = 2. The overall
parallel run-time will be

3. Scalability:

 The word “scalable” has a wide variety of informal uses


 Suppose we run a parallel program with a fixed number of processes/threads and a fixed input
size, and we obtain an efficiency E. Suppose we now increase the number of processes/threads
that are used by the program. If we can find a corresponding rate of increase in the problem size
so that the program always has efficiency E, then the program is scalable.
 As an example, suppose that Tserial = n, where the units of Tserial are in microsec-onds, and n is also
the problem size. Also suppose that Tparallel = n/p + 1. Then

10. Massive Parallelism:


 Massive parallelism in parallel computing is a technique that leverages the power of numerous
processors to perform multiple computations simultaneously, significantly enhancing
computational speed and efficiency.
 This method is crucial for solving complex computational problems that are too intensive for a
single processor to handle. Here are some key points about massive parallelism:
Key Characteristics:
1. Scalability: Systems can scale from a few processors to thousands or even millions, depending
on the computational needs.
2. Performance: By distributing tasks across many processors, systems can achieve much higher
performance than traditional single-processor systems.
3. Efficiency: Ideal for tasks that can be broken down into smaller, concurrent tasks, such as
simulations, data analysis, and real-time processing.
Applications:
 Scientific Research: Enabling detailed simulations and analyses, such as weather forecasting
and molecular modeling.
 Finance: Speeding up complex algorithms for trading and risk management.
 Artificial Intelligence: Enhancing the training and inference of large neural networks.
 Healthcare: Facilitating genome sequencing and other bioinformatics tasks.
Technologies Involved:
 GPUs (Graphics Processing Units): Often used in conjunction with CPUs for high-throughput
tasks.
 FPGA (Field-Programmable Gate Arrays): Customizable to specific tasks for optimal
performance.
 Supercomputers: Highly parallel systems with thousands of processors working in tandem.

11. Graphics processing units:


 A graphics processing unit (graphical processing unit, GPU) is an electronic circuit designed to
speed computer graphics and image processing on various devices.
 These devices include video cards, system boards, mobile phones and personal computers (PCs).
 By performing mathematical calculations rapidly, a GPU reduces the time needed for a computer
to run multiple programs.
 This makes it an essential enabler of emerging and future technologies such as machine learning
(ML), artificial intelligence (AI) and blockchain.
 Before the invention of GPUs in the 1990s, graphics controllers in PCs and on video game
controllers relied on a computer's central processing unit (CPU) to run tasks.
 Since the early 1950s, CPUs have been the most important processors in a computer, running all
instructions necessary to run programs, such as logic, controlling and input/output (I/O).
 In the 2010s, GPU technology gained even more capabilities, perhaps most significantly ray
tracing (the generation of computer images by tracing the direction of light from a camera)
and tensor cores (designed to enable deep learning).

How does a GPU work?


 A GPU has its own rapid access memory (RAM), an electronic memory used to store code and
data that the chip can access and alter as needed.
 Advanced GPUs typically have RAM that has been built to hold the large data volumes required
for compute-intensive tasks such as graphics editing, gaming or AI/ML use cases.
 Two popular kinds of GPU memory are Graphics Double Data Rate 6 Synchronous Dynamic
Random-Access Memory (GDDR6) and GDDR6X, a later generation. GDDR6X uses 15% less
power per transferred bit than GDDR6, but its overall power consumption is higher because it is
faster.
Types of GPU’s:
There are three types of GPUs:
 Discrete GPUs
 Integrated GPUs
 Virtual GPUs
Discrete GPUs

 Discrete GPUs, or dGPUs, are graphics processors that are separate from a device's CPU, where
information is taken in and processed, allowing a computer to function.

 Discrete GPUs are typically used in advanced applications with special requirements, such as
editing, content creation or high-end gaming.
 They are distinct chips with connectors to separate circuit boards and attached to the CPU by
using an express slot.

 One of the most widely used discrete GPUs is the Intel Arc brand, built for the PC gaming
industry.

Integrated GPUs
 An integrated GPU, or iGPU, is built into a computer or device's infrastructure and typically
slotted in next to the CPU.
 Designed in the 2010s by Intel, integrated GPUs became more popular as manufacturers such as
MSI, ASUS and Nvidia noticed the power of combining GPUs with CPUs rather than requiring
users to add GPUs by a PCI express slot themselves.
 They remain a popular choice for laptop users, gamers and others who are running compute-
intensive programs on their PCs.
Virtual GPUs
 Virtual GPUs, or vGPUs, have the same capabilities as discrete or integrated GPUs but without
the hardware.
 They are software-based versions of GPUs built for cloud instances and can be used to run the
same workloads.
 Also, because they have no hardware, they are simpler and cheaper to maintain than their
physical counterparts.
12. GPGPUs:
 General-Purpose computing on Graphics Processing Units (GPGPUs) is a fascinating area of
parallel computing.
 Essentially, GPGPUs leverage the massive parallel processing power of GPUs, which were
originally designed for rendering graphics, to perform a wide range of computational tasks
traditionally handled by CPUs.
Here are some key points about GPGPUs in parallel computing:
Architecture and Components
 Multiple Cores: GPUs have hundreds or thousands of smaller, efficient cores compared to
CPUs, which are designed for sequential processing.
 Memory Hierarchy: Includes registers, shared memory, and global memory, allowing efficient
data handling and access.
Programming Models
 CUDA (Compute Unified Device Architecture): Developed by NVIDIA, it provides a C-like
language and runtime system for writing parallel code.
 OpenCL (Open Computing Language): An open standard that supports heterogeneous
computing across different platforms and devices.
Applications
 Scientific Research: Accelerates complex simulations in fields like physics, biology, and
environmental science.
 Machine Learning and AI: Essential for training and running AI models due to their ability to
handle large-scale parallel processing.
 Medical Imaging: Processes medical images like MRI and CT scans for faster diagnostics.
 Financial Modeling: Used for risk analysis, algorithmic trading, and real-time data processing.
Benefits
 Enhanced Computational Speed: Parallel processing capabilities offer significant speed
advantages over traditional CPUs.
 Energy Efficiency: More energy-efficient for parallel tasks, providing a greener alternative for
high-performance computing.

You might also like