HPC Unit 456
HPC Unit 456
UNIT 4
Idling:
Idling refers to situation where processors or threads remain inactive or have no work to
perform during parallel execution.
This can happen due to load imbalance, resulting in wasted resources and decreased
efficiency.
Excess computation:
Excess computation overhead arises from redundant or unnecessary calculations in
parallel program.
It can occur due to improper task decomposition, redundant calculations, or inefficient
algorithms.
It leads to additional computational resources being used and longer execution time
जय श्री राम
Performance overhead:
• Execution time:
The total time taken to execute a parallel program.
• Total overhead:
The additional time or resources consumed by a parallel program beyond the
actual computation.
• Speedup:
The performance improvement achieved by parallelizing a program compared to
its sequential version.
• Efficiency:
It measures of how effectively computational resources are utilized to solve
problem
जय श्री राम
• Scalability:
The ability of parallel system to maintain or improve its performance as number
of processors/threads increases.
• Communication decrease:
When scaling down, the communication cost decreases by a factor of 'n/p'.
• Computation increase:
Computation at each processing element increases by factor of 'n/p' when scaling
down. This increase is due to total workload is divided among reduced number of
processors.
Total cost: Θ (n log p)
https://youtu.be/gOv7t5yYvmo
जय श्री राम
4. Describe Minimum Execution time and minimum cost, optimal execution time
=>
Minimum Execution Time (TPmin):
To find the minimum execution time, you can differentiate the expression for TP (parallel
runtime) with respect to p (number of processors) and set it to zero:
d(TP)/dp = 0
Or
https://youtu.be/Vk-mJcj7y_I
जय श्री राम
Advantages:
Efficient
High performance
Faster processing
Limitations:
Require significant memory
High computational complexity for large matrices
Not suitable for sparse matrices with mostly zero elements
Applications:
Scientific simulations, data analytics, and machine learning
Image/signal processing, data compression, and recommendation systems
UNIT 5
Load Balancing:
Load balancing involves distributing computational workload evenly across available
processing elements.
Load balancing techniques aim to minimize impact of workload and computational
complexities.
Data Locality:
Data locality refers to proximity of data to processing element that operates on it.
Efficient utilization of data locality can minimize data transfer and communication
overhead, leading to improved performance.
Techniques like data partitioning, data replication are used to enhance data locality in
parallel sorting.
जय श्री राम
Scalability:
Scalability refers to ability of parallel sorting algorithm to handle larger problem sizes
and utilize increasing computational resources effectively.
For this proper algorithm need to be designed
Combining Phase:
Sorted sub-arrays need to be combined into one sorted array.
Use parallel merge algorithm or communication/synchronization between processors.
Advantages:
Improved performance
Scalability
Efficient resource utilization
Challenges:
Load balancing
Data partitioning
Communication and synchronization overhead
https://youtu.be/UO5cQ5G9DFI
जय श्री राम
a) Source partitioned:
Graph is divided into smaller subgraphs, and each processor is assigned a subgraph
Parallel Execution Time (Tp): θ(n2)
Advantages:
Clearly defined partitioning
Each processor can compute shortest path independently
Suitable for distributed memory systems
Limitations:
Overhead of exchanging information between processors
Load imbalancing
जय श्री राम
b) Source parallel:
p>n
In this multiple processor work concurrently to explore different parts of the graph.
Advantages:
Parallelism is achieved by distributing the workload among multiple processors.
Improved performance.
Suitable for large graphs
Limitations:
Communication and synchronization overhead
Load imbalancing
Dijkstra Algorithm
https://youtu.be/84y-fHI008M
Random polling:
Request randomly selected processor for work. This approach helps prevent any
particular processor from being overloaded with work.
Shared Memory:
Processes access a shared memory region.
Communication through reading and writing to shared memory.
Simple implementation, high performance on shared memory systems.
Requires synchronization.
Publish-Subscribe:
Publishers send messages to a central broker, subscribers receive relevant messages.
Loosely coupled communication.
Simplifies system design, dynamic scalability.
Used in event-driven systems and publish-subscribe frameworks.
https://youtu.be/Sazh4Y-WlDk
https://youtu.be/embRDiiH-ts
UNIT 6
Components:
• Host (CPU):
It is responsible for managing overall execution of program and managing tasks
• Device (GPU):
It performs parallel computations
• Kernels:
Kernels are parallel functions that are executed on the GPU
Written in the CUDA C/C++ language
• Thread Hierarchy:
Threads are lightweight, independent units of execution that run on GPU. Threads
are organized in a hierarchical manner, starting from individual threads grouped
into thread blocks (grid)
• Grid:
A grid is collection of thread blocks that execute independently on GPU.
• Memory Hierarchy:
The CUDA architecture includes various memory types that are accessible by
threads. This includes registers (private to each thread), shared memory (shared
within thread block) and global memory (accessible by all threads)
जय श्री राम
CUDA Applications
Medical Imaging
Computing Finance
Oil and Gas Exploration
Data science and analytics
Deep learning and machine learning
Benefits:
Massive Parallelism
Accelerated Performance
Heterogeneous Computing
Programming Flexibility
Wide Adoption and Support
Limitations:
Hardware Dependency
Learning Curve
Memory Limitations
Limited Software Support
Development Complexity
https://youtu.be/Ongct-wmYxo
जय श्री राम
• Thread Hierarchy:
It organizes threads into group of blocks (grid)
• Memory Management:
It offers explicit memory management for global memory (accessible by all
threads) and shared memory (accessible within a block)
Benefits of CUDA C:
High Performance
Flexibility
Broad GPU Support
Limitations of CUDA C:
GPU Dependency of NVIDIA
More Learning Time
Limited Portability
Inter-Thread Communication:
Implementing mechanisms for inter-thread communication, such as shared memory, to
facilitate data sharing and coordination between threads within block.
__syncthreads():
It is synchronization primitive in CUDA. It includes strategies for data transfer between
CPU and GPU and utilizing shared memory effectively,
जय श्री राम
Components:
• HDFS:
Distributed file system for storing and accessing large data files.
• MapReduce:
Programming model for parallel processing and analysis of large datasets.
• YARN:
Resource management framework for scheduling jobs and allocating resources in
a Hadoop cluster.
• Hadoop Ecosystem:
Collection of tools and frameworks that extend Hadoop's capabilities.
Features:
• Speed:
Fast in-memory processing.
• Distributed computing:
It distributes data and computation across machines for efficient parallel
processing.
• Scalability:
Handles large data and scales from single machines to clusters.
• Integration:
Works well with popular big data tools.
• Ease of use:
User-friendly API supporting multiple languages.
जय श्री राम
Features:
• Stream Processing:
It enables real-time data processing and analysis
• Batch Processing:
It efficiently executes complex batch jobs.
• Fault Tolerance:
It allows recovery from failures without data loss.
• Scalability:
It scales horizontally to handle large data volumes, automatically parallelizing
computations
• Integration:
It seamlessly integrates with other big data ecosystems, like Kafka, Hadoop, etc.
• Dynamic Updates:
It supports dynamic updates to running jobs.
जय श्री राम
Features:
• Platform and Device Independence:
It allows developers to write code that runs on different hardware platforms and
devices.
• Heterogeneous Computing:
Utilize multiple devices simultaneously for efficient processing.
• Parallel Execution:
Task is divided into multiple small tasks, which are executed concurrently
• Portability:
Write code once and run it on various hardware platforms.