0% found this document useful (0 votes)
9 views

Unit 2 HPCcontent

This is mine class notes

Uploaded by

Aadil Khan6078
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Unit 2 HPCcontent

This is mine class notes

Uploaded by

Aadil Khan6078
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

CO1 Comprehend the concept of distributed memory in computing and recognize

its significance in parallel and distributed systems.

What is Distributed Computing?

Distributed computing is a computing concept that leverages the combined power of


multiple interconnected computers to collaborate on a shared task. Unlike traditional
computing, which relies on a single central machine, distributed systems distribute the
workload across numerous interconnected nodes.

This approach brings several benefits, including heightened processing capabilities,


improved resilience against failures, and an enhanced ability to handle larger
workloads. By breaking down tasks into smaller components and distributing them
across the network, distributed computing enables swifter and more efficient
processing.

It finds extensive application in high-performance computing, big data processing, and


content delivery networks, revolutionizing our approach to complex computational
challenges.

Types of Distributed Systems

There are two types of arrangement of distributed systems:


• Client/server systems: In client-server systems, the client requests a resource or
file and the server fetches that resource. Users and servers usually communicate
through a computer network, so they are a part of distributed systems. A client
is in contact with just one server.

• Peer-to-peer systems: The peer-to-peer techniques contain nodes that are equal
participants in data sharing. The nodes communicate with each other as needed
to share resources. This is done with the help of a network. All the tasks are
equally separated between all the nodes.

• Middleware: Middleware can be thought of as an application that sits between


two separate applications and provides service to both. It works as a base for
different interoperability applications running on different operating systems.
Data can be transferred to other between others by using this service.

• Three-tier: Three-tier system uses a separate layer and server for each function
of a program. In this data of the client is stored in the middle tier rather than
sorted into the client system or on their server through which development can
be done easily. It includes an Application Layer, Data Layer, and Presentation
Layer. This is mostly used in web or online applications.

• N-tier: N-tier is also called a multitier distributed system. The N-tier system can
contain any number of functions in the network. N-tier systems contain similar
structures to three-tier architecture. When interoperability sends the request to
another application to perform a task or to provide a service. N-tier is commonly
used in web applications and data systems.

Benefits of Distributed Computing

Distributed computing presents numerous advantages that make it a valuable approach


across diverse fields. Now, let’s explore a few of the significant benefits it offers:

• Increased Processing Power: By harnessing the collective computing


power of multiple machines, distributed computing enables faster and more
efficient processing of complex tasks. This enhanced processing capability
allows for quicker data analysis, simulations, and computations,
empowering industries to tackle large-scale problems and achieve faster
results.
• Improved Fault Tolerance: Distributed systems are designed with
redundancy and fault tolerance in mind. If one machine or node fails, the
workload can be automatically rerouted to other functioning nodes, ensuring
uninterrupted operation. This resilience minimizes the impact of hardware
failures, software glitches, or network disruptions, resulting in increased
system availability and reliability.
• Enhanced Scalability: Distributed computing offers excellent scalability,
allowing systems to handle growing workloads and adapt to changing
demands. Additional machines or nodes can be easily added to the network,
expanding the system’s processing capacity without requiring major
architectural changes. This scalability enables businesses to accommodate
increasing data volumes, user traffic, and computational requirements
without compromising performance.
• Resource Efficiency: By distributing tasks across multiple machines,
distributed computing optimizes resource utilization. Each machine can
contribute its processing power, memory, and storage capacity to the overall
system, maximizing efficiency and reducing idle resources. This resource
optimization leads to cost savings as organizations can achieve high-
performance levels without needing expensive dedicated hardware.
• Support for Large-Scale Data Processing: In the era of big data,
distributed computing is essential for processing and analysing massive
datasets. Distributed frameworks and algorithms, such as MapReduce and
parallel processing, enable efficient data handling and analysis, unlocking
valuable insights from vast volumes of information. This capability is
instrumental in industries like finance, healthcare, and e-commerce, where
data-driven decision-making is critical.

Message Passing in Distributed System


Message passing in distributed systems refers to the communication medium used by
nodes (computers or processes) to commute information and coordinate their actions. It
involves transferring and entering messages between nodes to achieve various goals
such as coordination, synchronization, and data sharing.
Message passing is a flexible and scalable method for inter-node communication in
distributed systems. It enables nodes to exchange information, coordinate activities, and
share data without relying on shared memory or direct method invocations. Models like
synchronous and asynchronous message passing offer different synchronization and
communication semantics to suit system requirements. Synchronous message passing
ensures sender and receiver synchronization, while asynchronous message passing
allows concurrent execution and non-blocking communication.
Types of Message Passing
1. Synchronous message passing
2. Asynchronous message passing
3. Hybrids
1. Synchronous Message Passing
Synchronous message passing is a communication mechanism in existing programming
where processes or threads change messages in a synchronous manner. The sender
blocks until the receiver has received and processed the message, ensuring coordination
and predictable execution. This approach is generally enforced through blocking
method calls or procedure invocations, where a process or thread blocks until the called
system returns a result or completes its prosecution. This blocking behavior ensures that
the caller waits until the message is processed before proceeding. However,
synchronous message passing has potential downsides, such as delays or halts in the
system if the receiver takes too long to process the message or gets stuck. To ensure the
proper functioning of synchronous message passing in concurrent systems, it’s crucial
to precisely design and consider potential backups and error handling.

2. Asynchronous Message Passing

Asynchronous message passing is a communication mechanism in concurrent and


distributed systems that enables processes or factors to change messages without
demanding synchronization in time. It involves sending a message to a receiving
process or component and continuing execution without waiting for a response. Key
characteristics of asynchronous message passing include its asynchronous nature, which
allows the sender and receiver to operate singly without waiting for a response.
Communication occurs through the exchange of messages, which can be one-way or
include a reply address for the receiver to respond. Asynchronous message passing also
allows for loose coupling between the sender and receiver, as they can be running on
separate processes, threads, or different machines.

Message buffering is frequently used in asynchronous message passing, allowing the


sender and receiver to operate at their own pace. Asynchronous message passing is
extensively used in scenarios like distributed systems, event-driven architectures,
message queues, and actor models, enabling concurrency, scalability, and fault
tolerance.

3. Hybrids

Hybrid message passing combines elements of both synchronous and asynchronous


message ends. It provides flexibility to the sender to choose whether to block and hold
on for a response or continue execution asynchronously. The choice between
synchronous or asynchronous actions can be made based on the specific requirements
of the system or the nature of the communication. Hybrid message passing allows for
optimization and customization based on different scenarios, enabling a balance
between synchronous and asynchronous paradigms.

Asynchronous and synchronous computation and communication are two fundamental


paradigms in distributed computing, each with its own advantages and use cases.

CO3 Utilize appropriate models and frameworks for specific workloads in


parallel and distributed computing environments

Synchronous Computation/Communication

In synchronous distributed computing:


1. Computation: All processes or nodes in the system operate in lockstep,
progressing through their tasks in a coordinated manner.

2. Communication: Processes communicate with each other at predefined


synchronization points, often waiting for messages from other processes before
proceeding.

Advantages:

• Simplifies coordination and synchronization between processes.

• Easier to reason about system behavior and correctness.

• Well-suited for applications with tightly coupled components that require


deterministic behavior.

Use Cases:

• Parallel algorithms where tasks are tightly coordinated, such as matrix


multiplication.

• Real-time systems where strict timing constraints must be met.

• Consensus protocols like Paxos and Raft used in distributed consensus.


Asynchronous Computation/Communication
In asynchronous distributed computing:
1. Computation: Processes operate independently of each other, progressing
through tasks at their own pace without strict coordination.
2. Communication: Processes communicate with each other asynchronously,
sending and receiving messages independently of each other's execution.
Advantages:
• Offers greater flexibility and scalability, as processes can operate independently
and asynchronously.
• Tolerates varying latencies and failures more gracefully than synchronous
systems.
• Well-suited for loosely coupled systems and applications with unpredictable
workloads.
Use Cases:
• Web services and microservices architectures where components interact
asynchronously.
• Large-scale distributed systems where components are geographically
distributed and latency varies.
• Event-driven architectures where events trigger asynchronous processing.

Comparison

• Complexity: Synchronous systems are often simpler to design and reason about
due to their deterministic behavior, while asynchronous systems require careful
consideration of concurrency and potential race conditions.

• Performance: Synchronous systems may suffer from performance bottlenecks


and increased latency due to the need for coordination and synchronization,
whereas asynchronous systems can offer better scalability and responsiveness.

• Fault Tolerance: Asynchronous systems are often more resilient to failures and
network partitions, as they can continue operating independently even if some
components fail or become unreachable.

Hybrid Approaches

In practice, many distributed systems employ hybrid approaches that combine elements
of both synchronous and asynchronous computation and communication. For example,
systems may use asynchronous messaging for communication between loosely coupled
components while coordinating certain tasks synchronously when necessary.

CO4 Utilize appropriate models and frameworks for specific workloads in parallel
and distributed computing environments

Concurrency Control Mechanisms in Distributed Systems


Concurrency is the ability of a system to execute multiple tasks simultaneously.
There are two main mechanisms for concurrency control.
· Optimistic Concurrency Control (OCC):
OCC is a concurrency control mechanism that allows concurrent execution of
transactions without acquiring locks upfront. It assumes that conflicts between
transactions are infrequent, and transactions proceed optimistically. During the commit
phase, conflicts are detected, and if conflicts occur, appropriate actions such as aborting
and retrying the transaction are taken.
In a distributed system, OCC can be implemented by maintaining version information
for each data item. Each transaction reads a consistent snapshot of the database at the
beginning, and during the commit phase, it checks if any other transaction has modified
the same data items it has read. If conflicts are detected, the transaction is rolled back
and retried with a new snapshot.
· Pessimistic Concurrency Control (PCC):

PCC is a concurrency control mechanism that assumes conflicts are likely to occur and
takes a pessimistic approach by acquiring locks on resources upfront to prevent
conflicts. It ensures that transactions acquire exclusive access to resources, preventing
other transactions from modifying or accessing them until the locks are released.

In a distributed system, PCC can be implemented by using distributed locks or lock


managers. When a transaction wants to access a resource, it requests a lock on that
resource from the lock manager. If the lock is available, it is granted, and the transaction
proceeds. If the lock is not available, the transaction waits until the lock is released.

Concurrency Control Mechanisms.

Optimistic Concurrency Control (OCC):

1. Snapshot Isolation — Snapshot Isolation ensures that each transaction sees a


consistent snapshot of the database at the start of the transaction. MVCC and
timestamp ordering method help us achieve snapshot isolation.

2. MVCC — Multi-Version Concurrency Control maintains multiple versions of data


and allows transactions to proceed without acquiring locks upfront. Example: In a
banking system, multiple users can concurrently transfer funds between accounts
without blocking each other. Each transaction operates on its own version of the account
balances, ensuring consistency upon commit.

3. Timestamp Ordering — Assigns unique timestamps to transactions and enforces a


total order of their execution. Example: In a distributed system, transactions for
processing customer orders are assigned timestamps. The system ensures that the order
processing follows the order of timestamps to prevent conflicts and maintain
consistency.

4. CRDT (Conflict-Free Replicated Data Type) is a distributed data structure that


enables concurrent updates in a distributed system without the need for centralized
coordination or consensus algorithms. CRDTs are designed to handle conflicts that may
arise when multiple users concurrently modify the same piece of data. One common use
case for CRDTs is collaborative real-time editing applications, where multiple users can
simultaneously edit a shared document.

Pessimistic Concurrency Control (PCC):

1. Two-Phase Locking (2PL) — Acquires locks on data resources upfront and releases
them at the end of the transaction. Example: In a shared database, when a user wants to
update a specific row of data, 2PL ensures that other users cannot access or modify the
same row until the lock is released, preventing conflicts. Learn more about 2PL here.

2. Strict Two-Phase Locking (Strict 2PL) — A variant of 2PL where all locks
acquired during a transaction are held until the transaction is committed or rolled back.
Example: In a distributed database, a transaction locks all the necessary resources (e.g.,
tables, rows) at the beginning and holds the locks until the transaction is completed,
ensuring no other transactions can access or modify the locked resources.

3. Multiple Granularity Locking — Allows acquiring locks at various levels of


granularity, such as table level, page level, or row level. Example: In a database system,
a transaction can acquire a lock at the row level for a specific record it wants to update,
preventing other transactions from modifying the same record but allowing concurrent
access to other records in the table.

4. Distributed Lock Manager (DLM) — A distributed file system provides access to


files across multiple nodes in a network. A Distributed Lock Manager coordinates
access to shared files to prevent conflicts. For example, in a distributed file storage
system, the DLM ensures that only one client holds an exclusive lock on a file at a time
to avoid data corruption or inconsistencies caused by concurrent modifications.
The choice between OCC and PCC depends on factors such as workload characteristics,
contention level, and desired level of concurrency and performance. OCC is often
favoured when conflicts are expected to be infrequent, allowing for greater concurrency,
while PCC is preferred when conflicts are anticipated to be frequent, at the cost of
potentially more locking and blocking. For instance, an e-commerce solution may opt
for OCC under normal conditions and choose to use PCC when there is a burst in
demand for an item on sale i.e. a hot sku, or use PCC only when inventory for an item
reaches a certain low threshold.

Fault tolerance in distributed systems

Fault tolerance in distributed systems is a critical feature designed to ensure that a


system continues to operate correctly even when some of its components fail.
Distributed systems, which consist of multiple interconnected computers working
together, are inherently complex and prone to various types of failures. Fault tolerance
mechanisms help maintain system reliability, availability, and data integrity despite
these failures.

Here are key concepts and techniques involved in achieving fault tolerance in
distributed systems:

1. Redundancy

Replication: Storing copies of data or services on multiple nodes. If one node fails, the
system can retrieve data or services from another node.

Backup: Regularly creating copies of data that can be restored in case of failure.

2. Consensus Algorithms

Paxos: A protocol for achieving consensus among distributed nodes, ensuring


agreement on a single value even in the presence of failures.

Raft: A more understandable alternative to Paxos, designed to ensure a distributed


system agrees on state changes.
3. Failover and Recovery
Failover: Automatically switching to a standby component upon the failure of the active
component.
Recovery: Restoring a failed component to normal operation, often involving restarting
services or reloading data from backups.
4. Fault Detection
Heartbeats: Regular signals sent between nodes to indicate they are operational. If a
node fails to send a heartbeat, it is assumed to be down.
Watchdogs: Processes that monitor the health of other processes and take corrective
actions if necessary.
5. Load Balancing
Distributing tasks and workloads across multiple nodes to prevent overloading any
single node and to provide redundancy.
6. Partition Tolerance
Ensuring the system can continue to operate even when network partitions occur,
isolating some nodes from others.
7. Data Consistency Models
Eventual Consistency: Ensuring that, given enough time, all nodes will converge to the
same state, even though they may be temporarily inconsistent.
Strong Consistency: Ensuring that all nodes see the same data at the same time, often at
the cost of performance.
8. Self-Healing
Automatic detection and correction of faults, such as restarting failed nodes or
redistributing data.
9. Atomic Operations
Ensuring that operations are completed entirely or not at all, preventing partial updates
that could lead to inconsistencies.
10. Quorums
Using a subset of nodes (a quorum) to agree on operations or state changes, ensuring
system reliability even if some nodes fail.
Examples of Fault Tolerance Techniques in Real Systems
Google File System (GFS): Uses data replication and master-slave architecture to
handle failures.
Amazon DynamoDB: Employs consistent hashing, replication, and quorum-based
protocols for fault tolerance.
Apache Cassandra
CO5 Devise effective solutions to address challenges and problems encountered in
parallel and distributed computing scenarios.

OpenMPI (Open Message Passing Interface)

OpenMPI (Open Message Passing Interface) is an open-source Message Passing


Interface implementation that is designed for high performance on both large-scale and
small-scale parallel systems. It is widely used in high-performance computing (HPC)
environments for running parallel applications. OpenMPI provides a set of libraries and
tools that enable communication between processes in a distributed system.

Key Features of OpenMPI

• High Performance: Optimized for low-latency and high-throughput


communication.
• Portability: Runs on a wide variety of platforms, including various Unix flavors,
Linux, and Windows.
• Flexibility: Supports a range of communication networks such as Ethernet,
InfiniBand, Myrinet, and more.
• Scalability: Suitable for both small clusters and large supercomputers.
• Fault Tolerance: Includes features to help applications recover from failures.

Components of OpenMPI

MPI Libraries: Core libraries that provide the MPI standard functions.

Runtime Environment: Manages the execution of MPI applications, including process


launching and communication setup.

Resource Managers: Integrates with job schedulers and resource managers like Slurm,
PBS, and Torque.

Utilities: Tools for compiling, running, and debugging MPI applications.

Clusters play a critical role in distributed systems and high-performance computing


(HPC) by aggregating multiple computers (nodes) to work together as a single system.
Clusters provide several advantages, including improved performance, fault tolerance,
and scalability. Here are some key aspects of clusters and their roles in distributed
systems:

Key Roles and Benefits of Clusters


High Performance and Parallel Processing
Clusters enable parallel processing by dividing a task into smaller sub-tasks that run
concurrently on multiple nodes. This can significantly reduce the time required to
perform large computations.
High Performance Computing (HPC) clusters are widely used in scientific research,
simulations, data analysis, and other compute-intensive tasks.
Scalability
Clusters can scale horizontally by adding more nodes to handle increased workloads.
This scalability allows clusters to grow in capacity and performance as needed.
Horizontal scalability is more cost-effective compared to vertical scaling (increasing
the power of a single machine).
Fault Tolerance and High Availability
Clusters provide redundancy, ensuring that the failure of one or more nodes does not
bring down the entire system. Workloads can be redistributed to functioning nodes.
High Availability (HA) clusters ensure continuous operation of services by providing
failover mechanisms. When a node fails, another node can take over its tasks seamlessly.
Resource Sharing and Load Balancing
Clusters facilitate resource sharing, allowing multiple users or applications to utilize
shared computing resources efficiently.
Load balancing distributes workloads evenly across the nodes in a cluster, optimizing
resource use and preventing any single node from being overloaded.
Cost-Effectiveness
Building clusters using commodity hardware can be more cost-effective than investing
in expensive, high-end servers.
Clusters can be built incrementally, allowing organizations to spread out the cost over
time and upgrade hardware as needed.
Types of Clusters
Beowulf Clusters

These are typically built using commodity hardware and open-source software. They
are cost-effective solutions for parallel computing and are often used in academic and
research environments.
Load Balancing Clusters

These clusters focus on distributing incoming network traffic or workloads across


multiple nodes to ensure no single node is overwhelmed. Common in web hosting and
database management.
High Availability Clusters

Also known as failover clusters, these ensure that services remain available even when
one or more nodes fail. They are crucial for critical applications that require continuous
uptime.
Grid Computing Clusters

Grid computing involves loosely coupled clusters that work together on a common task
but are geographically dispersed. They leverage resources from multiple administrative
domains.
HPC Clusters

High-performance computing clusters are designed for intensive computational tasks.


They use specialized hardware and software optimizations to achieve maximum
performance.

Definition of Clusters
A cluster in computing is a group of interconnected computers that work together as a
single system. These computers, often referred to as nodes, collaborate to perform tasks,
share resources, and provide redundancy. Clusters are used to enhance performance,
increase availability, and ensure scalability and fault tolerance.
Taxonomy of Clusters
Clusters can be categorized based on various criteria such as their purpose, architecture,
and management. Here is a detailed taxonomy of clusters:
Based on Purpose
1. High Performance Computing (HPC) Clusters
• Designed for computationally intensive tasks.
• Used in scientific research, simulations, and complex calculations.
• Examples: NASA's Pleiades, DOE's Summit.
2. High Availability (HA) Clusters
• Ensure continuous operation by providing failover capabilities.
• Critical for applications that require minimal downtime.
• Examples: Financial services, e-commerce platforms.
3. Load Balancing Clusters
• Distribute workloads across multiple nodes to optimize resource use and
avoid overload.
• Common in web services and databases.
• Examples: Web servers, application servers.
4. Grid Computing Clusters
• Combine resources from multiple locations to work on a common task.
• Often geographically distributed and managed by different organizations.
• Examples: SETI@home, CERN's LHC Computing Grid.
5. Storage Clusters
• Focus on providing scalable and reliable data storage.
• Ensure data redundancy and quick access.
• Examples: Amazon S3, Google File System.
Based on Architecture
1. Homogeneous Clusters
• All nodes have similar or identical hardware and software configurations.
• Easier to manage and maintain.
• Examples: Beowulf clusters.
2. Heterogeneous Clusters
• Nodes have different hardware and software configurations.
• Flexible and can utilize a variety of resources.
• Examples: Computational grids.
Based on Management
1. Centralized Management
• Managed by a single entity or organization.
• Simplifies administration and resource allocation.
• Examples: Corporate data centers, university research clusters.
2. Decentralized Management
• Managed by multiple entities.
• Often found in grid computing and volunteer computing projects.
• Examples: BOINC projects, federated cloud services.
Based on Deployment
1. On-Premises Clusters
• Physically located and managed within an organization's own facilities.
• Provides control over hardware and security.
• Examples: Private data centers, university labs.
2. Cloud-Based Clusters
• Deployed and managed in the cloud.
• Offers flexibility, scalability, and often cost savings.
• Examples: AWS EC2 clusters, Google Cloud Kubernetes Engine.
3. Hybrid Clusters
• Combine on-premises and cloud-based resources.
• Allow for bursting into the cloud during peak demand.
• Examples: Enterprises with both local and cloud resources.
Examples of Cluster Implementations
1. Beowulf Clusters
• Built using commodity hardware and open-source software.
• Example: A university research cluster using Linux and inexpensive PCs.
2. Apache Hadoop Clusters
• Designed for big data processing.
• Uses HDFS for distributed storage and MapReduce for distributed
processing.
• Example: Data analytics platforms used by large enterprises.
3. Kubernetes Clusters
• Orchestrate containerized applications.
• Provide automated deployment, scaling, and management.
• Example: Cloud-native applications running on Google Kubernetes
Engine (GKE).
4. OpenMPI Clusters
• Facilitate communication in parallel computing environments.
• Example: Scientific simulations using MPI for message passing.
5. SLURM (Simple Linux Utility for Resource Management) Clusters
• Job scheduling and workload management.
• Example: Managing job queues and resources in an HPC environment.

CO4 Utilize appropriate models and frameworks for specific workloads in parallel
and distributed computing environments

Distributed Computing Limitations


Distributed systems, while offering numerous advantages such as scalability, fault
tolerance, and resource sharing, also come with several limitations and challenges.
Understanding these limitations is crucial for designing, implementing, and maintaining
efficient and reliable distributed systems. Here are some key limitations:
1. Complexity
• Design and Implementation: Distributed systems are inherently more complex
to design and implement than centralized systems. The need to coordinate and
synchronize multiple nodes adds to the complexity.
• Debugging and Testing: Identifying and fixing bugs can be more difficult in
distributed systems due to their non-deterministic nature and the complexity of
interactions between components.
2. Network Issues
• Latency: Communication between nodes can introduce significant latency,
especially if nodes are geographically dispersed. This can affect the performance
of the system.
• Bandwidth: Limited network bandwidth can become a bottleneck, restricting
the amount of data that can be transferred between nodes.
• Partitioning: Network partitions can isolate nodes, leading to challenges in
maintaining consistency and availability (as highlighted by the CAP theorem).
3. Fault Tolerance
• Consistency vs. Availability: Ensuring data consistency across all nodes while
maintaining availability can be challenging. The CAP theorem states that it is
impossible for a distributed system to simultaneously provide consistency,
availability, and partition tolerance.
• Partial Failures: Unlike centralized systems, distributed systems can experience
partial failures where some nodes fail while others continue to operate. Handling
these partial failures gracefully is complex.
4. Security
• Data Security: Ensuring data security across multiple nodes involves complex
encryption, authentication, and authorization mechanisms.
• Attack Surface: The distributed nature of these systems increases the attack
surface, making them more vulnerable to security threats such as distributed
denial-of-service (DDoS) attacks.
5. Data Management
• Data Consistency: Maintaining consistent data across multiple nodes is
difficult, particularly in real-time applications.
• Data Replication: Efficiently replicating data and ensuring synchronization
without significant overhead is challenging.
6. Coordination and Synchronization
• Concurrency: Managing concurrent access to shared resources can lead to
issues such as deadlocks and race conditions.
• Clock Synchronization: Ensuring that all nodes have a consistent view of time
is difficult, yet critical for certain applications.
7. Resource Management
• Load Balancing: Efficiently distributing workloads to avoid overloading any
single node while maximizing resource utilization can be complex.
• Heterogeneity: Managing resources across nodes with different performance
characteristics, operating systems, and configurations adds to the complexity.
8. Scalability Challenges
• Scaling Complexity: While distributed systems are designed to scale, ensuring
that the system scales efficiently as the number of nodes increases is not trivial.
• Overhead: The overhead associated with managing the distributed aspects (e.g.,
coordination, fault tolerance mechanisms) can limit scalability.
9. Interoperability
• Compatibility: Ensuring compatibility and seamless communication between
different systems, technologies, and protocols can be a significant challenge.
10. Economic Costs
• Infrastructure Costs: The cost of setting up and maintaining a distributed
infrastructure can be high.
• Operational Costs: Continuous monitoring, maintenance, and management of
distributed systems require substantial operational efforts and resources.
Examples in Real-World Systems
• Google File System (GFS): While GFS provides high fault tolerance and
performance, it faces challenges related to maintaining consistency across
replicated data and handling network partitions.
• Amazon Web Services (AWS): AWS provides scalable and reliable services but
dealing with latency, network issues, and ensuring data security across its
distributed infrastructure remains complex.
• Apache Hadoop: Hadoop is designed for distributed data processing but
managing data consistency, handling node failures, and dealing with the
complexity of MapReduce programming are significant challenges.
Cluster Computing Architecture :
• It is designed with an array of interconnected individual computers and the
computer systems operating collectively as a single standalone system.
• It is a group of workstations or computers working together as a single,
integrated computing resource connected via high speed interconnects.
• A node – Either a single or a multiprocessor network having memory, input
and output functions and an operating system.
• Two or more nodes are connected on a single line or every node might be
connected individually through a LAN connection.
Cluster Computing Architecture

Components of a Cluster Computer :


1. Cluster Nodes
2. Cluster Operating System
3. The switch or node interconnect
4. Network switching hardware

Cluster Components
Advantages of Cluster Computing :

1. High Performance :
The systems offer better and enhanced performance than that of mainframe computer
networks.
2. Easy to manage :
Cluster Computing is manageable and easy to implement.
3. Scalable :
Resources can be added to the clusters accordingly.
4. Expandability :
Computer clusters can be expanded easily by adding additional computers to the
network. Cluster computing is capable of combining several additional resources or
the networks to the existing computer system.
5. Availability :
The other nodes will be active when one node gets failed and will function as a proxy
for the failed node. This makes sure for enhanced availability.
6. Flexibility :
It can be upgraded to the superior specification or additional nodes can be added.
Disadvantages of Cluster Computing :

1. High cost :
It is not so much cost-effective due to its high hardware and its design.
2. Problem in finding fault :
It is difficult to find which component has a fault.
3. More space is needed :
Infrastructure may increase as more servers are needed to manage and monitor.
Applications of Cluster Computing :
• Various complex computational problems can be solved.
• It can be used in the applications of aerodynamics, astrophysics and in data
mining.
• Weather forecasting.
• Image Rendering.
• Various e-commerce applications.
• Earthquake Simulation.
• Petroleum reservoir simulation.
CO4 Utilize appropriate models and frameworks for specific workloads in parallel
and distributed computing environments

Design Decisions
Designing a cluster system involves several critical decisions that can significantly
impact the system's performance, scalability, reliability, and cost. These decisions cover
a wide range of aspects, from hardware selection to software configuration and network
design. Here are some key design decisions to consider when building a cluster system:
1. Purpose and Use Case
• Define Objectives: Clearly define the primary goals of the cluster, such as high
performance computing (HPC), high availability (HA), load balancing, or big
data processing.
• Workload Characteristics: Understand the types of workloads the cluster will
handle (e.g., computational tasks, data-intensive applications) to tailor the design
accordingly.
2. Node Hardware Configuration
• Node Specifications: Choose the CPU, memory, storage, and network
capabilities of each node based on the anticipated workload requirements.
• Homogeneous vs. Heterogeneous: Decide whether to use homogeneous nodes
(identical hardware) for simplicity and predictability, or heterogeneous nodes
(varied hardware) for flexibility and potentially better resource utilization.
• Scalability: Ensure the hardware is scalable to allow for future expansion
without significant redesign.
3. Network Architecture
• Network Topology: Select an appropriate network topology (e.g., star, mesh,
tree) based on performance, fault tolerance, and cost considerations.
• Interconnect Technology: Choose high-speed interconnects like InfiniBand,
Ethernet, or specialized HPC networks to minimize latency and maximize
throughput.
• Redundancy and Failover: Design the network with redundancy to handle
failures gracefully and maintain connectivity.
4. Storage Solutions
• Storage Type: Decide between local storage, shared storage, or a combination.
Local storage is faster for node-specific data, while shared storage is essential
for data that needs to be accessible across the cluster.
• Distributed File Systems: Implement distributed file systems (e.g., HDFS,
GlusterFS, Ceph) to ensure data redundancy, scalability, and availability.
• Data Management: Plan for efficient data distribution, replication, and access
patterns to minimize bottlenecks and ensure data integrity.
5. Software Stack
• Operating System: Select an operating system that is optimized for cluster
environments, typically a variant of Linux.
• Middleware: Choose middleware that supports resource management, job
scheduling, and communication (e.g., Slurm, Torque, OpenMPI).
• Application Frameworks: Incorporate application frameworks suited for the
workload, such as Apache Hadoop or Spark for big data, or TensorFlow for
machine learning.
6. Resource Management and Scheduling
• Scheduler: Implement a job scheduler (e.g., Slurm, Kubernetes) to manage job
queues, allocate resources, and optimize workload distribution.
• Resource Allocation: Develop policies for resource allocation, prioritizing tasks
based on urgency, resource requirements, and user priorities.
7. Fault Tolerance and High Availability
• Redundancy: Design the system with redundant components (e.g., power
supplies, network paths) to ensure high availability.
• Failover Mechanisms: Implement failover mechanisms to automatically handle
node or component failures without significant disruption.
• Monitoring and Alerts: Set up monitoring tools (e.g., Nagios, Prometheus) to
continuously track the health of the cluster and alert administrators to potential
issues.
8. Security
• Access Control: Implement robust authentication and authorization mechanisms
to ensure that only authorized users can access the cluster.
• Data Security: Use encryption for data in transit and at rest to protect sensitive
information.
• Network Security: Employ firewalls, VPNs, and other network security
measures to protect against unauthorized access and attacks.
9. Scalability and Flexibility
• Modular Design: Design the cluster to be modular, allowing for easy addition
or removal of nodes without significant reconfiguration.
• Elasticity: Consider using cloud-based resources or hybrid cloud approaches to
dynamically scale the cluster based on workload demands.
10. Cost Considerations
• Initial Investment: Balance the cost of high-performance hardware and network
components with the budget constraints.
• Operational Costs: Factor in the ongoing costs of power, cooling, maintenance,
and administration.
• Total Cost of Ownership (TCO): Evaluate the long-term costs and benefits,
including potential savings from increased efficiency and productivity.
11. Performance Optimization
• Load Balancing: Implement load balancing strategies to ensure even
distribution of workloads and avoid bottlenecks.
• Performance Tuning: Continuously monitor and tune the system for optimal
performance, adjusting configurations as needed based on real-world usage
patterns.
Example Scenario: Designing an HPC Cluster
1. Define Objectives
• Goal: Perform large-scale scientific simulations.
• Workloads: CPU-intensive with occasional GPU acceleration.
2. Node Hardware Configuration
• Compute Nodes: High-end CPUs, 128GB RAM, SSDs for local storage,
optional GPUs.
• Head Node: High-performance CPU, 256GB RAM, large SSD.
3. Network Architecture
• Topology: Fat-tree topology for low-latency, high-bandwidth
communication.
• Interconnect: InfiniBand for inter-node communication, Ethernet for
management traffic.
4. Storage Solutions
• Local Storage: SSDs on compute nodes for scratch space.
• Shared Storage: Lustre file system for large data sets and project files.
5. Software Stack
• OS: CentOS or Ubuntu.
• Middleware: Slurm for resource management and scheduling, OpenMPI
for inter-process communication.
• Application Frameworks: Libraries and tools for scientific computing
(e.g., MATLAB, TensorFlow).
6. Resource Management and Scheduling
• Scheduler: Slurm configured with fair-share scheduling and job
prioritization.
• Resource Allocation: Policies for CPU/GPU time, memory usage based
on project and user priorities.
7. Fault Tolerance and High Availability
• Redundancy: Dual power supplies, redundant network paths.
• Failover: Hot-swappable components, automatic failover configurations.
• Monitoring: Nagios for health checks, alerting system administrators of
issues.
8. Security
• Access Control: LDAP integration for user management, role-based
access controls.
• Data Security: SSL/TLS for data in transit, encryption for sensitive data.
• Network Security: Firewalls, VPN for remote access.
9. Scalability and Flexibility
• Modular Design: Easy addition of new compute nodes.
• Elasticity: Hybrid cloud setup with AWS for peak load handling.
10. Cost Considerations
• Initial Investment: High upfront cost for hardware and infrastructure.
• Operational Costs: Budgeting for power, cooling, maintenance.
• TCO: Evaluated over 5 years, considering hardware lifecycle and performance
gains.
11. Performance Optimization
• Load Balancing: Slurm configuration for even workload distribution.
• Performance Tuning: Regular benchmarking and tuning based on usage patterns.

CO1 Comprehend the concept of distributed memory in computing and


recognize its significance in parallel and distributed systems.
Network Hardware and Software of Cluster-Based Systems
The network is a critical component of a cluster-based system, as it facilitates
communication between nodes and ensures data transfer and synchronization. Both
hardware and software aspects of the network need to be carefully chosen and
configured to meet the performance, reliability, and scalability requirements of the
system.
Network Hardware
1. Network Interface Cards (NICs)
• Ethernet NICs: Commonly used for general-purpose clusters. Gigabit
Ethernet (1 Gbps) and 10 Gigabit Ethernet (10 Gbps) are standard, with
25 Gbps, 40 Gbps, and 100 Gbps options available for higher performance
needs.
• InfiniBand NICs: Preferred for high-performance computing (HPC)
clusters due to their low latency and high throughput capabilities. Speeds
can range from 40 Gbps to 200 Gbps.
• Fiber Channel NICs: Used in storage area networks (SANs) for high-
speed data transfer, typically in data storage clusters.
2. Switches
• Ethernet Switches: Managed or unmanaged switches with varying port
counts and speeds (1 Gbps, 10 Gbps, etc.). Managed switches allow for
better control and configuration of network traffic.
• InfiniBand Switches: Provide high-speed, low-latency connectivity
between nodes in HPC clusters. Examples include Mellanox and Cisco
InfiniBand switches.
• Top-of-Rack (ToR) Switches: Placed at the top of server racks to connect
the nodes within the rack to the network backbone.
3. Cables
• Ethernet Cables: Cat5e, Cat6, Cat6a, Cat7, and Cat8 cables, with
increasing levels of performance and shielding to reduce interference.
• InfiniBand Cables: Direct attach copper (DAC) cables and fiber optic
cables, providing high-speed connections with minimal latency.
4. Routers
• Used to connect different network segments and direct traffic between
clusters, data centers, or external networks.
5. Firewalls
• Protect the cluster from unauthorized access and cyber threats by
controlling incoming and outgoing network traffic based on security
rules.
6. Load Balancers
• Distribute incoming network traffic across multiple nodes to ensure no
single node is overwhelmed, improving availability and performance.
Network Software
1. Network Operating Systems
• Switch OS: Software running on network switches (e.g., Cisco IOS,
Juniper Junos) to manage network traffic and configurations.
• Router OS: Software for routers (e.g., Cisco IOS, MikroTik RouterOS)
to manage routing protocols and traffic flow.
2. Network Management Software
• Monitoring Tools: Nagios, Zabbix, Prometheus for monitoring network
health, performance, and alerting administrators to issues.
• Configuration Management: Tools like Ansible, Puppet, and Chef
automate the configuration and management of network devices.
3. Communication Libraries
• MPI (Message Passing Interface): Libraries like OpenMPI and MPICH
facilitate communication between processes running on different nodes,
essential for parallel computing tasks.
• RDMA (Remote Direct Memory Access): Allows high-throughput, low-
latency networking by enabling direct memory access from one computer
to another without involving the CPU, used with InfiniBand.
4. File Transfer Protocols
• NFS (Network File System): Enables nodes to access files over a
network as if they were on local storage, commonly used for shared
storage in clusters.
• SMB/CIFS (Server Message Block/Common Internet File System):
File sharing protocols that allow applications to read and write to files and
request services from server programs in a computer network.
5. Security Software
• Firewalls: Software firewalls (e.g., iptables, firewalld) manage and filter
network traffic to enhance security.
• VPN (Virtual Private Network): Provides secure connections between
nodes and remote users, ensuring data privacy and integrity.
6. Load Balancing Software
• HAProxy: Open-source software for TCP/HTTP load balancing.
• Nginx: Can be used as a reverse proxy and load balancer for HTTP and
other protocols.
• Apache Traffic Server: Used for caching, load balancing, and serving
web content.
Core Components of HPC Architecture
In an HPC architecture, a group of computers (nodes) collaborates on shared tasks. Each
node in this structure accepts and processes tasks and computations independently. The
nodes coordinate and synchronize execution tasks, ultimately producing a combined
result.
The HPC architecture has mandatory and optional components.
Mandatory Components
The compute, storage, and network components are the basis of an HPC architecture.
The following sections elaborate on each component.
Compute
The compute component is dedicated to processing data, executing software
or algorithms, and solving problems. Compute consists of computer clusters called
nodes. Each node has processors, local memory, and other storage that collaboratively
perform computations. The common types include:
• Headnode or login nodes. Entry points where users log in to access the cluster.
• Regular compute nodes. Locations for executing computational tasks.
• Specialized data management nodes. Methods for efficient data transfer within
the cluster.
• Fat compute nodes. Handlers for memory-intensive tasks, as they have large
memory capacity, typically exceeding 1TB.
Storage
The high-performance computing storage component stores and retrieves data
generated and processed by the computing component.
HPC storage types are:
• Physical. Traditional HPC systems often use physical, on-premises storage. On-
premise storage enables the inclusion of high-performance parallel file
systems, storage area networks (SANs), or network-attached storage
(NAS) systems. Physical storage is directly connected to the HPC infrastructure,
providing low-latency access to data for compute nodes within the local
environment.
• Cloud Storage. Cloud-based HPC storage solutions offer scalability and
flexibility. In contrast to traditional external storage, typically the slowest
computer system component, cloud storage within an HPC system operates at a
high speed.
• Hybrid. A hybrid HPC storage solution combines both on-premises physical
storage and cloud storage to create a flexible and scalable infrastructure. This
approach allows organizations to address specific requirements, optimize costs,
and achieve a balance between on-site control and the scalability offered by the
cloud.
Network
The HPC network component enables communication and data exchange among the
various nodes within the HPC system.
HPC networks focus on achieving high bandwidth and low latency. Different
technologies, topologies, and optimization strategies are utilized to support the rapid
transfer of large volumes of data between nodes.
HPC Scheduler
Task requests from the headnode are directed to the scheduler.
A scheduler is a vital HPC component. This utility monitors available resources and
allocates requests across the nodes to optimize throughput and efficiency.
The job scheduler balances workload distribution and ensures nodes are not overloaded
or underutilized.
Optional Components
Optional components in HPC environments are based on specific
requirements, applications, and budget considerations. Optional components
organizations choose to include in their HPC setups are:
• GPU-Accelerated systems. Boost computations for tasks that can be
parallelized on both CPU cores and Graphics Processing Units (GPUs), such as
simulations, machine learning, and scientific modeling. GPU acceleration
operates in the background, facilitating large-scale processing within the broader
system.
• Data management software. Systems that handle data storage, retrieval,
organization, and movement within an HPC environment. Data management
programs optimize system resource management according to specific needs.
• InfiniBand switch. Connects and facilitates communication between all nodes
in the cluster.
• Facilities and power. Physical space required to accommodate HPC.
• FPGAs (Field-Programmable Gate Arrays). Customizable hardware
acceleration is used in environments where highly efficient and low-latency
processing is essential.
• High-performance storage accelerators. Parallel file systems or high-speed
storage controllers enhance data access and transfer speeds.
• Remote visualization nodes. Help maintain computational efficiency when data
visualization is critical to HPC workflows. The nodes offload visualization tasks
from the main compute nodes.
• Energy-efficient components. Energy-efficient processors, memory, and power
supplies minimize the environmental impact and operational costs and
improve data center sustainability.
• Scalable and flexible network fabric. Enhances node communication,
improving overall system performance.
• Advanced security mechanisms. Include hardware-based encryption,
secure boot processes, and intrusion detection systems.

CO4 Utilize appropriate models and frameworks for specific workloads in parallel
and distributed computing environments

Protocols Distributed File Systems


In a cluster-based system, various protocols are employed to facilitate communication,
data transfer, synchronization, and overall coordination between nodes. These protocols
ensure that the cluster operates efficiently and reliably. Here are some of the key
protocols used in cluster-based systems:
Communication Protocols
1. Message Passing Interface (MPI)
• Purpose: MPI is a standardized and portable message-passing system
designed to function on parallel computing architectures.
• Key Features: It provides a set of library routines for parallel processing,
enabling processes to communicate with each other by sending and
receiving messages.
• Implementations: OpenMPI, MPICH.
• Use Cases: Scientific computations, simulations, and any application
requiring high-performance parallel processing.
2. Remote Direct Memory Access (RDMA)
• Purpose: RDMA allows direct memory access from the memory of one
computer into that of another without involving the processor, cache, or
operating system of either computer.
• Key Features: Low latency, high throughput, reduced CPU overhead.
• Protocols: InfiniBand, RoCE (RDMA over Converged Ethernet), iWARP
(Internet Wide Area RDMA Protocol).
• Use Cases: High-performance computing (HPC), databases, storage
networks.
3. Transmission Control Protocol (TCP) / Internet Protocol (IP)
• Purpose: TCP/IP is the fundamental suite of protocols for communication
over the internet and local networks.
• Key Features: Reliable, connection-oriented communication (TCP);
addressing and routing of packets (IP).
• Use Cases: General network communication, web services, data transfer.
4. User Datagram Protocol (UDP)
• Purpose: UDP is a connectionless protocol used for situations where low
latency is crucial and packet loss is acceptable.
• Key Features: Fast, minimal overhead, no guarantee of delivery.
• Use Cases: Real-time applications, streaming, VoIP.
File Transfer and Storage Protocols
1. Network File System (NFS)
• Purpose: NFS allows a user on a client computer to access files over a
network in a manner similar to how local storage is accessed.
• Key Features: File sharing across networked computers, transparent file
access.
• Use Cases: Shared file storage in UNIX/Linux environments.
2. Server Message Block (SMB) / Common Internet File System (CIFS)
• Purpose: SMB/CIFS is a network protocol for providing shared access to
files, printers, and serial ports.
• Key Features: File and resource sharing, network browsing.
• Use Cases: File sharing in Windows environments, cross-platform file
access.
3. Hadoop Distributed File System (HDFS)
• Purpose: HDFS is designed to store large data sets reliably and to stream
those data sets at high bandwidth to user applications.
• Key Features: Fault-tolerance, high throughput, large-scale data
processing.
• Use Cases: Big data applications, Hadoop ecosystems.
Data Synchronization and Coordination Protocols
1. Distributed Lock Manager (DLM)
• Purpose: DLM is used to manage locks in a distributed environment to
ensure that multiple processes do not access shared resources
simultaneously in a conflicting manner.
• Key Features: Locking mechanisms, synchronization, coordination.
• Use Cases: Clustered databases, file systems, resource management.
2. Zookeeper
• Purpose: Zookeeper is a centralized service for maintaining
configuration information, naming, providing distributed
synchronization, and group services.
• Key Features: High availability, consistency, distributed configuration
management.
• Use Cases: Distributed applications, coordination of cluster activities.
Load Balancing and Resource Management Protocols
1. HTTP/HTTPS
• Purpose: Hypertext Transfer Protocol (Secure) is the foundation of data
communication on the World Wide Web.
• Key Features: Stateless, request-response protocol, secure
communication (HTTPS).
• Use Cases: Web services, REST APIs, load balancing HTTP traffic.
2. DNS (Domain Name System)
• Purpose: DNS translates human-readable domain names to IP addresses
required for locating and identifying computer services and devices.
• Key Features: Name resolution, load balancing, fault tolerance.
• Use Cases: Domain name resolution, distributing traffic across multiple
servers.
3. Dynamic Host Configuration Protocol (DHCP)
• Purpose: DHCP automates the assignment of IP addresses, subnet masks,
gateways, and other IP parameters.
• Key Features: Dynamic IP address assignment, reduced administrative
overhead.
• Use Cases: Managing IP addresses in a network, configuring devices.
Security Protocols
1. Secure Shell (SSH)
• Purpose: SSH provides a secure channel over an unsecured network for
command-line interface access and network services.
• Key Features: Encrypted communication, secure remote login, secure
file transfer.
• Use Cases: Remote administration, secure access to cluster nodes, secure
file transfers.
2. Transport Layer Security (TLS) / Secure Sockets Layer (SSL)
• Purpose: TLS/SSL protocols provide secure communication over a
computer network.
• Key Features: Encryption, authentication, data integrity.
• Use Cases: Secure web transactions (HTTPS), secure email (SMTP over
TLS), VPNs.
Monitoring and Management Protocols
1. Simple Network Management Protocol (SNMP)
• Purpose: SNMP is used for collecting and organizing information about
managed devices on IP networks and modifying that information to
change device behavior.
• Key Features: Network management, monitoring, alerts.
• Use Cases: Network device monitoring, performance management.
2. IPMI (Intelligent Platform Management Interface)
• Purpose: IPMI is a set of standardized specifications for hardware-based
platform management systems that allows monitoring, logging, recovery,
inventory, and control of hardware.

CO5 Devise effective solutions to address challenges and problems encountered in


parallel and distributed computing scenarios.

Issues in cluster design


1. Performance
• Challenge: Ensuring optimal performance across all nodes while minimizing
latency and maximizing throughput.
• Solution: Employ high-speed interconnects, optimize network topology, utilize
parallel processing techniques, and tune software and hardware configurations
for performance.
2. Single System-Image
• Challenge: Providing a unified view of the cluster to users and applications,
abstracting the underlying distributed nature of the system.
• Solution: Implement distributed file systems and resource management software
to present a single, coherent view of resources and data. Utilize virtualization
and containerization technologies for transparent access to resources.
3. Fault Tolerance
• Challenge: Dealing with hardware failures, network issues, and software errors
without disrupting system operation or data integrity.
• Solution: Implement redundancy at various levels (hardware, network, data), use
fault-tolerant software, employ replication and backup strategies, and design
failover mechanisms to maintain system availability.
4. Manageability
• Challenge: Simplifying system administration, configuration, monitoring, and
troubleshooting in a complex distributed environment.
• Solution: Utilize centralized management tools, automation, and orchestration
frameworks for deployment, configuration, and monitoring. Implement
comprehensive logging and alerting systems for proactive management.
5. Programmability
• Challenge: Providing easy-to-use interfaces and APIs for developers to interact
with the cluster and leverage its resources effectively.
• Solution: Offer high-level programming models (e.g., MapReduce, MPI) and
frameworks (e.g., Apache Hadoop, Spark) for distributed computing. Provide
well-documented APIs and SDKs for resource management and job scheduling.
6. Load Balancing
• Challenge: Distributing workloads evenly across cluster nodes to prevent
resource bottlenecks and optimize resource utilization.
• Solution: Implement dynamic load balancers, scheduling algorithms, and
partitioning strategies to evenly distribute tasks. Utilize horizontal scaling and
elastic computing to dynamically allocate resources based on workload demand.
7. Security
• Challenge: Protecting the cluster from unauthorized access, data breaches, and
cyber-attacks.
• Solution: Implement strong authentication and access control mechanisms,
encrypt data in transit and at rest, and deploy firewalls, intrusion
detection/prevention systems, and security patches. Regularly audit and update
security policies and configurations.
8. Storage
• Challenge: Managing large volumes of data efficiently, ensuring data
consistency, availability, and durability.
• Solution: Employ distributed file systems (e.g., HDFS, GlusterFS) for scalable
and fault-tolerant storage. Utilize replication, erasure coding, and data
deduplication techniques for data redundancy and efficiency. Implement data
lifecycle management policies for tiered storage.
Each of these issues represents a critical aspect of cluster design that requires careful
consideration and implementation to build a robust, efficient, and reliable cluster
system. Addressing these challenges effectively is essential for achieving the desired
performance, scalability, and resilience of the cluster.

You might also like