DC-1
DC-1
INTRODUCTION
Definition:
- Distributed computing involves multiple interconnected devices (nodes) collaborating to
achieve a common goal or perform complex computations.
- It aims to enhance performance, reliability, and scalability of computational tasks beyond
the capabilities of individual machines.
- Resources like processors, memory, and storage are distributed across the network.
2.Memory:
● Each node has its own memory (RAM, cache, or secondary storage).
● Proper data distribution and access management are crucial for efficient
performance.
3. Storage:
● Distributed systems require vast data sharing and access among nodes.
● Distributed file systems and databases manage and replicate data for fault
tolerance and quick access.
4. Networking:
● Facilitates communication and data exchange between nodes.
● High-speed and reliable networks minimize latency and enable efficient data
transfer.
5. Operating Systems:
● Distributed operating systems manage resource allocation, scheduling, data
distribution, and communication among nodes.
● Handle complexities of coordinating tasks across multiple devices.
6. Middleware:
● Acts as an intermediary between application software and the operating system.
● Provides services and APIs to abstract complexities of distributed systems.
Importance:
● Enables parallelism and resource pooling, making computations more efficient and
faster.
● Enhances fault tolerance as tasks can be rerouted to available nodes if one fails.
● Scalability allows for easy expansion of resources to handle increased workloads.
● Supports various applications like cloud computing, big data processing, and
high-performance computing.
Challenges:
● Complex design and implementation due to distributed nature.
● Ensuring data consistency and avoiding conflicts among nodes.
● Network reliability and latency issues affecting overall performance.
● Handling node failures and maintaining fault tolerance.
● Synchronization and load balancing to prevent bottlenecks.
Applications:
● Cloud computing: Distributes resources and services to users over the internet.
● Big data processing: Distributes data processing across nodes to handle large
datasets.
● High-performance computing: Parallel processing for intensive computational
tasks.
● Internet of Things (IoT): Distributes processing in IoT networks for data analysis
and decision making.
Future Trends:
● Advancements in networking technologies for faster and more reliable
communication.
● Research in distributed algorithms for improved efficiency and fault tolerance.
● Integration of AI and machine learning in distributed computing for intelligent
decision making.
● Expanding applications in various domains with the growing demand for scalable
and efficient systems.
2. Scalability:
● As computing demands grow, it becomes essential to scale resources efficiently.
● Distributed systems allow easy expansion by adding more nodes to handle
increasing workloads, making them highly scalable.
3. Fault Tolerance and Reliability:
● Single-point failures in centralized systems can lead to complete breakdown.
● Distributed computing ensures fault tolerance as tasks can be redistributed to
available nodes if one fails, increasing system reliability.
5. Geographical Distribution:
● Modern applications often serve users globally, necessitating distributed
infrastructure for lower latency and improved user experience.
● Content delivery networks (CDNs) use distributed computing to cache and deliver
content from servers closer to end-users.
6. Cost Efficiency:
● Distributed computing allows organizations to use commodity hardware, which is
more cost-effective than investing in high-end, single machines.
● Resources can be dynamically allocated based on demand, optimizing resource
utilization and reducing overall costs.
7. Parallel Processing:
● Certain computational tasks are inherently parallelizable.
● Distributed computing enables dividing these tasks into smaller sub-tasks and
processing them concurrently, leading to faster results.
12. Decentralization:
● Distributed computing promotes decentralization, reducing reliance on a single
authority or central entity.
● This can lead to improved security and privacy by distributing sensitive data across
multiple nodes.
14. Future-Proofing:
● As technology evolves, distributed computing remains adaptable and capable of
integrating new advancements efficiently.
● It is well-suited to address the challenges of future computing requirements.
Importance of Messages:
● Facilitate communication between nodes: Messages allow nodes to share data
and information with each other, enabling collaboration in distributed environments.
● Support synchronization: Messages help synchronize actions and timing of tasks
across the distributed system, ensuring orderly execution.
● Enable fault tolerance: Messages aid in detecting failures and initiating recovery
mechanisms when a node experiences issues, enhancing system reliability.
● Contribute to scalability: Efficient messaging protocols help the system handle
increased communication traffic as the number of nodes grows.
Fault Tolerance: Messages help detect failures and initiate recovery mechanisms, such as
redistributing tasks or replicating data to ensure system reliability.
Message Queues: In some systems, messages are stored in message queues, managing
message flow for orderly and efficient processing.
Remote Procedure Calls (RPCs): Messages can implement RPCs, enabling nodes to
invoke functions on remote nodes as if executing local functions, fostering seamless
interaction.
Challenges: Message delivery guarantees, dealing with message loss, and ensuring
message order are challenges in distributed messaging systems.
Real-World Examples: Message brokers like RabbitMQ and Apache Kafka, commonly
used in distributed systems for reliable message communication.
1. Passing Systems:
Passing Systems, also known as Message Passing Systems, are based on the concept of
exchanging messages between processes or computing nodes. In this paradigm,
communication is achieved through explicit message passing, where one process sends a
message to another process to share data, request computation, or synchronize actions.
The communication typically occurs over a network or interconnect, enabling distributed
computing across multiple machines.
c. Scalability: Passing systems can scale well, as the overhead of message passing can be
managed efficiently, and adding more nodes to the system is generally straightforward.
d. Fault Tolerance: Passing systems can handle node failures by employing techniques
such as redundancy and message acknowledgments.
e. Message Buffering: Messages are typically buffered at both the sender and receiver
ends until they are processed, ensuring that data integrity and delivery are maintained.
Examples of Passing Systems include the Message Passing Interface (MPI), which is widely
used in high-performance computing (HPC) for parallel processing, and various
communication libraries and frameworks used in distributed systems.
Shared Memory Systems, on the other hand, are based on the idea of multiple processes or
threads sharing the same address space or memory region. In this paradigm, different
processes can access shared variables and data structures as if they were operating on
their local memory, leading to seamless communication and resource sharing.
c. Data Sharing: Shared memory systems facilitate easy data sharing among processes,
enabling efficient collaboration and communication.
d. Coherence and Consistency: Shared memory systems implement mechanisms to
maintain cache coherence and memory consistency across different processors or cores to
ensure that all processes see a consistent view of shared data.
1. Communication Model:
● Passing Systems: Explicit message passing.
● Shared Memory Systems: Implicit communication through shared memory access.
2. Communication Overhead:
● Passing Systems: Overhead associated with message serialization,
deserialization, and network communication.
● Shared Memory Systems: Lower communication overhead, as data access is
within the same memory address space.
3. Synchronization:
● Passing Systems: Synchronization typically achieved through explicit
message-based synchronization.
● Shared Memory Systems: Synchronization is required to ensure data consistency
when multiple processes access shared resources.
4. Scalability:
● Passing Systems: Generally scale well due to efficient message passing
mechanisms.
● Shared Memory Systems: May face scalability challenges due to contention for
shared resources.
5. Fault Tolerance:
● Passing Systems: Can handle node failures through redundancy and message
acknowledgments.
● Shared Memory Systems: Do not inherently handle node failures; additional
mechanisms are needed.
Primitives for distributed communication are fundamental building blocks or basic operations
that enable processes or nodes in a distributed computing system to communicate,
exchange messages, and interact with each other. These communication primitives abstract
the underlying complexities of network communication and provide a higher-level interface
for developers to implement distributed applications efficiently. They play a pivotal role in
coordinating actions, sharing data, and synchronizing processes in a distributed
environment. Let's delve into the details of some essential primitives for distributed
communication:
The "send" and "receive" primitives form the cornerstone of message passing systems.
These primitives enable processes to exchange messages between each other. When a
process sends a message, it specifies the destination process, the message data, and any
necessary parameters for communication. The recipient process, in turn, uses the "receive"
primitive to specify the source process from which it expects to receive the message. This
mechanism enables one-to-one or one-to-many communication patterns.
Importance:
- Core communication mechanism in many distributed systems.
- Facilitates data sharing and coordination.
- Supports communication across different nodes in a distributed network.
2.Broadcast:
The "broadcast" primitive allows a process to send a message to all other processes in the
distributed system simultaneously. As a result, every process receives the same information.
Broadcasts are instrumental for disseminating global information or coordinating actions
across the entire distributed system.
Use Cases:
- Dissemination of critical system updates.
- Coordinating system-wide actions.
- Sharing common data among all nodes.
3. Multicast:
Applications:
- Coordinating tasks among a subset of nodes.
- Distributing relevant information to a specific group.
Scatter and gather primitives are employed for distributing and aggregating data across a
group of processes. In a scatter operation, a single process sends different portions of a
data set to various processes, and each process receives a specific portion of the data. In a
gather operation, processes send their local data back to a single process, which aggregates
the data from all processes. Scatter and gather are commonly used in parallel processing to
divide work among processes and combine results afterward.
Significance:
- Effective data distribution in parallel processing.
- Simplifies data aggregation from distributed sources.
5. Barrier:
The "barrier" primitive serves as a synchronization point for processes. When processes
encounter a barrier, they wait until all other processes also reach that barrier before
proceeding. Barriers are crucial for ensuring that processes do not move forward until a
specific collective action, such as data exchange, is completed by all participating
processes.
Use Cases:
- Ensuring coordinated actions among processes.
- Achieving synchronization before critical tasks.
The "Remote Procedure Call" (RPC) primitive facilitates seamless communication between
processes running on different nodes. It allows a process to invoke a procedure or function
on a remote process as if it were executing a local procedure. RPC abstracts the
complexities of network communication, enabling transparent interaction between distributed
components.
Applications:
- Developing distributed applications with a familiar programming model.
- Calling functions on remote nodes without explicit message passing.
7. Publish-Subscribe:
Use Cases:
- Real-time information dissemination.
- Event-driven applications, such as financial trading systems.
In shared memory systems, processes use read and write primitives to access and modify
shared variables in a memory region accessible by all processes. These primitives ensure
data consistency and prevent race conditions when multiple processes access shared
resources concurrently.
Significance:
- Efficient data sharing in shared memory environments.
- Prevention of data corruption and synchronization issues.
Importance of Primitives:
Choosing Primitives:
Conclusion:
Primitives for distributed communication are essential tools for enabling effective interaction,
data sharing, and synchronization among processes in distributed computing systems. By
providing a standardized and abstracted way to communicate, these primitives enhance the
development of distributed applications and contribute to the efficient operation of
large-scale distributed systems.
In synchronous execution, tasks or processes are tightly coordinated and progress together
in a lockstep fashion. A task waits for a specific condition or event before proceeding to the
next step. Synchronous execution ensures that all processes are aligned and make progress
in a synchronized manner. This approach is often used in scenarios where strict order of
operations, coordination, and data consistency are crucial.
1. Blocking: Processes wait for each other before continuing, causing potential delays if any
process is slower or experiences a delay.
3. Data Consistency: Synchronous execution ensures that data is up-to-date and consistent
across all processes at every step.
4. Coordination: Well-suited for tasks that require synchronized actions or where results
from multiple processes need to be combined.
Asynchronous Execution:
1. Non-Blocking: Processes are not dependent on each other's progress, allowing for
parallelism and efficient resource utilization.
2. Overlap: Tasks can overlap and execute concurrently, leading to better utilization of
system resources.
3. Lack of Strict Order: Processes may complete in different orders, potentially leading to
challenges in maintaining strict data consistency.
4. Flexibility: Well-suited for scenarios where responsiveness and resource efficiency are
important, even if it means sacrificing strict coordination.
Synchronous execution ensures strict coordination and data consistency but may lead to
slower response times. Asynchronous execution offers greater parallelism and
responsiveness but requires managing potential data inconsistencies. Both modes have
their strengths and weaknesses, and the decision should align with the specific goals and
requirements of the distributed computing system.
- Node Failures: Dealing with node failures and ensuring fault tolerance is critical.
Designing mechanisms to detect failures, reassign tasks, and replicate data for reliability are
complex tasks.
- Data Integrity: Ensuring data integrity and preventing data corruption during
communication and storage is a challenge. Designing mechanisms for data validation and
error detection is essential.
- Scalability: Designing systems that can scale seamlessly as the number of nodes or
users grows requires careful consideration of resource allocation, data distribution, and
communication patterns.
4. Data Management:
- Data Distribution: Efficiently distributing and replicating data across nodes to ensure
availability and fault tolerance is complex. Deciding how and where to store data requires
careful design.
- Data Privacy: Protecting sensitive data from unauthorized access or exposure during
communication and storage is challenging.
6. Resource Management:
- Resource Allocation: Efficiently allocating and managing resources like CPU, memory,
and storage across nodes while considering varying workloads is a complex task.
- Debugging and Testing: Debugging and testing distributed systems can be challenging
due to their inherent complexity and non-deterministic behavior.
8. Heterogeneity:
- Hardware and Software Diversity: Designing systems that can work seamlessly across
heterogeneous hardware and software environments is complex.
9. Dynamic Environments:
- Node Joining and Leaving: Designing systems that can handle nodes joining or leaving
dynamically without disrupting the system's functionality is challenging.
In conclusion, distributed computing offers many benefits, but its design and implementation
are rife with challenges. Addressing issues related to communication, coordination, fault
tolerance, scalability, security, resource management, and programming complexity is
essential for building robust and efficient distributed systems. A thorough understanding of
these challenges and careful design considerations are vital to realizing the full potential of
distributed computing in various application domains.
4. Data Sharing: Distributed programs often require data to be shared among processes.
Proper data sharing mechanisms, such as shared memory or distributed databases, enable
processes to access and manipulate shared data while maintaining data consistency.
5. Failure Handling: Since distributed systems can experience node failures, a model of
distributed computations includes mechanisms for detecting, handling, and recovering from
failures. Techniques like redundancy, replication, and error detection are used to ensure
system resilience.
6. Global Clocks and Time: Distributed programs often need to reason about time and
order of events. Global clocks or logical clocks are used to establish a common time
reference across processes, aiding in synchronization and event ordering.
5. Scalability: Distributed programs can scale by adding more nodes to the system,
enabling the handling of larger workloads and datasets.
1. Task Decomposition: Decompose the computation into smaller tasks that can be
executed concurrently by different processes.
4. Data Sharing: Plan how shared data will be accessed, updated, and maintained while
ensuring data consistency.
5. Fault Tolerance Strategies: Design mechanisms to detect failures, replicate data, and
handle node failures gracefully.
6. Scalability and Load Balancing: Consider strategies for distributing tasks evenly among
processes to achieve scalability and optimal resource utilization.
Conclusion:
2. Communication Management:
- Message Passing: Establish communication channels between nodes for message
exchange.
- Routing: Determine the path messages take between sender and receiver nodes.
- Buffering: Manage message buffering to handle asynchronous communication and
prevent data loss.
Models of communication networks are conceptual frameworks that represent the structure,
topology, and behavior of communication networks in distributed computing systems. These
models provide a way to understand, analyze, and design the communication infrastructure
that connects nodes and processes within a distributed system. Different models of
communication networks offer varying levels of abstraction and detail, allowing designers to
choose the most suitable representation for their specific distributed computing application.
Let's delve into the key aspects of communication network models and explore some
examples:
2. Links: Depict the physical or logical connections between nodes. Links can represent
wired or wireless communication channels.
3. Topology: Describes the arrangement of nodes and links in the network, such as star,
ring, mesh, or tree topologies.
4. Communication Protocols: Specify the rules and conventions for data exchange and
communication between nodes.
5. Routing Algorithms: Determine how data packets are directed through the network from
source to destination.
6. Latency and Bandwidth: Represent the delay and capacity of communication channels,
respectively.
7. Network Layers: Divide the communication process into distinct layers, such as the OSI
model's physical, data link, network, transport, and application layers.
- Description: In this simple model, nodes communicate directly with each other using
dedicated communication links.
- Characteristics: The model is easy to understand and implement but lacks scalability
and redundancy in case of link failures.
- Description: All nodes are connected to a central hub, and communication occurs
through the hub.
- Characteristics: Hub failure can disrupt the entire network, but it provides centralized
control and easy management.
- Characteristics: Simple and predictable communication, but failure of any node or link
can disrupt the entire ring.
- Characteristics: High redundancy and fault tolerance, but complex to manage and
expensive to implement.
- Description: Nodes are arranged in a hierarchical tree structure, with a root node and
branches.
- Characteristics: Scalable and well-suited for large networks, but dependence on the root
node can lead to network failure if it fails.
SNAPSHOT Algorithm
A snapshot algorithm attempts to capture a coherent global state of a distributed system (for
the purpose of debugging or checkpointing, for instance).
We’ll say that processes in our system are named P1, P2, etc., and that the channel from
process Pi to process Pj is named C_ij. For instance, the channel from P1
to P2 is C_12, while the channel from P2 to P1 is C_21. A snapshot is a recording of the
state of each process (i.e., what events have happened on it) and each channel (i.e., what
messages have passed through it). The Chandy-Lamport algorithm ensures that when all
these pieces are stitched together, they “make sense”: in particular, it ensures that for any
event that’s recorded somewhere in the snapshot, any events that happened before that
event in the distributed execution are also recorded in the snapshot.
The Chandy-Lamport algorithm is a foundational technique used in distributed systems for
debugging, monitoring, and checkpointing purposes. It allows us to capture a consistent
snapshot of the entire system at a particular point in time, even while the system is running
concurrently.
3. Snapshot Consistency:
The Chandy-Lamport algorithm ensures that the captured snapshot is consistent, meaning
that it reflects a valid global state of the system. Specifically:
- Events recorded in the snapshot on each process are causally related.
- Any message recorded in the snapshot must have been sent before the marker message
that triggered the snapshot.
5. Considerations:
- The algorithm relies on the FIFO behavior of communication channels to ensure
consistency.
- It does not require global synchronization or coordination among processes.
6. Limitations:
- The algorithm may capture more information than necessary for certain use cases.
- It does not guarantee a minimal or efficient snapshot size.