0% found this document useful (0 votes)
7 views24 pages

DS UNIT-5

Unit 5 discusses Distributed Coordination-Based Systems, emphasizing the importance of communication and synchronization among processes in a distributed environment. It categorizes systems based on temporal and referential coupling, and introduces the publisher/subscriber model for efficient communication. The document also covers emerging trends such as grid computing, cloud computing, and virtualization, highlighting their benefits in resource utilization and scalability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views24 pages

DS UNIT-5

Unit 5 discusses Distributed Coordination-Based Systems, emphasizing the importance of communication and synchronization among processes in a distributed environment. It categorizes systems based on temporal and referential coupling, and introduces the publisher/subscriber model for efficient communication. The document also covers emerging trends such as grid computing, cloud computing, and virtualization, highlighting their benefits in resource utilization and scalability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Unit 5

Distributed Coordination-Based Systems


In a distributed system, multiple computers (or processes) work together to
complete a task. However, to function properly, these processes need to
communicate and coordinate with each other.
A Distributed Coordination-Based System is a system that helps different
processes in a network work together smoothly by managing their interactions.

Why is Coordination Needed?

Imagine a team project where different members work on separate tasks. To


complete the project successfully, they need:

1. Communication – To share information.

2. Synchronization – To ensure everyone is working in the correct order.

Similarly, in a distributed system, processes must coordinate to avoid conflicts,


share resources, and ensure tasks are completed efficiently.

Clean Separation Between Computation and Coordination

In coordination-based systems, the idea is to clearly separate two main aspects:

1. Computation: This is the actual work or tasks that each process performs.
Each process is responsible for its own specific job, like calculating numbers,
processing data, or running algorithms. These tasks are carried out
independently by each process.

2. Coordination: This is about how processes communicate and work together. It


involves managing the interactions between processes, ensuring they can
share information and cooperate effectively. Coordination is like the glue that
holds the different processes together, making sure they can work as a unified
system.

Unit 5 1
By keeping computation and coordination separate, systems can be more flexible
and easier to manage. Processes can focus on their tasks without worrying too
much about how they will communicate with others.

Taxonomy of Coordination-Based Distributed Systems

A taxonomy is a way to classify things. In the context of coordination-based


distributed systems, we can classify them based on how processes interact:

Temporal Coupling in Distributed Systems


Temporal coupling refers to whether processes need to be active at the same time
for communication to occur.

1. Temporally Coupled (Synchronous Communication)

Both sender and receiver must be active at the same time for
communication to succeed.

If one process is unavailable, communication fails or gets delayed.

Examples:

Live chat or phone call – Both participants must be present to interact.

2. Temporally Decoupled (Asynchronous Communication)

Sender and receiver do not need to be active at the same time.

Messages can be stored and processed later when the recipient is


available.

Examples:

Email – The sender and receiver don’t need to be online at the same
time.

Referential Coupling in Distributed Systems


In coordination-based systems, processes can be categorized based on whether
they directly know about each other or communicate indirectly.

1. Referentially Coupled (Direct Communication)

Unit 5 2
Processes explicitly know each other and communicate directly.

There is a fixed connection between sender and receiver.

Example:

A phone call – You dial a specific person's number to communicate.

2. Referentially Decoupled (Indirect Communication)

Processes do not need to know about each other.

They communicate through an intermediary, like a message queue or a


shared data space.

Example:

A radio broadcast – The speaker doesn’t know who is listening;


listeners tune in as needed.

Publish-Subscribe Model – A YouTube creator uploads a video, and


subscribers get notified without direct interaction.

Publisher/Subscriber Model

The publisher/subscriber (pub/sub) model is a popular way to handle


communication in distributed systems:

Publishers: These are processes that produce messages or information. They


"publish" messages on specific topics or channels.

Subscribers: These are processes that are interested in certain topics. They
"subscribe" to these topics to receive messages published on them.

Unit 5 3
In this model:

Temporal Coupling: Most pub/sub systems require that both publishers and
subscribers are active at the same time for communication to happen. This
means there is a temporal coupling because they need to be active
simultaneously.

Referential Decoupling: Publishers and subscribers do not need to know each


other explicitly. They only need to know about the topics they are interested in.
This makes the system more flexible and scalable.

Example

Imagine a news website:

Publishers: Journalists who write articles and publish them under different
categories like Sports, Technology, etc.

Subscribers: Readers who are interested in specific categories. They


subscribe to their preferred categories to get updates.

In this scenario:

The journalists (publishers) do not need to know who the readers


(subscribers) are.

The readers do not need to know who the journalists are.

Both need to be active at the same time for the readers to receive the latest
articles as soon as they are published.

Architecture of Distributed coordination based systems

Unit 5 4
1. The Principle of Exchanging Data Items Between Publishers and
Subscribers

In distributed systems, the publisher/subscriber (pub/sub) model is a common


way to exchange data:

Publishers: These are processes that generate and send out data or
messages. Think of them as broadcasters.

Subscribers: These are processes that want to receive specific types of data
or messages. Think of them as listeners.

Publish/Subscriber middleware: send data to a central system (like a


message broker or a shared space), and subscribers receive the data they are
interested in from this central system.

2. Traditional Architectures Example: Jini and JavaSpaces

Jini: This is a network architecture that allows devices and software


components to easily connect and work together. It uses a concept called
"lookup service" where services (like printers or databases) register
themselves, and clients (users or other services) can find and use these
services without needing to know their exact location.

JavaSpaces: This is a technology that provides a shared space (like a bulletin


board) where processes can put data (write) and take data (read). It's a way to
coordinate activities between different processes by sharing data in a
common area. Think of it as a shared whiteboard where everyone can post
notes and read notes posted by others.

Unit 5 5
JavaSpace acts as a shared memory space where processes can write and
read tuples

Three operations are shown:

1. "Write A" - Process A can insert a copy of data A into the space

2. "Write B" - Process B can insert a copy of data B into the space

3. "Read T" - A process can search for a tuple matching template T and
retrieve data C (with optional removal)

The tuples exist independently in the space, allowing for asynchronous


communication between processes

3. publish/subscribe system implemented in TIB/Rendezvous

This is a message-oriented coordination model where:

1. Publishers send messages on specific subjects (A and B in this case)

2. Subscribers register interest in specific subjects

3. Messages are multicast to all relevant subscribers

Each node contains:

RV daemon: Handles message routing

RV lib: Library for pub/sub operations

Subject information (what it publishes or subscribes to)

Unit 5 6
The network acts as the communication medium

Messages are multicast, meaning a single message can reach multiple


subscribers interested in that subject

4. Lime (Linda In a Mobile Environment): A Coordination Model for


Mobile and Distributed Systems

Designed for Mobile & Distributed Systems – Helps devices and processes
communicate efficiently.

Uses a "Transient Shared Dataspace" – A temporary shared area where data


can be stored and retrieved.

Data Exists Only While Connected – When a device disconnects, the data
disappears.

Ideal for Mobile Environments – Useful for systems where devices frequently
connect and disconnect (e.g., mobile networks, IoT).

Enables Seamless Coordination – Processes don’t need to know each other


directly but can exchange data through the shared space.

Unit 5 7
Communication in Publish-Subscribe Systems

In many publish/subscribe systems, communication is relatively straightforward.


Here's how it works:

Remote Method Invocations (RMI): In Java-based systems, communication


often happens through RMI, where one process (or application) can call a
method on another process running on a different machine. This is like one
computer asking another computer to perform a task and return the result.

Message Exchange: Publishers send messages (data) to a central system


(like a broker), and subscribers receive the messages they are interested in.
This is like a radio station broadcasting news, and listeners tuning in to the
channels they like.
How it Works?

A publisher sends a message (event).

A broker (middleware) delivers the message to all interested subscribers.

Subscribers receive only the messages they are interested in.

Example:

News App: A user subscribes to "Sports News." When a journalist


publishes a sports article, only the subscribed users receive it.

Stock Market Alerts: A stock trading app notifies users about price
changes only for stocks they follow.

Unit 5 8
Naming in Publish-Subscribe Systems
Naming in Pub-Sub systems is how publishers and subscribers identify the topics
they care about.
Types of Naming:

1. Topic-Based Naming: Messages are classified under topics.

Example: A publisher posts messages under topics like "Sports" , "Politics" ,


"Tech" .

Subscribers receive messages for the topics they follow.

2. Content-Based Naming: Messages are categorized by content instead of


predefined topics.

Example: A subscriber sets a filter like "Stock > $100" and gets only those
updates.

Example:

Topic-Based:

Publisher: "New article in Sports category!"

Subscribers to Sports receive it, but Tech subscribers don’t.

Content-Based:

Publisher: "AAPL stock price is $150."

Subscribers filtering AAPL stocks > $100 get the message.

Content-Based Routing
Content-based routing ensures that messages are delivered only to the relevant
subscribers based on content.
How it Works?

Each message carries metadata describing its content.

A broker analyzes the content and sends it only to interested subscribers.

Unnecessary messages are filtered out.

Unit 5 9
Routing Decisions: Routers (intermediate servers) check this description and
compare it to the interests of the subscribers. If a subscriber is interested in
that content, they receive the message. If not, it gets discarded, saving
resources.

Extreme Cases:

Broadcast to All Servers: One way is to send the message to all servers, and
let each one check who is interested. This method is simple but wasteful.

Broadcast Subscriptions: Another way is for each server to share its


subscription list with all other servers. This helps servers know who is
interested in what, making the routing process more efficient.

Example:

A weather update system:

A subscriber in New York only receives messages about New York


weather, not Los Angeles.

A stock market alert system:

If a user wants alerts for Tesla stock, they only receive those, not alerts for
other companies.

Unit 5 10
Security by Decoupling Publishers & Subscribers

Security in Pub-Sub systems is critical since publishers and subscribers should


not communicate directly.
How Security is Achieved?

1. Encryption: Messages are encrypted before publishing.

2. Trusted Service: A third-party system provides encryption and decryption


keys.

3. Decoupling: The broker ensures that publishers don’t directly send data to
subscribers, enhancing privacy.

Unit 5 11
Introduction to Emerging Trends in Distributed Systems
Emerging trends in distributed systems refer to new advancements and
technologies that are shaping how computers and networks work together to
solve complex problems. Staying updated with these trends is crucial because:

Performance Optimization: These new trends often improve how well


systems work, making them faster and able to handle more tasks at once.

Security Enhancements: As threats change, new trends bring stronger


security measures to protect data and systems.

Cost Efficiency: Innovations help organizations save money by using


resources more effectively.

Competitive Advantage: Adopting the latest technologies can give companies


an edge in the market.

Adaptability: Knowing about these trends helps organizations stay relevant as


technology evolves.

Grid Computing
Grid computing is like connecting a network of computers to work together as if
they were one powerful machine. It allows multiple computers to share their power
(processing, storage, etc.) to solve a large problem that one computer might not
be able to handle alone.

Scalability: When more computational power is needed, more computers can


be added to the grid. For example, if a task becomes bigger, additional
machines can pitch in to share the workload.

Resource Utilization: Often, computers have unused processing power or


storage. Grid computing helps make use of these idle resources, reducing
waste.

Complex Problem Solving: Grid computing is used for tough problems that
need a lot of computational power, like climate predictions, drug research, and
genetic analysis.

Unit 5 12
Collaboration: Scientists or engineers from different locations can work
together on large projects using grid computing, sharing resources and
solving problems faster.

Cost Savings: Instead of investing in new hardware, organizations can make


use of existing computers to share resources, cutting down on costs.

Cloud Computing
Cloud computing uses a network of remote servers (computers) to store and
process data instead of relying on a single server or personal computer. It
integrates well with distributed systems and is widely used for scalable and
flexible computing. Here’s how it works:

Scalable Infrastructure: Cloud computing allows resources like storage and


computing power to be scaled up or down based on demand. For instance, if a
website gets a sudden increase in visitors, the cloud can automatically provide
more servers to handle the load.

Load Balancing: Cloud services can distribute incoming requests across many
servers so no single server gets overwhelmed. This ensures that the system
remains fast and reliable.

Data Replication: To keep data safe, cloud storage solutions often replicate
(make copies of) data in multiple locations. This helps ensure data is always
available, even if one server goes down.

Virtualization
Virtualization is the process of creating virtual versions of physical resources
like computers, storage, or networks. It allows multiple users, applications, or
organizations to share the same physical infrastructure efficiently.
How Virtualization Works
A special software called a hypervisor is used to create and manage multiple
virtual machines (VMs) on a single physical system.
Hypervisor – A software layer that separates physical hardware from virtual
machines. It ensures that multiple VMs can run independently on the same

Unit 5 13
machine.
Virtual Machine (VM) – A software-based computer that runs an operating
system and applications like a real computer.
Host Machine – The physical computer running the hypervisor.

Guest Machine – The virtual machines running on top of the hypervisor.


Advantages of Virtualization

1. Cost Savings – Reduces the need for buying more hardware.

2. Resource Efficiency – Maximizes the use of physical resources.

3. Scalability – Easily create more VMs as needed.

Types of Virtualization

1. Hardware Virtualization (Virtual Machines)


Creating multiple virtual computers on a single physical machine.

Each virtual computer (VM) has its own CPU, RAM, storage, and OS.

The hypervisor ensures that all VMs share the physical system resources
efficiently.

Example:

Running Windows and Linux on the same computer using VMware or


VirtualBox.

Cloud platforms like AWS, Azure, and Google Cloud use virtualization to
provide cloud servers.

2. Application Virtualization
Running an application without installing it directly on a system.

The application runs in a virtual environment, separate from the operating


system.

It prevents software conflicts and allows apps to run on different devices


without compatibility issues.

Unit 5 14
Example:

Google Docs runs in a browser without installation.

Microsoft App-V allows Windows applications to run without installing


them on each device.

3. Network Virtualization
Creating multiple virtual networks on a single physical network.

It separates different network traffic into isolated virtual networks.

Improves security, scalability, and performance.

Example:

VPN (Virtual Private Network) allows secure internet browsing.

Software-Defined Networking (SDN) lets companies control networks


using software.

4. Desktop Virtualization
Using a virtual desktop instead of a physical computer.

Users log into a virtual desktop running on a remote server.

Allows employees to access their desktops from any device.

Example:

Amazon WorkSpaces offers cloud-based desktops.

Remote Desktop Protocol (RDP) lets you access another computer


remotely.

5. Storage Virtualization
Combining multiple storage devices into a single virtual storage system.

It improves data management, scalability, and reliability.

Example:

Google Drive, AWS S3, Dropbox store your data across multiple servers.

Unit 5 15
RAID (Redundant Array of Independent Disks) combines multiple physical
hard drives into one.

Service-Oriented Architecture (SOA)


Service-Oriented Architecture (SOA) is a way of designing software systems
where the functionality is broken down into reusable "services." Each service
performs a specific task and can be used by different systems.

Characteristics:

Interoperability: Different services can work together, even if they are


built on different technologies.

Service Encapsulation: Each service hides its internal workings, so other


systems don’t need to know the details of how it works.

Quality of Service (QoS): Defines the expected level of service for things
like speed and reliability, often outlined in a contract called a Service Level
Agreement (SLA).

Loose Coupling: Services are independent of each other, meaning one


service can change without affecting others.

Location Transparency: Services can be used anywhere, allowing for


better scalability and availability.

Cost Reduction: Since services are reusable, it reduces the need to build
everything from scratch, saving on development and maintenance costs.

Emerging/Future Trends in Distributed Systems


The future of distributed systems will be shaped by several emerging trends:

Ubiquitous Edge Computing: Edge computing involves processing data closer


to where it is generated (e.g., on devices like smartphones or sensors), rather
than sending it to a centralized cloud. This helps reduce delays and allows
faster decisions. When combined with AI, it enables real-time data processing
for things like smart homes, autonomous cars, and smart cities.

Enhanced Security and Privacy:

Unit 5 16
Zero Trust Architectures: This approach assumes that every user, device,
or application is potentially untrusted, so each action is strictly verified
before granting access.

Advanced Cryptography: With the rise of quantum computing, traditional


encryption methods may become vulnerable. Quantum-resistant
encryption methods are being developed to keep data secure in the future.

Quantum Computing and Networking:

Quantum-Enhanced Systems: Quantum computing will allow systems to


solve incredibly complex problems that are impossible for classical
computers. These could include optimization problems and large-scale
simulations.

Quantum Networks: These networks use quantum principles to transmit


data securely, leveraging quantum entanglement for unbreakable
encryption.

Interoperability and Standards: There will be a growing focus on creating


standards that allow different distributed systems to work together seamlessly.
Open standards and APIs (Application Programming Interfaces) will help
ensure that different systems can easily communicate and integrate with each
other.

Big Data

Big Data is data that is so large that traditional tools can't process it efficiently.
We usually work with data in sizes like megabytes (MB) or gigabytes (GB), but Big
Data can be in petabytes (1 petabyte = 10^15 bytes), which is enormous.
Example:
If we have data about every tweet sent on Twitter every day, the amount of data
generated would be huge—likely in the petabytes. Traditional tools like Excel
wouldn't be able to handle that much data, so we need frameworks like Hadoop.
Challenges: This data is often unstructured (e.g., social media posts, logs, sensor
data) and needs special tools to store, process, and analyze it.

Storing, processing, and analyzing huge amounts of unstructured data.

Unit 5 17
Solution:

Storage: Hadoop uses HDFS (Hadoop Distributed File System) to store data
across distributed clusters.

Processing: MapReduce processes data in parallel across clusters.

Analysis: Tools like Pig and Hive are used for data analysis.

Hadoop Ecosystem
Hadoop is like a super-powered engine that helps in managing, storing, and
analyzing Big Data.
Hadoop:

It is an open-source framework that processes Big Data across a


distributed cluster of machines.

Hadoop's Components:

HDFS (Hadoop Distributed File System): A way to store data across


many computers.

MapReduce: A system to process the data in parallel (splitting it into


smaller pieces).

Tools in the Hadoop Ecosystem

Hive: A data warehouse system built on top of Hadoop that lets you use SQL-
like queries (HiveQL) to interact with data. It converts these queries into
MapReduce jobs for processing the data.

Pig: A platform that makes it easier to write programs for processing data in
Hadoop. It uses a language called Pig Latin, which reduces the amount of
code needed for tasks.

Sqoop: A tool designed to transfer large amounts of data between Hadoop


and structured databases, like relational databases.

Flume: A service that collects and transfers large streams of log data into
Hadoop, making it easy to handle big data logs.

Unit 5 18
HDFS (Hadoop Distributed File System)
HDFS is a system that allows Hadoop to store large amounts of data across
multiple computers (called nodes). This is essential for Big Data because a single
computer can't handle such massive data by itself.

How it works:

Large data files are split into smaller pieces (called blocks) and spread
across different computers in a network (cluster).

One computer, called the NameNode, is responsible for managing the


location of all these blocks, while the others, called DataNodes, store the
actual data.

Example: Imagine a large file, like a 100GB movie. Instead of saving it all
on one computer, HDFS breaks it into 1GB chunks and stores each chunk
on a different computer, with the NameNode keeping track of where each
piece is.

Unit 5 19
MapReduce: Processing Big Data
MapReduce is a method used to process large datasets by breaking the task into
smaller parts and then combining the results.
MapReduce is a way to process data in parallel. It has two steps:

1. Map: Break the problem into smaller parts.

2. Reduce: Combine the results.

MapReduce Process: Word Counting Example:

1. Input Stage:

We start with a big chunk of text: "Deer Bear River Car Car River Deer Car
Bear".

This is the raw data that needs to be processed.

2. Splitting Stage:

The input text is divided into three smaller parts to process them at the
same time:

"Deer Bear River"

"Car Car River"

"Deer Car Bear"

Splitting the data like this allows us to process it on different machines,


speeding up the task.

3. Mapping Stage:

Each of the smaller chunks is processed to count how many times each
word appears.

For every word, a key-value pair is created where the word is the key and
the number 1 is the value.

Example from the first chunk "Deer Bear River":

Deer -> 1

Bear -> 1

Unit 5 20
River -> 1

4. Shuffling Stage:

All similar words are grouped together from all chunks.

The words are sent to the same reducer, so we’ll get:

All "Bear" counts grouped together

All "Car" counts grouped together

All "Deer" counts grouped together

All "River" counts grouped together

5. Reducing Stage:

For each word, the reducer adds up all the counts.

For example:

Bear: 1 + 1 = 2

Car: 1 + 1 + 1 = 3

Deer: 1 + 1 = 2

River: 1 + 1 = 2

6. Final Result:

After reducing, we get the final count of each word in the entire text:

Bear: 2

Car: 3

Deer: 2

River: 2

Unit 5 21
Hive
Hive is a tool that allows you to query and manage large datasets stored in
Hadoop using SQL-like language, called HiveQL. It acts as a bridge between
Hadoop and users, allowing them to interact with Hadoop’s distributed storage
without needing to write complex MapReduce code.
How Hive Works:

Instead of using traditional programming languages to process Big Data, Hive


allows you to write queries similar to SQL (Structured Query Language).

When you write a query in HiveQL (like a regular SQL query), Hive
automatically converts it into MapReduce jobs to process the data in the
Hadoop cluster.

Advantages of Hive:

1. SQL-like Queries: Hive uses a language called HiveQL that is similar to SQL,
making it easier for people familiar with databases to use Hadoop.

2. No Need for MapReduce Code: You don’t need to write complicated


MapReduce code. Hive does all the translation for you.

3. Data Warehousing: Hive is great for querying and summarizing large datasets,
which is useful for data warehousing applications.

Limitations of Hive:

Not Real-Time: Hive is not suitable for real-time data processing or low-
latency queries. It works best for batch processing large datasets.

Simple Example in Hive:

Unit 5 22
Imagine you have a dataset of books with their word counts, and you want to find
how many times the word "happy" appeared. Normally, you would need to write
complex MapReduce code, but in Hive, you can do it with this simple query:

SELECT COUNT(*) FROM books WHERE word = 'happy';

This query is similar to the kind of query you'd run on a regular SQL database, but
it’s executing on Big Data stored in Hadoop.

Pig
Pig is another tool for processing Big Data, designed to handle large datasets with
a more flexible approach than SQL. It uses a language called Pig Latin, which is a
data flow language that allows users to write scripts to process data in Hadoop.

How Pig Works:

Pig Latin is a simple scripting language used in Pig to process data. It allows
for more flexibility than traditional SQL queries.

The scripts you write in Pig Latin are then translated into MapReduce jobs that
Hadoop can process.

Example:
Let’s say you want to calculate the total sales for each product in a file sales_data :

A = LOAD 'sales_data' AS (product_id, sales);


B = GROUP A BY product_id;
C = FOREACH B GENERATE group, SUM(A.sales);
DUMP C;

How This Script Works:

1. LOAD: Load the sales data.

2. GROUP: Group the data by product ID.

3. FOREACH: Calculate the sum of sales for each product.

4. DUMP: Display the result.

Unit 5 23
Differences Between Hive and Pig:

Feature Hive Pig

Who Uses It Data Analysts (SQL users) Programmers (coding enthusiasts)

Language SQL-like (HiveQL) Data-flow (Pig Latin)

Data Type Structured (e.g., tables) Semi-structured (e.g., logs, JSON)

Speed Slower (good for batch processing) Faster (good for quick scripts)

Key Differences:

Hive is typically used by data analysts who are comfortable with SQL and are
working with structured data like tables. It’s slower but great for batch
processing tasks.

Pig is preferred by programmers and is more flexible when working with semi-
structured data like logs or JSON. It uses Pig Latin, a data flow language,
which can often execute tasks faster than Hive for certain jobs.

Unit 5 24

You might also like