DS UNIT-5
DS UNIT-5
1. Computation: This is the actual work or tasks that each process performs.
Each process is responsible for its own specific job, like calculating numbers,
processing data, or running algorithms. These tasks are carried out
independently by each process.
Unit 5 1
By keeping computation and coordination separate, systems can be more flexible
and easier to manage. Processes can focus on their tasks without worrying too
much about how they will communicate with others.
Both sender and receiver must be active at the same time for
communication to succeed.
Examples:
Examples:
Email – The sender and receiver don’t need to be online at the same
time.
Unit 5 2
Processes explicitly know each other and communicate directly.
Example:
Example:
Publisher/Subscriber Model
Subscribers: These are processes that are interested in certain topics. They
"subscribe" to these topics to receive messages published on them.
Unit 5 3
In this model:
Temporal Coupling: Most pub/sub systems require that both publishers and
subscribers are active at the same time for communication to happen. This
means there is a temporal coupling because they need to be active
simultaneously.
Example
Publishers: Journalists who write articles and publish them under different
categories like Sports, Technology, etc.
In this scenario:
Both need to be active at the same time for the readers to receive the latest
articles as soon as they are published.
Unit 5 4
1. The Principle of Exchanging Data Items Between Publishers and
Subscribers
Publishers: These are processes that generate and send out data or
messages. Think of them as broadcasters.
Subscribers: These are processes that want to receive specific types of data
or messages. Think of them as listeners.
Unit 5 5
JavaSpace acts as a shared memory space where processes can write and
read tuples
1. "Write A" - Process A can insert a copy of data A into the space
2. "Write B" - Process B can insert a copy of data B into the space
3. "Read T" - A process can search for a tuple matching template T and
retrieve data C (with optional removal)
Unit 5 6
The network acts as the communication medium
Designed for Mobile & Distributed Systems – Helps devices and processes
communicate efficiently.
Data Exists Only While Connected – When a device disconnects, the data
disappears.
Ideal for Mobile Environments – Useful for systems where devices frequently
connect and disconnect (e.g., mobile networks, IoT).
Unit 5 7
Communication in Publish-Subscribe Systems
Example:
Stock Market Alerts: A stock trading app notifies users about price
changes only for stocks they follow.
Unit 5 8
Naming in Publish-Subscribe Systems
Naming in Pub-Sub systems is how publishers and subscribers identify the topics
they care about.
Types of Naming:
Example: A subscriber sets a filter like "Stock > $100" and gets only those
updates.
Example:
Topic-Based:
Content-Based:
Content-Based Routing
Content-based routing ensures that messages are delivered only to the relevant
subscribers based on content.
How it Works?
Unit 5 9
Routing Decisions: Routers (intermediate servers) check this description and
compare it to the interests of the subscribers. If a subscriber is interested in
that content, they receive the message. If not, it gets discarded, saving
resources.
Extreme Cases:
Broadcast to All Servers: One way is to send the message to all servers, and
let each one check who is interested. This method is simple but wasteful.
Example:
If a user wants alerts for Tesla stock, they only receive those, not alerts for
other companies.
Unit 5 10
Security by Decoupling Publishers & Subscribers
3. Decoupling: The broker ensures that publishers don’t directly send data to
subscribers, enhancing privacy.
Unit 5 11
Introduction to Emerging Trends in Distributed Systems
Emerging trends in distributed systems refer to new advancements and
technologies that are shaping how computers and networks work together to
solve complex problems. Staying updated with these trends is crucial because:
Grid Computing
Grid computing is like connecting a network of computers to work together as if
they were one powerful machine. It allows multiple computers to share their power
(processing, storage, etc.) to solve a large problem that one computer might not
be able to handle alone.
Complex Problem Solving: Grid computing is used for tough problems that
need a lot of computational power, like climate predictions, drug research, and
genetic analysis.
Unit 5 12
Collaboration: Scientists or engineers from different locations can work
together on large projects using grid computing, sharing resources and
solving problems faster.
Cloud Computing
Cloud computing uses a network of remote servers (computers) to store and
process data instead of relying on a single server or personal computer. It
integrates well with distributed systems and is widely used for scalable and
flexible computing. Here’s how it works:
Load Balancing: Cloud services can distribute incoming requests across many
servers so no single server gets overwhelmed. This ensures that the system
remains fast and reliable.
Data Replication: To keep data safe, cloud storage solutions often replicate
(make copies of) data in multiple locations. This helps ensure data is always
available, even if one server goes down.
Virtualization
Virtualization is the process of creating virtual versions of physical resources
like computers, storage, or networks. It allows multiple users, applications, or
organizations to share the same physical infrastructure efficiently.
How Virtualization Works
A special software called a hypervisor is used to create and manage multiple
virtual machines (VMs) on a single physical system.
Hypervisor – A software layer that separates physical hardware from virtual
machines. It ensures that multiple VMs can run independently on the same
Unit 5 13
machine.
Virtual Machine (VM) – A software-based computer that runs an operating
system and applications like a real computer.
Host Machine – The physical computer running the hypervisor.
Types of Virtualization
Each virtual computer (VM) has its own CPU, RAM, storage, and OS.
The hypervisor ensures that all VMs share the physical system resources
efficiently.
Example:
Cloud platforms like AWS, Azure, and Google Cloud use virtualization to
provide cloud servers.
2. Application Virtualization
Running an application without installing it directly on a system.
Unit 5 14
Example:
3. Network Virtualization
Creating multiple virtual networks on a single physical network.
Example:
4. Desktop Virtualization
Using a virtual desktop instead of a physical computer.
Example:
5. Storage Virtualization
Combining multiple storage devices into a single virtual storage system.
Example:
Google Drive, AWS S3, Dropbox store your data across multiple servers.
Unit 5 15
RAID (Redundant Array of Independent Disks) combines multiple physical
hard drives into one.
Characteristics:
Quality of Service (QoS): Defines the expected level of service for things
like speed and reliability, often outlined in a contract called a Service Level
Agreement (SLA).
Cost Reduction: Since services are reusable, it reduces the need to build
everything from scratch, saving on development and maintenance costs.
Unit 5 16
Zero Trust Architectures: This approach assumes that every user, device,
or application is potentially untrusted, so each action is strictly verified
before granting access.
Big Data
Big Data is data that is so large that traditional tools can't process it efficiently.
We usually work with data in sizes like megabytes (MB) or gigabytes (GB), but Big
Data can be in petabytes (1 petabyte = 10^15 bytes), which is enormous.
Example:
If we have data about every tweet sent on Twitter every day, the amount of data
generated would be huge—likely in the petabytes. Traditional tools like Excel
wouldn't be able to handle that much data, so we need frameworks like Hadoop.
Challenges: This data is often unstructured (e.g., social media posts, logs, sensor
data) and needs special tools to store, process, and analyze it.
Unit 5 17
Solution:
Storage: Hadoop uses HDFS (Hadoop Distributed File System) to store data
across distributed clusters.
Analysis: Tools like Pig and Hive are used for data analysis.
Hadoop Ecosystem
Hadoop is like a super-powered engine that helps in managing, storing, and
analyzing Big Data.
Hadoop:
Hadoop's Components:
Hive: A data warehouse system built on top of Hadoop that lets you use SQL-
like queries (HiveQL) to interact with data. It converts these queries into
MapReduce jobs for processing the data.
Pig: A platform that makes it easier to write programs for processing data in
Hadoop. It uses a language called Pig Latin, which reduces the amount of
code needed for tasks.
Flume: A service that collects and transfers large streams of log data into
Hadoop, making it easy to handle big data logs.
Unit 5 18
HDFS (Hadoop Distributed File System)
HDFS is a system that allows Hadoop to store large amounts of data across
multiple computers (called nodes). This is essential for Big Data because a single
computer can't handle such massive data by itself.
How it works:
Large data files are split into smaller pieces (called blocks) and spread
across different computers in a network (cluster).
Example: Imagine a large file, like a 100GB movie. Instead of saving it all
on one computer, HDFS breaks it into 1GB chunks and stores each chunk
on a different computer, with the NameNode keeping track of where each
piece is.
Unit 5 19
MapReduce: Processing Big Data
MapReduce is a method used to process large datasets by breaking the task into
smaller parts and then combining the results.
MapReduce is a way to process data in parallel. It has two steps:
1. Input Stage:
We start with a big chunk of text: "Deer Bear River Car Car River Deer Car
Bear".
2. Splitting Stage:
The input text is divided into three smaller parts to process them at the
same time:
3. Mapping Stage:
Each of the smaller chunks is processed to count how many times each
word appears.
For every word, a key-value pair is created where the word is the key and
the number 1 is the value.
Deer -> 1
Bear -> 1
Unit 5 20
River -> 1
4. Shuffling Stage:
5. Reducing Stage:
For example:
Bear: 1 + 1 = 2
Car: 1 + 1 + 1 = 3
Deer: 1 + 1 = 2
River: 1 + 1 = 2
6. Final Result:
After reducing, we get the final count of each word in the entire text:
Bear: 2
Car: 3
Deer: 2
River: 2
Unit 5 21
Hive
Hive is a tool that allows you to query and manage large datasets stored in
Hadoop using SQL-like language, called HiveQL. It acts as a bridge between
Hadoop and users, allowing them to interact with Hadoop’s distributed storage
without needing to write complex MapReduce code.
How Hive Works:
When you write a query in HiveQL (like a regular SQL query), Hive
automatically converts it into MapReduce jobs to process the data in the
Hadoop cluster.
Advantages of Hive:
1. SQL-like Queries: Hive uses a language called HiveQL that is similar to SQL,
making it easier for people familiar with databases to use Hadoop.
3. Data Warehousing: Hive is great for querying and summarizing large datasets,
which is useful for data warehousing applications.
Limitations of Hive:
Not Real-Time: Hive is not suitable for real-time data processing or low-
latency queries. It works best for batch processing large datasets.
Unit 5 22
Imagine you have a dataset of books with their word counts, and you want to find
how many times the word "happy" appeared. Normally, you would need to write
complex MapReduce code, but in Hive, you can do it with this simple query:
This query is similar to the kind of query you'd run on a regular SQL database, but
it’s executing on Big Data stored in Hadoop.
Pig
Pig is another tool for processing Big Data, designed to handle large datasets with
a more flexible approach than SQL. It uses a language called Pig Latin, which is a
data flow language that allows users to write scripts to process data in Hadoop.
Pig Latin is a simple scripting language used in Pig to process data. It allows
for more flexibility than traditional SQL queries.
The scripts you write in Pig Latin are then translated into MapReduce jobs that
Hadoop can process.
Example:
Let’s say you want to calculate the total sales for each product in a file sales_data :
Unit 5 23
Differences Between Hive and Pig:
Speed Slower (good for batch processing) Faster (good for quick scripts)
Key Differences:
Hive is typically used by data analysts who are comfortable with SQL and are
working with structured data like tables. It’s slower but great for batch
processing tasks.
Pig is preferred by programmers and is more flexible when working with semi-
structured data like logs or JSON. It uses Pig Latin, a data flow language,
which can often execute tasks faster than Hive for certain jobs.
Unit 5 24