0% found this document useful (0 votes)
4 views

Module 2 Nosql

The document discusses various distribution models for NoSQL databases, including single-server configurations, sharding, master-slave replication, and peer-to-peer replication, highlighting their advantages and challenges. It emphasizes the importance of balancing consistency, availability, and partition tolerance as outlined by the CAP theorem, and explores strategies for managing write-write conflicts and ensuring data consistency. Additionally, it addresses trade-offs in durability and performance, suggesting that relaxing certain properties can enhance system responsiveness in specific use cases.

Uploaded by

Vikas U S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Module 2 Nosql

The document discusses various distribution models for NoSQL databases, including single-server configurations, sharding, master-slave replication, and peer-to-peer replication, highlighting their advantages and challenges. It emphasizes the importance of balancing consistency, availability, and partition tolerance as outlined by the CAP theorem, and explores strategies for managing write-write conflicts and ensuring data consistency. Additionally, it addresses trade-offs in durability and performance, suggesting that relaxing certain properties can enhance system responsiveness in specific use cases.

Uploaded by

Vikas U S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

21CS745 NOSQL Databases

MODULE 2
Distribution Models
Single Server:
 Single-server configuration is often recommended for its simplicity, eliminating the
complexities of distributed systems.
 It is easier for both operations and development teams to manage and reason about.
 Single-server setups work best when the data model fits well with NoSQL and doesn't need
extensive distribution, such as with graph databases or when processing aggregates.
Sharding:

 Sharding is a method of dividing data across multiple servers, improving performance and
horizontal scalability by placing different portions of the dataset on separate nodes.
 The key challenge in sharding is to ensure that data is grouped or "clumped" appropriately
(e.g., by user or geographic location) to minimize cross-node requests and maintain load
balance.
 Auto-sharding features in NoSQL databases handle distribution and queries across shards
automatically.
 Sharding is advantageous for both read and write performance but does not inherently
improve resilience—if a shard goes down, its data becomes inaccessible.
 Implementing sharding should be done before it's critically needed to avoid issues with
moving data across shards in production.
Master-Slave Replication:
 Master-slave replication involves copying data from a master node to multiple slave nodes.
The master handles all writes, while reads can be distributed among slaves.

1
Koustav Biswas, Dept. Of CSE, DSATM
21CS745 NOSQL Databases

 This model is useful for read-intensive applications, but the master still handles all updates,
which can be a bottleneck in write-heavy workloads.
 The system offers some read resilience since slaves can continue servicing read requests
even if the master fails. However, write operations will be unavailable until the master is
restored or a new master is appointed.
 Replication setups can involve manual or automatic master appointment, with automatic
systems reducing downtime by quickly electing a new master in case of failure.
Peer-to-Peer Replication
Master-slave replication improves read scalability but doesn't help with write scalability or eliminate
the single point of failure of the master. In contrast, peer-to-peer replication distributes read and
write operations across all nodes, avoiding the master-slave hierarchy and offering the following
advantages:

1. Equal Weight Across Nodes:


o All nodes in the peer-to-peer model have equal responsibility, meaning any node can
accept reads and writes.
o The loss of any node doesn’t prevent access to the data since other nodes can still
handle requests.
2. Benefits:

2
Koustav Biswas, Dept. Of CSE, DSATM
21CS745 NOSQL Databases

o Resilience: Peer-to-peer replication can handle node failures without losing access to
the data store.
o Scalability: It's easy to add new nodes to the system, which improves overall
performance.
However, peer-to-peer replication also brings some complications, primarily around consistency:
3. Write-write Conflicts:
o When multiple nodes can write data simultaneously, there’s a risk of two clients
attempting to update the same record at the same time, leading to write-write
conflicts.
o Inconsistencies from reads may be temporary, but inconsistencies from writes can
persist indefinitely.
There are two broad strategies to handle write inconsistencies:
 Coordination Approach:
o The replicas coordinate to avoid conflicts during writes. This gives a strong
consistency guarantee, similar to master-slave replication, but with the added
network overhead required to coordinate.
o Coordination typically requires a majority of nodes to agree on the write, allowing
the system to remain operational even if a minority of nodes fail.
 Coping with Inconsistency:
o In some contexts, it is possible to handle inconsistent writes by applying merging
policies to resolve conflicts. This approach allows for maximum performance
because it avoids the overhead of coordination but sacrifices some consistency.
Peer-to-peer replication essentially operates along a spectrum where systems trade off between
consistency and availability, depending on the needs of the application.

Combining Sharding and Replication


Sharding and replication can be combined to further optimize scalability and fault tolerance.
1. Master-Slave Replication with Sharding:
o With sharding, data is split into segments, with each segment (or shard) assigned to a
single master.
o The system can have multiple masters, but each data item has only one master
responsible for it.
o Nodes can be configured to serve as a master for some data shards and a slave for
others, or there can be dedicated nodes for master or slave duties.
2. Peer-to-Peer Replication with Sharding:
o Peer-to-peer replication can also be used with sharding, especially in column-family
databases.

3
Koustav Biswas, Dept. Of CSE, DSATM
21CS745 NOSQL Databases

o In such a setup, a common strategy is to have a replication factor of 3, meaning each


shard is stored on three different nodes.
o If a node fails, the shards on that node can be rebuilt on other nodes, ensuring data
redundancy and continued availability.
Consistency
While relational databases emphasize strong consistency by avoiding concurrency issues, NoSQL
databases take a more flexible approach, often using eventual consistency and allowing for more
complex ways to handle conflicts. Deciding between pessimistic and optimistic concurrency models
depends on the trade-offs between safety (consistency) and liveness (responsiveness), with replication
adding further complexity to the consistency model
Update Consistency: The example of two users, Martin and Pramod, trying to update a phone
number at the same time demonstrates a write-write conflict. Martin’s update may get overwritten by
Pramod’s due to the lack of concurrency control, which leads to lost updates and is seen as a failure of
consistency.
Concurrency Control:
 Pessimistic concurrency: This approach prevents conflicts by using write locks, ensuring only
one update can occur at a time. In this case, Martin’s update would be locked in first,
preventing Pramod from proceeding until Martin's update is complete.
 Optimistic concurrency: This method allows conflicts but detects them when they occur. For
example, conditional updates check whether the data has changed since the last read before
making an update. If Pramod’s update fails, he would be notified and could decide whether to
proceed with his update based on the current value.
Distributed Systems and Sequential Consistency: In a distributed system, multiple servers may handle
updates, and without proper control, they may apply updates in different orders, leading to
inconsistencies. Sequential consistency ensures that all nodes apply operations in the same order.
Handling Write-Write Conflicts: Another way to handle conflicts is to store both updates and mark
them as conflicted, similar to how version control systems handle concurrent changes. Users might be
asked to merge the conflicting changes manually or the system might automatically resolve simple
cases, such as differences in formatting.
Trade-offs in Concurrency:
 Pessimistic approaches: While they avoid conflicts, they often reduce system liveness
(responsiveness), as locking mechanisms can introduce deadlocks.
 Optimistic approaches: These aim for better performance by allowing conflicts to happen and
resolving them afterward, which is more scalable in systems with high concurrency.
Replication and Write-Write Conflicts: In replicated systems, where multiple nodes hold independent
copies of data, write conflicts become more likely. One way to maintain consistency is to designate a
single node for all writes, minimizing the chances of conflicting updates.

4
Koustav Biswas, Dept. Of CSE, DSATM
21CS745 NOSQL Databases

Read consistency
Logical consistency ensures that data makes sense together. For example, in an order system, if a line
item is updated, the corresponding shipping charge should also be updated to reflect the change.
However, in a concurrent scenario, one user may modify a line item while another reads the order
data, causing an inconsistent read. This is called a read-write conflict. To prevent such conflicts,
relational databases use transactions, which ensure that either both updates are applied before the read
or after, maintaining logical consistency.

In contrast, NoSQL databases typically don't support cross-aggregate transactions but do support
atomic updates within a single aggregate, meaning that consistency within a single data object is
guaranteed. However, if an update spans multiple aggregates, there is a risk of inconsistency during a
small time window, referred to as the inconsistency window. For instance, services like Amazon’s
SimpleDB typically have very short inconsistency windows (less than a second).
Replication Consistency:

Replication consistency refers to the challenge of ensuring that different replicas of the same data item
show the same value when read. A classic example is a hotel room booking system replicated across
multiple nodes in different locations. A user in one region may see a room as booked, while another,
due to network delays, still sees it as available. This type of inconsistency is described as eventual

5
Koustav Biswas, Dept. Of CSE, DSATM
21CS745 NOSQL Databases

consistency, meaning that while nodes may initially have inconsistent data, they will eventually
synchronize.
Techniques for Mitigating Inconsistencies:
 Read-your-writes consistency: Guarantees that after a user updates data, they can read their
own changes immediately. This is particularly important in systems like blogs, where users
expect to see their comments after posting them.
 Session consistency: Ensures that within a user's session, they will consistently read their own
writes. This can be achieved using sticky sessions (where all requests in a session are routed
to the same server) or version stamps, where each session tracks the most recent update and
ensures the server is up-to-date before reading.
Relaxing Consistency
The Tradeoff of Consistency
The decision to relax consistency is a fundamental tradeoff in system design. While it is always
possible to enforce strong consistency, doing so can slow down the system to an unacceptable degree,
especially in large-scale systems. Relaxing consistency can significantly improve latency and
throughput, which are critical in high-performance applications. The need for such tradeoffs depends
on the specific tolerance of the domain to inconsistencies. For example, banking systems typically
demand high consistency, while social media platforms might be more tolerant of occasional
inconsistencies in data like post counts or likes.
In traditional single-server relational databases, consistency is typically enforced through transactions
that provide ACID properties (Atomicity, Consistency, Isolation, Durability). However, transactions
can be expensive in terms of performance. Most databases provide mechanisms to relax isolation
levels, allowing certain levels of inconsistency in exchange for faster queries. For instance:

 Read-committed isolation ensures that only committed data is read but can still allow some
inconsistent reads.
 Serializable isolation provides the strongest consistency but comes with a significant
performance overhead, making it less practical for high-load systems.
Many applications use the read-committed level as a balance between consistency and performance.
By allowing some degree of inconsistency (e.g., permitting uncommitted data to be read), applications
can perform more efficiently, especially in high-concurrency environments.
Forgoing Transactions for Performance
Some systems, particularly high-performance and large-scale web platforms, have found that the cost
of supporting transactions is too high. For example, in the early days of MySQL, transactions were
not supported, and developers favored its speed over strong consistency. Websites were willing to
sacrifice transactions to get faster performance, especially for operations where strict consistency was
less critical.
At the higher end, large systems like eBay have had to avoid transactions entirely due to the
challenges of scaling and performance. Particularly in systems that use sharding (dividing data across
multiple databases for scalability), maintaining strict transactional consistency becomes extremely

6
Koustav Biswas, Dept. Of CSE, DSATM
21CS745 NOSQL Databases

difficult. In such cases, applications often rely on mechanisms other than transactions to ensure
acceptable levels of consistency across distributed components.
The CAP Theorem
The CAP theorem, proposed by Eric Brewer in 2000 and formally proven by Seth Gilbert and Nancy
Lynch, states that in a distributed system, it is impossible to achieve all three properties
simultaneously: Consistency, Availability, and Partition Tolerance.
 Consistency: Every read receives the most recent write or an error.
 Availability: Every request (to a non-failing node) receives a response, even if it may not
contain the most recent write.
 Partition Tolerance: The system continues to operate despite network partitions.
The theorem essentially claims that in a distributed system that may experience network partitions
(which is common), you must trade off between Consistency and Availability. This tradeoff means
that in the presence of a network partition, you can either:
 Ensure that all data is consistent, but at the cost of availability (the system may refuse
requests while trying to maintain consistency).
 Maintain availability by allowing some inconsistency (i.e., not every request will see the
latest data).
Examples of CAP in Practice:
1. Consistency-Focused (CP): A system prioritizes consistency over availability. For instance,
in a financial system, ensuring that transactions are processed in a strict order is more critical
than immediate availability.
2. Availability-Focused (AP): In some systems like web applications, availability might be
prioritized. For example, a news website may tolerate some inconsistency (showing slightly
stale data) to remain available even during network issues.
3. Mixed Approach: Systems like shopping carts (e.g., in Amazon's Dynamo) often accept
some inconsistency (AP) and resolve conflicts later, merging items added to the cart after a
network partition.

Relaxing Durability

While consistency often takes center stage when discussing the ACID properties of databases,
durability is usually seen as non-negotiable—after all, what use is a database if it can lose updates?
However, there are situations where trading some durability for performance might be desirable.
For instance, a database that operates mostly in memory and only periodically flushes data to disk can
provide much faster responsiveness. The tradeoff is that updates made after the last flush may be lost
if the system crashes. This approach is suitable for use cases like user-session states in large websites.
Losing a user session, while annoying, may be preferable to having a slower website. Thus,
nondurable writes can be a valid optimization in such scenarios.
Durability can also be relaxed when dealing with telemetric data. In some cases, missing a few
updates due to a server failure is acceptable if it allows for capturing data at a higher rate.
Replication also introduces durability tradeoffs. In a master-slave replication model, if the master fails
before replicating updates to the slaves, those updates may be lost. One solution is to ensure the

7
Koustav Biswas, Dept. Of CSE, DSATM
21CS745 NOSQL Databases

master waits for acknowledgments from replicas before confirming an update, but this introduces
latency and reduces availability if replicas are slow or fail.
Quorums
When trading off consistency or durability, there are ways to balance these concerns by involving
multiple nodes in a request. This leads to the concept of quorums.
For data replicated over several nodes, you don’t need all nodes to acknowledge a write to achieve
strong consistency; instead, you need a write quorum—more than half of the nodes must participate
in the write. If conflicting writes happen, only one can achieve a majority. The required number of
nodes for a write quorum is often expressed as W>N/2W > N/2W>N/2, where WWW is the number
of nodes participating in the write, and NNN is the replication factor (the total number of nodes
storing replicas of the data).
There’s also a read quorum, which defines how many nodes need to confirm a read to ensure the
most up-to-date data is retrieved. If writes require a quorum, fewer nodes need to be contacted for a
read to ensure consistency. However, if writes are confirmed by fewer nodes, more nodes must be
contacted for a consistent read.
The relationship between reads, writes, and replication can be expressed as R+W>NR + W >
NR+W>N, meaning you can achieve a strongly consistent read if the number of nodes involved in the
read RRR and write WWW combined exceeds the replication factor NNN.
In a master-slave setup, avoiding write-write conflicts requires only writing to the master and reading
from the master, simplifying the consistency process. In contrast, peer-to-peer replication models
benefit more from quorum-based approaches.
A typical replication factor of 3 allows for resilience against node failures, maintaining quorums even
when a single node is lost. Automatic rebalancing can quickly restore a lost replica, minimizing the
risk of further data loss. However, the number of nodes involved in an operation may vary depending
on the importance of consistency versus speed. For instance, an operation requiring fast but strongly
consistent reads might require all nodes to confirm writes, allowing reads from a single node.
Ultimately, database architects can choose from a range of options when balancing consistency,
availability, and durability, making the tradeoff more nuanced than a simple binary choice.
Business and System Transactions
The concept of supporting update consistency in systems without relying on traditional database
transactions is crucial in many business processes. Business transactions, such as browsing products
and making purchases, often span longer periods and involve user interactions. If these were executed
within a traditional system transaction, database elements would remain locked for extended periods,
which is impractical.
Instead, systems often start the actual system transaction only when finalizing the interaction,
minimizing the time database locks are held. However, the challenge arises when decisions, like
pricing or shipping, are based on outdated data that may have changed during the user's session. This
leads to the risk of inconsistencies in the final transaction.
Handling Update Consistency
Optimistic Offline Lock is a key strategy to manage this. It allows the system to check for changes in
the data before updating it, preventing issues when working with stale data. This is done by using
version stamps—metadata associated with a record that changes with every update. When a client

8
Koustav Biswas, Dept. Of CSE, DSATM
21CS745 NOSQL Databases

reads the data, it records the version stamp. Before updating, it checks if the version stamp has
changed, ensuring the update is made only if the data is current.
Methods of Implementing Version Stamps
There are various ways to implement version stamps:
1. Counters: Incremented with each update, they allow easy comparison of versions but require
a centralized system to avoid duplication.
2. GUIDs (Globally Unique Identifiers): These are large random numbers ensuring uniqueness
across systems but can't be used to track the order of updates.
3. Content Hashes: A hash of the resource content that is unique to the data. Like GUIDs, they
ensure consistency but don’t allow tracking the sequence of changes.
4. Timestamps: Used to track when data was last updated. They are compact and allow easy
comparison for recentness but require synchronized clocks across systems.
Some systems use composite version stamps, combining approaches like counters and content hashes
for more robust conflict detection. For example, CouchDB uses this approach, enabling the system to
track both the recentness of data and conflicts in distributed environments.
Version stamps are not only helpful for preventing update conflicts but also for ensuring session
consistency, where data remains coherent throughout a user's interaction with the system.
Version Stamps on Multiple Nodes
In distributed systems, it's possible that two different nodes might provide different answers when
queried for the same data. This difference could arise from:
 Replication Lag: One node has received an update that hasn't yet reached another.
 Inconsistent Updates: Both nodes may have processed conflicting updates independently,
leading to divergent versions of the data.

Simple Counter Approach


In a simple replication scenario, you could use a counter-based version stamp where each node
increments a counter whenever it updates data. For example, if one node provides a version stamp of
4 and another provides 6, you know the version with 6 is more recent.
However, this approach breaks down in multi-master or peer-to-peer models, where there is no
single authoritative source to manage updates.
Vector Stamps
To address this challenge, vector stamps (also known as vector clocks or version vectors) are
commonly used. Vector stamps track a version for each node in the system, allowing them to handle
more complex scenarios.
For instance, if you have three nodes (blue, green, and black), a vector stamp would look like:
[blue: 43, green: 54, black: 12]
Each time a node updates data, it increments its own counter. If the green node updates the data, the
stamp would become:

9
Koustav Biswas, Dept. Of CSE, DSATM
21CS745 NOSQL Databases

[blue: 43, green: 55, black: 12]


When nodes communicate with each other, they synchronize their vector stamps. This way, it’s
possible to:
1. Detect which version is newer: If one version has all counters greater than or equal to
another, it’s the newer version.
o For example, [blue: 1, green: 2, black: 5] is newer than [blue: 1, green: 1, black: 5]
since the green counter is higher.
2. Detect conflicts: If the stamps have counters that are both higher than the other in different
positions, this signals a write-write conflict.
o For example, [blue: 1, green: 2, black: 5] and [blue: 2, green: 1, black: 5] indicate that
both blue and green nodes have updated the data independently, resulting in a
conflict.
Handling Missing Values
Vector stamps are flexible, allowing you to add new nodes to the system without invalidating existing
version stamps. Missing values are treated as 0, so [blue: 6, black: 2] would be considered as [blue: 6,
green: 0, black: 2].
Tradeoffs in Using Vector Stamps
While vector stamps help detect inconsistencies, they don’t resolve them. The actual resolution
depends on the application’s domain logic, and it's part of the larger consistency vs. latency tradeoff
inherent in distributed systems:
 You can ensure strong consistency but may have to deal with higher latency or availability
issues during network partitions.
 Alternatively, you can choose to detect inconsistencies (via vector stamps) and handle them
post-factum, allowing for better latency and availability but requiring conflict resolution
mechanisms.
-------------------------------------------------END OF 2ND MODULE--------------------------------------------

10
Koustav Biswas, Dept. Of CSE, DSATM

You might also like