Lec 14
Lec 14
Lecture – 14
CAP
Theorem
Cap theorem.
Cap theorem was proposed by Eric Brewer, in Berkeley and which was subsequently proved by Gilbert
and Lynch. In a distributed systems you can specify, at most two out of three different guarantees; which
is now, specified as, which is known as, the cap theorem. Let us see, what these three different
guarantees? Three different things are. So, the first one is called, ‘Consistency’, the second one is called,
‘Availability’, third one is called, ‘Partition tolerance’. CAP theorem says that, out of these three at most
two can be the guaranteed, by any systems. So, let us see the, consistency that means all the nodes see the
same data, at any time or the read returns the latest return value, by any client. So, that is called,
‘Consistency’, that means at all points of time, the node will, get the most recent data, at any point of time
whenever it is being, referred or it is being accessed, by the read operation. So, whenever the read
operation is performed, it will, read it will return the latest write, by the client, if that is, guaranteed at all
points of time, then it is called, ‘Consistency’. Availability says that the system allows operation, all the
time and operations returned quickly. Meaning to say that, the operations, the system always is operating
and whenever is being accessed, it will perform, it will return, it will return the operations, very quickly.
So, this is called, ‘Availability’. Third criteria, is called, ‘Partition Tolerance’ that is the system continues
to work in spite, of network partitions. So, the cap theorem says that, out of three, out of these three. So,
cap, c a p that means consistency availability and partition tolerant, in short, it is called, ‘Cap Theorem’.
Cap theorem says that, out of three, out of these three different parameters, at most, two can be
guaranteed, at any time, by the system. So, this is, the design issue that we are going to see, how what is
the implication of this cap theorem or different NoSQL systems which are being available, at this point of
time.
Refer Slide Time :( 3: 26)
Now, let us see, all these three different things, which is specified in the cap theorem that is the first one
islet us, understand about the availability and what is the importance of availability? Availability says that
read and write will complete reliably and quickly, at all point of time. So, in if we measure, these are read
and write operation times and let us see that, this particular measurement will have an increase, of 500
millisecond, latency then let us see, what is the implication of most of the operations in the company. So,
this latency, of read and write of 500 milliseconds, it is shown that, for the companies like Amazon or a
Google, highest cost, drop of 20% in the revenue model, meaning to say that, so Amazon.com, is dealing
with the sales of these items. So, if there is latency, of 500 milliseconds, in these performance, then
obviously customers will churn and therefore they will drop in the revenues. So, at imagine each added
millisecond, of latency implies, implies a six million yearly loss. So, a user cognitive drip; is there and if
more than a second he lapses, between the clicking and the material appearing, the users mind is already
somewhere else and this will lead to a churn, in the customers, in any business. So, this is a, most
important parameter in some of the cases. So, the companies, like online which are dealing with the e
commerce online sales, that is imagine company or the Google, which taps the advertisements and other
such product which are and services online, has to ensure, high availability and not only reliability but, it
has to be very, quickly that is the latency also has to be very minimal. So, that the users can get the
services. So, therefore a service level agreements, written by the providers, predominantly deal with the
latencies, which are faced by the clients and therefore the availability is also, one of the most important
parameter or criteria, in the cap theorem.
Now, second criteria, which is called a, ‘Consistency’, of a CAP theorem C of that is called,
‘Consistency’. Consistency says that, all the nodes, see the same data, at any time are or reads returns, the
latest return values by the client. Meaning to say that, the nodes whatever updates are happening, on the
nodes by different write operations. So, they are updated, they are up-to-date. So, whenever read is
requested, it will always return, the latest write operations which are done by the any client. So, reads are
very latest, returns read returns the very latest, information which are written by the client, if that is
maintained at all points of time that is called, ‘Consistency’. Now, when you access your bank account or
investment account, by multiple clients wire, laptop, work station, phone, tablets, excess price you want
these updates to be done from, one client to be visible to the other clients. It's not that, if you have done
through the mobile phone and your mobile phone, updates are and you are, you are now, accessing or
referring reading wire, laptop, they may, not get that a decent update, if that is the case, then it is not a
consistency. So, similarly when a thousands of customers are looking to book a flight ticket, all the
updates from a client, should be accessible by the other client, in the reservations, airline reservation, to
booking a tickets. So, consistency also, is going to be an important parameter.
So, it prefers the availability or the consistency in the base model. Now, let us see the consistency, levels
which are supported in the Cassandra, Cassandra has different consistency levels. So, the client is allowed
to choose the consistency level, for each operation, that read and write, among these different consistency
levels. So they are, one that is any, second type of consistency level is all, third is one and fourth is
quorum. Let us, understand one by one all these consistency level. So any, consistency level, by mean,
any consistency level, that means, any server may not be the replica is can, can allow, this operation to be
complete. So this becomes a fastest, because the coordinator caches, the write operation and returns
quickly to the, to the client, to perform this read and write operations hence, this is the fastest one. The
second one is called, ‘All Replicas’. That means, it ensures that the read and write operations, write
operations requires, to ensure that, it has to be updated at all the replicas. So this is a kind of strong
consistency, so it will provide, if the consistency level is all then it will be as the slowest one. Third one is
called’ ‘One’, that is at least one of these replicas gets updated, so it is faster than all, but it is slower there
any, why because it cannot tolerate the failure of all, of the replicas. And fourth one is called, ‘Quorum’,
quorum says that, quorum that means, a quorum across all the replicas in the data center is to be updated,
what do you mean by the quorum? That we are going to see? So quorum is between the all and one, so
quorum is some number K. So, we are going to see how many replicas are required to be updated, under
the quorum system.
So quorum’s for consistency, quorum says that majority, so if there are in this example, there are five
different replicas, so the quorum says, the majority means, more than fifty percent, so that, becomes the
three. So minimum at least three different quorums, different replicas are to be updated. So, for any to the
proper these upper quorum, which has to be satisfied is that, if any to quorum’s, if we take the intersection
then there must be a common, replica common servers between any two system. So, any two quorum
vary intersect, client one does the write operation, in the write quorum and the Client two, reads
from the blue quorum, then it will get the update from, because there is a common server in both the
quorum’s, So at least one server in the blue quorum, gets the latest write? So quorums are faster than all,
but ensure the strong consistency.
Refer Slide Time :( 18:46)
So quorum’s, let us see in more detail, So several key value pairs, that is NoSQL to react and Cassandra
uses, the quorum. So reads; that is the client specifies the value of R, which is the value of the quorum,
which is less than the number of replicas, So R is the read consistency level, So coordinator waits for R
replicas to respond before sending, the result to the client. In the background, the coordinator, checks for
the consistency of remaining and - R replicas and initiate, the read repair if any. So meaning to say that,
not only it the, the read is being satisfied by, specifying the R number of replicas, but what about the, the
remaining replicas, that is N-R, whether are they consistent or not that also required to be checked, for the
consistency, of the remaining replicas. So that has to be done in the background and if they are
inconsistent then, the read repair is to be initiated, So that eventually they may be consistent at all the
levels.
∙ Now, if let us say, R is the replica, read replica count and W is the Write replica count, then there
are two necessary conditions in the quorum,
1. W+R > N
2. W > N/2
Select values based on application
∙ (W=1, R=1): very few writes and reads
∙ (W=1, R=N): great for write-heavy workloads with mostly one client writing per key
So, we have seen the quorum’s, across all the replicas, in all the datacenters and it ensures the global
consistency, which is still a fast one. So local quorum, So Coulomb's in a in, in the coordinated data
center are faster, only waits for the quorum in the first datacenter client contacts, each quorum’s, the
quorum in every data center, let each data center do its own quorum and support the heretical replies.