0% found this document useful (0 votes)
24 views

Lec 14

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Lec 14

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Big Data Computing

Prof. Rajiv Misra


Computer Science andEngineering,IIT Patna

Lecture – 14

CAP

Theorem

Cap theorem.

Refer Slide Time :( 0: 16)

Cap theorem was proposed by Eric Brewer, in Berkeley and which was subsequently proved by Gilbert
and Lynch. In a distributed systems you can specify, at most two out of three different guarantees; which
is now, specified as, which is known as, the cap theorem. Let us see, what these three different
guarantees? Three different things are. So, the first one is called, ‘Consistency’, the second one is called,
‘Availability’, third one is called, ‘Partition tolerance’. CAP theorem says that, out of these three at most
two can be the guaranteed, by any systems. So, let us see the, consistency that means all the nodes see the
same data, at any time or the read returns the latest return value, by any client. So, that is called,
‘Consistency’, that means at all points of time, the node will, get the most recent data, at any point of time
whenever it is being, referred or it is being accessed, by the read operation. So, whenever the read
operation is performed, it will, read it will return the latest write, by the client, if that is, guaranteed at all
points of time, then it is called, ‘Consistency’. Availability says that the system allows operation, all the
time and operations returned quickly. Meaning to say that, the operations, the system always is operating
and whenever is being accessed, it will perform, it will return, it will return the operations, very quickly.
So, this is called, ‘Availability’. Third criteria, is called, ‘Partition Tolerance’ that is the system continues
to work in spite, of network partitions. So, the cap theorem says that, out of three, out of these three. So,
cap, c a p that means consistency availability and partition tolerant, in short, it is called, ‘Cap Theorem’.
Cap theorem says that, out of three, out of these three different parameters, at most, two can be
guaranteed, at any time, by the system. So, this is, the design issue that we are going to see, how what is
the implication of this cap theorem or different NoSQL systems which are being available, at this point of
time.
Refer Slide Time :( 3: 26)

Now, let us see, all these three different things, which is specified in the cap theorem that is the first one
islet us, understand about the availability and what is the importance of availability? Availability says that
read and write will complete reliably and quickly, at all point of time. So, in if we measure, these are read
and write operation times and let us see that, this particular measurement will have an increase, of 500
millisecond, latency then let us see, what is the implication of most of the operations in the company. So,
this latency, of read and write of 500 milliseconds, it is shown that, for the companies like Amazon or a
Google, highest cost, drop of 20% in the revenue model, meaning to say that, so Amazon.com, is dealing
with the sales of these items. So, if there is latency, of 500 milliseconds, in these performance, then
obviously customers will churn and therefore they will drop in the revenues. So, at imagine each added
millisecond, of latency implies, implies a six million yearly loss. So, a user cognitive drip; is there and if
more than a second he lapses, between the clicking and the material appearing, the users mind is already
somewhere else and this will lead to a churn, in the customers, in any business. So, this is a, most
important parameter in some of the cases. So, the companies, like online which are dealing with the e
commerce online sales, that is imagine company or the Google, which taps the advertisements and other
such product which are and services online, has to ensure, high availability and not only reliability but, it
has to be very, quickly that is the latency also has to be very minimal. So, that the users can get the
services. So, therefore a service level agreements, written by the providers, predominantly deal with the
latencies, which are faced by the clients and therefore the availability is also, one of the most important
parameter or criteria, in the cap theorem.

Refer Slide Time :( 6: 11)

Now, second criteria, which is called a, ‘Consistency’, of a CAP theorem C of that is called,
‘Consistency’. Consistency says that, all the nodes, see the same data, at any time are or reads returns, the
latest return values by the client. Meaning to say that, the nodes whatever updates are happening, on the
nodes by different write operations. So, they are updated, they are up-to-date. So, whenever read is
requested, it will always return, the latest write operations which are done by the any client. So, reads are
very latest, returns read returns the very latest, information which are written by the client, if that is
maintained at all points of time that is called, ‘Consistency’. Now, when you access your bank account or
investment account, by multiple clients wire, laptop, work station, phone, tablets, excess price you want
these updates to be done from, one client to be visible to the other clients. It's not that, if you have done
through the mobile phone and your mobile phone, updates are and you are, you are now, accessing or
referring reading wire, laptop, they may, not get that a decent update, if that is the case, then it is not a
consistency. So, similarly when a thousands of customers are looking to book a flight ticket, all the
updates from a client, should be accessible by the other client, in the reservations, airline reservation, to
booking a tickets. So, consistency also, is going to be an important parameter.

Refer Slide Time :( 7: 49)


Now, another parameter, which is there in CAP, P stands for the partition. So, partitions can happen across
datacenters, when the internet gets disconnected. So, this can happen by way of, Internet routages,
internet router outages, are undersea cable cuts or DNS not working, in all these scenarios, you will see
that, the network gets partitioned and the entire data center, some of the data, centers may be it is
connected and partition. So, partition tolerance, partitions can occur, within the data center, within the
rack or switch outages and so on. So, rack switches also, can basically ensure or induce these partitions,
in tune to the network. So, still the, the desire system, the desire is that system to continue functioning
normally under, the partition that is why? It is called, ‘Partition Tolerance’. Partition tolerance, ensures
the functioning, even in the face, of partitions or it is also called as, ‘Network Failures’.

Refer Slide Time :( 9: 24)


Now, the cap theorem fallout, since the partition tolerant is essential in today's cloud computing systems,
the cap theorem implies that a system has to choose between the consistency and the availability. Now,
we will see the Cassandra. So, Cassandra uses, eventual that is called a, ‘Eventual Consistency’ which is
also, a weak form of consistency and then it ensures, it also uses the availability and partition tolerance.
So, it Cassandra uses, A and P, of a cap, where C is being compromised or that is called, ‘Eventual
Consistency’ or a ‘Weak Form of Consistency’. Similarly the traditional databases management system,
we see that, provides the strong consistency that, is it provides C and or availability, under the partition
since, the since the database, is on the same system hence, the partition is already, partition will not
happen hence the cap is, always satisfied or guaranteed, in the traditional RDBMS systems.

Refer Slide Time :( 10: 40)


So, CAP trade-off let us see that, starting point for this NoSQL system, the cap trade-off has to be, there
so, in a distributed storage system can achieve, at most two out of three, at most two out of this
consistency availability and partitioning. So, when the partition tolerance is important, then you have to
choose between consistency and availability. So. Same thing is happening in the Cassandra system. So,
we have to choose, between consistency and availability, so Cassandra chooses availability and a weaker
form of consistency, in Cassandra. Similarly in react and dynamo and Voldemort, also uses the partition
tolerance and availability with a, with a weaker form of consistency. Now, as far as RDBMS is concerned.
So RDBMS, which is not replicated, which ensures the consistency and availability and there is no
partition hence, consistency and availability are ensured in RDBMS is now, HBase, prefer reds partition
tolerance and consistency or availability. So, we will see that, HBase and hyper table, Big Table and span
and uses the, the consistency and partition tolerance or availability.

Refer Slide Time :( 12:04)


Let us see the, eventual consistency, which is a weak form of consistency. Which is supported in the
Cassandra? Let us go and detail see, how the Cassandra is trading off with the consistency, to ensure, the
partition, tolerance and the availability. Now, if all the writes stopped, to a particular key, then all its
values in the replicas will eventually converge. meaning to say that, if Some of the rights, do not takes
place or do not get updated at Some of its replicas, then those replicas, will be updated or will become
consistent after Some point of time. Hence, they will be eventually converge. So, if the write continues
then system always tries to keep converging. So this way, the moving wave, of the updated values,
moving wave of updated values, lagging behind the latest values, sent by the, by the client. So, these
waves will move over the time. So, that says that, moving wave, of an updated values, will be lagging
behind the latest values sent by the client, but always tries to catch up. So this difference will always be
there. Now if, the number of new values are not coming, then obviously, eventually it will become the
latest one, after eventually it will, be converging it. So if, the write continues then, the system tries to,
tries to keep converging and this may still, return some of the stale values to the client, if many back to
back write operations are there, but it will work fine when, there are a few, periods of low right’s and So,
the system can converge quickly.

Refer Slide Time :( 14:34)


Now, let us, compare the RDBMS with the key value stores. So, So RDBMS provides the acid properties,
atomicity, consistency, isolation and durability, whereas key value store like Cassandra provides, the base
property. BASE is basically available, soft state, eventual consistency. So basically available means,
there, they are it provides the availability insurance and Soft state eventual consistency, Soft state that we
have seen, in the form of cash cashing, in in-memory operations, most of that the table properties are
stored in cash and that is in the form of Memtable and eventual consistency means, finally everything,
will be updated, that is called, ‘Eventual Consistency’, not immediately but, eventually it will be updated.

Refer Slide Time :( 15:37)

So, it prefers the availability or the consistency in the base model. Now, let us see the consistency, levels
which are supported in the Cassandra, Cassandra has different consistency levels. So, the client is allowed
to choose the consistency level, for each operation, that read and write, among these different consistency
levels. So they are, one that is any, second type of consistency level is all, third is one and fourth is
quorum. Let us, understand one by one all these consistency level. So any, consistency level, by mean,
any consistency level, that means, any server may not be the replica is can, can allow, this operation to be
complete. So this becomes a fastest, because the coordinator caches, the write operation and returns
quickly to the, to the client, to perform this read and write operations hence, this is the fastest one. The
second one is called, ‘All Replicas’. That means, it ensures that the read and write operations, write
operations requires, to ensure that, it has to be updated at all the replicas. So this is a kind of strong
consistency, so it will provide, if the consistency level is all then it will be as the slowest one. Third one is
called’ ‘One’, that is at least one of these replicas gets updated, so it is faster than all, but it is slower there
any, why because it cannot tolerate the failure of all, of the replicas. And fourth one is called, ‘Quorum’,
quorum says that, quorum that means, a quorum across all the replicas in the data center is to be updated,
what do you mean by the quorum? That we are going to see? So quorum is between the all and one, so
quorum is some number K. So, we are going to see how many replicas are required to be updated, under
the quorum system.

Refer Slide Time :( 17:41)

So quorum’s for consistency, quorum says that majority, so if there are in this example, there are five
different replicas, so the quorum says, the majority means, more than fifty percent, so that, becomes the
three. So minimum at least three different quorums, different replicas are to be updated. So, for any to the
proper these upper quorum, which has to be satisfied is that, if any to quorum’s, if we take the intersection
then there must be a common, replica common servers between any two system. So, any two quorum
vary intersect, client one does the write operation, in the write quorum and the Client two, reads
from the blue quorum, then it will get the update from, because there is a common server in both the
quorum’s, So at least one server in the blue quorum, gets the latest write? So quorums are faster than all,
but ensure the strong consistency.
Refer Slide Time :( 18:46)

So quorum’s, let us see in more detail, So several key value pairs, that is NoSQL to react and Cassandra
uses, the quorum. So reads; that is the client specifies the value of R, which is the value of the quorum,
which is less than the number of replicas, So R is the read consistency level, So coordinator waits for R
replicas to respond before sending, the result to the client. In the background, the coordinator, checks for
the consistency of remaining and - R replicas and initiate, the read repair if any. So meaning to say that,
not only it the, the read is being satisfied by, specifying the R number of replicas, but what about the, the
remaining replicas, that is N-R, whether are they consistent or not that also required to be checked, for the
consistency, of the remaining replicas. So that has to be done in the background and if they are
inconsistent then, the read repair is to be initiated, So that eventually they may be consistent at all the
levels.

Refer Slide Time :( 19:55)


So the reads sometimes, does the read repair, to ensure the eventual
consistency. now, quorum’s the write come in to flavor, when a client
writes W(≤N), then write replication, then write consistency level, we
have to specify. So the client writes, a new value to W replicas turn, there
are two flavors. So the, coordinator blocks, until the quorum is reached or
the it will be in the Asynchronous, that means, it will just write and
return, returns back.

Refer Slide Time :( 20:34)

∙ Now, if let us say, R is the replica, read replica count and W is the Write replica count, then there
are two necessary conditions in the quorum,
1. W+R > N
2. W > N/2
Select values based on application
∙ (W=1, R=1): very few writes and reads

∙ (W=N, R=1): great for read-heavy workloads

∙ (W=N/2+1, R=N/2+1): great for write-heavy workloads

∙ (W=1, R=N): great for write-heavy workloads with mostly one client writing per key

Refer Slide Time :( 21:54)

So, we have seen the quorum’s, across all the replicas, in all the datacenters and it ensures the global
consistency, which is still a fast one. So local quorum, So Coulomb's in a in, in the coordinated data
center are faster, only waits for the quorum in the first datacenter client contacts, each quorum’s, the
quorum in every data center, let each data center do its own quorum and support the heretical replies.

Refer Slide Time :( 22:23)


So the type of consistency, which Cassandra provides is called, ‘Eventual Consistency’. What are the
other weak, forms of consistency models, which are available in practice? That we will see, in the, in the
next slide.

You might also like