Consistency and Rep Contd
Consistency and Rep Contd
Replica Placement
Replica server placement
Web: geophically skewed request patterns Where to place a proxy?
Permanent replicas
Eg Mirroring: all replicas mirror the same content
Temporary replicas
Server-initiated
- push caches
Client-initiated
Replica Placement
The logical organization of different kinds of copies of a data store into three concentric rings.
Update Propagation
Propagate only a notification of an update Transfer data from one copy to another Propagate the update operation to other copies (active replication)
Push-based
List of client replicas and caches Update (and possibly fetch update later) Immediate (or fetch-update time)
Pull-based
None Poll and update Fetch-update time
A comparison between push-based and pull-based approaches to updates Hybrid approach: Leases
Epidemic Protocols
A class of algorithms to provide update propagation in eventualconsistent data store Do not solve any update conflicts; only propagate updates to all replicas in as few messages as possible Based on theory of epidemics (spreading infectious diseases) Upon an update, try to infect other replicas as quickly as possible Pair-wise exchange of updates (like pair-wise spreading of a disease) Terminology: Infective store: store with an update it is willing to spread Susceptible store: store that is not yet updated
Spreading an Epidemic
Anti-entropy
Server P picks a server Q at random and exchanges updates Three possibilities: only push, only pull, both push and pull Claim: A pure push-based approach does not help spread updates quickly (Why?) Pull or initial push with pull work better
Removing Data
Deletion of data items is hard in epidemic protocols Example: server deletes data item x
No state information is preserved Cant distinguish between a deleted copy and no copy!
Used in Bayou system from Xerox PARC Bayou: weakly connected replicas
Useful in mobile computing (mobile laptops) Useful in wide area distributed databases (weak connectivity)
Replicated write protocols No primary is assumed for a data item Writes can take place at any replica
Active replication Quorum Based Cache coherence
Can primary backup protocols provide an implementation of sequential consistency? If non blocking protocols are used, is providing sequential consistency easy?
Primary-backup protocol in which the primary migrates to the process wanting to perform an update.
How can the approach be applied to mobile computers that are able to operate in a disconnected mode?
Active Replication
Each replica has an associated process that carries out update operations Operations need to be carried out in the same order everywhere How?
Quorum-Based Protocols
Requires clients to request and acquire the permission of multiple servers before reading or writing a replicated file. Giffords scheme 1. NR + NW > N 2. NW > N/2
Quorum-Based Protocols
Three examples of the voting algorithm: a) A correct choice of read and write set b) A choice that may lead to write-write conflicts c) A correct choice, known as ROWA (read one, write all)
Does a 2-resilient algorithm for 6 processes exist? Write it down or sketch a proof for its non-existence.
This is a proof by contradiction: It is known that consensus for 3 processes, one of them Byzantine, cannot be solved.. Assume now that consensus for 6 processes could be solved by an algorithm a. We create a new algorithm b which solves consensus for 3 processes using a. The algorithm b works as follows: each of the 3 processes (hosts) simulates 2 processes (guests). The input of the guests is equal to the input of their hosts. Correct hosts simulate correct guests, the Byzantine host simulates Byzantine guests. The 6 guests solve consensus using a. The hosts copy the decision of their guests (all guests make the same decision). This is a contradiction! Since b works if a works, therefore a must be wrong (as it is known, that b cannot exist.) Hence there exists no algorithm a solving consensus for 6 processes
Consider a business database/intelligence system for a web retailer such as amazon, where the data updates are sales orders generated by web servers which include the contents of the order, the customer ID of the person placing the order. The resulting database is used both for:
Tracking inventory and order status by web applications Mining suggestions to customers about what to purchase
Describe the demands of and tradeoffs between consistency and performance in this system
We use the notation W(x):v to denote a write operation to the variable x with the value v and R(x):v to denote a read operation to the variable x that returns the value v Initially all variables are set to zero Is the memory underlying the following two processes sequentially consistent?
Conclusion
Replication and caching improve performance in distributed systems Consistency of replicated data is crucial Many consistency semantics (models) possible
Need to pick appropriate model depending on the application Example: web caching: weak consistency is OK since humans are tolerant to stale information (can reload browser) Implementation overheads and complexity grows if stronger guarantees are desired
Example
In a web indexing system used by companies such as Google, the data updates are web pages crawled by concurrent web spiders. The backend then processes the resulting archived graph of web pages to generate indexes that are used to satisfy search requests. Analyze and describe the demands of and tradeoffs between consistency and performance in this system. Which consistency model should be applied in this application?