0% found this document useful (0 votes)

64 views

System Recovery

The document discusses various approaches for system and process recovery in the event of failures. It describes (1) restoring a system or process to a normal operational state through techniques like backward error recovery which restores a process to a previous checkpoint, and forward error recovery which removes errors from the current state; and (2) increasing availability through replication of hardware, software, and data components. The key approaches covered are operation-based recovery using write-ahead logging and state-based recovery using checkpointing and rollback. Consistent checkpointing across distributed processes is also discussed to avoid issues like orphan messages and livelock during recovery.

Uploaded by

Roshan Raju

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

System Recovery

Uploaded by

Roshan Raju

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 38

Recovery

Computer system recovery:

Restore the system to a normal operational state
Process recovery:
Reclaim resources allocated to process,
Undo modification made to databases, and
Restart the process
Or restart process from point of failure and resume
execution
Distributed process recovery (cooperating processes):
Undo effect of interactions of failed process with other
cooperating processes.
Replication (hardware components, processes, data):
Main method for increasing system availability
System:
Set of hardware and software components
Designed to provide a specified service (I.e. meet a set of
requirements)

Recovery (cont.)
System failure:
System does not meet requirements, i.e.does not perform its
services as specified

Error could lead to system failure

Erroneous System State:
State which could lead to a system failure by a sequence of valid
state transitions
Error: the part of the system state which differs from its intended
value

Error is a manifestation of a fault

Fault:
3

Anomalous physical condition, e.g. design errors, manufacturing

problems, damage, external disturbances.

Classification of failures
Process failure:
Behaviour: process causes system state to deviate

from specification (e.g. incorrect computation,

process stop execution)
Errors causing process failure: protection violation,
deadlocks, timeout, wrong user input, etc
Recovery: Abort process or Restart process from
prior state
System failure:
Behaviour: processor fails to execute
Caused by software errors or hardware faults
(CPU/memory/bus// failure)
Recovery: system stopped and restarted in correct
state
Assumption: fail-stop processors, i.e. system stops
execution, internal state is lost
4

Secondary Storage Failure:

Behaviour: stored data cannot be accessed
Errors causing failure: parity error, head crash,

etc.
Recovery/Design strategies:
Reconstruct content from archive + log of
activities
Design mirrored disk system
Communication Medium Failure:
Behaviour: a site cannot communicate with
another operational site
Errors/Faults: failure of switching nodes or
communication links
Recovery/Design Strategies: reroute, errorresistant communication protocols
5

Backward and Forward Error

Recovery
Failure recovery: restore an erroneous state to an

error-free state
Approaches to failure recovery:
Forward-error recovery:
Remove errors in process/system state (if errors
can be completely assessed)
Continue process/system forward execution
Backward-error recovery:
Restore process/system to previous error-free
state and restart from there

Comparison: Forward vs. Backward error recovery

Backward-error recovery

(+) Simple to implement

(+) Can be used as general recovery mechanism
(-) Performance penalty
(-) No guarantee that fault does not occur again
(-) Some components cannot be recovered
Forward-error Recovery
(+) Less overhead
(-) Limited use, i.e. only when impact of faults
understood
(-) Cannot be used as general mechanism for
error recovery

Backward-Error Recovery: Basic approach

Principle: restore process/system to a known, error-

free recovery point/ checkpoint.

System model:
CPU
secondary
storage

Main memory

Bring object to MM
to be accessed

Store logs and

recovery points

Write object back

if modified

Approaches:

(1) Operation-based approach

(2) State-based approach
8

stable
storage

Storage that
maintains
information in
the event of
system failure

(1) The Operation-based Approach

Principle:
Record all changes made to state of process (audit

trail or log) such that process can be returned to

a previous state

Example: A transaction based environment where

transactions update a database

It is possible to commit or undo updates on a
per-transaction basis
A commit indicates that the transaction on the
object was successful and changes are
permanent

(1.a) Updating-in-place
Principle: every update (write) operation to an object
creates a log in stable storage that can be used to
undo and redo the operation
Log content: object name, old object state, new
object state
Implementation of a recoverable update operation:
Do operation:update object and write log record
Undo operation: log(old) -> object (undoes the
action performed by a do)
Redo operation: log(new) -> object (redoes the
action performed by a do)
Display operation: display log record (optional)
Problem: a do cannot be recovered if system
crashes after write object but before log record write
(1.b) The write-ahead log protocol
Principle: write log record before updating object
10

(2) State-based Approach

Principle: establish frequent recovery points or

checkpoints saving the entire state of process

Actions:
Checkpointing or taking a checkpoint: saving
process state
Rolling back a process: restoring a process to a
prior state
Note: A process should be rolled back to the most
recent recovery point to minimize the overhead
and delays in the completion of the process

Shadow Pages: Special case of state-based

approach
Only a part of the system state is saved to
minimize recovery
When an object is modified, page containing
object is first copied on stable storage
(shadow page)
If process successfully commits: shadow
page discarded and modified page is made
part of the database
If process fails: shadow page used and the
modified page discarded

Recovery in concurrent systems

Issue: if one of a set of cooperating processes fails

and has to be rolled back to a recovery point, all

processes it communicated with since the recovery
point have to be rolled back.
Conclusion: In concurrent and/or distributed systems
all cooperating processes have to establish recovery
points

Orphan messages and the domino effect

x2
m

Y
Z

y1
z1

y2
z2

Time

Case 1: failure of X after x3 : no impact on Y or Z

Case 2: failure of Y after sending msg. m
Y rolled back to y2
m orphan massage
X rolled back to x2
Case 3: failure of Z after z2
Y has to roll back to y1
14

X has to roll back to x1

Z has to roll back to z1

Domino Effect

Lost messages
X

x1
m

Failure

y1
Time

Assume that x1 and y1 are the only recovery points

for processes X and Y, respectively

Assume Y fails after receiving message m
Y rolled back to y1, X rolled back to x1
Message m is lost

Problem of livelock

Livelock: case where a single failure can cause an infinite

number of rollbacks
X
Y

n1
m1

(a)

X
Y

(a)

(b)

x1
y1

Failure
Time

n2
m2

2nd roll back

(b)

Time

Process Y fails before receiving message n1 sent by X

Y rolled back to y1, no record of sending message m1, causing X
to roll back to x1
When Y restarts, sends out m2 and receives n1 (delayed)
When X restarts from x1, sends out n2 and receives m2
Y has to roll back again, since there is no record of n1 being sent
This cause X to be rolled back again, since it has received m2
and there is no record of sending m2 in Y
The above sequence can repeat indefinitely

Consistent set of checkpoints

Checkpointing in distributed systems requires that

all processes (sites) that interact with one another
establish periodic checkpoints

All the sites save their local states: local

checkpoints

All the local checkpoints, one from each site,

collectively form a global checkpoint

The domino effect is caused by orphan messages,

which in turn are caused by rollbacks

Strongly consistent set of checkpoints

Establish a set of local checkpoints (one for each

process in the set) such that no information flow
takes place (i.e., no orphan messages) during
the interval spanned by the checkpoints

Consistency of Checkpoint
2. Consistent set of checkpoints

Similar to the consistent global state

Each message that is received in a

checkpoint (state) should also be recorded as
sent in another checkpoint (state)
x1

[
y1

need to deal with

lost messages

Strongly consistent
z1

consistent
z2

Checkpoint/Recovery Algorithm

Synchronous

with global synchronization at checkpointing

Asynchronous

without global synchronization at checkpointing

Preliminary (Assumption)
Goal

Synchronous Checkpointing

To make a consistent global checkpoint

Assumptions

Communication channels are FIFO

No partition of the network
End-to-end protocols cope with message loss
due to rollback recovery and communication
failure
No failure during the execution of the algorithm

Preliminary (Two types of

checkpoint)
Synchronous Checkpointing

tentative checkpoint :

a temporary checkpoint
a candidate for permanent checkpoint

permanent checkpoint :

a local checkpoint at a process

a part of a consistent global checkpoint

Checkpoint Algorithm
Synchronous Checkpointing
Algorithm
1.
2.
3.
4.

5.
6.

an initiating process (a single process that invokes this

algorithm) takes a tentative checkpoint
it requests all the processes to take tentative checkpoints
it waits for receiving from all the processes whether taking a
tentative checkpoint has been succeeded
if it learns all the processes has succeeded, it decides all
tentative checkpoints should be made permanent; otherwise,
should be discarded.
it informs all the processes of the decision
The processes that receive the decision act accordingly

Supplement
Once a process has taken a tentative checkpoint, it shouldnt
send messages until it is informed of initiators decision.

Diagram of Checkpoint
Algorithm
Synchronous Checkpointing
Tentative
checkpoint

Initiator

decide to commit

|
request to
take a
tentative
checkpoint

[
consistent global checkpoint

permanent checkpoint

[
Unnecessary checkpoint

consistent global checkpoint

Optimized Algorithm
Synchronous Checkpointing
Each message is labeled by order of sending
Labeling Scheme

[
x2

: smallest label
y1
y2
: largest label
[
Y
y2
last_label_rcvdX[Y] :
the last message that X received from Y after X has taken its
last permanent or tentative checkpoint. if not exists, is in it.
first_label_sentX[Y] x2
:
the first message that X sent to Y after X took its last
permanent or tentative checkpoint . if not exists, is in it.
ckpt_cohortX :
the set of all processes that may have to take checkpoints
when X decides to take a checkpoint.
Checkpoint request need to be sent to only the processes
included in ckpt_cohort

Optimized Algorithm
Synchronous Checkpointing

ckpt_cohortX : { Y | last_label_rcvdX[Y] > }

Y takes a tentative checkpoint only if
last_label_rcvdX[Y] >= first_label_sentY[X]
>
last_label_rcvdX[Y]
X

[
first_label_sentY[X]

Optimized Algorithm
Synchronous Checkpointing
Algorithm
1.
2.
3.
4.
5.
6.
7.

an initiating process takes a tentative checkpoint

it requests p ckpt_cohort to take tentative
checkpoints ( this message includes
last_label_rcvd[reciever] of sender )
if the processes that receive the request need to take a
checkpoint, they do the same as 1.2.; otherwise, return
OK messages.
they wait for receiving OK from all of p ckpt_cohort
if the initiator learns all the processes have succeeded,
it decides all tentative checkpoints should be made
permanent; otherwise, should be discarded.
it informs p ckpt_cohort of the decision
The processes that receive the decision act accordingly

Correctness
Synchronous Checkpointing

A set of permanent checkpoints taken by this

algorithm is consistent

No process sends messages after taking a

tentative checkpoint until the receipt of the
decision
New checkpoints include no message from the
processes that dont take a checkpoint
The set of tentative checkpoints is fully either
made to permanent checkpoints or discarded.

Recovery Algorithm
Synchronous Recovery
Labeling Scheme
: smallest label
: largest label
last_label_rcvdX[Y] :
the last message that X received from Y after X has taken its
last permanent or tentative checkpoint. If not exists, is in it.
first_label_sentX[Y] :
the first message that X sent to Y after X took its last
permanent or tentative checkpoint . If not exists, is in it.
roll_cohortX :
the set of all processes that may have to roll back to the
latest checkpoint when process X rolls back.
last_label_sentX[Y] :
the last message that X sent to Y before X takes its latest
permanent checkpoint. If not exist, is in it.

Recovery Algorithm
Synchronous Recovery

roll_cohortX = { Y | X can send messages to Y }

Y will restart from the permanent checkpoint
only if
last_label_rcvdY[X] > last_label_sentX[Y]

Recovery Algorithm
Synchronous Recovery
Algorithm
1.
2.
3.
4.
5.
6.

an initiator requests p roll_cohort to prepare to

rollback ( this message includes
last_label_sent[reciever] of sender )
if the processes that receive the request need to
rollback, they do the same as 1.; otherwise, return OK
message.
they wait for receiving OK from all of p ckpt_cohort.
if the initiator learns p roll_cohort have succeeded,
it decides to rollback; otherwise, not to rollback.
it informs p roll_cohort of the decision
the processes that receive the decision act
accordingly

Diagram of Synchronous
Recovery
x1

Failure

X
X
Y

y1
z1

Unnecessary Rollback
31

y2
z2

Drawbacks of Synchronous Approach

Additional messages are exchanged

Synchronization delay
An unnecessary extra load on the system if
failure rarely occurs

Asynchronous Checkpoint
Characteristic

Each process takes checkpoints independently

No guarantee that a set of local checkpoints is
consistent
A recovery algorithm has to search consistent
set of checkpoints
No additional message
No synchronization delay
Lighter load during normal excution

Preliminary (Assumptions)
Asynchronous Checkpoint / Recovery

Goal
To find the latest consistent set of checkpoints

Assumptions

Communication channels are FIFO

Communication channels are reliable
The underlying computation is eventdriven

Preliminary (Two types of

log)
Asynchronous Checkpoint / Recovery

save an event on the memory at receipt of

messages (volatile log)
volatile log periodically flushed to the disk
(stable log) checkpoint

volatile log :
quick access
lost if the corresponding processor fails

stable log :
slow access
not lost even if processors fail

Preliminary (Definition)
Asynchronous Checkpoint / Recovery

Definition
CkPti : the checkpoint (stable log) that i rolled back to
when failure occurs
RCVDij (CkPti / e ) :
the number of messages received by processor i from
processor j, per the information stored in the checkpoint
CkPti or event e.

SENTij(CkPti / e ) :
the number of messages sent by processor i to processor
j, per the information stored in the checkpoint CkPti or
event e

Recovery Algorithm
Asynchronous Checkpoint / Recovery
Algorithm
1.
2.
3.
4.
5.

When one process crashes, it recovers to the latest

checkpoint CkPt.
It broadcasts the message that it had failed. Others
receive this message, and rollback to the latest event.
Each process sends SENT(CkPt) to neighboring
processes
Each process waits for SENT(CkPt) messages from
every neighbor
On receiving SENTji(CkPtj) from j, if i notices RCVDij
(CkPti) > SENTji(CkPtj), it rolls back to the event e
such that RCVDij (e) = SENTji(e),
repeat 3,4,and 5 N times (N is the number of
processes)

Asynchronous Recovery
X:Y
X

Ex0

Ex1

Ex2

Ex3

3 <= 2
2
(Y,2)

Ey0

Ey1

Ey2

Ey3

Ez1

Y:X
1 <= 2

(X,0)
(Y,1)

Ez0

(X,2) (Z,0)

X:Z
0 <= 0
Y:Z
1 <= 1

(Z,1)

Ez2

RCVDij (CkPti) <= SENTji(CkPtj)

Z:X
0 <= 0

Z:Y
1
2 <= 1

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (82)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
91% (35)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
EnterpriseOne Interview Questions
From Everand
EnterpriseOne Interview Questions
equitypress
No ratings yet
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
4th Unit Topics Recovery
No ratings yet
4th Unit Topics Recovery
73 pages
Distributed Failure Recovery
No ratings yet
Distributed Failure Recovery
30 pages
Failure Recovery in Distributed Systems
No ratings yet
Failure Recovery in Distributed Systems
24 pages
1904050001
No ratings yet
1904050001
119 pages
Lm2-Rollback & Recovery
No ratings yet
Lm2-Rollback & Recovery
34 pages
Unit 4_Deadlock Handling & Recovery Techniques & Failuere Classification
No ratings yet
Unit 4_Deadlock Handling & Recovery Techniques & Failuere Classification
55 pages
CheckpointingRecovery ds14
No ratings yet
CheckpointingRecovery ds14
35 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
21 pages
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
No ratings yet
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
23 pages
CS 194: Distributed Systems
No ratings yet
CS 194: Distributed Systems
15 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
unit 4
No ratings yet
unit 4
94 pages
DC Unit4
No ratings yet
DC Unit4
32 pages
Dc-3551 Unit IV Notes
No ratings yet
Dc-3551 Unit IV Notes
32 pages
Assignment 4 - 044
No ratings yet
Assignment 4 - 044
4 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
No ratings yet
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
14 pages
u4p6
No ratings yet
u4p6
10 pages
Lm3 Checkpointing Algorithm
No ratings yet
Lm3 Checkpointing Algorithm
40 pages
a161126
No ratings yet
a161126
26 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
15-440 Distributed Systems: Fault Tolerance, Logging and Recovery Thursday Oct 8, 2015
No ratings yet
15-440 Distributed Systems: Fault Tolerance, Logging and Recovery Thursday Oct 8, 2015
30 pages
rollback_slides
No ratings yet
rollback_slides
22 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
1-Lecture (2. Intro-Core Challenges)_Slides
No ratings yet
1-Lecture (2. Intro-Core Challenges)_Slides
22 pages
DistributedComputing(University) PartA
No ratings yet
DistributedComputing(University) PartA
19 pages
Chapter 8 Fault Tolerance
No ratings yet
Chapter 8 Fault Tolerance
20 pages
Unit IV 2 Marks With Answer
No ratings yet
Unit IV 2 Marks With Answer
2 pages
Checkpointing and Rollback
No ratings yet
Checkpointing and Rollback
61 pages
Unit Iv Recovery
No ratings yet
Unit Iv Recovery
27 pages
Distributed Computing: Farhad Muhammad Riaz
No ratings yet
Distributed Computing: Farhad Muhammad Riaz
18 pages
Module 4
No ratings yet
Module 4
59 pages
Distributed-Computing-Module-4-Important-Topics-PYQs
No ratings yet
Distributed-Computing-Module-4-Important-Topics-PYQs
23 pages
Distributed Systems - Fault Tolerance
No ratings yet
Distributed Systems - Fault Tolerance
21 pages
DS NOTES Unit 4 PDF
No ratings yet
DS NOTES Unit 4 PDF
36 pages
Consensus
No ratings yet
Consensus
77 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Presentation On Consistent Checkpoints & Recovery in Distributed System
100% (1)
Presentation On Consistent Checkpoints & Recovery in Distributed System
26 pages
Concurrent Checkpointing and Recovery in Distributed Systems
No ratings yet
Concurrent Checkpointing and Recovery in Distributed Systems
61 pages
Compiler 20230709
No ratings yet
Compiler 20230709
6 pages
Unit 4 - DSRM
No ratings yet
Unit 4 - DSRM
5 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
33 pages
Session 33
No ratings yet
Session 33
4 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
Fault Tolerance and Recovery
No ratings yet
Fault Tolerance and Recovery
50 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Failure Recovery: Checkpointing Undo/Redo Logging
No ratings yet
Failure Recovery: Checkpointing Undo/Redo Logging
22 pages
c1cc1cde-bdda-41e7-92a0-5453e98d0676
No ratings yet
c1cc1cde-bdda-41e7-92a0-5453e98d0676
5 pages
Distributed Deadlocks and Transaction Recovery
100% (1)
Distributed Deadlocks and Transaction Recovery
22 pages
CST402-SCHEME
No ratings yet
CST402-SCHEME
9 pages
DC(UNIT-4)
No ratings yet
DC(UNIT-4)
14 pages
Chapter_8-Fault_Tolerance (1)
No ratings yet
Chapter_8-Fault_Tolerance (1)
37 pages
3.Synchronization
No ratings yet
3.Synchronization
45 pages
Ds chapter 7 (2)
No ratings yet
Ds chapter 7 (2)
21 pages
Distributed Process Management
No ratings yet
Distributed Process Management
56 pages
11 Distributed1
No ratings yet
11 Distributed1
42 pages
M.Tech Course Distributed Computing
No ratings yet
M.Tech Course Distributed Computing
117 pages
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
AOCV
No ratings yet
AOCV
21 pages
Semi Custom Design Flow Leveraging Place
No ratings yet
Semi Custom Design Flow Leveraging Place
20 pages
MBIST Verification Best Practices Challenges
No ratings yet
MBIST Verification Best Practices Challenges
6 pages
Cloud Computing Brochure
No ratings yet
Cloud Computing Brochure
7 pages
Data Processing On Fpgas
No ratings yet
Data Processing On Fpgas
14 pages
Andrewdate 13
No ratings yet
Andrewdate 13
4 pages
CGPT Filter
No ratings yet
CGPT Filter
1 page
Mechanism of Feature Learning in Deep Fully Connected Networks and Kernel Machines That Recursively Learn Features
No ratings yet
Mechanism of Feature Learning in Deep Fully Connected Networks and Kernel Machines That Recursively Learn Features
39 pages
MPQF 167 C
No ratings yet
MPQF 167 C
2 pages
Ultra Scale
No ratings yet
Ultra Scale
13 pages
Fpgas 29032016
No ratings yet
Fpgas 29032016
55 pages
Zynq Ultrascale+ Architecture Stephanie Soldavini and Andrew Ramsey
No ratings yet
Zynq Ultrascale+ Architecture Stephanie Soldavini and Andrew Ramsey
18 pages
An Email To A Friend About The First Week After Starting Work in A New City - S Past N Prepositions of Place N Time
No ratings yet
An Email To A Friend About The First Week After Starting Work in A New City - S Past N Prepositions of Place N Time
2 pages
Basics
No ratings yet
Basics
9 pages
FVF 8 H
No ratings yet
FVF 8 H
7 pages
Async Reset Synchronization
No ratings yet
Async Reset Synchronization
2 pages
Fast Resource Sharing in VLSI Routing
No ratings yet
Fast Resource Sharing in VLSI Routing
167 pages
Lect 9 Metastability-Blackschaffer
No ratings yet
Lect 9 Metastability-Blackschaffer
26 pages
Sta Syllabi
No ratings yet
Sta Syllabi
6 pages
NLDM
No ratings yet
NLDM
2 pages
When The Yogurt Took Over
100% (1)
When The Yogurt Took Over
3 pages
Mbist Summary
100% (1)
Mbist Summary
3 pages
Ch3.fault Modeling
No ratings yet
Ch3.fault Modeling
53 pages
1 Constraints Coding Rules
No ratings yet
1 Constraints Coding Rules
117 pages
Thinking On The Web Berners Lee Gdel and Turing 0471768146 9780471768142
No ratings yet
Thinking On The Web Berners Lee Gdel and Turing 0471768146 9780471768142
292 pages
DATABASE APPIAN Notes
No ratings yet
DATABASE APPIAN Notes
4 pages
Application Configuration REG630 PDF
No ratings yet
Application Configuration REG630 PDF
29 pages
CV Formatting
No ratings yet
CV Formatting
2 pages
HTML - FORMS - FRAMES
No ratings yet
HTML - FORMS - FRAMES
4 pages
IoT NOTES UNIT 1!
No ratings yet
IoT NOTES UNIT 1!
31 pages

System Recovery

Uploaded by

System Recovery

Uploaded by

Recovery

Computer system recovery:

Error could lead to system failure

Error is a manifestation of a fault

Anomalous physical condition, e.g. design errors, manufacturing

problems, damage, external disturbances.

from specification (e.g. incorrect computation,

Secondary Storage Failure:

Backward and Forward Error

Comparison: Forward vs. Backward error recovery

(+) Simple to implement

Backward-Error Recovery: Basic approach

free recovery point/ checkpoint.

Store logs and

Write object back

(1) Operation-based approach

(1) The Operation-based Approach

trail or log) such that process can be returned to

Example: A transaction based environment where

transactions update a database

(2) State-based Approach

Principle: establish frequent recovery points or

checkpoints saving the entire state of process

Shadow Pages: Special case of state-based

Recovery in concurrent systems

and has to be rolled back to a recovery point, all

Orphan messages and the domino effect

Case 1: failure of X after x3 : no impact on Y or Z

X has to roll back to x1

Assume that x1 and y1 are the only recovery points

for processes X and Y, respectively

Livelock: case where a single failure can cause an infinite

2nd roll back

Process Y fails before receiving message n1 sent by X

Consistent set of checkpoints

Checkpointing in distributed systems requires that

All the sites save their local states: local

All the local checkpoints, one from each site,

The domino effect is caused by orphan messages,

Strongly consistent set of checkpoints

Establish a set of local checkpoints (one for each

Similar to the consistent global state

Each message that is received in a

need to deal with

with global synchronization at checkpointing

without global synchronization at checkpointing

To make a consistent global checkpoint

Communication channels are FIFO

Preliminary (Two types of

a local checkpoint at a process

an initiating process (a single process that invokes this

consistent global checkpoint

ckpt_cohortX : { Y | last_label_rcvdX[Y] > }

an initiating process takes a tentative checkpoint

A set of permanent checkpoints taken by this

No process sends messages after taking a

roll_cohortX = { Y | X can send messages to Y }

an initiator requests p roll_cohort to prepare to

Drawbacks of Synchronous Approach

Additional messages are exchanged

Each process takes checkpoints independently

Communication channels are FIFO

Preliminary (Two types of

save an event on the memory at receipt of

When one process crashes, it recovers to the latest

RCVDij (CkPti) <= SENTji(CkPtj)

You might also like