SlideShare a Scribd company logo
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems G. (John) Janakiraman, Jose Renato Santos,  Dinesh Subhraveti § ,  Yoshio Turner HP Labs §:  Currently at Meiosys, Inc.
Broad Opportunity for Checkpoint-Restart in Server Management Fault tolerance (minimize unplanned downtime) Recover by restarting from checkpoint Minimize planned downtime Migrate application before hardware/OS maintenance Resource management Manage resource allocation in shared computing environments by migrating applications
Need for General-Purpose Checkpoint-Restart Existing checkpoint-restart methods are too limited: No support for many OS resources that commercial applications use (e.g., sockets) Limited to applications using specific libraries Require application source and recompilation Require use of specialized operating systems Need a practical checkpoint-restart mechanism that is capable of supporting a broad class of applications
Cruz: Our Solution for General-Purpose Checkpoint-Restart on Linux Application-transparent: supports applications without modifications or recompilation Supports a broad class of applications (e.g., databases, parallel MPI apps, desktop apps) Comprehensive support for user-level state, kernel-level state, and distributed computation and communication state Supported on unmodified Linux base kernel – checkpoint-restart integrated via a kernel module
Cruz Overview Builds on Columbia Univ.’s Zap process migration Our Key Extensions Support for migrating networked applications, transparent to communicating peers Enables role in managing servers running commercial applications (e.g., databases) General method for checkpoint-restart of TCP/IP-based distributed applications Also enables efficiencies compared to library-specific approaches
Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary
Zap (Background) Process migration mechanism Kernel module implementation Virtualization layer groups processes into Pods with private virtual name space Intercepts system calls to expose only virtual identifiers (e.g., vpid)  Preserves resource names and dependencies across migration Mechanism to checkpoint and restart pods User and kernel-level state Primarily uses system call handlers File system not saved or restored (assumes a network file system) Linux System calls Zap Linux Pods Applications
Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary
Migrating Networked Applications Migration must be transparent to remote peers to be useful in server management scenarios Peers, including unmodified clients, must not perceive any change in the IP address of the application Communication state of live connections must be preserved No prior solution for these (including original Zap) Our Solution: Provide unique IP address to each pod that persists across migration Checkpoint and restore the socket control state and socket data buffer state of all live sockets
Network Address Migration Pod attached to virtual interface with own IP & MAC addr. Implemented by using Linux’s virtual interfaces (VIFs) IP address assigned statically or through a DHCP client running inside the pod (using pod’s MAC address) Intercept bind() & connect() to ensure pod processes use pod’s IP address Migration: delete VIF on source host & create on new host Migration limited to subnet eth0 [IP-1, MAC-h1] eth0:1 Pod DHCP Server Network DHCP Client 1. ioctl() 2. MAC-p1 3. dhcprequest(MAC-p1) 4. dhcpack(IP-p1)
Communication State Checkpoint and Restore Communication state: Control: Socket data structure, TCP connection state Data: contents of send and receive socket buffers Challenges in communication state checkpoint and restore: Network stack will continue to execute even after application processes are stopped No system call interface to read or write control state No system call interface to read send socket buffers No system call interface to write receive socket buffers Consistency of control state and socket buffer state
Communication State Checkpoint Acquire network stack locks to freeze TCP processing Save receive buffers using socket receive system call in peek mode Save send buffers by walking kernel structures Copy control state from kernel structures Modify two sequence numbers in saved state to reflect empty socket buffers Indicate current send buffers not yet written by application Indicate current receive buffers all consumed by application Checkpoint State State for one socket Note : Checkpoint does not change live communication state Control Rh Rt Recv buffers St Sh Send buffers Sh Rt+1 Timers, Options, etc. Rh St+1 Sh Rt+1 X X receive() direct access direct access Rh Rt . . . St Sh . . . Rt+1 Rh Sh St+1 Timers, Options, etc. Control Recv buffers Send buffers copied_seq rcv_nxt snd_una write_seq Live Communication State
Communication State Restore Create a new socket Copy control state in checkpoint to socket structure  Restore checkpointed send buffer data using the socket write call  Deliver checkpointed receive buffer data to application on demand Copy checkpointed receive buffer data to a special buffer Intercept receive system call to deliver data from special buffer until buffer is emptied Sh State for one socket Control Live Communication State copied_seq rcv_nxt snd_una write_seq St Sh . . . Send buffers Checkpoint State Control Rh Rt Recv buffers St Sh Send buffers Sh Rt+1 Timers, Options, etc. Rt+1 Sh Rt+1 Rt+1 Sh Timers, Options, etc. Rh Rt Recv data direct update St+1 write() To App by intercepted receive system call
Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary
Checkpoint-Restart of Distributed Applications State of processes and messages in channel must be checkpointed and restored  consistently Prior approaches specific to particular library – e.g., modify library to capture and restore messages in channel Cruz preserves TCP connection state and IP addresses of each pod, implicitly preserving global communication state Transparently supports TCP/IP-based distributed applications  Enables efficiencies compared to library-based implementations Communication Channel Library Library Library Checkpoint Node Processes Node Processes Node Processes TCP/IP TCP/IP TCP/IP
Checkpoint-Restart of Distributed Applications in Cruz Global communication state saved and restored by saving and restoring TCP communication state for each pod  Messages in flight need not be saved since the TCP state will trigger retransmission of these messages at restart Eliminates O(N 2 ) step to flush channel for capturing messages in flight Eliminates need to re-establish connections at restart Preserving pod’s IP address across restart eliminates need to re-discover process locations in library at restart Communication Channel Library Library Library Checkpoint Node Pod (processes) Node Pod (processes) Node Pod (processes) TCP/IP TCP/IP TCP/IP
Consistent Checkpoint Algorithm in Cruz (Illustrative) Algorithm has O(N) complexity (blocking algorithm shown for simplicity) Can be extended to improve robustness and performance, e.g.: Tolerate Agent & Coordinator failures Overlap computation and checkpointing using copy-on-write Allow nodes to continue without blocking for all nodes to complete checkpoint Reduce checkpoint size with incremental checkpoints <checkpoint> Node Pod TCP/IP Library Agent Node Coordinator Node Pod TCP/IP Library Agent Disable pod comm § <done> <continue> Enable pod comm <continue-done> <checkpoint> Disable pod comm Save pod state <done> <continue> Enable pod comm Resume pod <continue-done> Save pod state Resume pod §: using netfilter rules in Linux
Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary
Evaluation Cruz implemented for Linux 2.4.x on x86 Functionality verified on several applications, e.g., MySQL, K Desktop Environment, and a multi-node MPI benchmark Cruz incurs negligible runtime overhead (less than 0.5%) Initial study shows performance overhead of coordinating checkpoints is negligible, suggesting the scheme is scalable
Performance Result – Negligible Coordination Overhead Checkpoint behavior for Semi-Lagrangian atmospheric model benchmark in configurations from 2 to 8 nodes Negligible latency in coordinating checkpoints (time spent in non-local operations) suggests scheme is scalable Coordination latency of 400-500 microseconds is a small fraction of the overall checkpoint latency of about 1 second
Related Work MetaCluster product from Meiosys Capabilities similar to Cruz (e.g., checkpoint and restart of unmodified distributed applications) Berkeley Labs Checkpoint Restart (BLCR) Kernel-module based checkpoint-restart for single node No identifier virtualization – restart will fail in the event of an identifier (e.g., pid) conflict No support for handling communication state – relies on application or library changes MPVM, CoCheck, LAM-MPI Library-specific implementations of parallel application checkpoint-restart with disadvantages described earlier
Future Work Many areas for future work, e.g., Improve portability across kernel versions by minimizing direct access to kernel structures Recommend additional kernel interfaces when advantageous (e.g., accessing socket attributes) Implement performance optimizations to the coordinated checkpoint-restart algorithm Evaluate performance on a wide range of applications and cluster configurations Support systems with newer interconnects and newer communication abstractions (e.g., InfiniBand, RDMA)
Summary Cruz, a practical checkpoint-restart system for Linux No change to applications or to base OS kernel needed Novel mechanisms to support checkpoint-restart of a broader class of applications Migrating networked applications transparent to communicating peers Consistent checkpoint-restart of general TCP/IP-based distributed applications Cruz’s broad capabilities will drive its use in solutions for fault tolerance, online OS maintenance, and resource management
http://www.hpl.hp.com/research/dca
Zap Virtualization Groups processes into a POD (Process Domain) that has a private virtual   namespace Uses system call interception to expose only virtual identifiers (e.g., virtual pids, virtual IPC identifiers) Virtual identifiers eliminate conflicts with identifiers already in use within the OS on the restarting node All dependent processes (e.g., forked child processes) are assigned to same pod Checkpoint and restart operate on an entire pod, which preserves resource dependencies across checkpoint and restart
Zap Checkpoint and Restart Checkpoint: Stops all processes in pod with SIGSTOP Parent-child relationships saved from /proc State of each process is captured by accessing system call handlers and kernel data structures Restart: Original forest of processes recreated in a new pod by forking recursively Each process restores most of its resources using system calls  (e.g., open files) Kernel module restores sharing relationships (e.g., shared file descriptors) and other key resources (e.g., socket state)
Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary
Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary
Performance Result – Impact of Dropping Packets at Checkpoint Benchmark streaming data at maximum rate over a GigE link between 2 nodes Shows TCP recovers peak throughput in 100ms Will be overshadowed by checkpoint latency in real applications Optimizations can overlap TCP recovery entirely with checkpointing

More Related Content

What's hot (20)

Process Migration in Heterogeneous Systems
Process Migration in Heterogeneous SystemsProcess Migration in Heterogeneous Systems
Process Migration in Heterogeneous Systems
ijsrd.com
 
13 tm adv
13 tm adv13 tm adv
13 tm adv
ashish61_scs
 
Cs 704 d rpc
Cs 704 d rpcCs 704 d rpc
Cs 704 d rpc
Debasis Das
 
process management
 process management process management
process management
Ashish Kumar
 
Remote Procedure Call
Remote Procedure CallRemote Procedure Call
Remote Procedure Call
VIKASH MAINANWAL
 
Introduction to Remote Procedure Call
Introduction to Remote Procedure CallIntroduction to Remote Procedure Call
Introduction to Remote Procedure Call
Abdelrahman Al-Ogail
 
Distributed System by Pratik Tambekar
Distributed System by Pratik TambekarDistributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Pratik Tambekar
 
1 messagepassing-121015032028-phpapp01
1 messagepassing-121015032028-phpapp011 messagepassing-121015032028-phpapp01
1 messagepassing-121015032028-phpapp01
Zaigham Abbas
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
guest61205606
 
remote procedure calls
  remote procedure calls  remote procedure calls
remote procedure calls
Ashish Kumar
 
Distributed System
Distributed System Distributed System
Distributed System
Nitesh Saitwal
 
6 Switch Fabric
6 Switch Fabric6 Switch Fabric
6 Switch Fabric
FNian
 
Seminar
SeminarSeminar
Seminar
parita_parekh
 
5. Distributed Operating Systems
5. Distributed Operating Systems5. Distributed Operating Systems
5. Distributed Operating Systems
Dr Sandeep Kumar Poonia
 
Chapter 3 - Processes
Chapter 3 - ProcessesChapter 3 - Processes
Chapter 3 - Processes
Wayne Jones Jnr
 
Efficient Topology Discovery in Software Defined Networks
Efficient Topology Discovery in Software Defined NetworksEfficient Topology Discovery in Software Defined Networks
Efficient Topology Discovery in Software Defined Networks
Farzaneh Pakzad
 
5th KuVS Meeting
5th KuVS Meeting5th KuVS Meeting
5th KuVS Meeting
steccami
 
3 process management
3 process management3 process management
3 process management
Dr. Loganathan R
 
Mobicents Media Server theory, practice, cloud considerations, design discuss...
Mobicents Media Server theory, practice, cloud considerations, design discuss...Mobicents Media Server theory, practice, cloud considerations, design discuss...
Mobicents Media Server theory, practice, cloud considerations, design discuss...
telestax
 
Security problems in TCP/IP
Security problems in TCP/IPSecurity problems in TCP/IP
Security problems in TCP/IP
Sukh Sandhu
 
Process Migration in Heterogeneous Systems
Process Migration in Heterogeneous SystemsProcess Migration in Heterogeneous Systems
Process Migration in Heterogeneous Systems
ijsrd.com
 
process management
 process management process management
process management
Ashish Kumar
 
Introduction to Remote Procedure Call
Introduction to Remote Procedure CallIntroduction to Remote Procedure Call
Introduction to Remote Procedure Call
Abdelrahman Al-Ogail
 
Distributed System by Pratik Tambekar
Distributed System by Pratik TambekarDistributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Pratik Tambekar
 
1 messagepassing-121015032028-phpapp01
1 messagepassing-121015032028-phpapp011 messagepassing-121015032028-phpapp01
1 messagepassing-121015032028-phpapp01
Zaigham Abbas
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
guest61205606
 
remote procedure calls
  remote procedure calls  remote procedure calls
remote procedure calls
Ashish Kumar
 
6 Switch Fabric
6 Switch Fabric6 Switch Fabric
6 Switch Fabric
FNian
 
Efficient Topology Discovery in Software Defined Networks
Efficient Topology Discovery in Software Defined NetworksEfficient Topology Discovery in Software Defined Networks
Efficient Topology Discovery in Software Defined Networks
Farzaneh Pakzad
 
5th KuVS Meeting
5th KuVS Meeting5th KuVS Meeting
5th KuVS Meeting
steccami
 
Mobicents Media Server theory, practice, cloud considerations, design discuss...
Mobicents Media Server theory, practice, cloud considerations, design discuss...Mobicents Media Server theory, practice, cloud considerations, design discuss...
Mobicents Media Server theory, practice, cloud considerations, design discuss...
telestax
 
Security problems in TCP/IP
Security problems in TCP/IPSecurity problems in TCP/IP
Security problems in TCP/IP
Sukh Sandhu
 

Viewers also liked (20)

The Role of Venture Capital in the US Economy
The Role of Venture Capital in the US EconomyThe Role of Venture Capital in the US Economy
The Role of Venture Capital in the US Economy
Mark J. Feldman
 
The CleanTech Market Opportunity
The CleanTech Market OpportunityThe CleanTech Market Opportunity
The CleanTech Market Opportunity
Mark J. Feldman
 
Massachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech OpportunitiesMassachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech Opportunities
Mark J. Feldman
 
2016 NYU Entrepreneurs Festival Opening Remarks
2016 NYU Entrepreneurs Festival Opening Remarks2016 NYU Entrepreneurs Festival Opening Remarks
2016 NYU Entrepreneurs Festival Opening Remarks
New York University
 
Inside Google's Search Algorythm! (by Google Researchers)
Inside Google's Search Algorythm! (by Google Researchers)Inside Google's Search Algorythm! (by Google Researchers)
Inside Google's Search Algorythm! (by Google Researchers)
Mark J. Feldman
 
Measurement and modeling of the web and related data sets
Measurement and modeling of the web and related data setsMeasurement and modeling of the web and related data sets
Measurement and modeling of the web and related data sets
Mark J. Feldman
 
Email Marketing 101
Email Marketing 101Email Marketing 101
Email Marketing 101
Mark J. Feldman
 
Surveillance for the Olympic games in Athens, 2004
Surveillance for the Olympic games in Athens, 2004Surveillance for the Olympic games in Athens, 2004
Surveillance for the Olympic games in Athens, 2004
Mark J. Feldman
 
Talking to Humans: Customer Discovery 101
Talking to Humans: Customer Discovery 101Talking to Humans: Customer Discovery 101
Talking to Humans: Customer Discovery 101
New York University
 
Présentation sequoia - FR
Présentation sequoia - FRPrésentation sequoia - FR
Présentation sequoia - FR
sequoiapartnerssarl
 
Translating Customer Needs Into MVPs
Translating Customer Needs Into MVPsTranslating Customer Needs Into MVPs
Translating Customer Needs Into MVPs
New York University
 
McDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility ReportMcDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility Report
Mark J. Feldman
 
Venture investing & business plan
Venture investing & business  planVenture investing & business  plan
Venture investing & business plan
Digbijoy Shukla
 
Screening Venture Opportunities 2010
Screening Venture Opportunities 2010Screening Venture Opportunities 2010
Screening Venture Opportunities 2010
Jan Bendtsen
 
Intro to NSF I-Corps_6apr15
Intro to NSF I-Corps_6apr15Intro to NSF I-Corps_6apr15
Intro to NSF I-Corps_6apr15
New York University
 
Venture Capital Deal Terms
Venture Capital Deal TermsVenture Capital Deal Terms
Venture Capital Deal Terms
Mark J. Feldman
 
Small Cap Value Equity Pitchbook
Small Cap Value Equity PitchbookSmall Cap Value Equity Pitchbook
Small Cap Value Equity Pitchbook
Mark J. Feldman
 
How We Do Startups & Entrepreneurship at NYU
How We Do Startups & Entrepreneurship at NYUHow We Do Startups & Entrepreneurship at NYU
How We Do Startups & Entrepreneurship at NYU
New York University
 
Oracle 10g Application Server
Oracle 10g Application ServerOracle 10g Application Server
Oracle 10g Application Server
Mark J. Feldman
 
Beginners Guide To Venture Capital
Beginners Guide To Venture CapitalBeginners Guide To Venture Capital
Beginners Guide To Venture Capital
Mark J. Feldman
 
The Role of Venture Capital in the US Economy
The Role of Venture Capital in the US EconomyThe Role of Venture Capital in the US Economy
The Role of Venture Capital in the US Economy
Mark J. Feldman
 
The CleanTech Market Opportunity
The CleanTech Market OpportunityThe CleanTech Market Opportunity
The CleanTech Market Opportunity
Mark J. Feldman
 
Massachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech OpportunitiesMassachusetts - Israel Cleantech Opportunities
Massachusetts - Israel Cleantech Opportunities
Mark J. Feldman
 
2016 NYU Entrepreneurs Festival Opening Remarks
2016 NYU Entrepreneurs Festival Opening Remarks2016 NYU Entrepreneurs Festival Opening Remarks
2016 NYU Entrepreneurs Festival Opening Remarks
New York University
 
Inside Google's Search Algorythm! (by Google Researchers)
Inside Google's Search Algorythm! (by Google Researchers)Inside Google's Search Algorythm! (by Google Researchers)
Inside Google's Search Algorythm! (by Google Researchers)
Mark J. Feldman
 
Measurement and modeling of the web and related data sets
Measurement and modeling of the web and related data setsMeasurement and modeling of the web and related data sets
Measurement and modeling of the web and related data sets
Mark J. Feldman
 
Surveillance for the Olympic games in Athens, 2004
Surveillance for the Olympic games in Athens, 2004Surveillance for the Olympic games in Athens, 2004
Surveillance for the Olympic games in Athens, 2004
Mark J. Feldman
 
Talking to Humans: Customer Discovery 101
Talking to Humans: Customer Discovery 101Talking to Humans: Customer Discovery 101
Talking to Humans: Customer Discovery 101
New York University
 
Translating Customer Needs Into MVPs
Translating Customer Needs Into MVPsTranslating Customer Needs Into MVPs
Translating Customer Needs Into MVPs
New York University
 
McDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility ReportMcDonald's Worldwide Corporate Responsibility Report
McDonald's Worldwide Corporate Responsibility Report
Mark J. Feldman
 
Venture investing & business plan
Venture investing & business  planVenture investing & business  plan
Venture investing & business plan
Digbijoy Shukla
 
Screening Venture Opportunities 2010
Screening Venture Opportunities 2010Screening Venture Opportunities 2010
Screening Venture Opportunities 2010
Jan Bendtsen
 
Venture Capital Deal Terms
Venture Capital Deal TermsVenture Capital Deal Terms
Venture Capital Deal Terms
Mark J. Feldman
 
Small Cap Value Equity Pitchbook
Small Cap Value Equity PitchbookSmall Cap Value Equity Pitchbook
Small Cap Value Equity Pitchbook
Mark J. Feldman
 
How We Do Startups & Entrepreneurship at NYU
How We Do Startups & Entrepreneurship at NYUHow We Do Startups & Entrepreneurship at NYU
How We Do Startups & Entrepreneurship at NYU
New York University
 
Oracle 10g Application Server
Oracle 10g Application ServerOracle 10g Application Server
Oracle 10g Application Server
Mark J. Feldman
 
Beginners Guide To Venture Capital
Beginners Guide To Venture CapitalBeginners Guide To Venture Capital
Beginners Guide To Venture Capital
Mark J. Feldman
 

Similar to Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems (20)

An efficient recovery mechanism
An efficient recovery mechanismAn efficient recovery mechanism
An efficient recovery mechanism
ijcsa
 
Checkpointing the Un-checkpointable: MANA and the Split-Process Approach
Checkpointing the Un-checkpointable: MANA and the Split-Process ApproachCheckpointing the Un-checkpointable: MANA and the Split-Process Approach
Checkpointing the Un-checkpointable: MANA and the Split-Process Approach
inside-BigData.com
 
Implement Checkpointing for Android (ELCE2012)
Implement Checkpointing for Android (ELCE2012)Implement Checkpointing for Android (ELCE2012)
Implement Checkpointing for Android (ELCE2012)
National Cheng Kung University
 
Implement Checkpointing for Android
Implement Checkpointing for AndroidImplement Checkpointing for Android
Implement Checkpointing for Android
National Cheng Kung University
 
TIPC Roadmap 2021
TIPC Roadmap 2021TIPC Roadmap 2021
TIPC Roadmap 2021
Jon Maloy
 
Grds conferences icst and icbelsh (9)
Grds conferences icst and icbelsh (9)Grds conferences icst and icbelsh (9)
Grds conferences icst and icbelsh (9)
Global R & D Services
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
Michał Waleszczuk
 
Fault Tolerant And Disaster Recovery
Fault Tolerant And Disaster RecoveryFault Tolerant And Disaster Recovery
Fault Tolerant And Disaster Recovery
guest4a7fca
 
Distributed and Parallel Computing CheckpointingRecovery-ds14.ppt
Distributed and Parallel Computing CheckpointingRecovery-ds14.pptDistributed and Parallel Computing CheckpointingRecovery-ds14.ppt
Distributed and Parallel Computing CheckpointingRecovery-ds14.ppt
ahmadbataineh21
 
Distributed Checkpointing on an Enterprise Desktop Grid
Distributed Checkpointing on an Enterprise Desktop GridDistributed Checkpointing on an Enterprise Desktop Grid
Distributed Checkpointing on an Enterprise Desktop Grid
brent.wilson
 
Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013
Hajime Tazaki
 
Distributed OPERATING SYSTEM FOR BACHELOR OF BUSINESS INFORMATION TECHNOLOGY
Distributed OPERATING SYSTEM FOR BACHELOR OF BUSINESS INFORMATION TECHNOLOGYDistributed OPERATING SYSTEM FOR BACHELOR OF BUSINESS INFORMATION TECHNOLOGY
Distributed OPERATING SYSTEM FOR BACHELOR OF BUSINESS INFORMATION TECHNOLOGY
reginamutio48
 
RINA overview and ongoing research in EC-funded projects, ISO SC6 WG7
RINA overview and ongoing research in EC-funded projects, ISO SC6 WG7RINA overview and ongoing research in EC-funded projects, ISO SC6 WG7
RINA overview and ongoing research in EC-funded projects, ISO SC6 WG7
Eleni Trouva
 
A checkpointing mechanism for virtual clusters using memory- bound time-multi...
A checkpointing mechanism for virtual clusters using memory- bound time-multi...A checkpointing mechanism for virtual clusters using memory- bound time-multi...
A checkpointing mechanism for virtual clusters using memory- bound time-multi...
IJECEIAES
 
Checkpointing.pptx
Checkpointing.pptxCheckpointing.pptx
Checkpointing.pptx
AzmiNizar1
 
EuroMPI 2019: Multilevel Checkpointing for MPI Applications
EuroMPI 2019: Multilevel Checkpointing for MPI ApplicationsEuroMPI 2019: Multilevel Checkpointing for MPI Applications
EuroMPI 2019: Multilevel Checkpointing for MPI Applications
LEGATO project
 
PhD Slides
PhD SlidesPhD Slides
PhD Slides
Màrius Montón
 
Dynamic Resource Management In a Massively Parallel Stream Processing Engine
 Dynamic Resource Management In a Massively Parallel Stream Processing Engine Dynamic Resource Management In a Massively Parallel Stream Processing Engine
Dynamic Resource Management In a Massively Parallel Stream Processing Engine
Kasper Grud Skat Madsen
 
Reaching reliable agreement in an unreliable world
Reaching reliable agreement in an unreliable worldReaching reliable agreement in an unreliable world
Reaching reliable agreement in an unreliable world
Heidi Howard
 
Eucnc rina-tutorial
Eucnc rina-tutorialEucnc rina-tutorial
Eucnc rina-tutorial
ICT PRISTINE
 
An efficient recovery mechanism
An efficient recovery mechanismAn efficient recovery mechanism
An efficient recovery mechanism
ijcsa
 
Checkpointing the Un-checkpointable: MANA and the Split-Process Approach
Checkpointing the Un-checkpointable: MANA and the Split-Process ApproachCheckpointing the Un-checkpointable: MANA and the Split-Process Approach
Checkpointing the Un-checkpointable: MANA and the Split-Process Approach
inside-BigData.com
 
TIPC Roadmap 2021
TIPC Roadmap 2021TIPC Roadmap 2021
TIPC Roadmap 2021
Jon Maloy
 
Fault Tolerant And Disaster Recovery
Fault Tolerant And Disaster RecoveryFault Tolerant And Disaster Recovery
Fault Tolerant And Disaster Recovery
guest4a7fca
 
Distributed and Parallel Computing CheckpointingRecovery-ds14.ppt
Distributed and Parallel Computing CheckpointingRecovery-ds14.pptDistributed and Parallel Computing CheckpointingRecovery-ds14.ppt
Distributed and Parallel Computing CheckpointingRecovery-ds14.ppt
ahmadbataineh21
 
Distributed Checkpointing on an Enterprise Desktop Grid
Distributed Checkpointing on an Enterprise Desktop GridDistributed Checkpointing on an Enterprise Desktop Grid
Distributed Checkpointing on an Enterprise Desktop Grid
brent.wilson
 
Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013Direct Code Execution @ CoNEXT 2013
Direct Code Execution @ CoNEXT 2013
Hajime Tazaki
 
Distributed OPERATING SYSTEM FOR BACHELOR OF BUSINESS INFORMATION TECHNOLOGY
Distributed OPERATING SYSTEM FOR BACHELOR OF BUSINESS INFORMATION TECHNOLOGYDistributed OPERATING SYSTEM FOR BACHELOR OF BUSINESS INFORMATION TECHNOLOGY
Distributed OPERATING SYSTEM FOR BACHELOR OF BUSINESS INFORMATION TECHNOLOGY
reginamutio48
 
RINA overview and ongoing research in EC-funded projects, ISO SC6 WG7
RINA overview and ongoing research in EC-funded projects, ISO SC6 WG7RINA overview and ongoing research in EC-funded projects, ISO SC6 WG7
RINA overview and ongoing research in EC-funded projects, ISO SC6 WG7
Eleni Trouva
 
A checkpointing mechanism for virtual clusters using memory- bound time-multi...
A checkpointing mechanism for virtual clusters using memory- bound time-multi...A checkpointing mechanism for virtual clusters using memory- bound time-multi...
A checkpointing mechanism for virtual clusters using memory- bound time-multi...
IJECEIAES
 
Checkpointing.pptx
Checkpointing.pptxCheckpointing.pptx
Checkpointing.pptx
AzmiNizar1
 
EuroMPI 2019: Multilevel Checkpointing for MPI Applications
EuroMPI 2019: Multilevel Checkpointing for MPI ApplicationsEuroMPI 2019: Multilevel Checkpointing for MPI Applications
EuroMPI 2019: Multilevel Checkpointing for MPI Applications
LEGATO project
 
Dynamic Resource Management In a Massively Parallel Stream Processing Engine
 Dynamic Resource Management In a Massively Parallel Stream Processing Engine Dynamic Resource Management In a Massively Parallel Stream Processing Engine
Dynamic Resource Management In a Massively Parallel Stream Processing Engine
Kasper Grud Skat Madsen
 
Reaching reliable agreement in an unreliable world
Reaching reliable agreement in an unreliable worldReaching reliable agreement in an unreliable world
Reaching reliable agreement in an unreliable world
Heidi Howard
 
Eucnc rina-tutorial
Eucnc rina-tutorialEucnc rina-tutorial
Eucnc rina-tutorial
ICT PRISTINE
 

More from Mark J. Feldman (8)

How Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen DealsHow Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen Deals
Mark J. Feldman
 
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
Mark J. Feldman
 
Choosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware SolutionChoosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware Solution
Mark J. Feldman
 
Googlebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendorsGooglebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendors
Mark J. Feldman
 
II Security At Microsoft
II Security At MicrosoftII Security At Microsoft
II Security At Microsoft
Mark J. Feldman
 
Sub Prime Explanation
Sub Prime ExplanationSub Prime Explanation
Sub Prime Explanation
Mark J. Feldman
 
Email Marketing Tips and Tricks
Email Marketing Tips and TricksEmail Marketing Tips and Tricks
Email Marketing Tips and Tricks
Mark J. Feldman
 
Email Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your BusinessEmail Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your Business
Mark J. Feldman
 
How Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen DealsHow Venture Capitalist (VC) Firms Screen Deals
How Venture Capitalist (VC) Firms Screen Deals
Mark J. Feldman
 
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
COMMUNITY DEVELOPMENT FINANCIAL INSTITUTIONS: Considerations for Securitizati...
Mark J. Feldman
 
Choosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware SolutionChoosing The Right Enterprise Antispyware Solution
Choosing The Right Enterprise Antispyware Solution
Mark J. Feldman
 
Googlebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendorsGooglebase Information Pack for MLSs and MLS vendors
Googlebase Information Pack for MLSs and MLS vendors
Mark J. Feldman
 
II Security At Microsoft
II Security At MicrosoftII Security At Microsoft
II Security At Microsoft
Mark J. Feldman
 
Email Marketing Tips and Tricks
Email Marketing Tips and TricksEmail Marketing Tips and Tricks
Email Marketing Tips and Tricks
Mark J. Feldman
 
Email Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your BusinessEmail Marketing: Expand Your Reach, Grow Your Business
Email Marketing: Expand Your Reach, Grow Your Business
Mark J. Feldman
 

Recently uploaded (20)

Building Connected Agents: An Overview of Google's ADK and A2A Protocol
Building Connected Agents:  An Overview of Google's ADK and A2A ProtocolBuilding Connected Agents:  An Overview of Google's ADK and A2A Protocol
Building Connected Agents: An Overview of Google's ADK and A2A Protocol
Suresh Peiris
 
TAFs on WebDriver API - By - Pallavi Sharma.pdf
TAFs on WebDriver API - By - Pallavi Sharma.pdfTAFs on WebDriver API - By - Pallavi Sharma.pdf
TAFs on WebDriver API - By - Pallavi Sharma.pdf
Pallavi Sharma
 
AI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptxAI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptx
Shikha Srivastava
 
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCPMCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
Sambhav Kothari
 
John Carmack’s Slides From His Upper Bound 2025 Talk
John Carmack’s Slides From His Upper Bound 2025 TalkJohn Carmack’s Slides From His Upper Bound 2025 Talk
John Carmack’s Slides From His Upper Bound 2025 Talk
Razin Mustafiz
 
Artificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdfArtificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdf
NufiEriKusumawati
 
Building Agents with LangGraph & Gemini
Building Agents with LangGraph &  GeminiBuilding Agents with LangGraph &  Gemini
Building Agents with LangGraph & Gemini
HusseinMalikMammadli
 
Pushing the Limits: CloudStack at 25K Hosts
Pushing the Limits: CloudStack at 25K HostsPushing the Limits: CloudStack at 25K Hosts
Pushing the Limits: CloudStack at 25K Hosts
ShapeBlue
 
Fully Open-Source Private Clouds: Freedom, Security, and Control
Fully Open-Source Private Clouds: Freedom, Security, and ControlFully Open-Source Private Clouds: Freedom, Security, and Control
Fully Open-Source Private Clouds: Freedom, Security, and Control
ShapeBlue
 
PSEP - Salesforce Power of the Platform.pdf
PSEP - Salesforce Power of the Platform.pdfPSEP - Salesforce Power of the Platform.pdf
PSEP - Salesforce Power of the Platform.pdf
ssuser3d62c6
 
Storage Setup for LINSTOR/DRBD/CloudStack
Storage Setup for LINSTOR/DRBD/CloudStackStorage Setup for LINSTOR/DRBD/CloudStack
Storage Setup for LINSTOR/DRBD/CloudStack
ShapeBlue
 
Build your own NES Emulator... with Kotlin
Build your own NES Emulator... with KotlinBuild your own NES Emulator... with Kotlin
Build your own NES Emulator... with Kotlin
Artur Skowroński
 
Dr Schwarzkopf presentation on STKI Summit A
Dr Schwarzkopf presentation on STKI Summit ADr Schwarzkopf presentation on STKI Summit A
Dr Schwarzkopf presentation on STKI Summit A
Dr. Jimmy Schwarzkopf
 
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStackProposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
ShapeBlue
 
Planetek Italia Corporate Profile Brochure
Planetek Italia Corporate Profile BrochurePlanetek Italia Corporate Profile Brochure
Planetek Italia Corporate Profile Brochure
Planetek Italia Srl
 
Reducing Bugs With Static Code Analysis php tek 2025
Reducing Bugs With Static Code Analysis php tek 2025Reducing Bugs With Static Code Analysis php tek 2025
Reducing Bugs With Static Code Analysis php tek 2025
Scott Keck-Warren
 
"AI in the browser: predicting user actions in real time with TensorflowJS", ...
"AI in the browser: predicting user actions in real time with TensorflowJS", ..."AI in the browser: predicting user actions in real time with TensorflowJS", ...
"AI in the browser: predicting user actions in real time with TensorflowJS", ...
Fwdays
 
RDM Training: Publish research data with the Research Data Repository
RDM Training: Publish research data with the Research Data RepositoryRDM Training: Publish research data with the Research Data Repository
RDM Training: Publish research data with the Research Data Repository
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Partner Tableau Next Product First Call Deck.pdf
Partner Tableau Next Product First Call Deck.pdfPartner Tableau Next Product First Call Deck.pdf
Partner Tableau Next Product First Call Deck.pdf
ssuser3d62c6
 
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PCWondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Mudasir
 
Building Connected Agents: An Overview of Google's ADK and A2A Protocol
Building Connected Agents:  An Overview of Google's ADK and A2A ProtocolBuilding Connected Agents:  An Overview of Google's ADK and A2A Protocol
Building Connected Agents: An Overview of Google's ADK and A2A Protocol
Suresh Peiris
 
TAFs on WebDriver API - By - Pallavi Sharma.pdf
TAFs on WebDriver API - By - Pallavi Sharma.pdfTAFs on WebDriver API - By - Pallavi Sharma.pdf
TAFs on WebDriver API - By - Pallavi Sharma.pdf
Pallavi Sharma
 
AI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptxAI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptx
Shikha Srivastava
 
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCPMCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
Sambhav Kothari
 
John Carmack’s Slides From His Upper Bound 2025 Talk
John Carmack’s Slides From His Upper Bound 2025 TalkJohn Carmack’s Slides From His Upper Bound 2025 Talk
John Carmack’s Slides From His Upper Bound 2025 Talk
Razin Mustafiz
 
Artificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdfArtificial Intelligence (Kecerdasan Buatan).pdf
Artificial Intelligence (Kecerdasan Buatan).pdf
NufiEriKusumawati
 
Building Agents with LangGraph & Gemini
Building Agents with LangGraph &  GeminiBuilding Agents with LangGraph &  Gemini
Building Agents with LangGraph & Gemini
HusseinMalikMammadli
 
Pushing the Limits: CloudStack at 25K Hosts
Pushing the Limits: CloudStack at 25K HostsPushing the Limits: CloudStack at 25K Hosts
Pushing the Limits: CloudStack at 25K Hosts
ShapeBlue
 
Fully Open-Source Private Clouds: Freedom, Security, and Control
Fully Open-Source Private Clouds: Freedom, Security, and ControlFully Open-Source Private Clouds: Freedom, Security, and Control
Fully Open-Source Private Clouds: Freedom, Security, and Control
ShapeBlue
 
PSEP - Salesforce Power of the Platform.pdf
PSEP - Salesforce Power of the Platform.pdfPSEP - Salesforce Power of the Platform.pdf
PSEP - Salesforce Power of the Platform.pdf
ssuser3d62c6
 
Storage Setup for LINSTOR/DRBD/CloudStack
Storage Setup for LINSTOR/DRBD/CloudStackStorage Setup for LINSTOR/DRBD/CloudStack
Storage Setup for LINSTOR/DRBD/CloudStack
ShapeBlue
 
Build your own NES Emulator... with Kotlin
Build your own NES Emulator... with KotlinBuild your own NES Emulator... with Kotlin
Build your own NES Emulator... with Kotlin
Artur Skowroński
 
Dr Schwarzkopf presentation on STKI Summit A
Dr Schwarzkopf presentation on STKI Summit ADr Schwarzkopf presentation on STKI Summit A
Dr Schwarzkopf presentation on STKI Summit A
Dr. Jimmy Schwarzkopf
 
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStackProposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
ShapeBlue
 
Planetek Italia Corporate Profile Brochure
Planetek Italia Corporate Profile BrochurePlanetek Italia Corporate Profile Brochure
Planetek Italia Corporate Profile Brochure
Planetek Italia Srl
 
Reducing Bugs With Static Code Analysis php tek 2025
Reducing Bugs With Static Code Analysis php tek 2025Reducing Bugs With Static Code Analysis php tek 2025
Reducing Bugs With Static Code Analysis php tek 2025
Scott Keck-Warren
 
"AI in the browser: predicting user actions in real time with TensorflowJS", ...
"AI in the browser: predicting user actions in real time with TensorflowJS", ..."AI in the browser: predicting user actions in real time with TensorflowJS", ...
"AI in the browser: predicting user actions in real time with TensorflowJS", ...
Fwdays
 
Partner Tableau Next Product First Call Deck.pdf
Partner Tableau Next Product First Call Deck.pdfPartner Tableau Next Product First Call Deck.pdf
Partner Tableau Next Product First Call Deck.pdf
ssuser3d62c6
 
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PCWondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Mudasir
 

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

  • 1. Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems G. (John) Janakiraman, Jose Renato Santos, Dinesh Subhraveti § , Yoshio Turner HP Labs §: Currently at Meiosys, Inc.
  • 2. Broad Opportunity for Checkpoint-Restart in Server Management Fault tolerance (minimize unplanned downtime) Recover by restarting from checkpoint Minimize planned downtime Migrate application before hardware/OS maintenance Resource management Manage resource allocation in shared computing environments by migrating applications
  • 3. Need for General-Purpose Checkpoint-Restart Existing checkpoint-restart methods are too limited: No support for many OS resources that commercial applications use (e.g., sockets) Limited to applications using specific libraries Require application source and recompilation Require use of specialized operating systems Need a practical checkpoint-restart mechanism that is capable of supporting a broad class of applications
  • 4. Cruz: Our Solution for General-Purpose Checkpoint-Restart on Linux Application-transparent: supports applications without modifications or recompilation Supports a broad class of applications (e.g., databases, parallel MPI apps, desktop apps) Comprehensive support for user-level state, kernel-level state, and distributed computation and communication state Supported on unmodified Linux base kernel – checkpoint-restart integrated via a kernel module
  • 5. Cruz Overview Builds on Columbia Univ.’s Zap process migration Our Key Extensions Support for migrating networked applications, transparent to communicating peers Enables role in managing servers running commercial applications (e.g., databases) General method for checkpoint-restart of TCP/IP-based distributed applications Also enables efficiencies compared to library-specific approaches
  • 6. Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary
  • 7. Zap (Background) Process migration mechanism Kernel module implementation Virtualization layer groups processes into Pods with private virtual name space Intercepts system calls to expose only virtual identifiers (e.g., vpid) Preserves resource names and dependencies across migration Mechanism to checkpoint and restart pods User and kernel-level state Primarily uses system call handlers File system not saved or restored (assumes a network file system) Linux System calls Zap Linux Pods Applications
  • 8. Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary
  • 9. Migrating Networked Applications Migration must be transparent to remote peers to be useful in server management scenarios Peers, including unmodified clients, must not perceive any change in the IP address of the application Communication state of live connections must be preserved No prior solution for these (including original Zap) Our Solution: Provide unique IP address to each pod that persists across migration Checkpoint and restore the socket control state and socket data buffer state of all live sockets
  • 10. Network Address Migration Pod attached to virtual interface with own IP & MAC addr. Implemented by using Linux’s virtual interfaces (VIFs) IP address assigned statically or through a DHCP client running inside the pod (using pod’s MAC address) Intercept bind() & connect() to ensure pod processes use pod’s IP address Migration: delete VIF on source host & create on new host Migration limited to subnet eth0 [IP-1, MAC-h1] eth0:1 Pod DHCP Server Network DHCP Client 1. ioctl() 2. MAC-p1 3. dhcprequest(MAC-p1) 4. dhcpack(IP-p1)
  • 11. Communication State Checkpoint and Restore Communication state: Control: Socket data structure, TCP connection state Data: contents of send and receive socket buffers Challenges in communication state checkpoint and restore: Network stack will continue to execute even after application processes are stopped No system call interface to read or write control state No system call interface to read send socket buffers No system call interface to write receive socket buffers Consistency of control state and socket buffer state
  • 12. Communication State Checkpoint Acquire network stack locks to freeze TCP processing Save receive buffers using socket receive system call in peek mode Save send buffers by walking kernel structures Copy control state from kernel structures Modify two sequence numbers in saved state to reflect empty socket buffers Indicate current send buffers not yet written by application Indicate current receive buffers all consumed by application Checkpoint State State for one socket Note : Checkpoint does not change live communication state Control Rh Rt Recv buffers St Sh Send buffers Sh Rt+1 Timers, Options, etc. Rh St+1 Sh Rt+1 X X receive() direct access direct access Rh Rt . . . St Sh . . . Rt+1 Rh Sh St+1 Timers, Options, etc. Control Recv buffers Send buffers copied_seq rcv_nxt snd_una write_seq Live Communication State
  • 13. Communication State Restore Create a new socket Copy control state in checkpoint to socket structure Restore checkpointed send buffer data using the socket write call Deliver checkpointed receive buffer data to application on demand Copy checkpointed receive buffer data to a special buffer Intercept receive system call to deliver data from special buffer until buffer is emptied Sh State for one socket Control Live Communication State copied_seq rcv_nxt snd_una write_seq St Sh . . . Send buffers Checkpoint State Control Rh Rt Recv buffers St Sh Send buffers Sh Rt+1 Timers, Options, etc. Rt+1 Sh Rt+1 Rt+1 Sh Timers, Options, etc. Rh Rt Recv data direct update St+1 write() To App by intercepted receive system call
  • 14. Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary
  • 15. Checkpoint-Restart of Distributed Applications State of processes and messages in channel must be checkpointed and restored consistently Prior approaches specific to particular library – e.g., modify library to capture and restore messages in channel Cruz preserves TCP connection state and IP addresses of each pod, implicitly preserving global communication state Transparently supports TCP/IP-based distributed applications Enables efficiencies compared to library-based implementations Communication Channel Library Library Library Checkpoint Node Processes Node Processes Node Processes TCP/IP TCP/IP TCP/IP
  • 16. Checkpoint-Restart of Distributed Applications in Cruz Global communication state saved and restored by saving and restoring TCP communication state for each pod Messages in flight need not be saved since the TCP state will trigger retransmission of these messages at restart Eliminates O(N 2 ) step to flush channel for capturing messages in flight Eliminates need to re-establish connections at restart Preserving pod’s IP address across restart eliminates need to re-discover process locations in library at restart Communication Channel Library Library Library Checkpoint Node Pod (processes) Node Pod (processes) Node Pod (processes) TCP/IP TCP/IP TCP/IP
  • 17. Consistent Checkpoint Algorithm in Cruz (Illustrative) Algorithm has O(N) complexity (blocking algorithm shown for simplicity) Can be extended to improve robustness and performance, e.g.: Tolerate Agent & Coordinator failures Overlap computation and checkpointing using copy-on-write Allow nodes to continue without blocking for all nodes to complete checkpoint Reduce checkpoint size with incremental checkpoints <checkpoint> Node Pod TCP/IP Library Agent Node Coordinator Node Pod TCP/IP Library Agent Disable pod comm § <done> <continue> Enable pod comm <continue-done> <checkpoint> Disable pod comm Save pod state <done> <continue> Enable pod comm Resume pod <continue-done> Save pod state Resume pod §: using netfilter rules in Linux
  • 18. Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary
  • 19. Evaluation Cruz implemented for Linux 2.4.x on x86 Functionality verified on several applications, e.g., MySQL, K Desktop Environment, and a multi-node MPI benchmark Cruz incurs negligible runtime overhead (less than 0.5%) Initial study shows performance overhead of coordinating checkpoints is negligible, suggesting the scheme is scalable
  • 20. Performance Result – Negligible Coordination Overhead Checkpoint behavior for Semi-Lagrangian atmospheric model benchmark in configurations from 2 to 8 nodes Negligible latency in coordinating checkpoints (time spent in non-local operations) suggests scheme is scalable Coordination latency of 400-500 microseconds is a small fraction of the overall checkpoint latency of about 1 second
  • 21. Related Work MetaCluster product from Meiosys Capabilities similar to Cruz (e.g., checkpoint and restart of unmodified distributed applications) Berkeley Labs Checkpoint Restart (BLCR) Kernel-module based checkpoint-restart for single node No identifier virtualization – restart will fail in the event of an identifier (e.g., pid) conflict No support for handling communication state – relies on application or library changes MPVM, CoCheck, LAM-MPI Library-specific implementations of parallel application checkpoint-restart with disadvantages described earlier
  • 22. Future Work Many areas for future work, e.g., Improve portability across kernel versions by minimizing direct access to kernel structures Recommend additional kernel interfaces when advantageous (e.g., accessing socket attributes) Implement performance optimizations to the coordinated checkpoint-restart algorithm Evaluate performance on a wide range of applications and cluster configurations Support systems with newer interconnects and newer communication abstractions (e.g., InfiniBand, RDMA)
  • 23. Summary Cruz, a practical checkpoint-restart system for Linux No change to applications or to base OS kernel needed Novel mechanisms to support checkpoint-restart of a broader class of applications Migrating networked applications transparent to communicating peers Consistent checkpoint-restart of general TCP/IP-based distributed applications Cruz’s broad capabilities will drive its use in solutions for fault tolerance, online OS maintenance, and resource management
  • 25. Zap Virtualization Groups processes into a POD (Process Domain) that has a private virtual namespace Uses system call interception to expose only virtual identifiers (e.g., virtual pids, virtual IPC identifiers) Virtual identifiers eliminate conflicts with identifiers already in use within the OS on the restarting node All dependent processes (e.g., forked child processes) are assigned to same pod Checkpoint and restart operate on an entire pod, which preserves resource dependencies across checkpoint and restart
  • 26. Zap Checkpoint and Restart Checkpoint: Stops all processes in pod with SIGSTOP Parent-child relationships saved from /proc State of each process is captured by accessing system call handlers and kernel data structures Restart: Original forest of processes recreated in a new pod by forking recursively Each process restores most of its resources using system calls (e.g., open files) Kernel module restores sharing relationships (e.g., shared file descriptors) and other key resources (e.g., socket state)
  • 27. Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary
  • 28. Outline Zap (Background) Migrating Networked Applications Network Address Migration Communication State Checkpoint and Restore Checkpoint-Restart of Distributed Applications Evaluation Related Work Future Work Summary
  • 29. Performance Result – Impact of Dropping Packets at Checkpoint Benchmark streaming data at maximum rate over a GigE link between 2 nodes Shows TCP recovers peak throughput in 100ms Will be overshadowed by checkpoint latency in real applications Optimizations can overlap TCP recovery entirely with checkpointing