SRM Check List
SRM Check List
Sponsored by
Sean Clark
VCP 2, 3, 4 and VMware vExpert 2009
Executive summary
Planning and implementing a VMware disaster recovery (DR) plan is not a task to be taken lightly. If you jump in without some minimum planning youll end up with some surprises you, your boss or your budget wont be agreeable to. Worse yet, if you skip key steps neglecting to collect key information prior to designing, you can end up with an unwieldy DR solution that doesnt meet the needs of the business. This eBook will serve as a checklist that can guide you on the creation of a top-notch VMware disaster recovery plan. The checklist follows industry best practices for implementing complex technology solutions by taking a phased approach to instituting a DR plan. This approach includes the following phases: AssessmentGathering key requirements for DR solution DesignCreating a DR plan to meet business and technical requirements DeployStand up necessary infrastructure. Install, configure and test solution ManageTest your DR plan as frequently as possible This approach should be reapplied as your business requirements change or to take advantage of technology advancements that can reduce costs and enhance DR capabilities. The result is a DR plan that is flexible enough to adapt with the times and your business. Since were talking about recovery of VMware environments, we will focus on leveraging its unique capabilities to the maximum degree. The unique capabilities and properties of VMware environments allow us to create the ultimate test-driven DR plan and help be a catalyst for moving to 100% virtualized environments. The focus of this eBook is the planning side of DR and less on the execution and operation of the DR plan. As such, we'll focus on the Assess and Design phases.
Assess
Business impact analysis Determine RPO (recovery point objective) and RTO (recovery time objective) Understand your budget Understand application dependencies Automate VMware environment data collection
Design
Virtualize stragglers Analyze resource requirements Design for easiest restore Decide on infrastructure configuration Test-drive DR plan
What's in a BIA?
A BIA can be as complex and time consuming as you want it to be. Complexity and time are also a discrete function of the size and complexity of your business. In either regard, all business impact assessments are looking to accomplish the same fundamental tasks. For example, the following list was gathered from a great free BIA template published by the U.S. Centers for Disease Control and Prevention for the purpose of helping guide the development of a DR plan. In this template, the following main areas of assessment are addressed:
unless lawsuits are considered, but a high quality of patient care dictates that the best patient health information to be available to ensure continuously good care through shift changes and doctor rotations. So, in most healthcare organizations, budget becomes the determining factor for what you'll end up choosing for RTO and RPO. While working within your fixed budget, you may establish RTOs and RPOs that are as low as possible during the times you require. Veeam Backup & Replication with its ability to back up, replicate and provide near-CDP in one product, provides the flexibility to address all healthcare needs.
Phased Budget
Ideally, you should suggest a phased DR assessment budget by the business or by your customer. In this method, you are allowed a fixed budget to conduct your assessment and initial planning with the expectation the outcome will be a more accurate budget estimation of the final solution. This method builds trust with the business or customer and ensures them that you are not making these recommendations on a whim and that the DR plan is exercising due diligence. But once the assessment is complete and a preliminary design can be conducted, you'll have to a decision to make in determining how much you will ask for.
virtualized DR and communicate the benefits of legacy-free disaster recovery. If business leaders understand the true value of virtualized DR, you should have success funding the project properly to realize the full benefits. A good rule of thumb is to shoot for the plan that can create the lowest TCO (total cost of ownership) over the next 3 to 5 years.
Executive Champion
Virtualization can be a complex topic for business leaders to understand. Throwing DR planning on top of that can sometimes put non-technical leadership into a corner they are uncomfortable with. To make sure you get the DR plan your company needs, you'll need to understand the concerns and realities of the business, and be able to clearly communicate the solution's benefits to the company. If you are dreading these conversations you should consider identifying an executive champion to get involved. This champion is usually an executive familiar with IT but accustomed to planning with other executives and speaking at a technical level they can understand. For companies without formal technology executive positions like a chief information officer (CIO), an executive champion dedicated to DR planning project can be a critical component.
DHCP and DNS can be recovered, you need to ensure you have a strategy in place to guarantee their recovery. If Active Directory servers are restored incorrectly, you will waste precious hours conducting manual recovery steps to restore function to this critical service. Ensuring a successful trouble-free restore actually starts with a VM backup and replication tool that fully supports VSS (Volume Shadow Copy Services) for both backups and restore. Veeam Backup & Replication has provided this functionality since its first release. This VSS-aware technology is the key layer of defense for situations in which you are unable to replicate Active Directory to your disaster recovery facility. Example scenarios include single site businesses without the budget for a DR site, or businesses that choose to only maintain DR contracts with cloud providers and dont want to maintain active VMs that incur monthly fees.
period of time to ensure you have a good known state to restore to. Then shortly after the software change has been determined to be successful, you would then want to take another series of full and incremental backups to provide a good known state of the VM to restore back to if disaster occurs after the VM change. Once you have proven you have good backups of these stateless VMs, the schedule can be relaxed to save bandwidth and I/O. As long as you continue to test restores of the entire application stack, along with its dependent data tier, this can be an option for certain DR requirements.
Veeam Reporter
Veeam Reporter is a great tool that gives you the insight into your environment very quickly. There is both a free version and full version of Veeam Reporter. Both can be invaluable in quickly collecting inventory information from existing VMware environments. This information helps document your primary virtual datacenter, and can be a guide for setting up the disaster recovery site. Output from this tool include spreadsheets, Word documents and even Visio diagrams of your VMs, ESX(i) hosts, VMware datastores and networks. Veeam Reporter can grab a lot of data in just a few minutes.
Veeam Monitor
Whether you are looking to build out the bare minimum DR infrastructure or you are looking to determine at what point your DR solution is just getting gaudy, you'll need good statistics on current resource utilization to properly size your DR infrastructure. Veeam Monitor can be used to assist you with this resource utilization collection. Again, there is a free version, but the full version is the way to go if you would like to continue charting virtual infrastructure utilization after the planning phase. When measuring virtual infrastructure utilization, we are trying to capture a few key areas that will determine the ultimate cost of the solution. CPU and memory utilization is important for sizing the servers required for the recovery site. Here well focus on CPU GHz used on average, and GB memory consumed.
Virtualized DR Benefits
SnapshotsBe able to test critical patches and roll back if failure occurs Restore entire server image (operating system [OS], application, data) to any hardware without messing with drivers Refresh hardware with zero downtime Restores can be automated and tested often Files and other application items can be restored from VM backup images VSS integration can ensure application consistency
10
have the right consolidation ratios given the virtualized workloads and underlying hardware resources. Having the tools to properly plan these consolidation ratios, monitor utilization and give application owners relevant virtual hardware performance statistics is critical. Tools like Veeam Monitor can provide the performance view at the hypervisor level and provide OS level statistics in a single view. Having the right plan and the right visibility into performance will help gain trust and drive more use of virtualization where you need it most.
11
key virtual infrastructure inventory as well as resource utilization statistics is recommended. Its now time to use that information to properly size your DR solution.
Compute
It's best to start with compute statistics, that is, how much CPU and memory are utilized. If you're planning to restore every single server and maintain identical capacity, this exercise is easy and you'll duplicate your production environment at the DR facility. In more budget-minded organizations, you're going to analyze the statistics of production workloads and identify only the critical workloads that need to be running in order to establish the estimated DR infrastructure required in a disaster. These utilization statistics translate into CPU sockets required.
Storage
When analyzing the storage requirements, we want to make sure that we have enough raw storage space to store what is necessary. But we also want to know what kind of performance characteristics are required to drive your primary workloads. For storage space requirements you will start with the total gigabytes of all the VMs that you plan to recover to the DR site. Add to that, the amount of full backups or replicas you'll want to keep and the amount of daily incremental backups. Generally speaking, DR storage is based on the VM's configured memory and storage allocation, so calculations could be derived from the statistics gathered with Veeam Reporter. In addition to the raw space, it's important to understand the performance required of your storage systems. This is usually measured as IOPS and storage bandwidth. These two statistics describe how active your VM storage is, and whether you can get by with SATA drives, SAS drives, or whether you'd be a good candidate for an auto-tiered storage system with enterprise flash or SSDs for tier 0, SAS for tier 1 and SATA for tier 2. Many people make the mistake of buying large capacity SATA for DR because they can save on storage purchase costs. However, when it comes time to rely on that storage in a disaster, the availability of their systems is in jeopardy due to performance. It's understandable to want to save money on your DR, but for this size of an investment, you need to make sure you're not shooting yourself in the foot by getting too risky. Analyze the statistics and budget accordingly.
Network
We talked about calculating daily change rate of data earlier. If you decide to use replication over a secured VPN connection over the Internet or leased WAN circuit, you'll want to know how much data will need to be moved across the network in a single day or replication window. This will help you forecast the size of Internet bandwidth required to be successful and if you need to upgrade your WAN circuits. If your options for Internet bandwidth are limited, this analysis will be crucial to understand whether you will be a good candidate for replication or whether whole-VM backups to tape-backed disk archives is a better option for you.
12
of the connection. If you have high latency and packet loss connections, you may not be able to meet your backup windows, and consequently, suffer lower RPOs. Products like HyperIP from Netex offer WAN acceleration technology that is purpose-built for accelerating large data transfers over packet loss and high latency network links. If you are a Veeam customer, they even offer a 1-year free trial version of HyperIP to allow you to thoroughly kick the tires before purchasing.
OR
13
Using legacy protection measures on virtualized workloads introduces complexity, cost and risk to into your DR plan that your company cant afford. By restoring the whole VM you drastically cut the number of steps required and you open the door to be able to test and verify the recovery of the VM prior to needing it.
Servers
Any x86 server hardware will do here, but the question is more about what kind of capacity do you require in a disaster and what's the most cost effective way to meet that need at the DR site? Those 5-year-old 2U rackmount servers with 4 total processor cores and 16GB of RAM might do okay in a pinch for a small portion of your DR environment, but only if you don't recover the whole environment. Although those servers might be free, they don't look so good when you're paying for rack space and power for dozens of servers at a co-location facility. It may be cost advantageous to purchase new servers that have 10 times the capacity which can reduce DR licensing costs for Veeam and VMware while cutting your physical space and power requirements by a factor of 10. Whatever your decision, make sure you provide for adequate capacity based on real world measurements from your production VMware environment and guided by the Business Impact Assessment (BIA).
Storage
If you need vMotion and High Availability at the DR site, you'll need to invest in shared storage to go with the VMware ESX(i) servers that you'll be replicating to or restoring VM backups to. Choosing NAS, iSCSI and Fiber Channel are all good
14
decisions, and most are valid options. If you are a small business or a small remote office, shared storage for VMware may not always be possible and you may be required to use local storage contained within the recovery servers. Although these setups aren't as efficient to manage in a production environment, they can be good enough in a disaster to allow your business to provide revolutionary DR capability at a bargain price. In configurations with locally attached storage, you'll be happy to know that Veeam Backup & Replication can support that option as well since it can write to any VMware datastore visible to ESX(i) server.
Network
We talked earlier about network considerations. Basically, there is enough network bandwidth or there is not. There either is high latency or there is not. If you have the budget, make the investment in high bandwidth links between your recovery site and your primary datacenter. This can allow the most reliability and lowest operational cost for your backups and replication since no error-prone manual or physical methods are required to me move data to recovery site. Whether it is due to budget-related or geography-related limitations, not everyone has the network bandwidth available to replicate critical assets. That's why the old adage remains true, "Never underestimate the bandwidth of a van full of tapes driving down the highway." Your network realities may dictate a whole VM backup to disk or tape that is then trucked off-site for safe keeping or for test restoration at the DR site. In these situations you will be sacrificing the ultimate RTO from the start, so it's not as important to have all servers racked, stacked, powered and ready to go. You might even consider alternate means for provisioning server resources in these scenarios.
DR in the Cloud?
In the case where replication is not an option and you have VM backup images that can be restored to any ESX(i) servers in the world, why not restore to the cloud? There are countless VMware hosting providers available today that can rent you resource pools or whole VMware environments. Rather than investing in expensive, duplicate datacenter locations that will only be used in the unlikely event that an actual disaster occurs, you can instead bank that money and only pay a small portion in the event a disaster happens or in the event you'd like to test your recovery. If you do decide to move forward with restoring to a VMware hosting provider, you may need to do some advance planning on the contract side to help speed your recovery if needed. Although we're getting closer to the dream world that allows you to whip out the company credit card and spin up a DR site in minutes, it's more likely that you'll want to sign a contract in advance in order to get some guarantee that the resources you'll require will be available should you declare a disaster. Of course this insurance will cost you, but it will be much less than if you purchased the resources full time or if you stood up your own DR site.
Test-Driven DR Plan
In the software development world, a popular software development process is test-driven development (TDD). In TDD, developers first create automated unit tests that will only pass successfully if the new piece of code under development fulfills all criteria. By writing the test first and then developing the code, quality is
15
improved since code cant be released until the unit tests pass successfully. This process has proven very successful for software developers, and a derivation of the process can now be applied to virtualized DR. This derivation can be called test-driven DR and it turns traditional DR planning on its head by planning for the restore and verification of the restore before you plan to do your first backup.
16
17
2010
of the Year
Products
GOLD
VMware Backup
100% Reliability Best RTOs
TM
Best RPOs
TM
SureBackup
InstantRestore
SmartCDP
TM
vPower
TM
TM
5 Patents Pending!
Patents Pending!
vPower enables these game-changing capabilities in Veeam Backup & Replication v5:
Instant VM Recoveryrestore an entire virtual machine IN MINUTES by running it directly from a backup le U-AIR (Universal Application-Item Recovery)recover individual objects from ANY application, on ANY OS SureBackup Recovery Verificationautomatically verify the recoverability of EVERY backup, of EVERY virtual machine, EVERY time
A BIA helps you identify critical business systems, their IT and human dependencies, and an estimated disruption impact to your business. You can then determine which applications are most important.
Know your recovery point objective (RPO) and your recovery time objective (RTO).
Develop an intimate understanding of how your business runs and catalog the key resources and processes necessary to enable revenue creation. Translating the findings from the BIA into RTO/RPO requirements for each application, helps you focus your resources where they are needed most.
Virtualized DR is much less expensive than traditional physical DR, but it is still an additional cost on top of your already sizable investment in virtualization software, supported server hardware and new storage systems. Avoid the expense of maintaining both legacy and virtualization-aware backup systems by migrating to all virtual DR.
Most applications have dependencies external to the virtual machine (VM) that it runs on. In a disaster, its critical to have cataloged these dependencies because you will have to recover each one to restore endto-end functioning for that application. Start at the base infrastructure services like DHCP, DNS and Active Directory. But dont forget to account for file shares, databases or other non-virtualized servers recovered through legacy means.
VMware inventory automation can quickly and accurately collect information on the virtual environment thats invaluable to designing your DR plan. Tools from Veeam can help ease the task. Veeam Reporter can catalog the configuration of the VMware environment, even providing Visio diagrams to reference. Veeam Monitor can provide the performance statistics you need to size your DR infrastructure. Plus, a Veeam Backup & Replication proof of concept (POC) is a good way to learn what your daily data change rate is so you can appropriately size your network connections to the recovery location.
Virtualize stragglers.
If you still have physical servers, its time to make the switch. The benefits of virtualized DR are well known and have been written about and practiced for more than 5 years. No matter how good your DR solution for physical servers is, it cant come close to the capabilities and efficiencies of virtualized DR. Virtualize your remaining physical servers to achieve the most benefit from your DR Plan.
Using the data collected in the assessment phase of your DR plan is invaluable in sizing the DR site and creating your DR budget for servers and storage. The largest limiting factor to the ideal DR plan for VMware is the bandwidth required to replicate all necessary VMs. Consider products, such as HyperIP from Netex, that offer WAN acceleration technology purpose-built for accelerating large data transfers over high packet loss and high-latency network links. This can allow for better use of available bandwidth without breaking your budget.
Simplicity is king in a DR situation. Rather than reinvent the wheel by reinstalling operating systems, applications and restoring individual files, you should restore the entire application as a VM to minimize restore time. Replicating the VM with Veeam Backup & Replication can provide the lowest RTO possible since VMs only need to be powered on to restore service.
In todays cost-conscious IT environment, its good to know there are options for your recovery site configuration. Although you can choose to self-host DR options in your own facilities, you can also take advantage of VMware service providers that could provide your DR infrastructure as an on-demand cloud service.
Setting up a DR plan for VMware environments is not a one time activity. You need to ensure that you test, test and test. Manual tests are good, but since youre working with VMware technology, theres no reason testing cant be automated and run as often as daily if needed. Using Veeam Backup & Replications SureBackup automated backup verification feature is a great way to do this.