I wrote this blog article for Rackspace to go on their Blog. Original can be found HERE on the Rackspace Blog site.
VMware® vCenter™ Site Recovery Manager™ is used to orchestrate the failover of a group of virtual machines (VMs) from one location to another. It’s a really useful tool to have in your arsenal to tackle the daunting task of disaster recovery (DR), especially if you take a little time to plan for how and when it will be used.
What apps should you recover?
Starting to plan for DR is hard because it’s difficult to know where to begin. Best practices recommend starting with a list of all applications and their integrations. I say list every application you have—from large-footprint, heavy hitters to those quietly running on a single server in the corner. You may rediscover some you forgot because they run in the background.
Once you have a complete list, decide which are essential apps for your DR plan. Don’t be afraid to leave some apps that you consider essential during normal business operations off the list. Why? Because in a real disaster, you probably won’t need to bring up an app used to inventory new office equipment.
Where and how should you recover apps?
When you have an idea of what apps you want to recover, you need to know where you’re going to recover them. If you have a second data center, you could use that as your DR location. If you don’t operate two data centers, you’ll need to lease co-location (CoLo) space or engage with a service provider like Rackspace. How do you decide? Your biggest consideration in choosing CoLo or a service provider is whether you want to run Active/Active. You may want to run everything this way, but it can be cost prohibitive. At a minimum, you’ll need to run your vCenter and Site Recovery Manager servers, as well as their supporting services such as single sign-on or SQL as Active/Active.
In a perfect world, you would have enough resources on-premises and off-premises to run 100% of everything, leaving 50% idle hardware during normal operations. But if cost is an issue, choosing a service provider and using Site Recovery Manager is a better option than CoLo because you can establish an Active/Passive setup where the target data center is a scaled down version of your production environment.
A service provider can save you hardware, power, and cooling costs, and in a disaster, you can operate at 75% of normal capacity. You can also use idle hardware for development or quality-control work because Site Recovery Manager is smart enough to suspend VMs in the DR environment during a disaster. VMs are suspended, not shut down, making it easy for your developers to get back to work when the VMs are restored.
I purchased Site Recovery Manager. Now what?
Site Recovery Manager is a great orchestration tool. Planning to use it from the start is smart because Site Recovery Manager really helps you determine what’s in scope and what’s out of scope, and it’s good at what it does. Site Recovery Manager can handle bringing up lots of VMs, in a specific order, with specific dependencies. It can even go as far as booting VMware hosts from standby and suspending non-critical VMs at the recovery site. That being said, some vendors don’t officially support running workloads on Site Recovery Manager. We know it’s no different than power cycling a running physical server, but you don’t want to end up in a situation where your software vendor won’t support you.
Why don’t some apps support Site Recovery Manager?
Even if you have Site Recovery Manager, some of your apps may not be completely covered in the event of a disaster. This can be frustrating, but until apps are fully tested, vendors aren’t sure how they will behave.
If an application has its own native replication (e.g., Microsoft Active Directory, SQL, Exchange, etc), you’re better off using it than placing these apps on Site Recovery Manager. Native replication is tested and supported. Active Directory requires lots of thought, especially around proper testing. Microsoft says not to use Site Recovery Manager for a Domain Controller, but says to clone into your test network for testing. Sound odd? I know, Microsoft almost never says clone a data center, but this is one situation where it’s recommended, as long as you destroy the clone when you are done.
How should I run Site Recovery Manager?
The best thing to do is have a separate management VMware cluster at both locations where your vCenter & SRM servers will reside. This cluster would house all integrations, too, like the database server, VMware’s SSO, VMware Update Manager, etc. You need your vCenter & SRM server to run in an Active/Active configuration so they’re both online 24/7/365 (except in a disaster, of course). You’ll also want to make sure you’ve enabled both High Availability (HA) and Distributed Resource Scheduler (DRS) on the vSphere cluster so you can tolerate a host failure. A disaster can come in many forms, even bad RAM in your host running your vCenter server.
If you can’t do dedicated management clusters for your vCenter & SRM servers and their corresponding services, you can put them in any vSphere HA/DRS cluster, just make sure they have dedicated resources. This can be done through individual VM shares or reservations, or by creating a Resource Pool. Both ways work, and it is up to you how you want to tackle that one, but the last thing you want is your vCenter & SRM server to not have enough physical resources during a DR failover. You’ll also want to make sure they’re not on an SRM-protected datastore, that’d be additional replication traffic you don’t need. Once your primary site comes back online, the vCenter managing that site will come back online, too.
How does Site Recovery Manager handle networking?
A huge hurdle when running Site Recovery Manager recovery plans is networking. Ideally, you’ll want to span your VLANs so you can failover the VMs without any network configuration changes. If you can’t have your VLANs span data centers, you can designate fail-to port groups in vCenter Server backed by the same IP subnets as the source site. This is the hard part because if you run Active/Active, that DR site’s router won’t want to forward traffic. Of course, you can do policy NATs or remove the interfaces from the router so the DR site network forwards to the correct site. You can also use Site Recovery Manager to change the in-guest IP during failover. It has some GUI functionality on a per-VM basis. However, dr-ip-customizer.exe, a nifty tool with Site Recovery Manager, can manage all of your VMs in Site Recovery Manager. A CSV file is required for mass import, but the tool can export your list of VMs so all you have to do is change the IP and reimport.
Testing your Site Recovery Manager recovery plans can be tricky. In the recovery plan, you can designate different destination networks for test versus recovery. The default testing action is to create what some call a “bubble network” for test VMs. Site Recovery Manager automatically creates a VMware vNetwork Standard Switch (vSS) with generic port groups on each host to validate replication is working and the VMs can power on. This is okay if you want to simply validate that portion and completely isolate the test VMs. Communication between VMs across different hosts will be blocked though as the vSS and associated port groups won’t have any physical NICs backing them. I recommend taking your testing further and either have the test mode use your production networks or have designated test networks place the VMs. This way you can test actual integrations. There are lots of things to consider with testing, which I’ll cover in a future post.
What about storage?
Sizing your storage appropriately with Site Recovery Manager can also be challenging. You’ll want a 1:1 storage footprint—it’s not something you can scale down. If you have 100TB of protected VM data, then you’ll need 100TB of usable storage at the DR site. However, with EMC RecoverPoint there is additional overhead required for the journal LUN. EMC recommends 20%, but you need to understand how the journal LUN is used during testing because a journal LUN that’s too small can cause your test VMs to fail or replication to stop, forcing a rescan. You’ll also need to tweak the settings to meet your recovery point objective requirements. NetApp SnapMirror works well, too, but if you set an inadequate frequency, it may not be able to keep up.
How often should you test your recovery plans?
You’ll want to test recovery plans regularly. This is where properly sized journal LUNs and network design are crucial. You don’t want the journal to fill up nor do you want a test Windows server talking to Active Directory and inadvertently changing it’s machine password on a production data center. That would cause your current production Windows server to be out of sync and not authenticate, which wouldn’t be good.
What about documentation?
Now that you’ve examined everything, you should document every application. By document, I mean put together descriptions with full application diagrams showing all integrations; communication paths and ports; sizing and number of servers; storage—EVERYTHING. Document it in detail! You probably had some idea of dependencies or which apps you wanted up and running first, but this may cause you to reevaluate priority to make sure certain services start first. That’s okay, it’s a good thing to find out now before a disaster occurs, and you may need to be a little flexible.
In diagraming, Tier 0 may be your authentication services, the base of everything. Tier 1 could be data providers or application backends required by the majority of your Tier 2 apps. Tier 3 could simply be reporting or patching services not needed immediately following a disaster.
If you use Visio, maybe have page one showing actual deployment, then a second page showing the proposed DR solution and a third page showing future growth, or even the test, development, or quality-control environments. A DR review board may be useful as auditors of all of your applications. I spent a year doing these reviews and can vouch for their value. Reviews will help you in the long run as an architect because you’ll have a clear picture of integrations and how to deploy new applications. You’ll quickly become versed in all aspects of your business so it can be invaluable.
Some of this is non-linear because you won’t have a 100% clear picture of storage and networking requirements until you have all of your application diagrams reviewed. Have a plan for storage and networking, but quickly refine the details as you learn more about your environment.
It’s important to understand this exercise might take months to complete, maybe even a year, but it’s worth it. Not only that, because your environment is constantly changing, your information and recovery plans need to change with it. Having good information is key, junk in is junk out. You can’t architect the proper solution if you don’t have a clear picture of your entire environment.
You’ve got everything you need. Now what?
Your next step is to architect the proper Site Recovery Manager solution. If you have good data, this part should be easy. The next post in this Site Recovery Manager series will focus on architecting a solid DR solution.
Before I forget!! You can catch me at the following VMUG User Conferences:
May 15, 2014: Philadelphia – Event Details and Registration
June 24, 2014: Boston – Event Details and Registration
September 16, 2014: Dallas/Fort Worth – Event Details
September 24, 2014: Southern California – watch VMUG.com for upcoming details
Ongoing in the San Antonio area: Keep an eye on our VMUG Workspace for any upcoming VMUG meetings!