To include Active Directory or not to include Active Directory, that is the question.
I’ve been reading a lot around VMware’s Site Recovery Manager and considerations surrounding Active Directory. Most of what you will read says ‘NEVER’ protect AD with SRM, only use native AD replication, especially since SRM & vCenter at your Recovery Site require AD to be running anyway.
But what if you have multiple domains for different uses? This is where the lines become blurred. Think about this for a second:
- One AD environment (single forest/domain, no trusts) where vCenter & SRM live, call it infrastructure AD
- A second AD environment (also single forest/domain, no trusts) for your application servers, call it application AD
- You have infrastructure AD at both sites, SRM & vCenter authenticate accordingly
- Protected site has application AD
- Recovery site has nothing
Now here is where I say ‘why wouldn’t you protect AD with SRM?’ In a true disaster, the protected site is gone, no AD exists anywhere, so using SRM to bring them up on the recovery site makes sense. Is my logic flawed?
However, if I had my application AD living at both sites, using native replication, I agree 100% in not including your Domain Controllers in your SRM Recovery Plan. This leads to my concern…
Testing vs Planned vs Unplanned
This post will cover testing only. I’ll write a follow-up covering planned & unplanned failovers later.
To me, the only way to really test your DR plan (in this instance, your SRM Recovery Plan) is to not have anything different between them.
Let’s look at it from the perspective of having nothing at the recovery site, basically if I decided to use a DR service provider as my target site. We each have our own vCenter servers, my SRM server is paired with their SRM server, I have AD at my site, and the DRaaS provider has AD at their site. Microsoft doesn’t officially support protecting DCs with SRM, although it’s really no different than losing power at a datacenter and bringing the DCs back up after power has been restored. There are now two main considerations: Active Directory integrated DNS, or standalone DNS.
Active Directory integrated DNS
The main risk here is there is a slight possibility Active Directory services could enter a race condition when DNS is AD-integrated. It’s kind of like the ‘chicken and the egg’ argument, one kind of depends on the other. AD relies on DNS, and DNS won’t be up unless AD is running.
I’ve talked with Microsoft regarding this, and although their recommendation is not use SRM for AD DR, they did say it’s a fairly easy fix if you happen to end up in this race condition. You would have to enter Directory Services Restore Mode and basically pull DNS out. I really don’t know how common this is. How many of you have AD labs running where the rug gets pulled out from underneath them? I’ve had multiple labs with AD-integrated DNS and have NEVER had this problem (I bet I will now since I dropped the ‘NEVER’ word, HA!).
When building your SRM Recovery Plan, you’ll want to make sure your PDCE boots up first (for good measure). You could accomplish this in multiple ways:
- Place your PDCE in Priority Group 1, then the rest of the DCs in Priority Group 2, everything else in remaining Priority Groups
- Place all of your DCs in Priority Group 1 and set all of the non-PDCE DCs to depend on the PDCE
Active Directory with Standalone DNS
This is desirable in reference to the possible race condition, but really depends on your environment. Your SRM Recovery Plan will be similar ton AD-integrated DNS, but you will need to add an additional step or dependency:
- Place your DNS server & PDCE in Priority Group 1, set the PDCE to depend on the DNS server, then the rest of the DCs in Priority Group 2
- Place all of your DCs in Priority Group 1 and set all of the non-PDCE DCs to depend on the PDCE, and the PDCE to depend on the DNS server
Active Directory in both sites
Now this is truly the best way to handle DR with Active Directory, let AD do it’s native replication across your sites. If you lose your protected site, your recovery site already has AD running before you ever hit the Recovery button in SRM. I don’t need my Domain Controllers in my Recovery Plans now, right? Wrong, well, maybe, maybe not. It really depends on how you want to do it. The main issue is to be able to have your test VMs authenticate to AD, but NOT your Production Active Directory. There are basically two ways to test your recovery plan:
Including Active Directory Domain Controllers in Recovery Plans
- Maintain two different Recovery Plans; one for testing, one for DR
- Requires a separate Protection Group with Active Directory Domain Controllers to facilitate running DR Recovery Plans without the DCs
- Testing recovery plan includes your DCs from the Protected site
- Be certain you’re testing in a network that CANNOT talk to the production DCs
- Your DR Recovery Plan should NOT include your DCs, and the VMs should land on your production network
Leaving Active Directory Domain Controllers out of Recovery Plans
- Only need one Recovery Plan with your protected VMs
- Running a test requires you to clone a Global Catalog DC into your test network, then testing your Recovery Plan in that same test network
- Running a DR failover is no different from the test, except you don’t need to clone a DC
So why is testing different than a failover?
Microsoft actually recommends the cloning of Domain Controllers into the test environment, then destroying when done. When you’re running a test, you have duplicate computer accounts, and possibly duplicate Domain Controllers on the network. This poses a problem because if your test VM changed it’s computer account password in your production AD, it could replicate that and your production VM now has a bad password. There’s also an issue where AD is trying to replicate with a DC that is a test DC, and possibly cause a USN roll back.
This is why testing SRM with Active Directory is very touchy. You need to make sure you have everything in order, and test, TEST, TEST!
Feel free to comment or email me, this is a touchy subject, so I’m very open to discussion and interested in what others are doing here.
Watch for Part 2!
I would suggest another way to do accurate tests that need AD. Create an AD DC at the recovery side, and have a script that runs once per day and turns off that DC, cold clones it, and puts the clone on a private VLAN (if necessary it will first delete the already existing DC). Than, during the recovery test, use the same private VLAN. Now you have the DC you need for your testing, and it will not impact production at all. Plus if you need a new dc with every test that can easily be triggered during the recovery process. I have helped customers do this with success and no issues at all.
Thanks for the good info, Michael! I’ve thought about doing this same thing, but was trying to keep the post a reasonable length.
This is something I can put in part 2.
Keep the discussion going!
For testing you should have a networking bubble on the recovery site to test in. That bubble would allow you to fire up systems while not effecting production systems while running since they can’t talk to anything outside of the bubble. Then everything can run without changing IP’s and if you have your DC’/DNS/DHCP in the SRM plan you should get all those services working in the bubble. At least that’s my wish for testing.
Ed,
One limitation of the bubble networks (the “test” networks created with SRM runs the Recovery Plan in test mode) is that physical NICs are not assigned to the port groups, so VMs will never communicate outside the ESXi host. I do agree that you should create a separate test “bubble” network that’s pre-staged with physical network connectivity (VXLAN or regular VLAN), which is what I referenced in the article with having a “test network”.
Here’s part 2, for reference: http://thephuck.com/virtualization/vmware-site-recovery-manager-active-directory-part-2-domain-controllers-in-test-environment/