Disaster Recovery with VMware Site Recovery Manager (SRM)

No matter how resilient the infrastructure, disasters will happen! They are an inevitable feature of life and any Systems Administrator worth their salt will be prepared to deal with them whatever the root cause.

Today, customer service expectations are sky high and consequently their tolerance for extended unavailability, unreliable or low quality service is nil. Small disruptions lasting a mere minutes may be negligible depending on the service characteristics but lengthy downtime can be fatal for businesses, brand reputation and your job prospects. In a virtualized environment one way to mitigate and ensure speedy service resumption is to have near real time data replication to an alternate remote site.

However, even with storage replication in place there are still manual tasks such as editing configuration files, changing network addresses and renaming systems that can slow down and prolong the recovery operation. While solutions like Zerto and Vision Solutions's DoubleTake exist for Server Based Replication that tackle this problem they still perform this type of replication and recovery on an application level and are only suitable for the scope of a small group of systems. Recovering the whole data center and all the constituent virtual switches, storage and servers requires a little more logic. This is especially the case when the remote site cannot be made to be identical and also if there are dependencies between groups of systems requiring different startup and shutdown ordering etc. For large scale recovery additional logic and automation is needed to properly handle these kinds of scenarios. This is where VMware's Site Recovery Manager (SRM) comes into play. It brings the automation to disaster recovery for VMware based data centers and virtual infrastructures.

Installation

SRM is an add-on Windows based application that can be installed either on an existing vCenter server or as a standalone server that connects to your vCenter management nodes. One important thing to note is that you will need a single instance of the SRM application deployed at each site.

Up to date system requirements are located on VMware's documentation center. In summary, SRM needs a fairly recent version of Windows Server Operating System deployed as a Virtual Machine (VM) or physical (rare these days). SRM also requires a relational database. Compatible databases are Microsoft SQL server, Oracle RDBMS or PostgresSQL. Another thing to note is that the name or the schema and user for SRM should match.

A connection to database is made using the ODBC protocl and it supports either a 64bit or 32bit Data Source Name (DSN). Instructions for setting up the required DSN on Windows can be found here. The installation process is your typical Windows Wizard and basically consists of plugging in values for the database as well as the credentials and ports for the vCenter server. As it is fairly trivial I will not be going over it again here.

After the application is successfully deployed and connected to the vCenter server you will need to activate the plugin for the Desktop Client by navigating to Plug-ins.

VI Client Manage Plugins

Select the SRM plug-in from the list and install it.

Install SRM Plug-in

To launch the SRM interface from the Desktop Client navigate to View -> Solutions and Applications -> Site Recovery.

Site Recovery Interface

If running vSphere version 6 then when SRM is installed it will add a new tab on the Navigator left hand pane.

VMware Webclient

Pair Sites

The first action upon logging into SRM is to pair the production and recovery sites in the application. Sites are paired by connecting vCenters between locations and then mapping the corresponding resources (datastore locations, networks, clusters, resource pools) between each cluster that will be recovered at a remote location.

Pair Sites

Storage Replication

The key to any infrastructure recovery is having access at the secondary site to the production data. Therefore, replication at the storage level is the pillar on which everything rests. To get data over to the other site there are essentially two choices with SRM, Array Based Replication and vSphere Replication.

vSphere replication performs the data copy at the hypervisor level using the vCenter server as a proxy. In general this will have higher latency than direct replication from the storage array layer. Furthermore, it is possible to have additional performance degradation using this method on the vCenter and hypervisor due to the frequency of I/O interrupts needed. My recommendation and frankly VMware's too is to stick to Array Based Replication and use vSphere replication only on smaller environments.

A more robust and efficient solution is to perform all the replication from the SAN layer in enterprise architectures. What is required is that the array/SAN has replication features (this is standard on any enterprise grade disk array hardware) and then additionally the vendor will need to support what is called a Storage Replication Adapter (SRA). An SRA is small driver developed by the storage array vendor that allows SRM to communicate with the SAN. The full list of compatible SRAs and Storage Vendor partners can be found on the SRM Compatibility Matrix.

The SRA should be installed on the same server as SRM. Once installed it needs to be configured and the managed arrays added. To configure the array select the Add Array Manager link.

Add Array Manager

The beauty with array replication is that once the array is added no further configuration needs to be performed from SRM. SRM will automatically detect any array pairing, replication groups and policies created by the SAN administrator.

Protection Groups

Protection groups are the logical groupings between the datastores and the virtual machines that reside on them. These groups when defined determine which VM datastores are replicated on an ongoing basis and also which VMs will be protected in the event of a disaster. They are the building blocks of Recovery Plans in SRM.

To define a protection group select the Protection Groups tab.
Create Protection Groups

The wizard will guide you through selecting the primary site and then a list of the datastores that are being replicated from the storage level will be shown. Select all the datastores that contain VMs that require protection.

Protection Group Wizard

Recovery Plans

The final step in configuring the recovery process is to define the Recovery Plan. Creation of the plan is identical to the process for Protection Groups. The wizard can be launched by going to the Recovery Plan tab which is below the Protection Group tab.

The exception here is that instead of Datastores and VMs it is the actual Protection Groups themselves and the networks that are mapped between sites. After a recovery plan is created be sure to edit the Virtual Machines to ensure the protected site and recovery site match the IP addresses needed as if it is left blank then VMware will auto assign.

Before you wrap up manually edit the Recovery Steps to ensure startup and shutdown priority as well as add any custom scripts that will need to be executed to the procedure. Finally, select the test link to simulate the failover and advise if any errors will arise.

There are certainly many permutations but the above gives you an overview of the basic steps that will be performed each time. I find this software to be fairly straightforward to utilize. It is a powerful tool to have in your contingency arsenal that can limit the impact of future incidents when they occur.