2

SRM and the planned migrations…..

Be aware, this post has nothing to do with Automation or Orchestration. This post is only related to the VMware Site Recovery Manager and a solution for a Problem during a planned migration. Maybe the post is useful for someone else which encounters the same problems…..

Last weekend I supported a customer witch had to power down one of his both datacenter. For this, we had to migrate the virtual Machines from one DC in the other. At the end all machines must be migrated back. From my site a view, this should be an easy thing because the customer had an SRM implementation. The storage is served via an IBM V7000 and the LUNs are replicated over both datacenter….. The customer had built the recovery Plans and tested them before the Migration should occur…

So from my point of view I expected an easy migration……

After everything was cleared the users were at home we started with a “Planned Migration” from the Datacenter 1 (DC1) to Datacenter 2 (DC2). This was quite easy and at the end we created our “Failback Plan” with the SRM.

No for us it was time to take a drink and wait for the Power and Air-conditions Guys to finish their jobs.

After a few hours it was time for us to migrate the VMs back to DC1…….

The customer created a recovery plan for his two different clusters. In the first cluster only the normal VMs were placed. In the second cluster the Database VMs were located…..

So we started the planned and the VMs out of the first Cluster fail back without any problems…..sincerely the VMs from the Database Cluster could be relocated……

We got the error:  No host with hardware version ‘9’ and datastore ‘snap-ef732565ae’ which are powered on and not in maintenance mode are available….

So we checked the vSphere client…..the Hosts were online (Host Version 5.5) and the Datastore with the name ‘snap-ef732565ae’ was also present…..

Really strange……a quick search in the Web leads to this VMware Documentation (http://pubs.vmware.com/srm-55/index.jsp#com.vmware.srm.admin.doc/GUID-FE6A85EC-B44E-415A-9C5F-1E17BC846119.html) were the problem was described with the solution to wait 15 Minutes for the next try because the SRM had cached some old information. So we took a coffee break  and after 20 Minutes we started the next try……unfortunately we had the same problems……

I tried to figure out the problem in the logs but I could found anything what pointed to the error…..

So we tried a lot of things to finish our “Planned Migration”…..every try needed a lot of time……one of the last things we did was to restart all ESX Hosts of the DB Cluster…..after all Hosts were Online we did the next try and “voila” I worked……

In the last week I did a lot of research about this behavior.  I figured out that I could reproduce the error when I power-up the ESXi Server quickly without any delay.  So from my point of view it seems to be a “communication” problem between the ESX Hosts of a cluster.

 

So if anyone has the same problem….try a reboot of the ESX Hosts……