Virtualization Pro

Mar 24 2009   6:14PM GMT

VMware HA failure got you down?

Texiwill Edward Haletky Profile: Texiwill

Consider yourself lucky if you’ve never gotten the VMware HA message: An error occurred during configuration of the HA Agent on the host. But if you have, you may know that the ways to fix the error are extremely limited. Here is a method that worked for me.

Current methods

The current methods of troubleshooting this issue involve checking that the DNS is working properly, that the FT_HOSTS file in /etc/opt/vmware/aam is properly written for the hosts involved in your VMware Cluster, and disabling and re-enabling VMware HA within the VMware Cluster.

New method

The new method assumes that the VMware HA configuration is somehow at fault. I began to think this was the case when I noticed that the /opt/vmware/aam/ha/VMap process was not terminating on a reset of VMware HA. This process, as seen from the output of ps ax issued from the service console command line interface, should not exist when VMware HA is disabled. However, in my configuration it did exist. I also noticed I had problems reestablishing VMware HA after a recent reboot of a server caused by a faulty UPS. DNS was working, FT_HOSTS looked correct, and disabling and re-enabling VMware HA did no good.

Here are the steps that I followed to fix it:

  1. Log in to the service console of your problem hosts and verify that VMware HA is disabled using: service vmware-aam stop
  2. Ensure there are no VMware HA processes running by using: ps ax | grep aam | grep -v grep
  3. If processes exist, kill them using the Process ID returned by the previous command (first column) as the PID: kill -9 PID
  4. Issue the following command via the service console including the parenthesis: (cd /etc/opt/vmware/aam; mkdir .old; mv * .old; mv .[a-z]* .old)
  5. Using the Virtual Infrastructure Client click on the Host, then the Summary tab, and then Reconfigure for VMware HA.

Viola, VMware HA restarts and works properly! This solution may be seen as overkill as it forces VMware HA to recreate all configuration files. I may have been able to just remove the .vmware_fdport file and also Reconfigure for VMware HA, but I did not try that option. I bring this possibility up as it is NOT there on my now-running VMware HA-enabled hosts.

Now I have what looks to be a fool proof way to get VMware HA to start back up and protect my investment.

3  Comments on this Post

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.
  • DownTownMTB
    Thanks for the post! Are your hosts on the same subnet? What happened prior to HA failing, an upgrade? What version of ESX/VC are you running? With ESX 3.5 U2 there is a new feature that has bit us which forces all host to be on the same VLAN subnet. We have had to re-architect our ESX host networking to accommodate this feature. We may be unique in our setup.
    0 pointsBadges:
  • Texmansru47
    Hello, ESX v3.5 U3 w/VC 2.5 U4. Service consoles/VC are also on the same subnet. My issue was mostly a data corruption within AAM, but that is also very good information!
    25 pointsBadges:
  • CJGarrett
    Hi Ed, You could just disable and re-enable HA for the cluster. I've noticed when doing this, that VMware removes the whole aam directory. Back with 3.01 I had one cluster where we could not get HA back up after a motherboard replacement. Hosts files, DNS all OK. Eventually we had to de-populate the cluster host by host and create a new cluster object. Enabled HA on the new cluster and everything worked fine. Go figure? Cheers, Chris
    0 pointsBadges:

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: