A little known feature, called Virtual Machine Failure Monitoring (VMFM), was introduced in ESX 3.5. VMFM offers the ability to leverage HA to monitor VMs for operating system failures, such as blue screens, and have them automatically restarted. Previously, HA would only deal with ESX host failures by automatically restarting VMs on alternate ESX hosts in the event of a problem with the host server.
VMFM also extends HA to monitor VMs through a heartbeat sent every second when using VMware Tools. This new feature is disabled by default and is considered ‘experimental’ by VMware. This typically means it works but it is not officially supported for production use yet. In order for this feature to function properly you must first ensure the following conditions exist:
• ESX hosts are version 3.5
• VirtualCenter is version 2.5
• VMware Tools is installed on VMs and is the latest version
• You have a Cluster configured and HA enabled
To enable it follow the below steps:
• Edit the Settings for your Cluster
• Choose VMware HA and click the Advanced Options button
• Add the following Options and Values
das.vmFailoverEnabled – true (true or false)
das.FailureInterval – 30 (declare virtual machine failure if no heartbeat is received for the specified number of seconds)
das.minUptime – 120 (After a virtual machine has been powered on, its heartbeats are allowed to stabilize for the specified number of seconds. This time should include the guest operating system Boot up time)
das.maxFailures – 2 (Maximum number of failures and automated resets allowed for the time that das.maxFailureWindow specifies. If das.maxFailureWindow is ‐1 (no window), das.maxFailures represents the absolute number of failures after which automated response is stopped and further investigation is necessary)
das.maxFailureWindow – 86400 (Either -1 or a value in seconds. If das.maxFailures is set to a number, and that many automated resets have occurred within that specified failure window , automated restarts stop and further investigation is necessary)
I enabled this on a Cluster and tested it by simulating a blue screen on a VM running Windows 2003 Server and it worked perfectly. After 30 seconds the loss of heartbeat was detected and the VM was automatically restarted. Currently there are no notification alerts that can be configured when this occurs. That is, if you check the events for the VM you will see no evidence of this happening. The only mention of it that I found in the logs was in the hostd log on the ESX server ([2008-06-26 11:47:22.552 'ha-eventmgr' 3076440992 info] Event 101 : VM1 on Esx1.xyz.com in ha-datacenter is reset). Hopefully this will change in a later version when the feature is no longer considered ‘experimmental’. You can read more about this new feature in a white paper that VMware has provided on it.