Have you ever had VMs mysteriously restart and didn’t know why? Even after checking the VMs vmware.log file and the event log in vCenter Server, you could find no evidence of the restart or any potential problems. Well, the Virtual Machine Monitor (VMM) feature could be the culprit. VMM extends the High Availability (HA) feature to be able to detect guest OS failures by monitoring a heartbeat provided by VMware Tools. In case a guest OS failure is detected (i.e. Windows blue screen of death), the heartbeat would stop and the VM would be restarted on the same host. While this is a great feature to have, it can also be troublesome and cause VMs to restart even when there is not a problem with the guest OS.
I recently experienced this problem first hand when VMs were mysteriously resetting at times even though they were not experiencing problems. The culprit? VMM. As mentioned, when this happens it can be difficult to determine as the vCenter event log will not have an entry that VMM caused the VM to reset, virtual machine state alarms will not be triggered, and the vmware.log will have no mention that the event occurred. The only evidence that a VMM event took place will be in the /var/log/vmware/hostd.log file on the ESX host and will look similar to what follows.
[2009-03-20 04:44:35.252 'TaskManager' 3076453280 info] Task Created : haTask-512-vim.VirtualMachine.reset-47992 [2009-03-20 04:44:35.323 'ha-eventmgr' 3076453280 info] Event 8420 : Win2003-1 on esx1.xyz.com in ha-datacenter is reset [2009-03-20 04:44:35.323 'vm:/vmfs/volumes/48331160-05c64c5c-edf0-001e0bd8c708/Win2003-1/Win2003-1.vmx' 3076453280 info] State Transition (VM_STATE_ON -> VM_STATE_RESETTING)
Since this feature relies on monitoring heartbeats through VMware tools there is the possibility of certain events happening that cause the heartbeat to stop longer than the configured failure interval, thereby triggering a false positive and resetting the VM. One example of these types of events is an upgrade to VMware Tools on a VM which causes heartbeats to temporarily stop while the VM is being upgraded. For this reason VMware changed the VMM feature in vCenter Server 2.5 Update 4 to also monitor the VM’s disk and network activity. Therefore even if no heartbeats are received within the failure interval, the VM does not reset unless no disk or network activity is detected for a predetermined I/O stats interval. A VM guest OS that is truly locked up will typically not have any disk or network activity in addition to the loss of heartbeat. This added level of monitoring will help eliminate false positives and make this feature even better. Additionally you can change the failure detection interval to higher than the default of 30 seconds. This setting is located in the HA advanced options section as shown below.
So if you ever have a VM mysteriously reset itself and you have VMM enabled, be sure and check the /var/log/vmware/hostd.log file on the ESX host to see if it was caused by VMM. It would be very helpful if VMware could make it so that these types of events are also logged in the vCenter Server events view.