Recently I experienced a VMware HA event in my environment which caused the VMs on the affected hosts to be restarted on other servers. While most of the VMs started OK, there were a few that did not. When I manually tried to start them I received the error “Failed to power on VM – No swap file” and the VM would fail to start. What happened is that several VMs were in a zombie-like state, as they were not shut down gracefully. Even though their statuses were displayed as shutdown in the VMware Infrastructure Client (VI Client), there was still a process running on an ESX host that prevented it from being started.
In effect, while the VM’s OS was not running it will still in a running state on an ESX host and had a .vswp file already out there that could not be deleted. As a result, when another host tried to start it the .vswp file could not be created because the other host had a lock on it.
To resolve this situation I had to find out the host that still had a running process for the VM and forcibly terminate the process. To do so I had to log in to the service console of each host and run the following command: ps auxfww | grep VM name. This command returns a list of running process that contain the name of the VM.
When you run the ps command with the VM name listed you will always have one result regardless of if the VM is actually running on the host. This is because the command itself shows up in the result list as the VM name is being used in the command when it is run. However, if the VM is actually running on the host you will receive two results instead of one. The second result will be much longer as it contains several lines of text and will contain the path to the .vmx file of the VM. This second result also contains the process ID (pid) of the VM which can be used to forcibly terminate it. The pid of the VM is located in the second column of the results right after the username (typically root). As you can see in the below example, the first result with a pid of 25914 is the command itself and the second result with a pid of 23896 is the running VM.
[root@esx1 root]# ps auxfww | grep win2003-1
root 25914 0.0 0.2 3688 676 pts/0 S 13:17 0:00 \_ grep win2003-1
root 23896 0.0 0.2 2008 864 ? S< Feb13 4:12 /usr/lib/vmware/bin/vmkload_app /usr/lib/vmware/bin/vmware-vmx -ssched.group=host/user -# name=VMware ESX Server;version=3.5.0;licensename=VMware ESX Server;licenseversion=2.0 build-123630; -@ pipe=/tmp/vmhsdaemon-0/vmxd0af4bb011822fc5; /vmfs/volumes/442d541b-cb5a815d-6083-0017a4a9c074/ win2003-1/ win2003-1.vmx
Now that we know the pid of the VM (23896), to forcibly terminate it we type kill -9 23896. You can verify that the VM process has been terminated by running the ps command again. Only one result should be returned. Now that the VM has been stopped you can power it on using the VI Client and you should have no problems this time.