We have a multi cluster VMware environment consisting of HP BL680’s, DL580’s and DL585’s that has been very stable for over a year and we have made no recent changes to storage, switches, hosts or HBAs. Recently we experienced two incidents where ESX hosts on different clusters have disconnected from vCenter and from their shared EMC Clariion arrays (three separate frames). EMC said the Cisco fiber switch saw them as disconnected at the ESX host port. The incident did not happen all at once but started with one host disconnecting followed by other hosts over a period of two to three hours. In some cases both HBA paths lost connection to the SAN and in some cases only one HBA disconnected from the SAN. Re-booting the ESX host reestablished connection to vCenter and to the SAN but in some cases specific LUNs were still not accessible. VMware support found SCSI Reservations on multiple hosts and those host all were unable to see the same 3 LUNs. They had us trespass these LUNs after which the hosts could access their data. In a second incident two days later, one ESX host (one not involved in the previous incident) disconnected from vCenter but did not lose connection to the SAN. Within an hour two other hosts from the same cluster also disconnected from vCenter but not from the SAN. Three of two hosts were re-booted and re-connected to vCenter . The third restored itself without re-booting. Again a specific LUN was inaccessible not appearing to the host. The hosts vmkernel logs on the affected hosts were showing SCSI reservations. The LUN was trespassed, after which we could browse the LUN but the VM’s would not start. The LUN was then trespassed back and the VM’s were able to start and access the data. VMware has recommended a firmware upgrade to our HP and Emulex HBAs which they say resolved a similar issue with an environment similar to ours. However they do not know what the condition is that is causing the problem. Our environment has been very stable for over a year and we have made no changes to storage, switches, hosts or HBAs. Looking to hear from anyone with a similar experience who might have a handle on the root cause of this issue.
Software/Hardware used: VMware ESX 4.0, HP HBA, Emulex HBA, HP BL680, DL580, DL585, EMC Clariion, Cisco San Switch