Posted by: Eric Siebert
Eric Siebert, VMware
A bug in the latest versions of both VMware ESX and ESXi (3.5 Update 2) has effected many of VMware’s customers — and VMware is asking its users to wait 36 hours for a patch.
As the date changed to August 12, 2008, customers were finding out that they could no longer start virtual machines on there ESX hosts or vMotion them to other hosts.
A post was made to the VMware Technology Network (VMTN) community about this bug to which many customers responded that they were experiencing the same problem and had spent hours trying to figure out what was wrong. The problem was not immediately obvious to most because the error that was being displayed was that a general system error has occurred, the actual error that could be found by going through the virtual machine log files was that the product had expired. Many users contacted support, who eventually figured out they had a major issue on their hands.
Currently, the only workaround for this is to set the host clock back and to restart virtual machines; however, this workaround is not acceptable for many customers who rely on accurate time for their systems and applications as well as to satisfy compliance regulations. Virtual machines that are already running are not effected by this bug unless they are rebooted or powered off and back on.
The bug appears to have been code that was left in the beta version of ESX to stop working on a specific date after the beta had ended. This is commonly done by software vendors and is known as “time bombing”: software stops working past a certain date and users are forced to use the latest gold version instead of continuing to use the beta version.
VMware has published a knowledgebase article on this issue and promises to release a fix within 36 hours. For most customers this is not enough, having to wait 36 hours is much too long for a problem of this magnitude. They are looking for an immediate fix to the problem so they can apply it to their effected hosts. Additionally there is concern about how the fix will be delivered, presumably it will be released as a new build of ESX which will require ESX hosts to be offline as it is installed and they are re-booted.
Many customers posting to the VMTN thread have expressed anger and frustration at VMware for this. To make matters worse and further frustrate users, VMware’s knowledgebase went offline shortly after the document was published presumably because it could not handle the extraordinary amount of requests.
It is hard to believe a company the size of VMware could allow this to happen. Something like this could not be picked up in beta testing and is not necessarily a bug but negligence on VMware’s part by not removing or disabling this code before it was released as the gold version. Most software companies have strict processes for developing, testing and performing quality assurance before releasing a new build. How something like this could happen is anyone’s guess right now but it appears that either processes do not exist or they were simply not followed.
In the meantime, customers continue to wait for VMware to release a fix for this. Because of the severity and the effect on so many customers there will most likely be some type of fallout at VMware over this. Something needs to be done for VMware to assure customers that they are taking this very seriously and are committed to doing everything possible to ensure that this never happens again. With Hyper-V now a viable alternative, VMware can’t afford major mistakes like this.