Virtualization Pro

May 20 2009   5:13PM GMT

VMware Fault Tolerance: What it is and how it works



Posted by: Eric Siebert
Tags:
Eric Siebert
Fault Tolerance
VMware
vSphere

Fault Tolerance (FT) is a new feature in vSphere that takes VMware’s High Availability technology to the next level by providing continuous protection for a virtual machine (VM) in case of a host failure. It is based on the Record and Replay technology that was introduced with VMware Workstation that lets you record a VM’s activity and later play it back.

The feature works by creating a secondary VM on another ESX host that shares the same virtual disk file as the primary VM and then transferring the CPU and virtual device inputs from the primary VM (record) to the secondary VM (replay) via a FT logging NIC so it is in sync with the primary and ready to take over in case of a failure. While both the primary and secondary VMs receive the same inputs, only the primary VM produces output such as disk writes and network transmits. The secondary VM’s output is suppressed by the hypervisor and is not on the network until it becomes a primary VM, so essentially both VMs function as a single VM.

FT can be used with any application as the guest operating system and applications running on it are completely unaware of FT. This new feature is only included in the Advanced, Enterprise and Enterprise Plus editions of vSphere. It eliminates the need for VMware customers to use Microsoft’s Cluster Server (MSCS) to provide continuous availability for critical applications. In fact, VMware’s documentation states the following as a use case for FT:

Cases where high availability might be provided through MSCS, but MSCS is too complicated to configure and maintain.

While FT is a very useful feature it does have some limitations and strict usage requirements. On the host side it requires specific, newer processor models from AMD and Intel that support Lockstep technology. You might be wondering what Lockstep technology is. Simply put, Lockstep is a technique used to achieve high reliability in a system by using a second identical processor to monitor and verify the operation of the first processor. Both processors receive the same inputs so the operation state of both processors is identical or operating in “lockstep” and the results are checked for discrepancies. If the operations are not identical and a discrepancy is found, the error is flagged and the system performs additional tests to see if a CPU is failing.

This technology is integrated into certain AMD and Intel CPUs and is what the Fault Tolerance feature relies on to sync the CPU operations of a VM between two hosts so they are in identical states (VMware calls it vLockstep). This includes the AMD Barcelona quad-core processors that were first introduced in September of 2007 and the Intel Harpertown family processors that were first introduced in November of 2007. The vSphere Availability Guide references a KB Article (#1008027) on compatible processors that will presumably be published when vSphere is GA. More information on compatible processor models can be found at Eric Sloof’s NTPRO.NL blog and at Gabrie van Zanten’s blog, Gabe’s Virtual World. Below are the official requirements from VMware’s documentation:

Host requirements:

  • CPUs: Only recent HV-compatible processors (AMD Barcelona+, Intel Harpertown+), processors must be the same family
  • All hosts must be running the same build of VMware ESX
  • Storage: shared storage (FC, iSCSI, or NAS)
  • Hosts must be in an HA-enabled cluster
  • Network and storage redundancy to improve reliability: NIC teaming, storage multipathing
  • Separate VMotion NIC and FT logging NIC, each Gigabit Ethernet (10GB recommended). Hence, minimum of 4 NICs(VMotion, FT Logging, two for VM traffic/Service Console)

VM requirements:

  • VMs must be single-processor (no vSMP)
  • All VM disks must be “thick” (fully-allocated)
  • No non-replayable devices (USB, sound, physical CD-ROM, physical floppy, physical RDMs)
  • Make sure paravirtualization is not enabled by default (Ubuntu Linux 7/8 and SUSE Linux 10)
  • All applications and guest OSes are supported-both 32-bit and 64-bit

One additional requirement that is not listed is that the CPU clock speeds between the two ESX hosts must be within 400 Mhz of each other. The reason for this is so that one CPU does not lag behind the other so they can keep with each other and stay in sync. You can check to see if the processors in your hosts will support the FT feature by using the CPU Host Info utility that is covered in “VMware vSphere: Got 64-bit hardware?”. You can also read more about this new feature at the following links:

5  Comments on this Post

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • Jasonboche
    Very nice Eric. I learned something here.
    0 pointsBadges:
    report
  • Shaunmit
    Can you use FT is replace Microsoft Clustering? Can FT be started if the SQL agent is hung?
    0 pointsBadges:
    report
  • Eric Siebert
    You can depending on how you are using MSCS, FT provides protection against a host failure but not an application failure.
    1,210 pointsBadges:
    report
  • GarrettM
    Ran the CPU Host Info utility against a HP DL785 G5 that has AMD 8360 SE which is a Barcelona chip. Says the host is not FT compatible... Hmmm... GarrettM
    0 pointsBadges:
    report
  • Eric Siebert
    Hi Garret, Do you have the virtualization features (i.e. AMD-V) in the BIOS enabled? In many cases they are disabled by default. You might also try the new [A href="http://itknowledgeexchange.techtarget.com/virtualization-pro/new-sitesurvey-utility-from-vmware-checks-for-fault-tolerance-compatibility/"]SiteSurvey utility [/A]which might be more accurate.
    1,210 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to: