when relevant content is
added and updated.
Fault Tolerance (FT) is a new feature in vSphere that takes VMware’s High Availability technology to the next level by providing continuous protection for a virtual machine (VM) in case of a host failure. It is based on the Record and Replay technology that was introduced with VMware Workstation that lets you record a VM’s activity and later play it back.
The feature works by creating a secondary VM on another ESX host that shares the same virtual disk file as the primary VM and then transferring the CPU and virtual device inputs from the primary VM (record) to the secondary VM (replay) via a FT logging NIC so it is in sync with the primary and ready to take over in case of a failure. While both the primary and secondary VMs receive the same inputs, only the primary VM produces output such as disk writes and network transmits. The secondary VM’s output is suppressed by the hypervisor and is not on the network until it becomes a primary VM, so essentially both VMs function as a single VM.
FT can be used with any application as the guest operating system and applications running on it are completely unaware of FT. This new feature is only included in the Advanced, Enterprise and Enterprise Plus editions of vSphere. It eliminates the need for VMware customers to use Microsoft’s Cluster Server (MSCS) to provide continuous availability for critical applications. In fact, VMware’s documentation states the following as a use case for FT:
Cases where high availability might be provided through MSCS, but MSCS is too complicated to configure and maintain.
While FT is a very useful feature it does have some limitations and strict usage requirements. On the host side it requires specific, newer processor models from AMD and Intel that support Lockstep technology. You might be wondering what Lockstep technology is. Simply put, Lockstep is a technique used to achieve high reliability in a system by using a second identical processor to monitor and verify the operation of the first processor. Both processors receive the same inputs so the operation state of both processors is identical or operating in “lockstep” and the results are checked for discrepancies. If the operations are not identical and a discrepancy is found, the error is flagged and the system performs additional tests to see if a CPU is failing.
This technology is integrated into certain AMD and Intel CPUs and is what the Fault Tolerance feature relies on to sync the CPU operations of a VM between two hosts so they are in identical states (VMware calls it vLockstep). This includes the AMD Barcelona quad-core processors that were first introduced in September of 2007 and the Intel Harpertown family processors that were first introduced in November of 2007. The vSphere Availability Guide references a KB Article (#1008027) on compatible processors that will presumably be published when vSphere is GA. More information on compatible processor models can be found at Eric Sloof’s NTPRO.NL blog and at Gabrie van Zanten’s blog, Gabe’s Virtual World. Below are the official requirements from VMware’s documentation:
- CPUs: Only recent HV-compatible processors (AMD Barcelona+, Intel Harpertown+), processors must be the same family
- All hosts must be running the same build of VMware ESX
- Storage: shared storage (FC, iSCSI, or NAS)
- Hosts must be in an HA-enabled cluster
- Network and storage redundancy to improve reliability: NIC teaming, storage multipathing
- Separate VMotion NIC and FT logging NIC, each Gigabit Ethernet (10GB recommended). Hence, minimum of 4 NICs(VMotion, FT Logging, two for VM traffic/Service Console)
- VMs must be single-processor (no vSMP)
- All VM disks must be “thick” (fully-allocated)
- No non-replayable devices (USB, sound, physical CD-ROM, physical floppy, physical RDMs)
- Make sure paravirtualization is not enabled by default (Ubuntu Linux 7/8 and SUSE Linux 10)
- All applications and guest OSes are supported-both 32-bit and 64-bit
One additional requirement that is not listed is that the CPU clock speeds between the two ESX hosts must be within 400 Mhz of each other. The reason for this is so that one CPU does not lag behind the other so they can keep with each other and stay in sync. You can check to see if the processors in your hosts will support the FT feature by using the CPU Host Info utility that is covered in “VMware vSphere: Got 64-bit hardware?”. You can also read more about this new feature at the following links:
- Protecting Mission-Critical Workloads with VMware Fault Tolerance
- Fault Tolerance Datasheet
- Fault Tolerance Checklist
- Fault Tolerant VMs in VMware Infrastructure: Operation and Best Practices (VMworld 2008 presentation for attendees/subscribers only)