I read both the Virtualization Review performance testing article and the related commentary with quite a bit of interest. I came across this statement in the article:
The key to making this cross-comparison of the hypervisors is ensuring a consistent environment. For each test, every hypervisor was installed on the same server using the same disk system, processor quantity, and memory quantity. The same hardware was used for each test and all software was installed and configured the same way for each test.
I find this confusing. I interpret it to mean that identical hardware was used for testing, as well as some sort of install of the software. But I need to ask: What software? Are we talking about the contents of the virtual machines? Or are we talking about the hypervisors? What exact version of the hypervisors were tested? What the paragraph about does not speak to is the layout of the VMs within the disk system. Did the VMs share a LUN or did they use their own LUN? Hyper-V often requires one LUN per VM if you want Quick Migration support. Is this the way the other hypervisors VMs were also laid out? Since disk I/O is a major issue with performance I would have expected this to be spelled out in great detail.
The other parameters that I expected to see explained were how the VMs were imported into the hypervisors. Were they stood up as new VMs for each hypervisor? Were the VMs the results of a P2V migration? Or were they imported from a library of VMs? Were paravirtualized drivers installed within the VMs? Just how were the VMs configured and created?
I was looking through the article for several numbers, one I found and the others I did not.
- How many times did the testers run each test? The engineer in me wants to know the math behind the numbers. Perhaps there should have been a graph of runs vs. results — this would help me to determine if caching was involved.
- Were the tests run sequentially? I ask this because disk information is cached at several levels, so we need to know if the tests are running from cache or direct from disk. If you have one VM per LUN, cache comes into play quite a bit. Several disk I/O tools (iozone for example) can disable disk cache capabilities, but repetitive tests will often live within cache, which is not a very accurate disk I/O measurement, as with an average system disk access will not live in the cache. In essence, while the tests were supposed to be of just the hypervisor, the disk subsystem is part of the whole. Was the disk I/O time included in the equation? It appears that disk I/O was included, but this begs the question: how were the VMs laid out on the disk? Were they all using the same LUN? If so then the LUN got pummeled and the numbers are all suspect.
- Where are the numbers for running this same test on physical hardware? Otherwise how can we know if the CPU, RAM, and disk operations numbers make any sense or not? Where is the baseline for the tests?
The most interesting number to me was the time clock, but without knowing how many tests were run, the numbers are so close that the differences could amount to nothing more than mathematical round off or truncation errors. Ever seen a poll that has a documented 3% error, but the results are within 2% of each other? In terms of statistics, this means that they are really the same results. So where are the statistics behind the time clock values?
What I do like about this test is that it is an unoptimized test. In other words, no one went in and optimized the number of spindles per LUN, the SQL implementation, or the other workloads. In other words, the testers did not go out of their way to make anything look particularly good. The tester instead did exactly what a user would do: installed a workload and let it run. The average user is simply not going to optimize his install, he will just run it and expect the best.
On the other hand, many vendors will optimize everything to garner the best results. They will spend hours if not days tweaking this and that to get the best results. I even know one vendor who put a special bit in their hardware that once toggled would just repeat whatever was in cache on the hardware 10 million times. The vendor got great results but they were not real-world results, as you could never pump data to the hardware that fast through the other layers involved. We need more unoptimized performance tests.
If you are going to do performance tests between hypervisors, take care to ensure your VMs are laid out identically with respect to storage, document the use of any paravirtualized drivers, and run the tests many many times so that you can get your statistics correct.
You may find that there is just no clear winner, leaving your choice of hypervisor dependent entirely on the advanced features available within each product.