Posted by: Eric Siebert
Eric Siebert, hardware, Servers, Virtualization
When system administrators receive new servers, they are often anxious to get them unpacked, in the rack and loaded up with ESX so they can start creating virtual machines. But an important first step should be done before proceeding with virtualization software installation on the server: always burn-in the memory to test for defective memory modules.
Defective memory will usually be unnoticed in a newly-deployed server and it may be months before signs of defective memory start to show. In one group of five HP servers, I had to replace seven memory DIMMs over an 18 month time period. Most of these were eventually detected by HP’s Insight Manager agents that reside on the server, but two of them caused hard server crashes of VMware ESX servers commonly known as a PSOD (Purple Screen of Death). A PSOD on one of your production servers, loaded up with important virtual machines, is never a good thing. You can reduce your chances of this happening by burning in your memory.
Most servers do a brief memory test on startup as part of their POST procedure. This is not a very good test and will only detect the most obvious of memory problems. A more thorough test checks the interaction of adjacent memory cells to ensure that writing to one cell does not overwrite an adjacent cell.
A good, free memory test utility is available, called Memtest86+, that performs many different tests to thoroughly test your servers memory. You can download it as a small 2MB ISO file that can be burned to a CD and booted on your new server. Let the memory burn-in for at least 24 hours (the longer the better though). Memtest86+ will run indefinitely and the pass counter will increment as all of the tests are run. The more RAM you have in your system, the longer it will take to complete one pass. A system with 32GB will generally take about one day to complete. Memtest86+ not only tests your system’s RAM but also the CPU L1 and L2 caches. Should it detect an error, the easiest way to identify the memory module that caused it is to simply remove a DIMM and run the test again and repeat until it passes. Documentation on Memtest86+ includes troubleshooting methods, detailed test descriptions and the causes of errors.
If you already have ESX servers running and want to test their memory, you can use the little known Ramcheck service to do this while ESX is running. This service is non-disruptive and runs in the background consuming minimal CPU cycles.
The extra time you spending testing memory before deploying servers helps eliminate potential problems down the road.