Posted by: Ed Tittel
If you allow Windows to track and report on errors, every time your PC experiences some kind of problem it “phones home” to Redmond, and reports on what’s happened. It also promises to send you information about any related solutions that may come up as a result, but for most of us, a much more typical response to seeking solutions for such problems looks like this in the Action Center interface:
As it happens, however, Microsoft also researches the causes for and sources of such problems, thanks to the telemetry that delivers all this information to their tracking servers. They’ve just published their first-ever report on this data. It’s called “Cycles, Cells, and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs.” The summary for the report is both interesting and informative enough to be worth verbatim reproduction, so here goes:
We present the first large-scale analysis of hardware failure rates on a million consumer PCs. We find that many failures are neither transient nor independent. Instead, a large portion of hardware induced failures are recurrent: a machine that crashes from a fault in hardware is up to two orders of magnitude more likely to crash a second time. For example, machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault. Further, machines that crashed once had a probability of 1 in 3.3 of crashing a second time. Our study examines failures due to faults within the CPU, DRAM and disk subsystems. Our analysis spans desktops and laptops, CPU vendor, overclocking, underclocking, generic vs. brand name, and characteristics such as machine speed and calendar age. Among our many results, we find that CPU fault rates are correlated with the number of cycles executed, underclocked machines are significantly more reliable than machines running at their rated speed, and laptops are more reliable than desktops.
Lest you be inclined to pooh-pooh this report and its contents, it’s probably worth observing that it received the “Best Paper” award in the ACM’s (Association for Computing Machinery, a leading computer-science professional organization, to which I have belonged since 1982) Proceedings of Eurosys 2011 conference publication.
Joel Hruska from ExtremeTech overviews its findings in an excellent story entitled “Microsoft Analyzes over a million PC failures, results shatter enthusiast myths.” I’ll summarize the high points here:
- The longer a CPU runs, the more likely it is to crash. Machines with less than 5 days of active use over an 8-month period (what MS calls Total Accumulated CPU Time, aka TACT) have a 1:330 chance of crashing. Machines with over 30 days of TACT over the same 8-month period have a 1:190 chance of crashing.
- Once a hardware fault appears, it is 100 times more likely to recur after that. 97% of machines tend to crash from the same cause within a month of the first such crash.
- Over-clocking (no surprise there) is likely to cause crashes, while underclocking makes them less likely. Figure 3 from the report summarizes overall overclocking findings. For underclocking CPU failures go from 1:330 for stock to 1:460 for underclocked; DRAM one-bit flip errors drop from 1:2000 (stock) to 1:3600 (UC); and disk issues drop from 1:380 to 1:560. This also confirms conventional wisdom that underclocking improves PC reliability (it definitely reduces heat output, which is probably related).
- Surprisingly to some (but not to me, based on lots of hands-on experience) laptops proved to be more stable than desktops, countering the researchers’ own expectations.
- PCs from major systems vendors (such as Dell, HP, Asus, Lenovo, and so forth — defined as the “Top 20 computer OEMs” in the report) proved more reliable than those from all other vendors, with 1:120 for CPU problems (OEMs) versus 1:93 (everybody else), and 1:2700 (OEMs) for RAM one-bit flip problems versus 1:950 (everybody else).
All in all the report makes for some interesting reading and suggests that MS may be learning more from this data in the aggregate, however unresponsive their forwarding of problem solutions through the Action Center might seem. Should be interesting to keep an eye out for future such findings.