Why do DIMM cards fail so often in servers

820 pts.
Tags:
DIMM cards
Recently I have been doing a study of the rate of failure of server components. It probably won’t shock anyone that the most failing component is the disk drive given its electrical/mechanical nature. What I do find surprising is that running a very close second are DIMM memory cards. I would assume that they are completely electronic and therefore should be fairly high on the reliability scale but they are not. The disk drives average a failure every 500 plus months of server use and the DIMM cards are approximately 1000 months between failures which is only twice as good as disk drives. Does anyone have an idea as to why DIMM cards would be so failure prone? Jim White

Answer Wiki

Thanks. We'll let you know when a new response is added.

By your calculations, the average hard drive should last over 41 years (500/12) and a DIMM module would last over 82 years. (1000/12). Considering that the average life of a server is around 4 or 5 years, I don’t see either of these numbers you supplied as “failure prone” In my experience, and believe many others will agree, the first parts that fail are mechanical i.e. Hard drives and fans. A major cause for component failure, other that mechanical, is generally heat (often times because of a failed fan…) This is just a theory, but processors are the focus of the cooling systems in a computer, as it the power supply. Often there is a heat sink on the north bridge chip and video cards as well. These cooling systems keep component at a relatively regulated temp when operating. RAM on the other hand has no cooling system, now generally RAM does not run that hot but repeated hot/cold cycles can cause material fatigue on just about anything. Still if your average DIMM will last 82 years, according to the numbers you have supplied, that is probably the reason companies are not too concerned with putting another noise making fan in the box. I have found that most problems with RAM are caused by poor handling, or heat due to dust build up. If you keep them cool, keep them clean, and supply them with clean power, your server hardware failure rate will be extremely low.

———————

I’ll have to agree with Flame here. I rarely see RAM, or any other electrical component fail. However I have seen hard drives fail on a regular basis. I’ve probably worked with a couple of thousand servers over the years I other than RAM that was shipped from the vendor bad I can probably count the sticks on RAM on one hand which have failed after being in use for more than 30 days.

Discuss This Question: 3  Replies

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • Jim4522
    Flame, when I say that the current disks are failing every "500 months of server use", that is not the same as saying they fail every five hundred months. In a facility that has 10,000 servers that facility experiences 10,000 server months each month. And in that facility that means 20 diskdrives on average will need to be replaced if they are IBM quality diskdrives. In four years that facvilty will experience 480,000 server months of use and will experience the replacement of 960 again if they are IBM quality disdrives. But if those servers are from HP those 480,000 server months would produce 3,117 disk replacements because HP drives experience a diskdrive failure every 154 server months. You might be interested in the following replacement rates in "server use months": Device IBM HP Dell Sun Fujitsu Sys BD 1291 742 895 727 1,427 Memory 407 255 1,044 163 150 Power Sup 3,874 1,293 1,044 283 2,855 Disk 500 154 261 232 317 CPU 2,113 10,024 6,267 1,490 2,855 HBA 2,213 8,019 2,089 2,981 571 The best composite server (Fujitsu sysy bd, Dell Memory, IBM power supply, IBM disk drive, HP CPU, HP HBA) would have 61% less failure then the best existing server. If users measured the "Maintenance Rates" of servers and the "Replacement Rates" of the six major components that constitute 90+% of all failures the vendors would immediately respond by making much more reliable servers and components. And the cost of maintenance which is higher than either cooling and power cost would decrease substantially. Jim4522
    820 pointsBadges:
    report
  • Flame
    Thank you for the clarification, It makes more sense to me now. I was thinking about this on the way to work, as integrated circuits in general are becoming more powerful, they are also becoming more fragile. Perhaps RAM chips are starting to demonstrate the physical limitations of their manufacturing process. I'd imagine that CPU's and RAM both are made on rather densely populated dies, with extremely narrow connections. CPU's get a fan at least. Your question has caused me to look into the chip manufacturing industry, I'm learning a lot! Thanks! -Flame
    14,925 pointsBadges:
    report
  • Meredith Courtemanche
    For anyone landing on this discussion today, SearchDataCenter's senior technology editor Steve Bigelow has written a small series on DIMMs. They cover voltage: http://bit.ly/1c2dYun, DIMM types: http://bit.ly/1fCRocu and memory error correction/recovery features: http://bit.ly/1kPjaBP. 
    1,235 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

To follow this tag...

There was an error processing your information. Please try again later.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Thanks! We'll email you when relevant content is added and updated.

Following