Unexpected Win2K Server shutdown.

pts.
Tags:
Security
Over a four minute, 36 second period beginning around 10:20 PM EST, seven servers, all in one rack, all on the same UPS suffered unexpected shutdowns. All servers are time synchronized and examining the Event Viewer only two shutdown at exactly the same time. Two other racks of servers, each on their own UPS units did not shutdown. The UPS on the rack that shutdown has two, two year old batteries in it and was running at 30% of capacity. While I did not have any UPS management software running on this UPS, I did have it on the other racks' UPS units. Their logs showed no unusual electrial events, e.g., brownouts or spikes. This had never before occurred and has not repeated itself. Does anyone have any idea of what could cause this?

Answer Wiki

Thanks. We'll let you know when a new response is added.

Have you tried the ‘Wood’s Hole Observatory for non-repeatable events’?

Since the only thing in common to the servers is the UPS and probably the ‘Switch/Router’ they connect to. My first choice would be the UPS. Wake on LAN is easy, Die on LAN should leave traces in the event logs.

Are they sharing outlets (duplex wiring)?

Best guess is what ever tripped the first one will not be found, but the restore of the first tripped out the second and so forth. You didn’t specify the servers and some are much more sensitive to interruptions than others. Not necessarily a bad thing – data protection is the primary job.
Good luck!

Discuss This Question: 2  Replies

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • GBirmingham
    This is an interesting event you describe, and perhaps one with a reasonable explanation. First, let's examine the facts... You have three racks of servers with each rack served by a dedicated UPS. During the 4m 36s duration shutdown event, only one rack experienced a problem and the seven servers on that single UPS collectively experienced a complete unscheduled shutdown event. UPS loading on the impacted UPS is stated to be about 30% of maximum load. You didn't say how long the UPS should have been able to carry the total server load... That's a number you should have available... You said that you had no monitoring on the UPS that experienced the event, but you did say that was not the case on the other two UPS units, and that they reported no input power problems. You didn't say if all three UPSes share a common input power path ( from utility power main entrance to UPS, including intermediate breaker panels, etc...), and if all power to your computer room is controlled by you and not located in a tenant common space in your facility (often found in suburban office parks). You mention the battery age of the UPS that was involved, which says the UPS has been in service for about 2+ years, but you didn't say anything about the age of the other UPS units. First off, battery age would likely affect how long the UPS can carry the load, but are not likely to cause a total loss of output by themselves. However, if utility power was completely interrupted or had experienced a sustained state of voltage sag that would case the UPS to switch fromutility power and try to supply battery supported power to the servers. Thus if the batteries are really bad, and unable to hold a charge, or the trickle charging circuit in the UPS is bad and the batteries are not being charged, then loss of input power probably results in an almost immediate cessation of power to the servers - a dead rack almost immediately and sensed by all servers at once. But, if the batteries are weak, then your load carrying time is reduced and power output begins to sag gradually at some steady rate of decline. This is where having the load carrying time for each UPS handy. This if the correct load carrying time is calculated to be 16 minutes for all seven servers and they die in around four, then the 2 year old batteries become immediate suspects. Also, switching powers supplies in servers will have different thresholds of input power levels at which they shutdown automatically. That is to say that if I could gradually lower the input voltage to all seven servers at once, they would likely shut down at different times. This is due to component aging in the power supply units. Now if you happend to have two servers of similar or identical configuration with power supplies that are of the same age or very close in age, it is likely they might well shutdown at the very same time (as you observed in their logs). Others may take a longer or shorter time to do so, again depending on component aging. This really sounds more like what you experienced -- a loss of primary AC on the UPS, and either the batteries are very weak, or the load carrying time of this configuration is really only 4 m and 36s. But you had to have a loss of input power to experience this. So the real question here is: Was there a loss of utility power event that affected only the one UPS but not the others? Since I don't know what kind of UPS units are involved here, let's assume they are units that plug into a wall outlet, and they likely require a 15A or 20A circuit for input. Most computer room UPS configurations have dedicated outlets for UPS units. But sometimes the UPSes can be wired directly to a breaker panel. There their individual supply breakers are well-labeled and marked as to their purpose. Three racks of servers sounds pretty heavy duty, so you don't want their power switched off by accident. However, for non-hardwired units, they often just get plugged into an outlet whose breaker may not be marked as supporting a UPS and which can get turned off to service something else on the circuit by accident. Or sometime a new device is connected to that circuit, which then causes an overload and the breaker trips. No one notices that a server UPS had been taken off line at the same time. So I'd be looking at the source of input power to the affected UPS vs the two that didn't have a problem, and see if the explanation of the event lies there. It really sounds like the unit's AC got disrupted and by chance the power loss was corrected, but not before the servers powered off. You might want to make sure all power panel breakers are clearly marked when they supply a UPS and that other non-UPS loads are moved off the circuits to insure that power doesn't get accidentially turned off. Seeing that the outage occurred at 10:20 PM - What time does the cleaning crew come in ? GBirmingham
    0 pointsBadges:
    report
  • Kerm
    Don't forget other considerations. Are all the systems that shut down the same model and age? I have a track running with HP to see if the Compaq DL140 power supply is the cause of several unannounced shutdowns.
    0 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

To follow this tag...

There was an error processing your information. Please try again later.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Thanks! We'll email you when relevant content is added and updated.

Following