ENDJOBABN

Jul 5 2009   6:00AM GMT

Friday Night Lights



Posted by: Steve Pitcher
AS/400, System i

I had a lovely Friday, working from home in complete silence until my little boy woke up.  I then went to the office and shut my door for a few hours, banging out the finishing touches of a project due at the end of the day.

Around 3PM we get word that there’s going to be a scheduled power outage at 3:30 for repairs.  Evidently there were a couple of explosions earlier in the day and some of the electrical equipment got toasted.  The outage would last between 5 hours and 3 days.  Big window of an estimate, but what can you do?  The poor electricians were probably airing the smoke out of the substation and assessing the problem.

We have a 4 person IT shop.  A manager, myself and two other technicians.  As luck would have it, my 2 week after hours on-call shift started Friday, the manager is on vacation and the two other technicians started their vacation Friday after work.  Guess who gets to hang out after work to make sure things are in good order with the equipment?

At 3:30 the power shuts off and our UPS handles the power load for our servers for about a minute or so until the propane generators fire up.  At 3:32 the propane generator kicks in like clockwork and runs along for all of 5 minutes before shutting off again.

I run and find one of our electricians to have a look while I check the UPS.  Cool.  I have 33 minutes of battery time left.  I make a few calls and send a few emails to prep users on the possibility of a total computer shutdown on the 5 companies we support out of our office.

Looks like the generator was toasted from the power surges earlier in the day.  Knowing all is lost and I need to power down 3 AS/400’s and maybe 10-12 Windows servers with 15 minutes to spare, I head back to the server room and check the UPS.  The front panel says “15 minutes battery time remaining” so I have plenty of room to move.

I get 2 AS/400’s on the way down (and I need a full 8 minutes for them to shut off) and start working on the 3rd when the UPS starts making the awful fast beeping noise indicating an imminent shutdown.  It’s times like these when you second guess yourself on your ability to restore from backup tapes.  1 very short minute later, all machines go down…HARD.  My stomach rolled over like you’d expect.

Colorful and creative cursing ensued at the UPS for telling me I had 15 minutes when I really had 5-6.

More colorful and creative cursing ensued at the flipping generator for failing when I needed it the most.

6 long hours later after the power was restored I started to power up the machines to find the lovely amber alert light on our new AS/400 model 515.  Luckily after booting into SST it just turned out to be an indicator of power fluctuation.

Even more colorful cursing ensued at the bloke at IBM who put this feature in the new machines.  Our models 170 and 270 went through the same experience but appeared fine with no system attention light.  Put the message in the QSYSOPR message queue but don’t fire up the “uh oh” light and cry wolf.  I want to see that light come on when I have a DASD failure or something and need to take action.

With all that said, all systems were a go with no hardware or software damage.

I don’t like dodging a bullet, but the alternative is being hit by one.  I had to hunt down one of the technicians in order to put myself on the UPS email alert system in case the systems went to UPS power in the next few days until we get the generator repaired.  In that case I’d have to remote in and power down all systems and bank on only having 20 instead of 33 minutes to get the job done right.  Tethered to the computer room 30 miles away.  I’ll also have to get the UPS checked to ensure it’s giving an accurate representation of battery time based on the load.

It’s time to review our systems continuity strategy and schedule more regular testing.  I’d suggest you do the same.

Comment on this Post


You must be logged-in to post a comment. Log-in/Register