SbElectric
1600 pts. | Jun 19 2009 10:01PM GMT
One comes to mind, regarding data backups incidents. We set up an elaborate data backup job having multiple steps (with condition code checking for each steps) for various types of data. We dutifully tested each step, checked step completion code of 0 (zero) & displayed proper alert message for the operators if the step did not execute properly. The operators were instructed to check for condition code of ‘0’ for each step. Everything went fine - we tested the restore process by using the backup tapes. Like in any large data centers, there were always some procedure/process change or environmental changes in the center. Luckily most of these changes or caught (or observed) in our large backup job. When some steps did not execute with Condition code “0”, the operators dutifully caught this anomaly and informed us. We made the necessary modifications and life was back to normal.
One day due to some security concerts, some databases were put in DMZ zone to isolate from main configuration. These databases contained sensitive and critical information. Operators reported that everything went fine with backup job (all condition codes were “0”}. But as a passing comment, mentioned that the job ran a bit quicker. Being technically savvy (?), I thought this must be ok since there were less contentions in the DMZ! This continued for a few more days. It was summer time … living was easy … and fishes are jumping.
My boss has an uncanny sense of “trust but verify” mentality. So one day he called us to restore the library on a different server from the DMZ. Dutifully we loaded the backup tape to restore and it yielded “0“ records restored!! On examining the backup job we noticed the message “not sufficient authority to access the database”. Job step skipped, condition code “0”. The job did not have proper access privileges for the DMZ.
Needles to say we learned our lesson of “trust but verify” sheepishly!!






