Why did the system crash?

12855 pts.
Tags:
AS/400
PTFs
Forgive me if this is hard to read but editor looks messed up again (tried using all 3 browsers). At a loss here. We had a user working remotely through a VPN last night. Around 19:00 last night they lost the connection running a query and did not try logging back on. At 19:17 I see a message CPF590A 40 INFO Session to device WFHQ_XXXX1 ended normally. QCMNARB06 QSYS 095033 QSWLDFR 0000 06/12/13 19:18:01.401627 QSYS Then every 8 seconds or so CPI2417 40 INFO Job message queue for 144880/XXXX/WFHQ_XXXX1 has been wrapped. WFHQ_XXXX1 XXXX 144880 QMHPQSEH 0000 06/12/13 19:18:25.611239 it looks like it ran until (7 hours later) CPF1164 00 COMPLETION Job 144880/XXXX/WFHQ_XXXX1 ended on 06/13/13 at 02:19:12; 11115.548 seconds used; end code 0 WFHQ_XXXX1 XXXX 144880 QWTMCEOJ 0000 06/13/13 02:19:12.122111 XXXX Late last night we started getting storage errors. CPF0907 80 INFO Serious storage condition may exist. Press HELP. QSYSARB5 QSYS 095008 QWCATARE 0000 06/12/13 23:35:16.943655 QSYS By early this morning we maxed or storage and were dead in the water for 9 hours. Now that we have our system back (with the help of IBM) how do we prevent this in the future? Why did this job take long to end after the connection failed and why did it eat up all our storage. We are back to using 30% of our total disk. Is there a setting I should look at or a PTF we may be missing ? Open to ideas.

Software/Hardware used:
i-Series, V7R1M0

Answer Wiki

Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Discuss This Question: 6  Replies

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • CharlieBrowne
    Just lost everything I keyed in. :-( My guess is this: Since disk utilization was OK when the system came back up, the job was looping trying to recover from the disconnect. Logging was being done to temporary storage for the job, To stop this in the future, Look at System Value QDEVRCYACN. It is also on JOBDs You also may want to set the MAXSTG value on USRPRFs
    41,380 pointsBadges:
    report
  • TomLiotta
    First, apparently there is no joblog from the failing job. That's where any problem determination should start (and hopefully end). But the indication is of a 'Normal' termination, so perhaps no spooled joblog was created, perhaps due to job description logging level. But then there's the element of "...running a query...", and we'd need some definite details about the actual query specification. . A significant query might take a long time to initiate. That maybe could delay any response back to the remote device, causing it to see the connection as being dead. . Are there query options that limit what any query is allowed to attempt? E.g., what is the limit set for QUERY_TIME_LIMIT and related options? And do you know the exact query that was requested? . Have you reviewed WRKPRB? . Without a joblog, I don't know of a way to be sure what caused the long time to end. A common reason is a long series of messages in the job's external message queue. We're pretty there were a lot of them because of the repeating CPI2417 messages. But every wrapping instance should effectively show that the limit on size was reached. The various QJOBMSGQ* system values should help understand what happens for each instance. . I'd expect something like a large number of query status messages. It doesn't yet feel like device recovery simply because of the 'Normal' ending. It's worth looking at those related system values, though, just to be sure that that don't allow excessive retries. . Without some info within the job, I can't think of a way to do anything about this. Although a remote device was involved, we don't know if it was relevant. (And any TCP/IP connection is "remote" by definition.) About the only thing that can be done is to monitor for space and have alerts sent out before it reaches a serious level. Of course, someone needs to be responsible for monitoring and reacting to them. . Tom
    125,585 pointsBadges:
    report
  • ToddN2000
    Thanks a lot for the tips. The user was running an interactive WRKQRY when their VPN connection dropped. When our system came back up, with IBM's help, they looked at the QHST log and found the job. They looked at the query and there was a file being created for them to import into Excel. It had, according to our ops manager, BILLIONS of records. Our ops manager writes a lot queries, more than she should. Mainly for users , VP's and CFO's. She said it looked fine and should not have created such a massive file. It would have been great to have the system call someone when the storage limit error messages came up but we don't have that software or nobody wants the responsibility. I am no longer in that department sine I moved to the .NET side of our company but they still come to me for solutions.
    12,855 pointsBadges:
    report
  • CharlieBrowne
    I believe setting the MAXSTG would resolve this issue in the future.
    41,380 pointsBadges:
    report
  • TomLiotta
    Management Central is installed (even if never used); but if your system doesn't use it (or other monitoring software) , the system is running uncontrolled. It can monitor for meeting and exceeding DASD thresholds and run programs when conditions are met. . A number of years ago, I used a SDLC line and controller description with an old modem configured for auto-dial. When varied on, it would dial whatever number was assigned. It didn't actually send any info, but I had a VRYCFG that could run as needed to send me a signal at desired times. E.g., it could run after long-running upgrades to let me know they had finished, so I wouldn't have to hang around the console checking every few minutes as the night dragged on. . IOW, minor creative thought can get useful work done. . Tom
    125,585 pointsBadges:
    report
  • TomLiotta
    I think MAXSTG() would work if the query was generating an actual physical result table. But if it was only generating intermediate temporary results, e.g., temporary indexes, etc., then MAXSTG() wouldn't apply. For many kinds of queries, it just doesn't help. . A more likely limit would be for the job classes MAXTMPSTG() attribute. It'd be necessary to determine which *CLS object was relevant to such temporary storage. . And of course, setting ASP thresholds along with settings for the QSTGLOW* system values should be done when the system is set up. . Tom
    125,585 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

To follow this tag...

There was an error processing your information. Please try again later.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Thanks! We'll email you when relevant content is added and updated.

Following