The iSeries Blog

Nov 24 2008   2:19PM GMT

Performance tuning the AS/400: Fault rates and pools



Posted by: Leah Rosin
Tags:
Operating systems
System management

A recently published tip on performance tuning the AS/400 was created to respond to Search400.com reader feedback. Raymond Johnson answered a series of questions submitted by a reader, and since publication, the reader has provided a few more questions that Johnson has kindly answered.

The article is fantastic, to say the least. Our system has fault rates in the hundreds and page rates in the thousands. I assume that this indicates that the system is thrashing and it is spending more time moving data in and out of storage than it does processing it. Is that true? When the high page and faulting rates were brought up, it was mentioned that high fault rates are not a big contributor to poor performance, and so are no longer a concern as much as they had been in the past. Is that true? It makes sense to me when the system spends more time moving data around than it does processing the data, then of course, the response time will take a hit.

Is there anything you can share about the comment that high fault rates are not as big a concern now as they were in the past?

Ray responded with an explanation of thrashing and page faulting on the AS/400:

Because the answer to the questions was not really straight forwarded, I have tried to share a little insight about thrashing and page faulting.

High faulting rates can mean thrashing and poor performance. It can also mean that some new task has just started running and none of the code or data was in memory and had to be moved from disk to memory. It can also be “normal”for the particular system, time period and workload.

Thrashing typically occurs on a system when batch and interactive work share the same memory pool. Interactive work typically processes a small amount of information and then sends a response back to the user and then waits for a response. The key point here being that the interactive job has completed a small task and is waiting for the user. On the other hand, batch gets control of the CPU and is processing a file that can be millions of records long. A typical batch process doesn’t relinquish control of the CPU until it is forced to by a parameter called “time slice end.”

What can happen is that a batch program pulls hundreds or thousands of records into memory, starts to process the data, and then hits time slice end. Next, several interactive jobs with higher priority all get to run. These interactive jobs essentially flush memory so the data that the batch job was using has been completely paged out of system memory by the work of the interactive jobs. When the batch job gets the CPU back, it starts loading memory all over again, only to be kicked out at time slice end by more interactive jobs that have a higher priority. Repeat this cycle and this is what I call thrashing. If a batch job shares memory with other batch jobs or similar work, the thrashing typically does not occur or occurs much less frequently.

I recommend that you look at the WRKSYSSTS screen when the system is busy and everyone is happy (i.e. no complaints about a slow system). Press F5 and F10 several times and take a few screen shots. This should be your baseline of good performance. Next, observe the WRKSYSSTS screen when many users are complaining of poor performance. Now you have some real information to work with. Hopefully what you see now will make sense. I think of the WRKSYSSTS screen as the system dashboard. With this information you can start to analyze system performance.

An additional metric that I didn’t really address was the ratio of DB page faults to DB pages and the ratio of non-DB page faults to non-DB pages. At first glance, I would say that if the number of “pages” is at least a factor of 10 larger than the number of “page faults,” this could be normal.

The age old answer of “it depends” comes into play here. As the system performs more work, the value of the parameter “pages” increases. This is a very good indicator of the amount of data being read from disk to supply transactions with the requested data. Big numbers in the Pages column is a good indication.

Regarding the question: When the high page and faulting rates were brought up, it was mentioned that high fault rates are not a big contributor to poor performance, and so are no longer a concern as much as they had been in the past. Is that true?

Hopefully you now know the answer, however I did want to emphasize one point – generally high faulting rates (high being a relative number) are a big contributor to poor performance. The only reason that high faulting rates are not as big of a concern now as they were in the past is that fast machines with lots of brute force can hide horrible performance. New machines have faster disks, faster IOP/IOA’s, faster CPU’s and often more memory. Because of the reduced cost of hardware performance it appears to me that system performance tuning has become a lost art. Both commercial software programs and technicians with knowledge of performance can dramatically improve system performance in some situations with no additional hardware. However both software and human resources generally are more expensive than more hardware.

Because every machine is truly unique, and every workload and number of users at any given time is also unique, only you can observe what constitutes good performance on your system.

The reader then asked a follow-up question regarding pool data:

When I enter the WRKSBSD command for a particular subsystem, I enter an option #5 for that subsystem to display its parameters. Then I enter option #2 for pool definitions. That screen lists the POOL ID, STORAGE SIZE, and ACTIVITY LEVEL.

Then I enter a WRKSHRPOOL command. That screen lists POOL as the left column, but also has a POOL ID column. I need to find out: What is the relationship of the POOL and POOL ID columns on the WRKSHRPOOL command, to the POOL ID column on the WRKSBSD POOL DEFINITION screen? Does the POOL ID columns on the WRKSBSD display referring to the POOL ID column on the WRKSHRPOOL command?

I believe what I need to figure out is:

  1. The size of each pool.
  2. Whether that size can be automatically changed or always stays the same.
  3. What subsystems use each pool. In other words, I need to see each pool and what subsystems [and therefore jobs], feed into that pool.

I would think that if I get the total of the DEFINED SIZE column on the WRKSHRPOOL command, it would equal the MAIN STORAGE SIZE amount. On my screen, it does not. In fact, there is a difference of 2433 M. Is that difference normal? Or does the difference represent memory we have physically installed, but not used for anything?

Thank you very much for all of your time in this matter. I realize that faster processors, memory, and disk, hide performance issues. But, if the performance issues were addressed, we would really see throughput increase without additional hardware expense

Ray’s briefly explained how to understand pool numbers on the AS/400:
Pool numbers are one of the most confusing issues when dealing with memory on i. I have added a few notes in his questions and a couple of screen shots. This can get pretty deep pretty quickly for an email. See the two screen shots below. Looking at them together usually helps put the pieces together.

In the WRKSYSSTS screen shot note that the “Sys Pool” numbers 1 – 5. System Pool numbers 3 and greater are assigned arbitrarily when the system IPL’s by which subsystem starts first. Note on the second screen shot of WRKSBS screen that you see the subsystem pools 1-10.

Performance tuning on AS/400 screenshot

Performance tuning on AS/400 pool data

QINTER and QSPOOL come defined with the OS. Separating batch and interactive is a manual process.

Rule #1 of tuning – all subsystems should have System Pool #2 defined for the first subsystem pool since that is where the “task dispatcher” runs by definition (you can’t change it). Nothing gets done until the task dispatcher dispatches the work.

So you always want pool 1 and 2 to be running well. If they are not running well, no one is running well.

A different reader submitted this question regarding non-DB faults more than 10:

I was interested in the section regarding non-DB faults will all be less than 10.0. Our system regularly see the a much higher figure. Following on from the explanation given later in the article is the only fix for this adding more memory or is it a case that there could be a problem with the way an application has been coded.

Ray explained a quick fix and expands on page faulting rates and what they mean:

The quickest way (not the only way) to fix is to add memory. I believe I that I discussed moving memory from other pools and changing the Max active value later in the same article. Adjusting these numbers can often improve performance. Caveat – if the System Value QPFRADJ is turned on, all of the changes you just made will be unmade when the Performance tuner deems necessary.

However we need to first backup and ask – are you experiencing performance issues. A page faulting rate above 10 may provide superb performance for your machine. It all depends on the CPU speed, the amount of memory, the speed of disk access, the workload, and often the network connections.

On my small (P05) system performance starts to slow down when my page faulting rates go above 10. This is a guideline that has worked well for me as a good starting point when analyzing system performance.

6  Comments on this Post

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • Leah Rosin
    The earlier article talking about RAISING the Max Act level until you get Wait->Inel entries. I was taught, and my system seems to confirm, that Wait->Inel occurs because the Max Act is too LOW a number. In database class years ago, they said that one downside of setting the Max Act too high is that the DB optimizer looks at that to divide up the memory: if you have Max Act set to 500 and only really need 200, some of the database optimization gets written to disk, because it figures that there COULD be 500 jobs competing for the memory. Since they, I've heard that this is no longer the case and there really isn't a downside to setting the Max Act too high, although I suspect that might be another case of incredibly fast hardware compensating for less-than-optimal tuning.
    0 pointsBadges:
    report
  • Leah Rosin
    If you use ops navigator you can also set the tuning priority so typically Machine would be 1, Interactive 2 and spool/share 3
    0 pointsBadges:
    report
  • Leah Rosin
    This article makes a lot of sense to me. I'm currently using an external storage system(DS8100) on my i570-MMA. We just recently implemented our Global Mirroring replication (managed by TPC-R) and immediately noticed a hit on disk response time on the i5/OS partition. Batch and Interactive jobs are now running 5 or more times longer than usual. The performance will get better when I suspend the replication process. I run a performance data report on disk utilization and found that disk write wait time is way high during Global Copy formation. I'm not sure if this is something that the i5/OS can fine tune. The CPU and Memory is not a question. The system have more than enough CPU and Memory to run. Any feedback is greatly appreciated. Thanks a lot. JB
    0 pointsBadges:
    report
  • TomLiotta
    Just ran across this today. Haven't looked in detail, but scanning had me stopped at this statement: [I]Rule #1 of tuning – all subsystems should have System Pool #2 defined for the first subsystem pool since that is where the “task dispatcher” runs by definition (you can’t change it).[/I] Where did that idea come from? It's not true, and I don't know how many versions (not to mention releases) have gone by since it was true, if ever. I often have *SBSDs without assigning *BASE. No problem. I think a bit more detail is needed beyond simply saying "(you can’t change it)". Perhaps it is worded to allow easy misinterpretation. Tom
    125,585 pointsBadges:
    report
  • TomLiotta
    Whoa! Never mind. I responded to a blog post rather than to the article that the post was about. Tom
    125,585 pointsBadges:
    report
  • TomLiotta
    Okay... never mind that "Never mind." After reading the original article, I see that it directs me back to here... (loop [I]v.[/I]: see 'loop') Tom
    125,585 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to: