Posted by: Leah Rosin
Operating systems, System management
A recently published tip on performance tuning the AS/400 was created to respond to Search400.com reader feedback. Raymond Johnson answered a series of questions submitted by a reader, and since publication, the reader has provided a few more questions that Johnson has kindly answered.
The article is fantastic, to say the least. Our system has fault rates in the hundreds and page rates in the thousands. I assume that this indicates that the system is thrashing and it is spending more time moving data in and out of storage than it does processing it. Is that true? When the high page and faulting rates were brought up, it was mentioned that high fault rates are not a big contributor to poor performance, and so are no longer a concern as much as they had been in the past. Is that true? It makes sense to me when the system spends more time moving data around than it does processing the data, then of course, the response time will take a hit.
Is there anything you can share about the comment that high fault rates are not as big a concern now as they were in the past?
Ray responded with an explanation of thrashing and page faulting on the AS/400:
Because the answer to the questions was not really straight forwarded, I have tried to share a little insight about thrashing and page faulting.
High faulting rates can mean thrashing and poor performance. It can also mean that some new task has just started running and none of the code or data was in memory and had to be moved from disk to memory. It can also be “normal”for the particular system, time period and workload.
Thrashing typically occurs on a system when batch and interactive work share the same memory pool. Interactive work typically processes a small amount of information and then sends a response back to the user and then waits for a response. The key point here being that the interactive job has completed a small task and is waiting for the user. On the other hand, batch gets control of the CPU and is processing a file that can be millions of records long. A typical batch process doesn’t relinquish control of the CPU until it is forced to by a parameter called “time slice end.”
What can happen is that a batch program pulls hundreds or thousands of records into memory, starts to process the data, and then hits time slice end. Next, several interactive jobs with higher priority all get to run. These interactive jobs essentially flush memory so the data that the batch job was using has been completely paged out of system memory by the work of the interactive jobs. When the batch job gets the CPU back, it starts loading memory all over again, only to be kicked out at time slice end by more interactive jobs that have a higher priority. Repeat this cycle and this is what I call thrashing. If a batch job shares memory with other batch jobs or similar work, the thrashing typically does not occur or occurs much less frequently.
I recommend that you look at the WRKSYSSTS screen when the system is busy and everyone is happy (i.e. no complaints about a slow system). Press F5 and F10 several times and take a few screen shots. This should be your baseline of good performance. Next, observe the WRKSYSSTS screen when many users are complaining of poor performance. Now you have some real information to work with. Hopefully what you see now will make sense. I think of the WRKSYSSTS screen as the system dashboard. With this information you can start to analyze system performance.
An additional metric that I didn’t really address was the ratio of DB page faults to DB pages and the ratio of non-DB page faults to non-DB pages. At first glance, I would say that if the number of “pages” is at least a factor of 10 larger than the number of “page faults,” this could be normal.
The age old answer of “it depends” comes into play here. As the system performs more work, the value of the parameter “pages” increases. This is a very good indicator of the amount of data being read from disk to supply transactions with the requested data. Big numbers in the Pages column is a good indication.
Regarding the question: When the high page and faulting rates were brought up, it was mentioned that high fault rates are not a big contributor to poor performance, and so are no longer a concern as much as they had been in the past. Is that true?
Hopefully you now know the answer, however I did want to emphasize one point – generally high faulting rates (high being a relative number) are a big contributor to poor performance. The only reason that high faulting rates are not as big of a concern now as they were in the past is that fast machines with lots of brute force can hide horrible performance. New machines have faster disks, faster IOP/IOA’s, faster CPU’s and often more memory. Because of the reduced cost of hardware performance it appears to me that system performance tuning has become a lost art. Both commercial software programs and technicians with knowledge of performance can dramatically improve system performance in some situations with no additional hardware. However both software and human resources generally are more expensive than more hardware.
Because every machine is truly unique, and every workload and number of users at any given time is also unique, only you can observe what constitutes good performance on your system.
The reader then asked a follow-up question regarding pool data:
When I enter the WRKSBSD command for a particular subsystem, I enter an option #5 for that subsystem to display its parameters. Then I enter option #2 for pool definitions. That screen lists the POOL ID, STORAGE SIZE, and ACTIVITY LEVEL.
Then I enter a WRKSHRPOOL command. That screen lists POOL as the left column, but also has a POOL ID column. I need to find out: What is the relationship of the POOL and POOL ID columns on the WRKSHRPOOL command, to the POOL ID column on the WRKSBSD POOL DEFINITION screen? Does the POOL ID columns on the WRKSBSD display referring to the POOL ID column on the WRKSHRPOOL command?
I believe what I need to figure out is:
- The size of each pool.
- Whether that size can be automatically changed or always stays the same.
- What subsystems use each pool. In other words, I need to see each pool and what subsystems [and therefore jobs], feed into that pool.
I would think that if I get the total of the DEFINED SIZE column on the WRKSHRPOOL command, it would equal the MAIN STORAGE SIZE amount. On my screen, it does not. In fact, there is a difference of 2433 M. Is that difference normal? Or does the difference represent memory we have physically installed, but not used for anything?
Thank you very much for all of your time in this matter. I realize that faster processors, memory, and disk, hide performance issues. But, if the performance issues were addressed, we would really see throughput increase without additional hardware expense
Ray’s briefly explained how to understand pool numbers on the AS/400:
Pool numbers are one of the most confusing issues when dealing with memory on i. I have added a few notes in his questions and a couple of screen shots. This can get pretty deep pretty quickly for an email. See the two screen shots below. Looking at them together usually helps put the pieces together.
In the WRKSYSSTS screen shot note that the “Sys Pool” numbers 1 – 5. System Pool numbers 3 and greater are assigned arbitrarily when the system IPL’s by which subsystem starts first. Note on the second screen shot of WRKSBS screen that you see the subsystem pools 1-10.
QINTER and QSPOOL come defined with the OS. Separating batch and interactive is a manual process.
Rule #1 of tuning – all subsystems should have System Pool #2 defined for the first subsystem pool since that is where the “task dispatcher” runs by definition (you can’t change it). Nothing gets done until the task dispatcher dispatches the work.
So you always want pool 1 and 2 to be running well. If they are not running well, no one is running well.
A different reader submitted this question regarding non-DB faults more than 10:
I was interested in the section regarding non-DB faults will all be less than 10.0. Our system regularly see the a much higher figure. Following on from the explanation given later in the article is the only fix for this adding more memory or is it a case that there could be a problem with the way an application has been coded.
Ray explained a quick fix and expands on page faulting rates and what they mean:
The quickest way (not the only way) to fix is to add memory. I believe I that I discussed moving memory from other pools and changing the Max active value later in the same article. Adjusting these numbers can often improve performance. Caveat – if the System Value QPFRADJ is turned on, all of the changes you just made will be unmade when the Performance tuner deems necessary.
However we need to first backup and ask – are you experiencing performance issues. A page faulting rate above 10 may provide superb performance for your machine. It all depends on the CPU speed, the amount of memory, the speed of disk access, the workload, and often the network connections.
On my small (P05) system performance starts to slow down when my page faulting rates go above 10. This is a guideline that has worked well for me as a good starting point when analyzing system performance.