DB2/400 SQL Sub Query Performance?

pts.
Tags:
Data warehousing/Business intelligence
Database Management Systems
DB2
DB2 Universal Database
UDB for iSeries/i5
Here is the problem. Table INVEHH is a (Events header table) with about 500 records and INVEHD (Event Details) with about 22 million records. These tables are on a DB2/400 database. 1) Original Query that takes forever: select EVENT, INUMBR, ISTORE, EVHYY, EVHMM, EVHDD from INVEHD where EVENT in (Select EVENT from INVEHH where (EVHYY*10000 + EVHMM*100 + EVHDD)= 20040617) 2) If I run the subquery first then I get the following. Select EVENT from INVEHH where (EVHYY*10000 + EVHMM*100 + EVHDD)= 20040617 Event Number 19 967 970 972 986 990 991 993 994 998 ******** End of data ******** 3) And then if I re-run the first query with the results from the subquery like this, I get instentaneous results. select EVENT, INUMBR, ISTORE, EVHYY, EVHMM, EVHDD from INVEHD where EVENT in ( 19, 967, 970, 972, 986, 990, 991, 993, 994, 998) As far as I know query one should perform just as well as query 3. But one takes forever and 3 has instentaneous response. Let me know, if you can tell me what is it that I am doing wrong. I had discussed this with my colleagues and we came up with some alternatives that executes pretty fast. But theoretically query one should run just as fast. Ajish

Answer Wiki

Thanks. We'll let you know when a new response is added.

Tip:

The mentioned sub part:
(EVHYY*10000 + EVHMM*100 + EVHDD)= 20040617

Can be set up in a more optimised way via:
EVHYY = 2004 and EVHMM = 06 and EVHDD = 17

In case the first condition is not true, rest of condition(s) is/are not tested anymore…..

Another option to optimise is to introduce a logical on the fields EVHYY, EVHMM and EVHDD for table INVEHH

=======================================================================

The problem is the subquery. It must be run for every row in the table since the rows in INVEHH can change at any time. Any change to INVEHH by any other user or job in the system could change the result set. Are the tables journaled? Do jobs on the system run under commitment control?

If you want the subquery to return a static result, you should be able to use a Common Table Expression:
<pre>
WITH HH as (Select EVENT
from INVEHH
where (EVHYY*10000 + EVHMM*100 + EVHDD)=
20040617 )
select EVENT, INUMBR, ISTORE, EVHYY, EVHMM, EVHDD
from INVEHD
where EVENT in (select EVENT from HH)
</pre>
The CTE should produce a result set that the full-select can rely upon.

Of course, the date calculation should also go. There seems to be no point in executing the multiplication and summation of date values when they could be directly compared.

Tom

—————————– kccrosser
The subquery is indeed the problem. Generally, when you do a computation in the Where clause, you prevent the query optimizer from using available indexes. Direct comparisons with the individual fields is MUCH more efficient.

However, you also need to make sure you have an effective index on the table.

In this case, the optimized index would be:
create unique index ui_myindex on INVEHH ( EVHDD, EVHMM, EVHYY, EVENT );

Then, when you run your subquery, the lookup in the index will be very efficient. This kind of “covering” index allows the query engine to completely avoid doing a fetch against the real table – when it finds an index record with the corresponding YY/MM/DD values, it has the EVENT data in the index itself.

Why do I suggest the index column order be DD, then MM, then YY, then EVENT?

Depending on the database engine and version, the order of the columns in the index can be quite important. In this case, the DD values will range from 01 to 31, the MM values from 01 to 12, and the YY values probably from 2000 to 2009. Thus, the DD values are the most “discriminant” (the widest range of values) and the YY values the least “discriminant”. Ordering the columns from most to least discriminant can significantly boost performance, as the query engine is dividing up the available values “quicker” by using the more discriminant values first.

I would bet that recoding the subquery and adding the above index will solve your performance problems.

Discuss This Question: 2  Replies

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • Apothen
    The problem is not the subquery, that's running over a table that has only about 1000 records. My question really is about how DB2/400 optimizes the query. This is how I think the query optimizer should do: Step 1) Run the subquery over the table with 500 records - should run really fast. Step 2) Now run the main query using the result set - The table is indeed indexed by the field (element) in the where clouse. What seems to be happenning is: DB2/400 runs the subquery for each one of the records in the big table with 22 million records. I know in SQL Server and Oracle, we can look at the query execution path and make changes to it, if RDBMS did not pick the best route. So my question really is, is there any such facility in DB2/400? or Does DB2/400 expect the queries to be structured in some other fashion that is specific to the platform?
    0 pointsBadges:
    report
  • BigKat
    SQL on the iSeries seems to have a couple of quirks from other platforms. It seems that in a statement using "where EVENT in (Select EVENT from INVEHH..." syntax, the "in" select is run for EVERY record in the outer statement. Try using an inner join syntax. I've found it to help most of the time! select EVENT, INUMBR, ISTORE, EVHYY, EVHMM, EVHDD from INVEHD as a inner join (Select distinct EVENT from INVEHH where (EVHYY*10000 + EVHMM*100 + EVHDD)= 20040617) as b on a.EVENT = b.event even better is to use EVHYY = 2004 and EVHMM = 06 and EVHDD = 17 instead of (EVHYY*10000 + EVHMM*100 + EVHDD)= 20040617 because SQL can make use of a logical over the date fields if they exist if you can say that there can NEVER be duplicate EVENTs selected from the INVEHH file, you can eliminate the distinct from the statement. Not knowing the data relationships for sure I added the distinct just in case.
    8,100 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

To follow this tag...

There was an error processing your information. Please try again later.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Thanks! We'll email you when relevant content is added and updated.

Following