Strange Oracle 9i behaviour

765 pts.
Tags:
Oracle 9i
Oracle performance
Solaris 10
SQL Plus
We are seeing some very strange behaviour in the performance of a query on our database. We have run the same query on the actual database server (s1) using sql+, and on a server (s2) 1 hop away (taking out network performance). The exact same queries, without any bind variables, are kicked off at exactly the same time. The query that is run on the s1 finishes in 30-40 minutes, whilst the query that is run from s2 takes 4 hours to finish. Even when the queries are started at different times during the day, no matter in which order, or only 1 at a time, the s1 query always takes less than an hour, whilst the s2 query always takes close on 4 hours. The explain plans for both sessions show the exact same paths, however, they have different costs (I believe this is due to a logon trigger that is forcing the optimizer_index_cost_adj = 10 when the s2 session is created). The OS user used to log in to s1 does not have any specific resource rights.

It has been noted that there are a lot of waits between reads for client side queries, which don't seem to be happening on the server side query. The session statistics don't really tell us much, other than that it is taking a lot longer to run.

Is this behaviour normal for server vs client queries? Does it make a difference that the costs are higher in the explain plan generated, even though the steps are exactly the same? If not, what else can we look at?

 



Software/Hardware used:
Oracle 9i,Solaris,SQL+

Answer Wiki

Thanks. We'll let you know when a new response is added.

<i>1) Are the servers hardware in terms of CPU and memory allocation for your various SGA, DB buffer cache, shared pool, etc identical ? Are you using Automatic Memory Management, btw ?</i>
We have the pga_aggregate_target set to 3758096384 and the workarea_size_policy set to AUTO.

Discuss This Question: 21  Replies

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • carlosdl
    "The explain plans for both sessions show the exact same paths, however, they have different costs (I believe this is due to a logon trigger that is forcing the optimizer_index_cost_adj = 10 when the s2 session is created)" I agree, that is most likely the reason. Have you tried temporarily removing this trigger to confirm that ? However, since the execution plan being chosen by the optimizer is the same, I don't think the cost is being a factor in the query's execution time. "Is this behaviour normal for server vs client queries?" I don't think so. Have you tried from different clients ? There is most likely a switch between s1 and s2. I would try connecting the server to a different port on the switch, and I would also try changing s1's patch cord. Even, if possible, I would try connecting a client directly to the server (with a crossover cable) without any device in between to completely take the switch out of the picture. I would also make sure s1 NIC's card is properly configured and would perform tests to make sure it is working as expected. Strange behaviour indeed.
    69,510 pointsBadges:
    report
  • carlosdl
    Rethinking this... Depending on the amount of data being returned by the query, and taking into consideration that it is just a single query and not a set of many database operations, even if there are network problems, the performance impact may not be as huge as in your case. Maybe adding an additional condition to return just one row (rownum for example) could help discarding the possible network impact in the return of data, and could offer some more ideas as to where the problem could lie.
    69,510 pointsBadges:
    report
  • Darryn
    Yes, we have tried from different clients, all with the same behaviour. However, we did run it from the same LAN that the Oracle server is on, and that ran in the same time as on the server itself, so it does look like a network issue. I'm not 100% sure of the network setup, but it looks like we have a VLAN whereby the 1 hop between the servers does go through a switch, but I still don't see how this can be having such a big impact. A tnsping returns 10msec, and our network teams cannot see any issues with either networks, or the switch. I have tried limiting the rows by using rownum < 2, but this hasn't made any difference (other than adding an extra step in the explain plan). We are getting all the cabling, switch ports and NICs checked on Monday, as well as a sniffer put on the switch, so hopefully that will tell us something. I'm not entirely convinced though. Thanks for your suggestions and help.
    765 pointsBadges:
    report
  • Kccrosser
    I wonder if the query from the other computer is being treated as an "ad-hoc" query and thus getting lower priority. There is a good article on Oracle Database Resource Manager at: http://www.databasejournal.com/features/oracle/article.php/3296961/Oracle-Database-Resource-Manager-Part-1Overview.htm You may be encountering a situation where the ORM is dropping the priority on the "remote" query, causing it to run much longer. Following the info in the article above, you should be able to try forcing the priority to be the same for both queries and see if this eliminates much of the additional time.
    3,830 pointsBadges:
    report
  • Darryn
    Cables, switches and NICs all seem to be fine, still waiting for sniffers to be set up though. We don't have DRM enabled.
    765 pointsBadges:
    report
  • Kccrosser
    I think the problem is the use of the "optimizer_index_cost_adj". Setting this value to 10 tells Oracle to use the indexes - even when a full-table scan would work better. If your query takes 30-40 minutes locally, it may be doing some full table scans already. Forcing the query optimizer to use the indexes may actually be causing the additional overhead. The fact that the "rownum" condition didn't reduce the time says that the problem is in the query execution at the server - not a network problem. The "optimizer_index_cost_adj" should only be set low (like 10) for OLTP type transactions, and any query that takes 30-40 minutes would not be a candidate. Try setting this value back to 100 (default) from the client and retest. Burleson has a good discussion on this parameter at: http://www.dba-oracle.com/oracle_tips_cost_adj.htm
    3,830 pointsBadges:
    report
  • carlosdl
    But according to Darryn, the execution plan looks the same.
    69,510 pointsBadges:
    report
  • Kccrosser
    He says the plans show the same paths "... however, they have different costs...". Setting the optimizer_index_cost_adj value to 10 effectively forces the optimizer to try to use indexes even when a full table scan may be a lot more efficient. Essentially the cost adjustment from the logon trigger seems to be trying to set transactions from that source to be "FIRST_ROWS". That is appropriate for rapid response OLTP transactions, but counter-productive for long-running large queries. I would first try forcing the optimizer_index_cost_adj back to the default value of 100 and see if that fixes the problem. The other thing I would be looking at is the PGA/SGA areas and their statistics. If they are setting optimizer parameters in the login scripts, are there other constraints being applied that might reduce the available PGA or SGA resources? Even a fairly slight reduction in available PGA space may result in a lot of additional memory paging overhead in a large query. If the optimizer_index_cost_adj change doesn't fix the proble, try querying the following while the process is running (from each source) and see if there are significant differences:
    select
       program, pga_used_mem, pga_alloc_mem, pga_max_mem 
    from v$process;
    
    Of course, looking at all the performance stats is always a good starting place. In the absence of a good online analysis system, I have used the following to look at the query activity:
    select
     au.username,
     vs.executions as Exes,
     vs.disk_reads as "Disk",
     trunc(vs.disk_reads / greatest(vs.executions,1)) as "Disk/Exe",
     vs.buffer_gets as "Gets",
     trunc(vs.buffer_gets / greatest(vs.executions,1)) as "Gets/Exe",
     vs.sql_text
    from
     sys.v_$sql vs
       left outer join sys.all_users au on (au.user_id = vs.parsing_user_id)
    where
     (vs.buffer_gets / greatest(vs.executions,1)) > 1000
    order by 4 desc;
    
    Caveat - this code worked in Oracle 8 thru 10. I haven't run it on newer systems, so there may be some changes needed. This lets you see the SQL queries that have the most cost. Obviously, you need to tune this to look at different factors. I am commonly looking for the queries that have the highest disk I/O costs, and then the highest buffer gets costs.
    3,830 pointsBadges:
    report
  • carlosdl
    "Setting the optimizer_index_cost_adj value to 10 effectively forces the optimizer to try to use indexes even when a full table scan may be a lot more efficient." If that parameter is in fact making the optimizer opt for the indexes (which I know is its goal), I would expect that to be reflected in the execution plan. I mean, if the optimizer choose an access path when running the query that is different from the one it told you it was going to use, then looking at execution plans wouldn't be of much help.
    69,510 pointsBadges:
    report
  • Kccrosser
    From Tom Kyte - "Now, if I run the query and ask Oracle "what plan did you use", we get to see 'reality' - not an explain plan (explain plans are many times not representative of reality!) ". I treat explain plans as generally useful guidance - they can be very useful for finding bad/missing indexes when you see "table full scan". However, during execution there is really no guarantee that Oracle will actually execute the query as per the explain plan. For example, if the explain plan assumes that a particular portion of the query will return 1000 rows, it may decide to use those in indexed queries against the next table and show that in both explain plans. However, during execution Oracle finds that that subquery actually returns 100000 rows. When this happens, Oracle may change the subsequent processing. This is the kind of thing where the subsequent query might perform better as a table full scan instead of trying to use indexes, but if the optimizer_index_cost_adj has biased the optimizer to use indexes, the client may be forcing Oracle to use the suboptimal approach.
    3,830 pointsBadges:
    report
  • carlosdl
    That is interesting, and new to me. Thanks. Taking that into account, I agree that the optimizer_index_cost_adj parameter may be causing the problem.
    69,510 pointsBadges:
    report
  • Paladine Solamnia
    Hi Darryn, I think the easiest way to troubleshoot is to find out what is the main difference between both your servers S1 and S2. As what kccrosser suggested, try setting your optimizer_index_cost_adj back to the default value of 100. Ceteris Paribus, you should probably ask yourself the following: 1) Are the servers hardware in terms of CPU and memory allocation for your various SGA, DB buffer cache, shared pool, etc identical ? Are you using Automatic Memory Management, btw ? 2) You said the explain plans are the same except the costs. Which component in the explain plan is different with respect to the cost? Is it the index, the join, the full table scans or the index scans ? 3) Is the amount of data in the tables queried identical ? This has a big impact on how the optimizer plans its execution. For instance, querying against a small table would yield better performance if a full table scan is used while an index scan would be ideal if only 5% of the table rows are to be returned. 4) I cannot emphasise this more often than I would but it is imperative that the statistics for the tables in the query along with any related indexes are accrued correctly in the form of high-quality statistics. I've seen the performance of queries degrade badly 'cos the statistics are not updated and skewed badly. At the very least, try querying against dba_tables and dba_segments using this query I wrote to ascertain when the statistics are last updated: set pagesize 1000 set linesize 300 spool ana_progress.log alter session set NLS_DATE_FORMAT='DD-MON-YYYY HH24:MI:SS'; select a.owner, a.table_name, b.partition_name, a.last_analyzed from dba_tables a, dba_segments b where a.table_name = b.segment_name and a.owner = b.owner and a.owner in('<owner>') order by 1,2; spool off Similarly, query against the dba_indexes and their last_analyzed to see any differences. If there are any discrepancies, run the following against the tables in your query: execute dbms_stats.gather_table_stats(tabname=>'<table_name>',ownname=>'<owner>',cascade=>TRUE); I hope this helps.
    170 pointsBadges:
    report
  • Darryn
    [...] Strange Oracle 9i behaviour [...]
    0 pointsBadges:
    report
  • Darryn
    This is now very wierd (16/03). We tested the cost_adjustment at 100 last week, and it didn't improve the performance of the query, however, I have retested it this week (yesterday and today), and the query ran in the normal time (40 mins). The execution plan has only slightly changed though, with 1 index from a table being used in place of another (the plan for the adjustment = 10 is still the same). Our stats are usually updated over the weekend (I have checked, and they were updated this weekend), so the statistics should not have been stale (although I know plans can change with the stats update). -(Update 17/03) - This is certainly strange, and I don't know if I am going mad. I have set off the query with both optimizer adj = 100 and 10 this morning (2 sql+ sessions). Both explain plans are exactly the same in all aspects, cost and plan, both the same as with the previous optimizer adj = 100. I thought I might have made a mistake, so I started again (killed sessions, closed sql+), but now both explain plans have reverted to the cost adj = 10. Double checked this by opening another sql+ window, set the cost adj =100, and it is the same as the 10????? Could it be that we have a problem with our optimizer? "Setting the optimizer_index_cost_adj value to 10 effectively forces the optimizer to try to use indexes even when a full table scan may be a lot more efficient." Yes, we were trying to force the optimizer to use the indexes, as we noticed that our indexes weren't being used when they should be (that is a another problem, although related). When we used hints to use the indexes, the queries ran far quicker than the ones that did full table scans, which is why we set the index cost adjustment. " I have used the following to look at the query activity" As this is a reporting database, both the disk and buffer fetches are high (running the query returns almost 900 rows). If I run it for just the query in question, with the queries started 1.5 mins apart (both with same explain_plan), I get the following: EXES DISK DISK/EXE GETS GETS/EXE 1) 2 115198 57599 7468269 3734134 2) 4 165601 41400 8963173 2240793 I think we need a new database.......
    765 pointsBadges:
    report
  • Darryn
    Hi Paladine 1) Are the servers hardware in terms of CPU and memory allocation for your various SGA, DB buffer cache, shared pool, etc identical ? Are you using Automatic Memory Management, btw ? I'm not sure, I will need to check. 2) You said the explain plans are the same except the costs. Which component in the explain plan is different with respect to the cost? Is it the index, the join, the full table scans or the index scans ? E.g. Adj100 - Table access by local index row id - cost 226,644; cpu_cost 9,219,635,858; io_cost 225,960 - Index Range Scan - cost 1,517, cpu_cost 216,554,762; io_cost 1,500 Adj10 - Table access by local index row id - cost 78,757; cpu_cost 921,963,586; io_cost 78,689 - Index Range Scan - cost 1,517, cpu_cost 216,554,762; io_cost 1,500 3) Is the amount of data in the tables queried identical ? This has a big impact on how the optimizer plans its execution. For instance, querying against a small table would yield better performance if a full table scan is used while an index scan would be ideal if only 5% of the table rows are to be returned. Yes, they are all fairly large tables (375 mil, 113 mil, 84 mil, 3 mil). I noticed the change in performance yesterday when a different index was chosen on the 375 mil table (there are only 2 indexes on the table). The other indexes are direct table->table joins, so no full table scans are done in the query. 4) I cannot emphasise this more often than I would but it is imperative that the statistics for the tables in the query along with any related indexes are accrued correctly in the form of high-quality statistics. I’ve seen the performance of queries degrade badly ‘cos the statistics are not updated and skewed badly. At the very least, try querying against dba_tables and dba_segments using this query I wrote to ascertain when the statistics are last updated: Our stats are updated every weekend, and I have checked that they were throughout the process. Note:- We are experiencing poor performance throughout the database, we are just using this query as a test.
    765 pointsBadges:
    report
  • Darryn
    Sorry, posted in the wrong area: 1) Are the servers hardware in terms of CPU and memory allocation for your various SGA, DB buffer cache, shared pool, etc identical ? Are you using Automatic Memory Management, btw ? We have the pga_aggregate_target set to 3758096384 and the workarea_size_policy set to AUTO.
    765 pointsBadges:
    report
  • carlosdl
    1) Are the servers hardware in terms of CPU and memory allocation for your various SGA, DB buffer cache, shared pool, etc identical ? 3) Is the amount of data in the tables queried identical ? Paladine Solamnia: The same query is being run against the same database (on the same database server), but from different servers.
    69,510 pointsBadges:
    report
  • Paladine Solamnia
    We are experiencing poor performance throughout the database, we are just using this query as a test. Hmm, any changes to the database server recently such as a migration request to some schemas or some parameter changes or even hardware changes ? I mean things don't happen out of a sudden if this server has been going on fine for the last few weeks/months. Are there any batch jobs that run over the night doing lots of DML operations such as inserts, deletes and updates? Any reasons why this query in particular ? Is it because it appeared in the top 10 SQL statements in your Statspack reports ? Btw, I think the buffer gets are really off the charts. Thats why the query is so slow. I thought I might have made a mistake, so I started again (killed sessions, closed sql+), Personally, I think if the problem resides purely in your database server, there is no need to test the queries on the database server (locally) and on another machine a hop away. (Thanks Carlosdl for pointing that out) Just do all the sqlplus sessions locally on the database server. Is this machine a critical production database ? If not, you might want to do the following to get a "cleaner" reading for explain plan instead of just kill sessions. SQL> alter system flush shared_pool; SQL> alter system checkpoint; SQL> alter session set events = 'immediate trace name flush_cache'; I don't really fancy SQL tuning since you said that it's poor performance throughout the database but having said that I think if this query is important, perhaps tuning it as an interim solution is better than no solution at all. The execution plan has only slightly changed though, with 1 index from a table being used in place of another I'm of the opinion you need to take a look at the clustering factor of the indexes in question. On top of that, another back to basics review of the indexes in question should be based on the following questions : 1) How many distinct values in that particular column of the table that was indexed are we talking about? If the number of distinct values is very low and the index used is a normal B-Tree index, it would make more sense to use a bit-map index instead on columns such as Gender i.e. Male/Female. 2) Are there lots of NULL values in that particuar column that was indexed ? If there are, chances are that index wouldn't be utilized no matter how hard you try as the index is invalidated the minute it encounters NULL. Try using a function based index on a nvl(<NULL columnn>, <some value>) function to convert the NULL value to something instead. That is the silver bullet that burleson always preach. 3) Are your indexes composite indexes i.e. indexes placed on several columns ? If they are, the order of the columns by which your composite index is placed play a part in the attractiveness of the index to the optimizer. The only other thing I can think of is to rebuild all the indexes. That would certainly help but given that the data in the tables are so huge, the rebuild might take a lot of time.
    170 pointsBadges:
    report
  • Kccrosser
    "poor performance throughout the database" - this is a red flag for me. I have often found that when you are having poor performance throughout, there are LOTS of underlying problems with tables, indexes, queries, and system configurations. This leads to a problem in tuning, since you might fix one problem and not see an improvement in system performance, so it seems like your fix didn't work. Often the system is really overloaded/choked - if it is trying to run with 130% load, and you take out a 10% problem, the system is still trying to run at a 120% load, and everything still runs slow. To address these, I use that query I posted earlier and start working down through the "worst of the worst" - looking at each query to determine what is causing it to behave badly. In one large system, I had to fix over 30 queries (mostly by adding/changing indexes) before the system dropped out of saturation mode and started being responsive to further improvements. The Disk I/O and Buffer Get numbers suggest that the system is doing some serious table (or index) scanning, and that the data of interest isn't cached, so it needs to hit the disk. This implies that the indexes aren't efficient, and may explain why the optimizer doesn't choose to use some of the indexes. Are you using "covering" indexes? (Indexes where the index itself contains the column(s) being retrieved based on the key values.) These can seriously improve system performance, as the system can often avoid fetching the actual data row if the desired retrieval value is in the index. Sometimes a modest increase in the size of an index can cut performance by 10x or more. I would start looking at the overall systems performance issues before focusing on this particular query. 40 minutes to retrieve 900 rows seems excessive by itself - that is the kind of performance time I would expect for a monthly detail financial report for a large corporation.
    3,830 pointsBadges:
    report
  • Darryn
    Quick update on the problem. We seem to have narrowed it down to Operating Systems. We changed the query to a quicker one, and when run from unix systems, it completed in under 60 secs, but took 45-60 mins from windows systems. We are struggling to find out why there is such a difference, and have opened a case with metalink. One of our more experienced DBAs has concluded that the optimizer_mode=RULE is being forced from windows (he was able to produce the same explain plan and poor performance from unix by altering the session mode). There are no logon triggers or anything else we can find that is forcing the optimizer to use RULE, so now we are waiting on a response from Oracle.
    765 pointsBadges:
    report
  • carlosdl
    Thanks for the update, Darryn.
    69,510 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

To follow this tag...

There was an error processing your information. Please try again later.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Thanks! We'll email you when relevant content is added and updated.

Following