Greetings from Vijay,
Consider table B as below with columns b1 and b2.
Table B
B1 B2
==========
1 abc
2 xyx
3 abc
4 efg
1 xyx
3 abc
4 car
5 edf
Query 1
=========
select distinct b2 from B
Output :
======
B2
============
abc
car
edf
efg
xyx
Explain Plan: Query 1
=====================
Operation Object Name
=============================================
SELECT STATEMENT Hint=CHOOSE
SORT UNIQUE
TABLE ACCESS FULL B
Query 2 :
=========
SELECT b2
FROM (SELECT b2 ,row_number() OVER (PARTITION BY b2 ORDER BY b2) AS rnum FROM B)
WHERE RNUM <2
Output :
======
B2
=============
abc
car
edf
efg
xyx
Explain Plan: Query 2
=====================
Operation Object Name
===================================================
SELECT STATEMENT Hint=CHOOSE
VIEW
WINDOW SORT PUSHED RANK
TABLE ACCESS FULL B
The Question
==============
When number of rows in table B increases (to the order of a million rows) , would i be right in expecting oracle to perform better with Query2 ?
given the fact :
" Any value in column b2 will not repeat more than 5 times "
Software/Hardware used:
ASKED:
May 13, 2004 12:04 PM
UPDATED:
May 22, 2004 10:41 PM
Vijay,
I have to agree with HOPKIHC, that the only way to be certain of the results it to test it.
In my experience, you cannot rely solely on the explain plans or cost results… often times when comparing two solutions, we would get “costs” that were significantly higher on one sql yet it would actually run faster. So really the only true cost analysis is to try it in your environment and with your own data.
Don’t try to achieve all your results through programming! Work with your DBA to set up appropriate indexing or partitioning on the table so you get the results you need.
Also, from both a developer and manager standpoint, a good rule is to “Keep It Simple”. Even if the simple query takes a bit longer to produce output, the long-term maintenance savings will far outweigh that small cost.
Hope this helps.
S.Holtmeyer
Vijay, you seem a lot cleverer than I – I would never have come up with such a fancy SQL as your second statement (unless I had a very desperate situation) – well done.
Having said that, I am questioning if a more simplistic answer is appropriate. If this sql is a regular feature of you application, it might be appropriate to
a)create an index on B(B2) allowing a fast full scan. Obviously this comes at the cost of maintaining that index for each insert where that column is poulated, update to that column, or delete (of which I hate no knowledge)
b) put serious consideration into tuning sort area size, multi block read counts for both data and sort blocks, etc.
Theses parameters can be altered at a session level prior to executing the statement (if you have that flixibility).
My experience would suggest that the physical (non cached)and logical IO generated by the full table scan, and the physical IO required to write out and read back sort buffers (which is partially addresses by your second query) will have a lot more to do with execution time than the sort algorith (but I will conceed that perhaps my experience was a little simplistic).
In any case, the sort performace will depend very much on the sort area size, and if you are sorting to disk (due to the volume) you should marry the sort area size and the extent size of sort segments, ensuring temporary tablespace is locally manages, etc, and agree that if this is really so important, you must tune iteratively and benchmark carefully to ensure the validity of your results.
If you have access to Metalink I refere you to Note:177334.1 which has links to lots of articles related to sorting and temporary segments.
Good luck, Evan.