DISTINCT or work-around

pts.
Tags:
Oracle
Performance/Load
SQL
Greetings from Vijay, Consider table B as below with columns b1 and b2. Table B B1 B2 ========== 1 abc 2 xyx 3 abc 4 efg 1 xyx 3 abc 4 car 5 edf Query 1 ========= select distinct b2 from B Output : ====== B2 ============ abc car edf efg xyx Explain Plan: Query 1 ===================== Operation Object Name ============================================= SELECT STATEMENT Hint=CHOOSE SORT UNIQUE TABLE ACCESS FULL B Query 2 : ========= SELECT b2 FROM (SELECT b2 ,row_number() OVER (PARTITION BY b2 ORDER BY b2) AS rnum FROM B) WHERE RNUM <2 Output : ====== B2 ============= abc car edf efg xyx Explain Plan: Query 2 ===================== Operation Object Name =================================================== SELECT STATEMENT Hint=CHOOSE VIEW WINDOW SORT PUSHED RANK TABLE ACCESS FULL B The Question ============== When number of rows in table B increases (to the order of a million rows) , would i be right in expecting oracle to perform better with Query2 ? given the fact : " Any value in column b2 will not repeat more than 5 times "

Answer Wiki

Thanks. We'll let you know when a new response is added.

Hi Vijay.

I am going to have to recite from the Gospel of Tom Kyte on this one:

benchmark, benchmark, benchmark

Do not just look at the query plan… tkprof your results to see the timing of these queries. Populate test tables with tens of thousands of rows (using a pl/sql script) and see how these two approaches scale.

But before even doing that, understand what you want out of your query. Do you want a list of distinct values? If yes, then I’d recommend using the DISTINCT operator. That’s what it’s there for. It’s a function the Oracle optimizer understands, and based on your table statistics and use of indexes, it will figure out the best way to do it. Don’t second-guess the optimizer.

More important, Query #1 is simpler to understand. Your successor will appreciate that.

I heartily recommend BOTH of Tom’s books recent books (Efficient Oracle by Design and Expert One on One). I’d go so far to say that they’re required reading.

Hope this points you in a useful direction.

Discuss This Question: 2  Replies

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • Holtmeyer
    Vijay, I have to agree with HOPKIHC, that the only way to be certain of the results it to test it. In my experience, you cannot rely solely on the explain plans or cost results... often times when comparing two solutions, we would get "costs" that were significantly higher on one sql yet it would actually run faster. So really the only true cost analysis is to try it in your environment and with your own data. Don't try to achieve all your results through programming! Work with your DBA to set up appropriate indexing or partitioning on the table so you get the results you need. Also, from both a developer and manager standpoint, a good rule is to "Keep It Simple". Even if the simple query takes a bit longer to produce output, the long-term maintenance savings will far outweigh that small cost. Hope this helps. S.Holtmeyer
    0 pointsBadges:
    report
  • EvanOraDBA
    Vijay, you seem a lot cleverer than I - I would never have come up with such a fancy SQL as your second statement (unless I had a very desperate situation) - well done. Having said that, I am questioning if a more simplistic answer is appropriate. If this sql is a regular feature of you application, it might be appropriate to a)create an index on B(B2) allowing a fast full scan. Obviously this comes at the cost of maintaining that index for each insert where that column is poulated, update to that column, or delete (of which I hate no knowledge) b) put serious consideration into tuning sort area size, multi block read counts for both data and sort blocks, etc. Theses parameters can be altered at a session level prior to executing the statement (if you have that flixibility). My experience would suggest that the physical (non cached)and logical IO generated by the full table scan, and the physical IO required to write out and read back sort buffers (which is partially addresses by your second query) will have a lot more to do with execution time than the sort algorith (but I will conceed that perhaps my experience was a little simplistic). In any case, the sort performace will depend very much on the sort area size, and if you are sorting to disk (due to the volume) you should marry the sort area size and the extent size of sort segments, ensuring temporary tablespace is locally manages, etc, and agree that if this is really so important, you must tune iteratively and benchmark carefully to ensure the validity of your results. If you have access to Metalink I refere you to Note:177334.1 which has links to lots of articles related to sorting and temporary segments. Good luck, Evan.
    0 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

To follow this tag...

There was an error processing your information. Please try again later.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Thanks! We'll email you when relevant content is added and updated.

Following