views:

594

answers:

9

If I have a sql query that uses 'distinct' (in oracle), would it be faster than retrieving the non-distinct then getting the unique results via java programming?

I heard somewhere that oracle sql distinct is heavy, but is it heavier than manual 'distinction' via java programming?

Thanks, Franz

+8  A: 

Two main aspects:

  • If you have to transfer the data over to Java, there's all the overhead of doing that transfer. Doing the work in the database means you don't need to transfer data you don't need.
  • The database will be able to use its indexes, caches etc to speed things up.

I'd be very surprised to find that fetching all the data and then doing a distinct operation in Java was faster than doing it in the database.

Jon Skeet
+20  A: 

The rule of the thumb is that the data is faster to process in the database than in your programming language.

The reason is that the data is already available in the database, and it saves the effort of going to your app:

  • marshalling data for the driver from the database;
  • network transfer;
  • unmarshalling data from the driver to the application;


As for Oracle DISTINCT being heavy, what it can mean is to never simply throw a DISTINCT at a query just because it seems like a good idea - profile it with a realistic data set since it can have serious performance implications.

Robert Munteanu
+2  A: 

In memory data manipulation is normally faster, but. If you know you have lots of duplicates it's better to distinct on DB. Otherwise you'll use much more network bandwidth and memory. So it actually depends on the data.

Robert Koritnik
I think Oracle will do the distinct manipulation in memory too. How else to do that? I don't understand your answer.
tuinstoel
I was talking about GENERAL data manipulation. Not some intrinsic DB functionality like DISTINCT. For instance doing something a bit more complex by using temp tables etc.
Robert Koritnik
A: 

SQL distinct is "heavy" because it has to eliminate multiple occurrences. This can be achieved by first sorting the data and then eliminate runs with equal elements. The heaviness relates to the fact that it costs to do this operation.

The idiomatic solution here would be to let the database do the lifting and then worry about performance if it becomes a problem.

jlouis
+2  A: 

Back in my days, when animals where still talking and playing games on their atari's, creating unneeded Network Traffic was a BAD BAD thing.

Getting more data than you needed was simply not done. The only reason I can see getting all the data and manipulate it in Java to get distinct values is when you will need the other data too, a bit later. So in order to cach it.

Note : that's not the same as saying do everything on the server, doing manipulation on data in an environment with more support for datamanipulation can be a good thing. Just don't ask for more data than you will need (caching as a notable exception)

Peter
+4  A: 

If your application can do that faster than the database, the database is seriously broken (except when the application is running on a considerably faster machine). Of course the database has to do some work (either sorting or using hashtables to eliminate duplicates), but so does your application!

It's almost always wrong to do tasks in the application that the database can do well.

ammoQ
+1  A: 

Oracle 9 and Oracle 10 execute a distinct operation differently, Oracle 9 sorts, Oracle 10 hashes. It is possible that in a few select cases Java can do the distinct faster than Oracle 9 but slower than Oracle 10.

I think overal Oracle can do it faster. See Jon Skeet's answer.

tuinstoel
+2  A: 

general rule: lets database do database's job, you are saving memory, network, CPU on your side. Most of these resources will be spent on database side either, but there have been done large optimizations before going to production, so - give baby to her mother...

ante.sabo
A: 

Oracle, especially in later versions, has several methods it can choose from to implement the DISTINCT operation, including sorts or hashes, and accessing the table in various ways (e.g. using indexes or full scans). It also has more information about the data, including statistics and constraints, which can help it to find optimisations that a java program might not necessarily know about.

This is apart from the main issue here, which is the amount of data being piped around (as has been stated several times by others here).

Jeffrey Kemp