views:

328

answers:

4

I have a project whereby I'm reading huge volumes of data from an Oracle database from Java.

I have the feeling that the application we are writing is going to process the data far faster than it will be given to us using a single threaded SELECT query and so I've been trying to research faster ways of obtaining the data.

Does anyone have anything I could read that would help me with my plight?

+3  A: 

Hi PintSizedCat,

Oracle supports parallel DML. In particular this applies to SELECT queries. Ultimately the bottleneck will probably be the IO read speed. Either use faster disks or stripe the data accross many disks.

Update

As APC noted in the comments Parallel Queries/DML is an Entreprise Edition feature and is not available in the Standard Edition.

Also, Parallel DML/Query is not the solution to all performance problems. Since more than one process will be used by the query it may improve throughput, but at the cost of concurrency. The purpose of parallelism is to use more resources to process the query faster. If the query is IO-bound or CPU-bound, there is no extra resources to use and adding parallelism will only make matter worse.

From the link above:

Parallel execution is not normally useful for:

  • Environments in which the CPU, memory, or I/O resources are already heavily utilized. Parallel execution is designed to exploit additional available hardware resources; if no such resources are available, then parallel execution will not yield any benefits and indeed may be detrimental to performance.
Vincent Malgrat
Or if the output is less than all the data you are selecting look at stored procedures running in the Oracle db (in PL/SQL or Java)
Mark
Parallel Query will be limited by CPUs as well. That is, if our server's CPU is already maxed out we won't get any benefit from PQ. In fact it might make things worse.
APC
Also the usual caveat applies regarding licensing. Parallel Query is an Enterprise Edition feature.
APC
I don't think the number of cpus or the speed of disks will be an issue. I'm not quite so worried about this because I know that the db is running of solid state disks. Reading multithreaded seems the best way so we can use up as much of the box as possible and read things in as fast as possible into our application.
PintSizedCat
@APC: absolutely, I updated my answer to add a necessary word of caution.
Vincent Malgrat
+3  A: 

You haven't given us a lot of information on why it will be necessary to bring "huge volumes of data" into the Java application instead of processing it on the database side. Although there can be exceptions, usually this is signal to re-think the design. As a general rule with Oracle it is most efficient to do as much work as you can with pure set operations (SQL), followed by procedural processing with the rdbms engine (PL/SQL) before bringing results back to the client application.

dpbradley
Unfortunately due to the nature of our system we need all of the information in it before any processing can be done (aggregation of many different systems worth of data, not all from oracle). I will certainly consider doing some processing with stored procedures but I will have to read in the majority of the data.
PintSizedCat
@PintSizedCat - OK, sounds like a somewhat special situation, but you should do some testing to show that Oracle result set transfer will be the actual bottleneck before going too much further.
dpbradley
@PintSizedCat: Have you demonstrated where the bottleneck is? Have you run your SQL wrapped in DBMS_MONITOR.SESSION_TRACE_ENABLE() and DBMS_MONITOR.SESSION_TRACE_DISABLE() calls to get wait information? Are you sorting the data, intentionally or otherwise?
Adam Musch
I haven't yet, I wanted to get a jump start on the project by learning a bit about oracle and how I might be able to do it. We start tomorrow. I will do what both of you state and find out if we have a bottleneck with oracle.
PintSizedCat
+3  A: 
I'd be very suspicious of *any* number as being a "sweet spot", it probably depends too much on many factors, including the size of each row, and perhaps even the nature of the network transport layer. In the end it's best to pick a starting point and do performance testing with representative data volumes.
Jeffrey Kemp
True, I should have added "your mileage may vary" or a similar disclaimer. Notice I threw out numbers from 75 to 3000, a fairly big range. My guess is that above 75 the performance gains, if any, will get very small. But this is just a guess. My second guess is that one can easily waste more time testing multiple scenerios that will be saved trying to get that last nano second of performance. But again, that depends on the situation... One thing that strikes me is that the original question assumes there will be a problem before they have tried anything concrete...
+2  A: 

Firstly, 'huge data' to database people is [at least] gigabytes, in which case I suspect your problems are going to be reading those sort of volumes into your processes memory and aggregating them there. Why do you think a single-threaded select will be the bottleneck ?

If the bottleneck were getting the data from disk, then having multiple threads pulling data from the same disk wouldn't necessarily be faster and may even be slower. But if you could spread the data over separate disks, separate threads would be faster. If, using SSD, you don't think disks will be a contention point,we can look elsewhere.

If the bottleneck was network bandwidth, again multiple threads wouldn't fit any more data through the pipe any faster. You may even benefit from unloading the data to a flat file, compressing it and transferring that.

If the select is being sorted or comes from a hash-join, you may use memory more efficiently with a single thread. Multiple sessions would have to share the machine's memory.

If there is a CPU intensive processing then multiple threads may help. That could be as simple as having multiple connections from java, each getting a different slice of data (eg A-K and L-Z), but it would very much depend on the SELECT.

I agree with dpbradley that you should determine the bottleneck first. If you have the data and select, it should be simple enough to determine how long it takes (both on the local machine and through the network), and a trace would be a necessary starting point to really go into how it could be speeded up.

Gary
Sorry, I should have said, it's more like terabytes worth of data. You make a lot of sense in terms of pulling the data from disk though, good point. thx, makes a lot of sense in terms of the multithreading.
PintSizedCat
If you are shifting terabytes, I'd consider compression over the network. The effectiveness would depend on whether it is verbose (eg XML) or already compresses (video files). I suspect the network would be a throttle long before the database.
Gary
ok cool, good to know, there is potential for having our server on the same box as the database but this is down the line somewhat.
PintSizedCat