views:

112

answers:

4

Hi Experts,

I have a requirement where I have to select around 60 million plus records from database. Once I have all records in ResultSet then I have to formate some columns as per the client requirement(date format and number format) and then I have to write all records in a file(secondary memory).

  • Currently I am selecting records on day basis (7 selects for 7 days) from DB and putting them in a HashMap. Reading from HashMap and formating some columns and finally writing in a file(separate file for 7 days).
  • Finally I am merging all 7 files in a single file.

  • But this whole process is taking 6 hrs to complete. To improve this process I have created 7 threads for 7 days and all threads are writing separate files.

  • Finally I am merging all 7 files in a single file. This process is taking 2 hours. But my program is going to OutOfMemory after 1 hour and so.

Please suggest me the best design for this scenario, should I used some caching mechanism, if yes, then which one and how?

Note: Client doesn't want to change anything at Database like create indexes or stored procedures, they don't want to touch database. Thanks in advance.

A: 

Depends on the database you are using, but if it was SQL Server, I would recommend using something like SSIS to do this rather than writing a program.

Eric Petroelje
Client is not ready to change anything at database. No index and Stored Procedure.
Garhwali Bhai
Sometimes you have to tell the client why what they want is a bad idea and the best way to do something is to make the change at the database level. They don't want a slowly running application locking up their tables either. An SSIS package will not change the structure of the database or add any Stored procs or indexes but it is one of the best ways to import or export data as it is specificaly designed to do just that.
HLGEM
As HLGEM mentioned, using SSIS does not require any database changes. It's just a tool for doing data transformations (which is exactly what you are doing). The caveat is that it only works with SQL Server - so if they aren't using SQL Server, then of course SSIS isn't an option (although there may be similar tools for whatever database they are using)
Eric Petroelje
+4  A: 

Do you need to have all the records in memory to format them? You could try and stream the records through a process and right to the file. If your able to even break the query up further you might be able to start processing the results, while your still retrieving them.

Depending on your DB backend they might have tools to help with this such as SSIS for Sql Server 2005+.

Edit

I'm a .net developer so let me suggest what I would do in .net and hopefully you can convert into comparable technologies on the java side.

ADO.Net has a DataReader which is a forward only, read only (Firehose) cursor of a resultset. It returns data as the query is executing. This is very important. Essentially, my logic would be:

IDataReader reader=GetTheDataReader(dayOfWeek);

while (reader.Read())
{
    file.Write(formatRow(reader));
}

Since this is executing while we are returning rows your not going to block on the network access which I am guessing is a huge bottleneck for you. The key here is we are not storing any of this in memory for long, as we cycle the reader will discard the results, and the file will write the row to disk.

JoshBerke
please suggest something at Java, JDBC side. They don't want to change anything at DB side, I know this is crazy :):)
Garhwali Bhai
JDBC/ADO are just client access technologies, neither require a change on the database. Its not crazy they want to keep their DB in the current state either - its the important bit where the data is. Surely JDBC has a similar read-only, forward-only cursor.
gbjbaanb
I'd assume JDBC does, I'd be shocked if it doesn't since this isn't a really novel concept. The key is that the cursor allows for reading the results while the query is executing. I used this in a simillar pattern where I didn't know how many results I would need until my business logic ran (And the logic was basically an expression defined by the user so could be anything). Using the datareader was a huge benefit since we just stopped reading soon as we had enough data.
JoshBerke
+2  A: 

I think what Josh is suggesting is this:

You have loops, where you currently go through all the result records of your query (just using pseudo code here):

while (rec = getNextRec() )
   {
   put in hash ...
   }

for each rec in (hash)
   {
   format and save back in hash ...
   }

for each rec in (hash)
   {
   write to a file ...
   }

instead, do it like this:

while (rec = getNextRec() )
   {
   format fields ...
   write to the file ...
   }

then you never have more than 1 record in memory at a time ... and you can process an unlimited number of records.

Ron

Ron Savage
A: 

Obviously reading 60 million records at once is using up all your memory - so you can't do that. (ie your 7 thread model). Reading 60 millions records one at a time is using up all your time - so you can't do that either (ie your initial read to file model).

So.... you're going to have to compromise and do a bit of both.

Josh has it right - open a cursor to your DB that simply reads the next record, one after the other in the simplest, most feature-light way. A "firehose" cursor (otherwise known as a read-only, forward-only cursor) is what you want here as it imposes the least load on the database. The DB isn't going to let you update the records, or go backwards in the recordset, which you don't want anyway, so it won't need to handle memory for the records.

Now you have this cursor, you're being given 1 record at a time by the DB - read it, and write it to a file (or several files), this should finish quite quickly. Your task then is to merge the files into 1 with the correct order, which is relatively easy.

Given the quantity of records you have to process, I think this is the optimum solution for you.

But... seeing as you're doing quite well so far anyway, why not just reduce the number of threads until you are within your memory limits. Batch processing is run overnight is many companies, this just seems to be another one of those processes.

gbjbaanb