views:

185

answers:

4

I am working with a rather large mysql database (several million rows) with a column storing blob images. The application attempts to grab a subset of the images and runs some processing algorithms on them. The problem I'm running into is that, due to the rather large dataset that I have, the dataset that my query is returning is too large to store in memory.

For the time being, I have changed the query to not return the images. While iterating over the resultset, I run another select which grabs the individual image that relates to the current record. This works, but the tens of thousands of extra queries have resulted in a performance decrease that is unacceptable.

My next idea is to limit the original query to 10,000 results or so, and then keep querying over spans of 10,000 rows. This seems like the middle of the road compromise between the two approaches. I feel that there is probably a better solution that I am not aware of. Is there another way to only have portions of a gigantic resultset in memory at a time?

Cheers,

Dave McClelland

+1  A: 

When dealing with such large datasets it is important not to need to have it all in memory at once. If you are writing the result out to disk or to a webpage, do that as you read in each row. Don't wait until you've read all rows before you start writing.

You also could have set the images to DelayLoad = true so that they are only fetched when you need them rather than implementing this functionality yourself. See here for more info.

Mark Byers
I should have mentioned, I am (currently) using ADO.NET. Is there an equivalent to DelayLoad? I could migrate to Linq, but would prefer not. I'll update the question tags.Also, I am not writing the results anywhere. I am running some image analysis algorithms on the results to compare it to an image passed into the function. As a result, I only need to store the best image and can ignore the rest.Thanks for the quick reply!
Dave McClelland
+3  A: 

One option is to use a DataReader. It streams the data, but it's at the expense of keeping an open connection to the database. If you're iterating over several million rows and performing processing for each one, that may not be desirable.

I think you're heading down the right path of grabbing the data in chunks, probably using MySql's Limit method, correct?

Anthony Pegram
Yes, I had planned on using SQL's limit.
Dave McClelland
Also, I am already using a DataReader to store the results (using OdbcCommand.ExecuteReader() then iterating through with while(datareader.Read() ). It seems not to stream very effectively as it was still filling up all available system memory. Is there a use of datareaders that I'm not aware of?
Dave McClelland
@Dave, someone else may be able to offer insight on the specific memory usage of the datareader, but my understanding is that it should only have a single record in memory at a given time. At any rate, it should require far less memory resources than a DataSet or DataTable. Your memory problems could be the result of what you're doing with the data once you've retrieved it, how long you're keeping those objects in scope, etc.
Anthony Pegram
A: 

I see 2 options.

1) if this is a windows app (as opposed to a web app) you can read each image using a data reader and dump the file to a temp folder on the disk, then you can do whatever processing you need to against the physical file.

2) Read and process the data in small chunks. 10k rows can still be a lot depending on how large the images are and how much process you want to do. Returning 5k worth of rows at a time and reading more in a separate thread when you are down to 1k remaining to process can make for a seamless process.

Also while not always recommended, forcing garbage collection before processing the next set of rows can help to free up memory.

ProphetBeal
A: 

I've used a solution like one outlined in this tutorial before: http://www.asp.net/(S(pdfrohu0ajmwt445fanvj2r3))/learn/data-access/tutorial-25-cs.aspx

You could use multi-threading to pre-pull a portion of the next few datasets (at first pull 1-10,000 and in the background pull 10,001 - 20,000 and 20,001-30,000 rows; and delete the previous pages of the data (say if you are at 50,000 to 60,000 delete the first 1-10,000 rows to conserve memory if that is an issue). And use the user's location of the current "page" as a pointer to pull next range of data or delete some out-of-range data.

Gary