views:

45

answers:

2

I'm working on a database, where the number of rows is somewhere above 1,000,000. I have my select statement, but if I fetchall to begin with, I run out of memory quickly. Here are my 2 questions:

  1. Since I dont know the exact size of the database to start, is there any way to find out the size of the database without doing a fetchall? The computer literally cannot support it.

  2. Is there any way to fetch say a certain chunk of the database, maybe like 5,000 at a time to process, instead of doing an individual fetchrow for each and every line? I just finished running a test, and to do it row by row, its looking to be almost 4 minutes per 1000 rows worked on, and the boss isnt looking favorably on a program that is going to take almost 3 days to complete.

This is my code:

while ($i < $rows)
{
    if ($i + $chunkRows < $rows)
    {
        for ($j = 0; $j < $chunkRows; $j++)
        {
            @array = $sth->fetchrow();
            ($nameOne, $numberOne, $numberTwo) = someFunction($lineCount,@array,$nameOne,$numberOne, $numberTwo);
        }
    }
    else #run for loop for j < rows % chunkrows
    $i = $i + $j
}
A: 

To get the number of rows in a table, you can use

Select count(*) from Table

To limit the number of rows returned, this may be specific to your database. MySQL, for example, has a Limit keyword which will let you pull back only a certain number of rows.

That being said, if you are pulling back all rows, you may want to add some other questions here describing specifically what you are doing, because thats not a common thing in most applications.

If you dont have a limit available in your database, you can do things like flag a column with a boolean to indicate that a row was processed, and then re-run your query for a limited number of rows, skipping those that have been completed. Or record the last row id processed, and then limit your next query to rows with a greater id. There's a lot of ways around that.

GrandmasterB
So I ended up messing my way through the count, which makes my life a little easier. I had a select count(1) going, just never syntax'd it properly.I dont believe that there is a limit-type keyword in DBI which is what I'm using. Also, I dont have the access to update the database, since this data is coming from an outside source. Looking at how the database is and how its sorted, the only thing that is in an order (like the row ID or something) is a timestamp, but there can be multiple identical timestamps in the database. I'm looking for a fetch(0,5000) starting at 0 and taking 5000
ThePirateSheep
My thought is that the reason that the program takes 3 days is due to the fact that each fetch is individual, and each time it is called, there is the bulk of my latency and time spent. I dont know if I'm explaining it as best as I could :/
ThePirateSheep
DBI is not a database, it is the perl database *interface*. What you want to look for is to see if the actual database server you are connecting to has a LIMIT keyword. What Database server are you using?If you cant write to the database... can you write to a file? Query the rows, write the info to a file, and then process each line in the file. That might keep the memory usage down. Or, do the same to a 2nd table that you do have write access to.
GrandmasterB
+1  A: 

Show your fetchrow looping code; there may be ways to improve it, depending on how you are calling it and just what you are doing with the data.

I believe the database drivers for most databases will fetch multiple rows at once from the server; you are going to have to say what underlying type of database you are using to get good advice there. If indeed it is communicating with the server for each row, you are going to have to modify the SQL to get sets of rows at a time, but how to do that varies depending on what database you are using.

Ah, DB2. I'm not sure, but I think you have to do something like this:

SELECT *
FROM (SELECT col1, col2, col3, ROW_NUMBER() OVER () AS RN FROM table) AS cols
WHERE RN BETWEEN 1 AND 10000;

and adjust the numbers for each query until you get an empty result. Obviously this is more work on the database side to have it repeat the query multiple times; I don't know if there are DB2 ways to optimize this (i.e. temporary tables).

ysth