views:

187

answers:

3

I'm working at a complex script which could be processing upto 500,000 records. Here's my question.

Basically my code will parse a text file to get each of those 500,000 or so records. Each record will have a category, my code will need to check if a new record in the categories table had been created for that category during that particular processing, and if not, it will create that record.

So I have 2 options:

1) I store an array of keys=>values containing category name and ID, so I could do this:

if (array_key_exists($category,$allCategories))
   $id=$allCategories[$category];
else
{
   mysql_query("INSERT INTO categories (procId,category) 
                       VALUES ('$procId''$category')");
   $id=mysql_insert_id();
   $allCategories[$category]=$id;
}

2) Each time this text file is processed, it will get its own process ID. So rather than checking the $allCategories variable which could grow to have 100,000+ entries, I could do this:

SELECT id FROM categories WHERE procId='$procId' AND category='$category'

The downside here is that this query will be run for each of the 500,000+ records. Whereas the disadvantage of holding all the categories in an array is that I could run out of memory or the server could crash.

Any thoughts?

+2  A: 

Can you just keep a list of the ids that you've already inserted? If they are integer ids, that's 4 bytes each times 100,000 entries would use only about 400K of memory.

ETA:

To avoid storing the category name, hash the name and store the hash. With a 128 bit MD5 hash, that's 16 bytes per hash or only about 1.6MB of memory + overhead.

Eric Petroelje
But i'd also need to store the category names, which are in a string such as "Category A", "Category B", "Long name category C", etc. So it could add up. But if its a better option than doing a query each time, i might go with it
Click Upvote
PHP takes about 68 bytes to store an integer. It isn't C. It stores everything in a C struct called a ZVal and... well you can't take sizeof(int) from C and assume it translates to PHP.
James Socol
@James - yeesh, I knew PHP would have extra overhead, but I had no idea it was that bad :)
Eric Petroelje
Eric, which hash system do you recommend? Md5?
Click Upvote
+1  A: 

One idea will be to add a constraint on table so duplicate inserts are rejected by database. Then just keep inserting all records and let the db do the checking.

Tahir Akhtar
Good idea but each of the records need to be linked to the appropriate category ID in the database. Also, there might be more than 1 record with the same category name but a different userId or procId (process id)
Click Upvote
You can define a composite unique key on (category,userid,procid) so database will only reject inserts when exact same combination of these column values is already present in the table.
Tahir Akhtar
Does that work in mysql?
Click Upvote
CREATE TABLE `categories` ( `category_id` int(11) NOT NULL default '0', `user_id` int(11) NOT NULL default '0', `proc_id` int(11) NOT NULL default '0', UNIQUE KEY `UniqueCategoryUserProc` (`category_id`,`user_id`,`proc_id`) ) TYPE=InnoDB
Tahir Akhtar
+1  A: 

Given that your average category name is 30 bytes, you'd only need 30 * 500000 bytes = 15000000 bytes = 15000 kilobytes = 1.5 megabytes.

I think you have this much memory.

Georg