Joining very large lists

Lets put some numbers first: The largest of the list is about 100M records. (but is expected to grow upto 500). The other lists (5-6 of them) are in millions but would be less than 100M for the foreseeable future. These are always joined based on a single id. and never with any other parameters. Whats the best algorithm to join such lists?

I was thinking in lines of distributed computing. Have a good hash (the circular hash kinds, where you can add a node and there's not a lot of data movement) function and have these lists split into several smaller files. And since, they are always joined on the common id (which i will be hashing) it would boil down to joining to small files. And maybe use the nix join commands for that.

A DB (at least MySQL) would join using merge join (since it would be on primary key). Is that going to be more efficient that my approach?

I know its best to test and see. But given the magnitute of these files, its pretty time consuming. And I would like to do some theoretical calculation and then see how it fairs in practice.

Any insights on these or other ideas would be helpful. I dont mind if it takes slightly longer, but would prefer the best utilization of the resources I have. Don't have a huge budget :)

But not unless I do some sharding. The numbers are pretty high. So I would imagine, I'll have to do the final merge across db's externally

neal aise 2010-08-20 08:31:23

100 million rows is not really that big. The usual rule of thumb is consider using table partitioning at around 50 million rows.

Mitch Wheat 2010-08-20 08:41:34

ansaurus

tags:

views:

answers:

Joining very large lists

related questions