views:

69

answers:

1

Lets put some numbers first: The largest of the list is about 100M records. (but is expected to grow upto 500). The other lists (5-6 of them) are in millions but would be less than 100M for the foreseeable future. These are always joined based on a single id. and never with any other parameters. Whats the best algorithm to join such lists?

I was thinking in lines of distributed computing. Have a good hash (the circular hash kinds, where you can add a node and there's not a lot of data movement) function and have these lists split into several smaller files. And since, they are always joined on the common id (which i will be hashing) it would boil down to joining to small files. And maybe use the nix join commands for that.

A DB (at least MySQL) would join using merge join (since it would be on primary key). Is that going to be more efficient that my approach?

I know its best to test and see. But given the magnitute of these files, its pretty time consuming. And I would like to do some theoretical calculation and then see how it fairs in practice.

Any insights on these or other ideas would be helpful. I dont mind if it takes slightly longer, but would prefer the best utilization of the resources I have. Don't have a huge budget :)

+5  A: 

Use a Database. They are designed for performing joins (with the right indexes of course!)

Mitch Wheat
But not unless I do some sharding. The numbers are pretty high. So I would imagine, I'll have to do the final merge across db's externally
neal aise
100 million rows is not really that big. The usual rule of thumb is consider using table partitioning at around 50 million rows.
Mitch Wheat