ansaurus

Question

Answer 1

A:

why doesn't your naive solution scale well? if you're using bulk updates and commit infrequently, you can update 1 million rows per second on any database, without any tuning.

If you want to run multiple instances of system A, you can use a hash function to divide the input data into groups, where each instnace of system A consumes exactly one group.

If you have a constant number of instances of system A, e.g. 17, you can use the function id%17 as the hash function.

Shachar 2009-11-03 12:47:26

Then, we are running only one instance of the system A - whereas the resources allow for much more!

Bart J 2009-11-03 12:50:12

Thanks, Shachar. The trouble with using modulo operator "%" to divide the tasks among processes is that some of the URLs will have to be run again on System A. For eg: I initially run with 10 processes and then add more hardware and decide to run with 20... then, won't system A analyse some of the same URLs again?

Bart J 2009-11-03 15:47:08

Not if you noted that those URLs were already processed, using the has_extracted column (or another, dedicated column)

Shachar 2009-11-03 16:42:34

My current implementation is close to what Shachar has suggested. But, I still think there is room for improvement - Perhaps, using messaging (inspired by this - http://s-expressions.com/2009/04/12/using-messaging-for-scalability/)

Bart J 2009-11-09 07:18:09

Answer 2

A:

I think this can be as follows:

URL generator (1pcs or many PCS)
URL stack (1pcs)
URL processor (many pcs)

URL generator(s) generates URLs and pushes all of them to stack, say, in database. Or in memory or where you want.

URL processors consult URL stack to give them one next URL to process. URL Stack gives them URL and marks it as a given or deletes it. When URL processor is finished processing an URL, it consults URL stack again and says, that it finished processing URL1 and wants to process URL2. URL Stack can then mark/delete URL1 from it's list and give URL2.

If URL stack becomes a narrow thing, you can just clusterize database.

FractalizeR 2009-11-03 15:47:19

Thanks FractalizeR. Assuming that there are two URL processors X and Y. If the URL stack gives a URL100 to X which does not goes into weird state/takes a long time to complete, etc. then wouldn't the same URL100 be given to Y because URL processor X has not marked that URL as completed.

Bart J 2009-11-03 16:09:41

this is no different than using the database in the first place, with the has_extracted column (could add another was_picked_up_for_analysis column, though).

Shachar 2009-11-03 16:41:20

2Bart J:It depends on what you want. Mark URLs as given and only give them again after some time and if they are not marked as completed.

FractalizeR 2009-11-03 22:09:19

Answer 3

A:

I somehow feel my problem is similar to the one posted on this link (an extract provided below). The solution in the afore mentioned link and this link - "Databases suck for messaging" have given me a better direction at implementing a better solution.

Extract: So you want to build a system that does jobs. You want the jobs to be able to be run in parallel for speed, but also for redundancy. This system needs to be coordinated so, for example, the same jobs aren't being done twice, the status of every job is easy to see, and multiple servers can run jobs by simply querying the central source.

Bart J 2009-11-10 04:25:06

ansaurus

tags:

views:

answers:

Scaling a system bound by database?

related questions