views:

106

answers:

3

Following is the scenario and some proposed solutions. Are there any better solutions?

There is a system A which has to "analyse" lots of URLs. Another system B generates these URLs - currently there are about 10 million of them in a database. Sample schema:

id URL has_extracted
1 abc.com 0
2 bit.ly  1

My solutions are as follows:

Naive solution: Have a perl script/process which feeds the URL (from the database) to system B and updates the has_extracted column The problem with this approach is that it does not scale well.

Solution 2:Split up the database into five(or n) tables . (I am planning to remove the has_extracted column because it seems such a scalability bottle-neck in this scenario.)

Solution 3: Remove the has_extracted column Create another table which maintains/tracks the last URL tracked by each process.

Critiques/Proposed solutions requested. Thanks in advance.

A: 

why doesn't your naive solution scale well? if you're using bulk updates and commit infrequently, you can update 1 million rows per second on any database, without any tuning.

If you want to run multiple instances of system A, you can use a hash function to divide the input data into groups, where each instnace of system A consumes exactly one group.

If you have a constant number of instances of system A, e.g. 17, you can use the function id%17 as the hash function.

Shachar
Then, we are running only one instance of the system A - whereas the resources allow for much more!
Bart J
Thanks, Shachar. The trouble with using modulo operator "%" to divide the tasks among processes is that some of the URLs will have to be run again on System A. For eg: I initially run with 10 processes and then add more hardware and decide to run with 20... then, won't system A analyse some of the same URLs again?
Bart J
Not if you noted that those URLs were already processed, using the has_extracted column (or another, dedicated column)
Shachar
My current implementation is close to what Shachar has suggested. But, I still think there is room for improvement - Perhaps, using messaging (inspired by this - http://s-expressions.com/2009/04/12/using-messaging-for-scalability/)
Bart J
A: 

I think this can be as follows:

  1. URL generator (1pcs or many PCS)
  2. URL stack (1pcs)
  3. URL processor (many pcs)

URL generator(s) generates URLs and pushes all of them to stack, say, in database. Or in memory or where you want.

URL processors consult URL stack to give them one next URL to process. URL Stack gives them URL and marks it as a given or deletes it. When URL processor is finished processing an URL, it consults URL stack again and says, that it finished processing URL1 and wants to process URL2. URL Stack can then mark/delete URL1 from it's list and give URL2.

If URL stack becomes a narrow thing, you can just clusterize database.

FractalizeR
Thanks FractalizeR. Assuming that there are two URL processors X and Y. If the URL stack gives a URL100 to X which does not goes into weird state/takes a long time to complete, etc. then wouldn't the same URL100 be given to Y because URL processor X has not marked that URL as completed.
Bart J
this is no different than using the database in the first place, with the has_extracted column (could add another was_picked_up_for_analysis column, though).
Shachar
2Bart J:It depends on what you want. Mark URLs as given and only give them again after some time and if they are not marked as completed.
FractalizeR
A: 

I somehow feel my problem is similar to the one posted on this link (an extract provided below). The solution in the afore mentioned link and this link - "Databases suck for messaging" have given me a better direction at implementing a better solution.

Extract: So you want to build a system that does jobs. You want the jobs to be able to be run in parallel for speed, but also for redundancy. This system needs to be coordinated so, for example, the same jobs aren't being done twice, the status of every job is easy to see, and multiple servers can run jobs by simply querying the central source.

Bart J