Following is the scenario and some proposed solutions. Are there any better solutions?
There is a system A which has to "analyse" lots of URLs. Another system B generates these URLs - currently there are about 10 million of them in a database. Sample schema:
id URL has_extracted
1 abc.com 0
2 bit.ly 1
My solutions are as follows:
Naive solution: Have a perl script/process which feeds the URL (from the database) to system B and updates the has_extracted column The problem with this approach is that it does not scale well.
Solution 2:Split up the database into five(or n) tables . (I am planning to remove the has_extracted column because it seems such a scalability bottle-neck in this scenario.)
Solution 3: Remove the has_extracted column Create another table which maintains/tracks the last URL tracked by each process.
Critiques/Proposed solutions requested. Thanks in advance.