Which is better for building a URL queue in large scale web crawler. Linked-list or or B-tree?
+1
A:
If you don't need to search the queue (and queues don't generally need to be searched), then a linked list.
Mitch Wheat
2009-06-02 02:46:44
yes my queue should work more like a stack with push and pop. but since its going to handle thousands of url, i need a very fast implementation. and is it possible to FIFO on linked list?
kar
2009-06-02 04:02:44
Yes. You need to use a head and tail pointer. Insert at tail and remove from head.
Mitch Wheat
2009-06-02 04:31:06
+1
A:
If the order is important (and queues are), then a linked-list. If you need to search the queue, then B-tree.
Scott Ferguson
2009-06-02 03:00:11
search not needed as it works like a stack, only i need a non-duplicate aware list. is there a special linked-list for this?
kar
2009-06-02 04:04:32
+1
A:
If you're building a large-scale crawler, you almost certainly want to be using something like an AMQP message queue, most probably RabbitMQ. RabbitMQ (and many other similar MQs) will do upwards of 100,000 transactions per second with a pretty normal installation. I use it in my own spider/crawler setup, and it works like a charm. Certainly a whole lot easier than building something similar from scratch.
Bob Aman
2009-07-15 18:44:25
Incidentally, most advanced high speed message queues do use a linked list internally, with pointers to both the head and tail. Also sometimes reference pointers to other locations within the queue. Really depends on the MQ's feature set. AMQP 1.0, for example defines the concept of "links" and links have to maintain their own pointers into the queue.
Bob Aman
2009-07-20 21:35:30