tags:

views:

126

answers:

3

Which is better for building a URL queue in large scale web crawler. Linked-list or or B-tree?

+1  A: 

If you don't need to search the queue (and queues don't generally need to be searched), then a linked list.

Mitch Wheat
yes my queue should work more like a stack with push and pop. but since its going to handle thousands of url, i need a very fast implementation. and is it possible to FIFO on linked list?
kar
Yes. You need to use a head and tail pointer. Insert at tail and remove from head.
Mitch Wheat
+1  A: 

If the order is important (and queues are), then a linked-list. If you need to search the queue, then B-tree.

Scott Ferguson
search not needed as it works like a stack, only i need a non-duplicate aware list. is there a special linked-list for this?
kar
+1  A: 

If you're building a large-scale crawler, you almost certainly want to be using something like an AMQP message queue, most probably RabbitMQ. RabbitMQ (and many other similar MQs) will do upwards of 100,000 transactions per second with a pretty normal installation. I use it in my own spider/crawler setup, and it works like a charm. Certainly a whole lot easier than building something similar from scratch.

Bob Aman
Incidentally, most advanced high speed message queues do use a linked list internally, with pointers to both the head and tail. Also sometimes reference pointers to other locations within the queue. Really depends on the MQ's feature set. AMQP 1.0, for example defines the concept of "links" and links have to maintain their own pointers into the queue.
Bob Aman