The best way to store a large set of urls for crawler

tags:

web-crawler

views:

answers:

+1 Q:

The best way to store a large set of urls for crawler

I'm writing a custom built crawler, and need to know if a specific url is crawled or not, so I won't add the same url twice. Right now I'm using mysql to store hash values of each url. But I'm wondering if this may become very slow if I have a large set of urls, say, hundreds of millions.

Is there other ways to store urls? Do people use lucene to do this? Or is there specific data structure to do this?

+1 A:

You have not specified your development platform, but there is really good data structure called Trie (http://en.wikipedia.org/wiki/Trie) there is lot of implementation on java, c++, c# ...

Dewfy 2010-07-26 08:47:10

I use java for the crawler.

2010-07-26 09:20:06

http://stackoverflow.com/questions/623892/where-do-i-find-a-standard-trie-based-map-implementation-in-java describe where you can get implementation

Dewfy 2010-07-26 10:00:53

you may want to try BerkeleyDb

ced 2010-07-26 08:47:33

ansaurus

tags:

views:

answers:

The best way to store a large set of urls for crawler

related questions