tags:

views:

46

answers:

2

I'm writing a custom built crawler, and need to know if a specific url is crawled or not, so I won't add the same url twice. Right now I'm using mysql to store hash values of each url. But I'm wondering if this may become very slow if I have a large set of urls, say, hundreds of millions.

Is there other ways to store urls? Do people use lucene to do this? Or is there specific data structure to do this?

+1  A: 

You have not specified your development platform, but there is really good data structure called Trie (http://en.wikipedia.org/wiki/Trie) there is lot of implementation on java, c++, c# ...

Dewfy
I use java for the crawler.
http://stackoverflow.com/questions/623892/where-do-i-find-a-standard-trie-based-map-implementation-in-java describe where you can get implementation
Dewfy
A: 

you may want to try BerkeleyDb

ced