i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) .
I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated server using some statistical model.
Anyways, imagine starting with just around 10,000 seed URLs, and doing exhaustive crawl....
is it possible ?
I am in need of crawling the web but limited to a dedicated server. how can i do this, is there an open source solution out there already ?
for example see this real time search engine. http://crawlrapidshare.com the results are exteremely good and freshly updated....how are they doing this ?