guide on crawling the entire web ?

tags:

web-crawler

views:

211

answers:

+8 Q:

guide on crawling the entire web ?

i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) .

I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated server using some statistical model.

Anyways, imagine starting with just around 10,000 seed URLs, and doing exhaustive crawl....

is it possible ?

I am in need of crawling the web but limited to a dedicated server. how can i do this, is there an open source solution out there already ?

for example see this real time search engine. http://crawlrapidshare.com the results are exteremely good and freshly updated....how are they doing this ?

+1 A:

I bet it is possible. You only need to have a quantum CPU and quantum RAM.

Seriously, a single server wouldn't be able to catch up with the growth of the entire web. Google uses a huge farm of servers (counted in tens, if not hundreds of thousands), and it can't provide you with immediate indexing.

I guess if you're limited to a single server and are in need of crawling the entire web, you're really in need of results of that crawl. Instead of focusing on "how to crawl the web", focus on "how to extract the data you need using Google". A good starting point for that would be: Google AJAX Search API.

Marcin Seredynski 2010-01-17 08:16:46

+1 but should really be a comment

RCIX 2010-01-17 08:19:01

Sounds possible but the two real problems will be network connection and hard drive space. Speaking as someone who knows almost nothing about web crawling, i'd start with several terabytes of storage and work my way up as i amass more information, and a good broadband internet connection. A deep pocket is a must for this!

RCIX 2010-01-17 08:22:18

I doubt terabytes are the right units when we're talking about web crawling. Google processes about 20 petabytes of data every day. Read abstract: http://portal.acm.org/citation.cfm?doid=1327452.1327492

Marcin Seredynski 2010-01-17 08:24:52

True but i seriously doubt someone could pump petabytes through even a broadband connection...

RCIX 2010-01-17 09:12:25

+7 A:

Crawling the Web is conceptually simple. Treat the Web as a very complicated directed graph. Each page is a node. Each link is a directed edge.

You could start with the assumption that a single well-chosen starting point will eventually lead to every other point (eventually). This won't be strictly true but in practice I think you'll find it's mostly true. Still chances are you'll need multiple (maybe thousands) of starting points.

You will want to make sure you don't traverse the same page twice (within a single traversal). In practice the traversal will take so long that it's merely a question of how long before you come back to a particular node and also how you detect and deal with changes (meaning the second time you come to a page it may have changed).

The killer will be how much data you need to store and what you want to do with it once you've got it.

cletus 2010-01-17 08:25:26

Just start your crawling by sending the entire dictionary to google.com ;)

Marcin Seredynski 2010-01-17 08:27:11

I just wonder the whole Internet should be larger than 750 GB. Moreover, the data structure designed to index the web also takes a lot of storage.

xiao 2010-01-17 08:26:47

+2 A:

See this for an alternative solution, depending on what you'd be looking to do with that much data (even if it were possible): http://searchenginewatch.com/2156241

... EDIT: Also, dont forget, the web is changing all the time, so even relatively small-sized crawling operations (like classifieds sites that aggregate listings from lots of sources) refresh their crawls on a cycle, say, like a 24-hour cycle. That's when website owners may or may not start being inconvenienced by the load your crawler puts on their servers. And then depending on how you use the crawled content, you've got de-duping to think about because you need to teach your systems to recognise whether the crawl results from yesterday are different from those of today etc... gets very "fuzzy", not to mention the computing power needed.

Tom 2010-01-17 08:30:43

+1 A:

I believe the paper you're referring to is "IRLbot: Scaling to 6 Billion Pages and Beyond". This was a single server web crawler written by students at Texas A&M.

Leaving aside issues of bandwidth, disk space, crawling strategies, robots.txt/politeness - the main question I've got is "why?" Crawling the entire web means you're using shared resources from many millions of web servers. Currently most webmasters allow bots to crawl them, provided they play nice and obey implicit and explicit rules for polite crawling.

But each high-volume bot that hammers a site without obvious benefit results in a few more sites shutting the door to everything besides the big boys (Google, Yahoo, Bing, etc). So you really want to ask the why question before spending too much time on the how.

Assuming you really do need to crawl a large portion of the web on a single server, then you'd need to get a fatter pipe, lots more storage space (e.g. assume 2K compressed text per page, so 2TB for 1B pages), lots more RAM, at least 4 real cores, etc. The IRLbot paper would be your best guide. You might also want to look at the crawler-commons project for reusable chunks of Java code.

And a final word of caution. It's easy for an innocent mistake to trigger problems for a web site, at which time you'll be on the receiving end of an angry webmaster flame. So make sure you've got thick skin :)

kkrugler 2010-06-03 16:49:08

ansaurus

tags:

views:

answers:

guide on crawling the entire web ?

related questions