How to crawl billions of pages?

+1 A:

Patiently, sure. If you can crawl hundreds, billions is merely an issue of time and resources.

Jonathan Sampson 2009-12-20 07:42:18

+1 A:

Sure given enough time and storage space. More hardware helps speed up the process is all. If you are wanting to do it in hours or days, its probably not going to be with just one server.

GrayWizardx 2009-12-20 07:42:28

+1 A:

hmm.. if you can "crawl" 1 page per second , you can total 86400 pages per day(11574.074 days needed to finish your 1st billion , use this to calculate needed time according to your page per sec speed) .. Patience is required.. and of course the storage space..

Madi D. 2009-12-20 07:46:05

Some of us object to the asinine use of "u" and "ur". If you don't have the time to spare to insert a couple of extra characters for conventional and correct spelling, then don't bother answering.

Carl Smotricz 2009-12-20 08:44:36

oops, sorry about that :) , i normally notice such stuff.. but after 26 hours working straight, i can hardly focus.. -Edited and Fixed- :)

Madi D. 2009-12-20 09:09:27

Thank you. +1 for a refreshingly simple way to put the size of the problem into perspective. Clever multitasking could bring the time down by maybe a factor of 100 but it's still a long time.

Carl Smotricz 2009-12-20 12:05:24

Of course, 11,574 days is just over 31 years. You'd want to go a bit faster than one per second.

Barry Brown 2010-01-10 20:00:01

@Carl: as Barry calculated 11,574 us around 31 years, divided by a 100, would make it like 4 months, with the crazily insane increase of web pages even multitasking wouldn't be sufficient enough!

Madi D. 2010-01-15 10:46:36

@Barry: of course he'd have to do a bit faster than one per second!! , throw in "multitasking" as suggested by Carl and you'd have a fine system, i'd also recommend a couple of computer farms (google-style) to get over the huge Data storage needed ... :)

Madi D. 2010-01-15 10:50:45

A:

Yes, it is possible. However, the internet does not seem to have an end as new data is being added each moment.

Most probably a forked program with the processes each spidering different pages will help you get around network latency.

Alan Haggai Alavi 2009-12-20 08:04:24

A:

You can, but it would take quite some time. It's probably unrealistic with just one server for a number of reasons. For example, the web is being continuously updated, so if you're only using one server, then by the time you're done crawling your index will be severely outdated. Also, if you assume 50 web pages have a size of 1MB, then it would take somewhere around 20 petabytes to store 1 billion web pages. You're probably not going to get enough disk space in 1 machine to accomplish that.

Ben McCann 2009-12-20 08:26:05

+13 A:

Large scale spidering (a billion pages) is a difficult problem. Here are some of the issues:

Network bandwidth. Assuming that each page is 10Kb, then you are talking about a total of 10 Terabytes to be fetched.
Network latency / slow servers / congestion mean that you are not going to achieve anything like the theoretical bandwidth of your network connection. Multi-threading your crawler only helps so much.
I assume that you need to store the information you have extracted from the billions of pages.
Your HTML parser needs to deal with web pages that are broken in all sorts of strange ways.
To avoid getting stuck in loops, you need to detect that you've "done this page already".
Pages change so you need to revisit them.
You need to deal with 'robots.txt' and other conventions that govern the behavior of (well-behaved) crawlers.

Stephen C 2009-12-20 08:42:16

+15 A:

Dave Quick 2009-12-20 10:06:58

this is by far the best answer, unlike all the sarcasm and other childish answers.

gpow 2009-12-20 20:39:03

+1 A:

The original paper by Page and Brin (Google) 1998 described crawling 25 million pages on 4 machines in 10 days. They had open 300 connections at a time per machine. I think this is still pretty good. In my own own experiments with off the shelf machines running Linux, I could reliably open 100-200 simultaneous connections. There are three main things you need to do while crawling: (1) choose what to crawl next, (2) get those pages, (3) store those pages. For (1) you need to implement some kind of priority queue (i.e., to do breadth first search or OPIC), you also need to keep track of where you have been. This can be done using a Bloom filter. Bloom filters (look it up on Wikipedia) can also be used to store if a page had a robot.txt file and if a prefix of a given url is excluded. (2) getting the pages is a fixed cost and you can't do much about it; however, as on one machine you are limited by the number of open connections, if you have cable you probably won't come close to eating all the available band-width. You might have to worry about bandwidth caps though. (3) storing the pages is typically done in a web archive file like what the Internet Archive does. With compression, you can probably store a billion pages in 7 terabytes, so storage-wise it would be affordable to have a billion pages. As an estimate of what one machine can do, suppose you get a cheapo $200 machine with 1Gb or ram and 160Gb harddrive. At 20KB a page (use Range requests to avoid swallowing big pages whole), 10 million pages would take 200 GB, but compressed is about 70 GB. If you keep an archive that your search engine runs off of (on which you have already calculated say page rank and bm25), and an active crawl archive, then you've consumed 140 GB. This leaves you about 20 GB for other random stuff you need to handle. If you work out the memory you need to try to keep as much of your priority queue and the bloom filters in RAM as possible you are also right at the edge of what possible. If you crawl 300,000 pages/day, it'll take you slightly over a month/10million page crawl

Chris Pollett 2010-01-10 19:35:18

ansaurus

tags:

views:

answers:

How to crawl billions of pages?

related questions