views:

137

answers:

4

I am interested to know in a very general situation (a home-brew amateur web crawler) what will be the performance of such. More specifically how many pages can a crawler process.

When i say home-brew take that in all senses, a 2.4Ghz core 2 processor, java written, 50mbit internet speed, etc, etc.

Any resources you may share in this regard will be greatly appreciated

Thanks a lot,

Carlos

A: 

Just something very usefull for a webcrawler is to use the PHP Fork function.

pcntl_fork

Then you can give some tasks your crawler has to do to children,

Pcntl_fork will increase the speed of your crawler if you apply it the good way!

pcntl_fork let you be able to split your tasks to multiple processes, now 1 process is running and have to do everything written in your crawlingscript.

I saw you're using JAVA , maybe their is some workaround to get something like the fork function.

Second your ISP is the first problem you will reconize: Your connectionspeed!
You also have the problem when you send to much requests within a limit of time at some site they will ban your IP-Address and its game-over on that site.

Jordy
The original poster has mentioned he is using Java, so a PHP function is irrelevant.
GaryF
@ GaryF: I wasn't finished with my answer.
Jordy
By the way ;) Take a look at my webcrawler i'm currently working on!http://crawler.tmp.remote.nl/example.php
Jordy
@Jordy: Actually, Java has a whole set of classes to "do several things at once" (a.k.a. concurrency) See e.g. this: http://download.oracle.com/javase/tutorial/essential/concurrency/ - it's PHP's approach which is a workaround.
Piskvor
The OP uses Java, so they probably will use threads or some other kind of concurrency. They could equally run several processes at once.
MarkR
+2  A: 

First of all, the speed of your computer won't be the limiting factor; as for the connection, you should artificially limit the speed of your crawler - most sites will ban your IP address if you start hammering them. In other words, don't crawl a site too quickly (10+ seconds per request should be OK with 99.99% of the sites, but go below that at your own peril).

So, while you could crawl a single site in multiple threads, I'd suggest that each thread crawls a different site (check if it's also not a shared IP address); that way, you could saturate your connection with a lower chance of getting banned from the spidered site.

Some sites don't want you to crawl parts of the site, and there's a commonly used mechanism that you should follow: the robots.txt file. Read the linked site and implement this.

Note also, that some sites prohibit any automated crawling at all; depending on the site's jurisdiction (yours may also apply), breaking this may be illegal (you are responsible for what your script does, "the robot did it" is not even an excuse, much less a defense).

Piskvor
"you should artificially limit the speed of your crawler" - You can bounce between different sites to maximize the speed of the crawl, but not hammering any one site. So limiting the speed of the crawler isn't necessary. So at that level, you can/will/should max out the connection (which will always be slower than the machine)
SnOrfus
@SnOrfus: Good point, added.
Piskvor
If you are crawling a sufficiently large number of sites concurrently, then the seped of your computer / connection WILL be the limiting factor. You should only limit the rate at which you crawl a single site (or single IP address if you really want).
MarkR
@MarkR: I don't know what kind of processing is happening with the sites being crawled, but I can't think of a reasonable situation where processing (storing/indexing comes to mind) a page would take longer than downloading it. Consider that if you're downloading it concurrently, then you're most assuredly processing it concurrently as well and you will always max your connection before your machine.
SnOrfus
+1  A: 

In my experience, mostly making site scrapers, the network download is always the limiting factor. You can usually shuttle the parsing of the page (or storage for parsing later) to a different thread in less than the time it will take to download the next page.

So figure out, on average, how long it takes to download a web page. Multiply that by how many threads you have downloading until it fills your connection's throughput, average out the speed of any given web server and the math is fairly obvious.

SnOrfus
A: 

If your program is sufficiently efficient, your internet connection WILL be the limiting factor (as Robert Harvey said in his answer).

However, by doing this with a home internet connection, you are probably abusing your provider's terms of service. They will monitor it and will eventually notice if you frequently exceed their reasonable usage policy.

Moreover, if they use a transparent proxy, you may hammer their proxy to death long before you reach their download limit, so be careful - make sure that you are NOT going through your ISP's proxy, transparent or otherwise.

ISPs are set up for most users to do moderate levels of browsing with a few large streaming operations (video or other downloads). A massive level of tiny requests with 100s outstanding at once, will probably not make their proxy servers happy even if it doesn't use much bandwidth.

MarkR