building a web crawler

views:

402

answers:

+1 Q:

building a web crawler

im currently developing a custom search engine with built-in web crawler. for some reason im not into multi-threading, thus so far my indexer was coded in single-threaded manner. Now i have a small dilema with crawler im building, can anybody suggest which is better, crawl 1 page then index it, or crawl 1000+ page and cache then index?

+1 A:

Better? In terms of what? In terms of speed I can't forsee a noticable difference. In terms of robustness (recovering from a catastrophic failure) its probably better to index each page as you crawl it.

Boo 2009-05-14 00:26:12

+1 A:

I would strongly suggest getting "in" to to multi-threading if you are serious about your crawler. Basically, you would want to have at least one indexer and at least one crawler (potentially multitudes for both) running at all times. Among other things, this minimizes start-up and shutdown overhead (e.g. initializing and freeing data structures).

Matthew Flaschen 2009-05-14 00:28:34

+3 A:

Networks are slow (relative to the CPU). You will see a significant speed increase by parallelizing your crawler. Otherwise, your app will spend the majority of its time waiting on network IO to complete. You can either use multiple threads and blocking IO or a single thread with asynchronous IO.

Also, most indexing algorithms will perform better on batches of documents verses indexing one document at a time.

Benji York 2009-05-14 01:10:38

+1 A:

Not using threads is OK. However if you still want performance, you need to deal with Asynchronous IO. I would recommend checking out Boost.ASIO link text. Using Asynchronous IO will make your dilemma "irrelevant", as it would not matter. Also as a bonus, in future if you do decide to use threads, then its trivial to tell Boost.Asio to apply multuple threads to the problem.

Yogi 2009-05-14 02:45:07

thanks, i might give it a try

2009-05-15 02:21:23

ansaurus

tags:

views:

answers:

building a web crawler

related questions