tags:

views:

402

answers:

4

im currently developing a custom search engine with built-in web crawler. for some reason im not into multi-threading, thus so far my indexer was coded in single-threaded manner. Now i have a small dilema with crawler im building, can anybody suggest which is better, crawl 1 page then index it, or crawl 1000+ page and cache then index?

+1  A: 

Better? In terms of what? In terms of speed I can't forsee a noticable difference. In terms of robustness (recovering from a catastrophic failure) its probably better to index each page as you crawl it.

Boo
+1  A: 

I would strongly suggest getting "in" to to multi-threading if you are serious about your crawler. Basically, you would want to have at least one indexer and at least one crawler (potentially multitudes for both) running at all times. Among other things, this minimizes start-up and shutdown overhead (e.g. initializing and freeing data structures).

Matthew Flaschen
+3  A: 

Networks are slow (relative to the CPU). You will see a significant speed increase by parallelizing your crawler. Otherwise, your app will spend the majority of its time waiting on network IO to complete. You can either use multiple threads and blocking IO or a single thread with asynchronous IO.

Also, most indexing algorithms will perform better on batches of documents verses indexing one document at a time.

Benji York
+1  A: 

Not using threads is OK. However if you still want performance, you need to deal with Asynchronous IO. I would recommend checking out Boost.ASIO link text. Using Asynchronous IO will make your dilemma "irrelevant", as it would not matter. Also as a bonus, in future if you do decide to use threads, then its trivial to tell Boost.Asio to apply multuple threads to the problem.

Yogi
thanks, i might give it a try