Suggestions needed for threading and process architecture for search engine software.

The software is a classic search engine. There is one portion of the app that is tasked with crawling/collecting data, and there is another that takes that data and builds an index or database. The final portion handles queries from clients, and performs a search on the data, before retrieving the results.

The specific engine that I'm discussing is one where the data is frequently updated (at least once per minute) so the queries must always be operating on the latest data.

My question is simple. Should these three tasks be handled by three separate processes, or a single process with multiple threads dedicated to each?

The main reason for my question is regarding the best way to partition memory. If the searcher has to update the available data for the indexer, and the indexer has to update the datasets for the query handler, would it make sense for them all to live under the same process and have the same address space? Or would it be acceptable to have separate processes that use shared memory mapped files?

I am leaning towards separate processes so that each can live on a different machine, enabling clustering, distribution, etc. But in terms of raw speed for smaller datasets, would a consolidated approach be preferred?

The OS is Windows, the language is C++.

I'm no expert but I would be leaning towards the seperate processes approach as that gives the best flexibility, ability to scale, easy to manage (restarting one service wouldn't afect the others) and performance.

I'd also be tempted to consider different databases for the different tasks as well. If you take the approach of having one component doing one job - and doing it well, then it makes sense to apply this principle to the DB as well.

I it depends on where you see the performance bottle-knecks being as to how you do that. I'm thinking along the lines of an initial collection area, perhaps a staging area (sorting, etc) and a final area dedicated to fast access and searching.

SQL to SQL batch processes / ETL would give best performance I guess.

Thinking it through - I'd build 3 seperate application that together formed the solution. That would also allow you to use different technology for different tasks if you really wanted to. Allows a more flexible maintenance path.

ansaurus

tags:

views:

answers:

Suggestions needed for threading and process architecture for search engine software.

related questions