views:

76

answers:

5

Hello all,

The title might be a bit misleading, but I couldn't figure out a better title. I'm writing a simple search engine which will search on several sites for the specific domain. To be concrete: I'm writing a search engine for hardstyle livesets/aftermovies/tracks. To do I will search on the sites who provide livesets, tracks, and such. The problem here is speed, I need to pass the search query to 5-7 sites, get the results and then use my own algorithm to display the results in a sorted order. I could just "multithread" it, but it's easier said then done so I have a few questions.

  1. What would be the best solution to this problem? Should I just multithread/process this application, so I'm going to get a bit of speed-up?

  2. Are there any other solutions or I am doing something really wrong?

Thanks,

William van Doorn

+2  A: 

Unless you're trying to learn multithreading, avoid writing the infrastructure for this yourself. Synchronizing lots of tasks that could take different times, handling failures, etc., it's a mess.

For largely parallelizable tasks (such as querying multiple sites, combining results, etc.), you may want to look at existing infrastructures.

Map/reduce frameworks (such as Hadoop for Java) can handle some of this for you, letting you focus on the logic of your application.

Uri
Would this infrastructure be that complicated? Now we're talking about Java, I could just create an ExecutorService with a simple Runnable. But will look at Hadoop, thanks.
wvd
@wvd: IT really depends on scale and how much you want to parallelize. For small inputs and results, you can brew your own.If you have tons of sites you need to query and tons of results to combine, and especially if you need to use multiple processes to query large number of results, the infrastructure and the robustness of a map/reduce tool plus the improved performance could help.
Uri
Thanks, I got my answer.
wvd
A: 

Use Google? ;)

The bottleneck will be downloading the information multithreading will help.

Otherwise download only the html.

rdkleine
A: 

I'll try to use some pseudo code here:

// main thread

barrier = Barrier(numberOfQueries) // initialize the barrier 
                                   // with number of working threads

for (i = 0; i < numberOfQueries; i++) {
  workers(i) = Worker(i, barrier) // create a worker passing the barrier
  workers(i).start() // start a worker
}

barrier.await() // wait until the barrier resets to ZERO

for (i = 0; i < numberOfQueries; i++) {
  results(i) = workers(i).result // collect the results
}

display(results) // display the results


// worker thread

function start() {
  doTheJob() // do the long job of querying a site
  this.barrier.decrement // once the job is finished decrement the barrier
}
Boris Pavlović
Yeah, I know how to do it -- but I'm just interested if this was actually the good way.
wvd
Somebody may find it useful, don't you think?
Boris Pavlović
+1  A: 

In the specific case of a search engine I recommend you check out Solr or Lucene. For 5-7 sites Hadoop will probably be overkill. Incremental indexing is possible and also adding specific metadata to each of the searchable things.

I can imagine these sites publish a lot of their content also in RSS feeds which you can use to keep your indexes up to date faster than you would by continuously crawling them.

The search engine itself allows all kind of interesting ways to get lindingly fast to your results for postprocessing or immediately displaying to your users.

For parallelization there is excellent support in the JSR-166y packages (java.util.concurrent) which allow parallelization without headaches if you stick to one of the patterns proposed. They work really well.

Just some thoughts.

Peter Tillemans
A: 

You can use Map/Reduce for this kind of task. Hadoop is a implementation in java

Dimitri