ansaurus

Question

Answer 1

+2 A:

Unless you're trying to learn multithreading, avoid writing the infrastructure for this yourself. Synchronizing lots of tasks that could take different times, handling failures, etc., it's a mess.

For largely parallelizable tasks (such as querying multiple sites, combining results, etc.), you may want to look at existing infrastructures.

Map/reduce frameworks (such as Hadoop for Java) can handle some of this for you, letting you focus on the logic of your application.

Uri 2010-06-18 14:37:56

Would this infrastructure be that complicated? Now we're talking about Java, I could just create an ExecutorService with a simple Runnable. But will look at Hadoop, thanks.

wvd 2010-06-18 14:42:11

@wvd: IT really depends on scale and how much you want to parallelize. For small inputs and results, you can brew your own.If you have tons of sites you need to query and tons of results to combine, and especially if you need to use multiple processes to query large number of results, the infrastructure and the robustness of a map/reduce tool plus the improved performance could help.

Uri 2010-06-18 14:45:32

Thanks, I got my answer.

wvd 2010-06-18 14:48:48

Answer 2

A:

Use Google? ;)

The bottleneck will be downloading the information multithreading will help.

Otherwise download only the html.

rdkleine 2010-06-18 14:39:33

Answer 3

A:

I'll try to use some pseudo code here:

// main thread

barrier = Barrier(numberOfQueries) // initialize the barrier 
                                   // with number of working threads

for (i = 0; i < numberOfQueries; i++) {
  workers(i) = Worker(i, barrier) // create a worker passing the barrier
  workers(i).start() // start a worker
}

barrier.await() // wait until the barrier resets to ZERO

for (i = 0; i < numberOfQueries; i++) {
  results(i) = workers(i).result // collect the results
}

display(results) // display the results


// worker thread

function start() {
  doTheJob() // do the long job of querying a site
  this.barrier.decrement // once the job is finished decrement the barrier
}

Boris Pavlović 2010-06-18 14:44:59

Yeah, I know how to do it -- but I'm just interested if this was actually the good way.

wvd 2010-06-18 14:46:04

Somebody may find it useful, don't you think?

Boris Pavlović 2010-06-18 15:01:12

Answer 4

+1 A:

In the specific case of a search engine I recommend you check out Solr or Lucene. For 5-7 sites Hadoop will probably be overkill. Incremental indexing is possible and also adding specific metadata to each of the searchable things.

I can imagine these sites publish a lot of their content also in RSS feeds which you can use to keep your indexes up to date faster than you would by continuously crawling them.

The search engine itself allows all kind of interesting ways to get lindingly fast to your results for postprocessing or immediately displaying to your users.

For parallelization there is excellent support in the JSR-166y packages (java.util.concurrent) which allow parallelization without headaches if you stick to one of the patterns proposed. They work really well.

Just some thoughts.

Peter Tillemans 2010-06-18 14:45:58

Answer 5

A:

You can use Map/Reduce for this kind of task. Hadoop is a implementation in java

Dimitri 2010-06-18 14:56:18

ansaurus

tags:

views:

answers:

Writing a search engine

related questions