views:

65

answers:

1

Hello all,

I have a question about parallelization:

I have two datasets. Dataset1 has m rows and k columns, Dataset2 has n rows and k columns.(m > n) My program reads those datasets from files and store them in the memory. The task is to take each instance of Dataset1(let's call this query instance) and compare with all instances of Dataset2.

Now my question is that:

  • (Option1)Should I partition the Dataset2 into x number of partitions and assign those partitions to the x number of worker threads in each query(this implies that, comparison with query instance in Dataset2) of Dataset1 or
  • (Option2)Should I take x number of instances from Dataset 1 , assign x number of worker threads to query Dataset2 simultaneously.

Which one would be more efficient? //BTW, I'm using PThreads library at the moment.

+1  A: 

I would go with Option 1, i.e. partition Dataset 2.

Rationale:

Dataset 1 is probably too large to fit in cache anyway, which is why you're sweeping over it once and comparing each entry to all entries in Dataset 2.

Now, let's assume that Dataset 2 is also too large to fit completely in a single processor's cache, but it is large enough to fit in cache if it is partitioned into x partitions, with each processor's cache containing one of the partitions. In this case, if you go with Option 1, each processor's cache will be large enough to contain one of the partitions. If you go with Option 2, each processor needs to process the whole of Dataset 2, which is too large to fit in its cache, so thrashing occurs.

If Dataset 2 is small enough to fit in cache completely, then it's probably largely irrelevant which option you choose.

Martin B