views:

78

answers:

2

My program often deals with large quantities of data, and one particular component of it creates a subset of that data, based on a condition. You could view this as, having the string,

10457038005502

the problem is to return the first five (say) non-0 elements, that is to return:

14573

In actual fact, each element of this string is a large data structure containing a sizeable amount of data, the entire dataset often being several gigabytes in size and comprising tens of thousands of elements, and to figure out if the element should be included (is not a '0') each element needs to be processed. I've phrased it as I have above both to try to explain it clearly, and also to try to keep this focused on algorithms or techniques, not our specific implementation and data.

  • Edit: Thanks to those who have replied so far. All suggestions revolve around multithreading, which I agree is a good approach (and we do have a threaded task framework this would fit into.) I was hoping that the problem itself can be treated as an algorithmic problem - I suspect, though I don't know, that it's a general sort of problem applicable to searching through all sorts of data. In this light, an awesome reply would be "Schwarzenegger et al proposed algorithm X for this in 1995, google this term."

Our current approach is a single-threaded linear search from the first point we already know in the input data set, along the array, calculating if an element needs to be kept or not and building the results as we go. Often the requested data subset is not at the beginning - using the string example, you might need to know, say, elements 8-15 (if 15 exist, which you might not know until you've hit the end of the input data.) Of course, we don't know what element 8 of the output dataset is in the input dataset until we've processed that far from the beginning.

How else should we solve this problem?

I am seeking input on any completely different approaches or algorithms to solve this kind of problem.

  • What other approaches are there to this problem, ie quickly getting an arbitrary subset?

  • Knowing the current solution is computation-bound because of the amount of processing per element, or rather, because it requires each element to be generated in order to examine it to see if it's a '0', what algorithms or approaches might you suggest to tackle the problem? Is there any way to minimise the work the program does?

If it affects specific libraries or tools, we are using C++ (not managed; we use Embarcadero C++ Builder 2010.) For example, we can't use LINQ which, without having used it, I think could be a useful tool / language feature for this sort of problem. However, we can of course implement any algorithmic solutions that you might normally implement with less work in another environment.

+1  A: 

Assuming each computation can be carried independently of others (i.e., that the results for an item do not depend on the results from a previous item) the obvious first step would be to use multithreading to carry out the computations in parallel.

Jerry Coffin
For the input data, each element's value is generally standalone. How it's generated can vary wildly, but assuming an element exists, it's standalone. For the output data, it does seem that the inclusion of an item (given you request a specific indexed subset) depends on processing each of the previous input data items. Re multithreading, do you mean something like partitioning the input data and processing each partition in threads or as task object to get a list of 0/non-0 elements, and then re-assembling?
David M
Yes -- the idea would be that if (for example) you're asked for the first 5 non-zero items, that you read the first five inputs, and then spin up five threads to process those five items in parallel. I can see where inclusion of a non-zero item might depend on how many were found previously, but if you're asked for N, then you can process at least the first N in parallel, and be sure any positive results will be used. Yes, when you have your results, you then re-assemble them back into original order.
Jerry Coffin
+1  A: 

From a quick read of your question I suppose that you might want to implement some sort of master-worker parallelisation. Have one thread (or process, you choose) read the first N entries, start up N threads(or processes) and pass one task to each. Each thread then works independently and reports back to the master upon completion (success or failure). Master then determines whether more tasks need to be created and passed to workers.

The main potential problem with this is probably ensuring good load-balancing across workers, and ensuring that the master thread is not too much of a bottleneck.

OpenMP 3.0 supports this type of task parallelism and has a fairly gentle learning curve.

High Performance Mark
OpenMP unfortunately doesn't work with C++ Builder (see http://openmp.org/wp/openmp-compilers/ .) Thanks for the reply / algorithm/approach suggestion though!
David M
So ditch C++ builder :-)
High Performance Mark