My program often deals with large quantities of data, and one particular component of it creates a subset of that data, based on a condition. You could view this as, having the string,
10457038005502
the problem is to return the first five (say) non-0 elements, that is to return:
14573
In actual fact, each element of this string is a large data structure containing a sizeable amount of data, the entire dataset often being several gigabytes in size and comprising tens of thousands of elements, and to figure out if the element should be included (is not a '0') each element needs to be processed. I've phrased it as I have above both to try to explain it clearly, and also to try to keep this focused on algorithms or techniques, not our specific implementation and data.
- Edit: Thanks to those who have replied so far. All suggestions revolve around multithreading, which I agree is a good approach (and we do have a threaded task framework this would fit into.) I was hoping that the problem itself can be treated as an algorithmic problem - I suspect, though I don't know, that it's a general sort of problem applicable to searching through all sorts of data. In this light, an awesome reply would be "Schwarzenegger et al proposed algorithm X for this in 1995, google this term."
Our current approach is a single-threaded linear search from the first point we already know in the input data set, along the array, calculating if an element needs to be kept or not and building the results as we go. Often the requested data subset is not at the beginning - using the string example, you might need to know, say, elements 8-15 (if 15 exist, which you might not know until you've hit the end of the input data.) Of course, we don't know what element 8 of the output dataset is in the input dataset until we've processed that far from the beginning.
How else should we solve this problem?
I am seeking input on any completely different approaches or algorithms to solve this kind of problem.
What other approaches are there to this problem, ie quickly getting an arbitrary subset?
Knowing the current solution is computation-bound because of the amount of processing per element, or rather, because it requires each element to be generated in order to examine it to see if it's a '0', what algorithms or approaches might you suggest to tackle the problem? Is there any way to minimise the work the program does?
If it affects specific libraries or tools, we are using C++ (not managed; we use Embarcadero C++ Builder 2010.) For example, we can't use LINQ which, without having used it, I think could be a useful tool / language feature for this sort of problem. However, we can of course implement any algorithmic solutions that you might normally implement with less work in another environment.