I need a very fast algorithm for the following task. I have already implemented several algorithms that complete it, but they're all too slow for the performance I need. It should be fast enough that the algorithm can be run at least 100,000 times a second on a modern CPU. It will be implemented in C++.
I am working with spans/ranges, a structure that has a start and an end coordinate on a line.
I have two vectors (dynamic arrays) of spans and I need to merge them. One vector is src and the other dst. The vectors are sorted by span start coordinates, and the spans do not overlap within one vector.
The spans in the src vector must be merged with the spans in the dst vector, such that the resulting vector is still sorted and has no overlaps. Ie. if overlaps are detected during the merging, the two spans are merged into one. (Merging two spans is just a matter of changing the coordinates in the structure.)
Now, there is one more catch, the spans in the src vector must be "widened" during the merge. This means that a constant will be added to the start and another (larger) constant to the end coordinate of every span in src. This means that after the src spans are widened they might overlap.
What I have arrived at so far is that it cannot be done fully in-place, some kind of temporary storage is needed. I think it should be doable in linear time over the number of elements of src and dst summed.
Any temporary storage can probably be shared between multiple runs of the algorithm.
The two primary approaches I have tried, which are too slow, are:
Append all elements of src to dst, widening each element before appending it. Then run an in-place sort. Finally iterate over the resulting vector using a "read" and "write" pointer, with the read pointer running ahead of the write pointer, merging spans as they go. When all elements have been merged (the read pointer reaches end) dst is truncated.
Create a temporary work-vector. Do a naive merge as described above by repeatedly picking the next element from either src or dst and merging into the work-vector. When done, copy the work-vector to dst, replacing it.
The first method has the problem that sorting is O((m+n)*log(m+n)) instead of O(m+n) and has somewhat overhead. It also means the dst vector has to grow much larger than it really needs.
The second has the primary problem of a lot of copying around and again allocation/deallocation of memory.
The data structures used for storing/managing the spans/vectors can be altered if you think that's needed.
Update: Forgot to say how large the datasets are. The most common cases are between 4 and 30 elements in either vector, and either dst is empty or there is a large amount of overlap between the spans in src and dst.