I have D2 program that, in its current form, is single threaded, and calls the same pure function about 10 to 100 times in an inner loop for each iteration of the outer loop of this program. There is no data dependency among the calls, i.e. no call uses the result from any other call. Overall, this function is called millions of times, and is the main bottleneck in my program. The parameters are unique almost every time, so caching wouldn't help.
At first glance, this seems like the perfect candidate for parallelization. The only problem is that the function only takes about 3 microseconds per call, well below the latency of creating a new thread, and not that far above the overhead of adding a job to a task pool (meaning, acquiring a mutex, allocating memory to hold information about the task, dealing with possible contention for the task pool's queue, etc.). Is there any good way to take advantage of parallelism that is this fine-grained?