Does OpenMP natively support reduction of a variable that represents an array?
This would work something like the following...
float* a = (float*) calloc(4*sizeof(float));
omp_set_num_threads(13);
#pragma omp parallel reduction(+:a)
for(i=0;i<4;i++){
a[i] += 1; // Thread-local copy of a incremented by something interesting
}
// a now contains [13 13 13 13]
Ideally, there would be something similar for an omp parallel for, and if you have a large enough number of threads for it to make sense, the accumulation would happen via binary tree.