I am allocating some float arrays (pretty large, ie 9,000,000 elements) on the GPU using cudaMalloc((void**)&(storage->data), size * sizeof(float))
. In the end of my program, I free this memory using cudaFree(storage->data);
.
The problem is that the first deallocation is really slow, around 10 seconds, whereas the others are nearly instantaneous.
My question is the following : what could cause this difference ? Is deallocation memory on a GPU usually that slow ?