Hello,
I am new to python, apologies if this has been asked already.
Using python and numpy, I am trying to gather data across many netcdf files into a single array by iteratively calling append()
.
Naively, I am trying to do something like this:
from numpy import *
from pupynere import netcdf_file
x = array([])
y = [...some list of files...]
for file in y:
ncfile = netcdf_file(file,'r')
xFragment = ncfile.variables["varname"][:]
ncfile.close()
x = append(x, xFragment)
I know that under normal circumstances this is a bad idea, since it reallocates new memory on each append()
call. But two things discourage preallocation of x:
1) The files are not necessarily the same size along axis 0 (but should be the same size along subsequent axes), so I would need to read the array sizes from each file beforehand to precalculate the final size of x.
However...
2) From what I can tell, pupynere (and other netcdf modules) load the entire file into memory upon opening the file, rather than just a reference (such as many netcdf modules in other enviroments). So to preallocate, I'd have to open the files twice.
There are many (>100) large (>1GB) files, so overallocating and reshaping is not practical, from what I can tell.
My first question is whether I am missing some intelligent way to preallocate.
My second question is more serious. The above snippet works for a single-dimension array. But if I try to load in a matrix, then initialisation becomes a problem. I can append a one-dimensional array to an empty array:
append( array([]), array([1, 2, 3]) )
but I cannot append an empty array to a matrix:
append( array([]), array([ [1, 2], [3, 4] ]), axis=0)
Something like x.extend(xFragment) would work, I believe, but I don't think numpy arrays have this functionality. I could also avoid the initialisation problem by treating the first file as a special case, but I'd prefer to avoid that if there's a better way to do it.
If anyone can offer help or a suggestion, or can identify a problem with my approach, then I'd be grateful. Thanks