I have been refactoring throwaway code which I wrote some years ago in a FORTRAN-like style. Most of the code is now much more organized and readable. However the heart of the algorithm (which is performance-critical) uses 1- and 2-dimensional Java arrays and is typified by:
for (int j = 1; j < len[1]+1; j++) {
int jj = (cont == BY_TYPE) ? seq[1][j-1] : j-1;
for (int i = 1; i < len[0]+1; i++) {
matrix[i][j] = matrix[i-1][j] + gap;
double m = matrix[i][j-1] + gap;
if (m > matrix[i][j]) {
matrix[i][j] = m;
pointers[i][j] = UP;
}
//...
}
}
For clarity, maintainability and interfacing with the rest of the code I would like to refactor it. However on reading Java Generics Syntax for arrays and Java Generics and numbers I have the following concerns:
Performance. The code is planned to use about 10^8 - 10^9 secs/yr and this is just about manageable. My reading suggests that changing double to Double can sometimes add a factor of 3 in performance. I'd like other experience on this. I would also expect that moving from foo[] to List would be a hit as well. I have no first-hand knowledge and again experience would be useful.
Array-bound checking. Is this treated differently in double[] and List and does it matter? I expect some problems to violate bounds as the algorithm is fairly simple and has only been applied to a few data sets.
If I don't refactor then the code has an ugly and possibly fragile intermixture of the two approaches. I am already trying to write things such as:
List<double[]> and List<Double>[]
and understand that the erasure does not make this pretty and at best gives rise to compiler warnings. It seems difficult to do this without very convoluted constructs.
- Obsolescence. One poster suggested that Java arrays should be obsoleted. I assume this isn't going to happen RSN but I would like to move away from outdated approaches.
SUMMARY The consensus so far:
Collections have a significant performance hit over primitive arrays, especially for constructs such as matrices. This is incurred in auto(un)boxing numerics and in accessing list items
For tight numerical (scientific) algorithms the array notation [][] is actually easier to read but the variables should named as helpfully as possible
Generics and arrays do not mix well. It may be useful to wrap the arrays in classes to transport them in/out of the tight algorithm.
There is little objective reason to make the change
QUESTION @SeanOwen has suggested that it would be useful to take constant values out of the loops. Assuming I haven't goofed this would look like:
int len1 = len[1];
int len0 = len[0];
int seq1 = seq[1];
int[] pointersi;
double[] matrixi;
for (int i = 1; i < len0+1; i++) {
matrixi = matrix[i];
pointersi = pointers[i];
}
for (int j = 1; j < len1+1; j++) {
int jj = (cont == BY_TYPE) ? seq1[j-1] : j-1;
for (int i = 1; i < len0+1; i++) {
matrixi[j] = matrixi[j] + gap;
double m = matrixi[j-1] + gap;
if (m > matrixi[j]) {
matrixi[j] = m;
pointersi[j] = UP;
}
//...
}
}
I thought compilers were meant to be smart at doing this sort of thing. Do we need to still do this?