views:

30

answers:

2

Hi All,

I have some massive (4.6 million lines) data files that I'm trying to edit with fortran. Basically, throughout the files is a series of headers followed by a table of numbers. Something like this:
p he4 blah 99 ggg
1.0e+01 2.0e+01 2.0e+01
2.0e+01 5.0e+01 2.0e+01
.
.
3.2e+-1 2.0e+01 1.0e+00
p he3 blafoo 99 ggg
1.1e+00 2.3e+01 2.0e+01

My task is to replace certain entries in one file with those from the other. The list is supplied separately.

I have written a code that already works. My strategy is to just read and echo the first file until I find a header that matches the replacement list. Then find the same header in the second file, echo the entries. Finally, switch back to echoing the first file. The only problem with this approach is that it's SOOOOOO slow! I looked into direct access of the files, but they don't have fixed record lengths. Does anyone have a better idea?

Cheers for the help, Rich

A: 

Are the headers in the files sorted in any way? If not then creating an index file of the headers in the second file should speed up the first lookup. My fortran is very rusty, but if you can sort the headers in the second file into an index file with a reference to the position of the full entry you should be able to speed things up dramatically?

Jaydee
Hmm interesting. The headers cannot be expected to be sorted. But building an index could work. However, is there any way to skip to a line in a file? I would still need to do read line by line and rewinds unless there is some kind of "seekg", right?
Richard Longland
Sorry, I haven't done any fortran for 25 years. I'd have thought Fortran would have random file access functionality somewhere though.
Jaydee
In any case, I'd have thought that having a simple sorted list of the headers in the second file would help speed things up as you scan the file up to the point where the header in the first file is going to be found (or not) and then start with the next header. This means that you don't need to scan the entire second file each time.
Jaydee
I found this which may help you http://objectmix.com/fortran/313648-fseek-unformatted-sequential-files.html This is for Fortran 2003.
Jaydee
A: 

I am assuming that you are reading file 1, and writing the results to file 3. File 2 contains the replacements.

Preprocess file 2, by loading each header, and using a hash algorithm to create 
an array with and integer hash representation of each header value in it, and a
pointer/subscript to the values to replace it by.

while there are lines left in file 1

    read an original line from file 1
    hash the original line to get the hash value.

    if the hash value is in the hash array
         write the replacement to file 3
    else
         write the original line to file 3

That ought to do the trick.

EvilTeach