views:

201

answers:

3

I'm currently working with extremely large fixed width files, sometimes well over a million lines. I have written a method that can write over the files based on a set of parameters, but I think there has to be a more efficient way to accomplish this. The current code I'm using is:

def self.writefiles(file_name, positions, update_value)
@file_name = file_name
@positions = positions.to_i
@update_value = update_value

line_number = 0
@file_contents = File.open(@file_name, 'r').readlines

    while line_number < @file_contents.length
       @read_file_contents = @file_contents[line_number]
       @read_file_contents[@positions] = @update_value
       @file_contents[line_number] = @read_file_contents
       line_number += 1
    end

write_over_file = File.new(@file_name, 'w')
line_number = 0 

    while line_number < @file_contents.length
        write_over_file.write @file_contents[line_number]
        line_number += 1
    end

write_over_file.close
end

For example, if position 25 in the file indicated that it is an original file the value would be set to "O" and if I wanted to replace that value I would use ClassName.writefiles(filename, 140, "X") to change this position on each line. Any help on making this method more efficient would be greatly appreciated!

Thanks

A: 
#!/usr/bin/ruby
# replace_at_pos.rb
pos, char, infile, outfile = $*
pos = pos.to_i
File.open(outfile, 'w') do |f|
  File.foreach(infile) do |line|
    line[pos] = char
    f.puts line
  end
end

and you use it as:

replace_at_pos.rb 140 X inputfile.txt outputfile.txt

For replacing set of values, you can use a hash:

replace = {
  100 => 'a',
  155 => 'c',
  151 => 't'
}
. . .
replace.each do |k, v|
  line[k] = v
end
Mladen Jablanović
Great, I'll have to try this out and see what kind of performance boost I receive. Just one more quick question for you... How would I modify this if more than one position would need changed. E.G I need to update the date in positions 100..107. Thank you once again for the help!
Ruby Novice
Hmmm just used the code you provided and all it appears to do is delete every line in the file.
Ruby Novice
First part or the second? I tried the first, works ok. The second is just an idea, not a working code.
Mladen Jablanović
The first part (I'm working from Windows)
Ruby Novice
I changed the code (and invoking syntax), try it now. Not sure why it wouldn't work in Windows.
Mladen Jablanović
+1  A: 

If it's a fixed width file, you can open the file for read/write and use seek to move to the start of the data you want to write, and only write the data you're changing and not the whole line. This would probably be more efficient than rewriting the entire file to replace one field.

Here's a crude example. It reads the last field (10,20,30) increments it by 1, and writes it back:

tha_file (10 characters per line, including newline)

12 3 x 10
23 4 x 20
78 9 x 30

seeker.rb

#!/usr/bin/env ruby
fh=open("tha_file", "r+")

$RECORD_WIDTH=10
$POS=8
$FIELD_WIDTH=2

# seek to first field
fh.seek($POS - 1, IO::SEEK_CUR)

while !fh.eof?

  cur_val=fh.read($FIELD_WIDTH).to_i
  puts "read #{cur_val}"
  fh.seek(-1 * $FIELD_WIDTH, IO::SEEK_CUR)
  cur_val = cur_val + 1

  fh.write(cur_val)
  puts "wrote #{cur_val}"

  # Move to start of next field in the middle of next record
  fh.seek($RECORD_WIDTH - $FIELD_WIDTH, IO::SEEK_CUR)
end
Shin
I attempted this before going with the method used above and it unfortunately caused all kinds of problems. I suppose I had only been using Ruby for a week or so at that point though, so maybe I'll give it another shot.
Ruby Novice
Could you possibly give me an example of what the code would look like? I can't seem to stop seek from changing the formatting of the file any time I insert new values. I've tried looking around for some more in depth guides on how to use it, but every site seems to give the same example. Thanks
Ruby Novice
The problem is you have to always remember /exactly/ where you are in the file and must make sure to write the fields in the same width. My code above doesn't check width and will break wnen going from 99 to 100.
Shin
Awesome, this gives me a much better idea about how to go about implementing this approach. Thanks for taking the time to write up a more comprehensive example, I'll tinker with it a bit and see if I can't get it working =D
Ruby Novice
So I managed to get this method working with the files that I'm using (was actually much more simple than expected), but I've found an odd problem. The original method is still completing faster than the IO method (tried flushing the buffer, etc.) and I cannot figure out why. I haven't had time to do intensive benchmarking to solve what's causing it yet, so I'm just curious if you have any idea what could be slowing it down?Thanks!
Ruby Novice
It really depends on your data and what you're trying to do. In your example, you read the whole file at once, and write a whole new file. Roughly N IO operations. My example does a read, seek, write, seek, so 4*N IO operations. One or the other might be faster depending on the size of your data. Another alternative would be to write the file processing logic in a faster language (C, Java, etc).
Shin
A: 

You will certainly save some time and quite a lot of memory by reworking the programs to read from the file a line at a time (You are currently reading the whole file into memory). You then write to a backup copy of the file within the loop and then rename the file at the end. Something like this.

  def self.writefiles2(file_name, positions, update_value)
    @file_name = file_name
    @new_file = file_name + ".bak"
    @positions = positions.to_i
    @update_value = update_value

    line_number = 0
    reader = File.open(@file_name, 'r')
    writer = File.open(@new_file, 'w')

    while (line = reader.gets() and not line.nil? )
      line[@positions] = @update_value
      writer.puts(line)
    end
    reader.close
    writer.close
    # Rename the file
  end

This would of course want some error handling around the rename element which could result in the loss of your input data.

Steve Weet
Just benchmarked both methods, and unfortunately the original method I have been using is quite a bit faster. Thanks for the input though!
Ruby Novice
Well that's odd as mine showed exactly the opposite. i.e. Mine ran in about 2/3 of the time. (100 iterations over a 256k lines file) Gave 102s vs 161s. Did you run then within the same process? I tried that but there was very little memory left after the first run so I tried them in separate processes.
Steve Weet
Hmmm I'll have to try it once again then, sorry I missed the update to your post yesterday. Thanks!
Ruby Novice
You will almost certainly find that shin's solution is the quickest. The version above must end up faster as it is doing a single scan of the input file and a consecutive write of the output whereas your original is reading the input file into an array and then iterating over the array to write the output.
Steve Weet