views:

148

answers:

2

Hello, I'm writing a small parser for Google and I'm not sure what's the best way to design it. The main problem is the way it will remember the position it stopped at.

During parsing it's going to append new searches to the end of a file and go through the file startig with the first line. Now I want to do it so, that if for some reason the execution is interrupted, the script knows the last search it has accomplished successfully.

One way is to delete a line in a file after fetching it, but in this case I have to handle order that threads access file and deleting first line in a file afaik can't be done processor-effectively.

Another way is to write the number of used line to a text file and skip the lines whose numbers are in that file. Or maybe I should use some database instead? TIA

A: 

There's nothing wrong with using a state file. The only catch will be that you need to ensure you have fully committed your changes to the state file before your program enters a section where it may be interrupted. Typically this is done with an IO#flush call.

For example, here's a simple state-tracking class that works on a line-by-line basis:

class ProgressTracker
  def initialize(filename)
    @filename = filename
    @file = open(@filename)

    @state_filename = File.expand_path(".#{File.basename(@filename)}.position", File.dirname(@filename))

    if (File.exist?(@state_filename))
      @state_file = open(@state_filename, File::RDWR)
      resume!
    else
      @state_file = open(@state_filename, File::RDWR | File::CREAT)
    end
  end

  def each_line
    @file.each_line do |line|
      mark_position!
      yield(line) if (block_given?)
    end
  end

protected
  def mark_position!
    @state_file.rewind
    @state_file.puts(@file.pos)
    @state_file.flush
  end

  def resume!
    if (position = @state_file.readline)
      @file.seek(position.to_i)
    end
  end
end

You use it with an IO-like block call:

test = ProgressTracker.new(__FILE__)

n = 0

test.each_line do |line|
  n += 1

  puts "%3d %s" % [ n, line ]

  if (n == 10)
    raise 'terminate'
  end
end

In this case, the program reads itself and will stop after ten lines due to a simulated error. On the second run it should display the next ten lines, if there are that many, or simply exit if there's no additional data to retrieve.

One caveat is that you need to remove the .position file associated with the input data if you want the file to be reprocessed, or if the file has been reset. It's also not possible to edit the file and remove earlier lines or it will throw off the offset tracking. So long as you're simply appending data to the file, or restarting it, everything will be fine.

tadman
+1  A: 

Have you looked at Treetop?

luccastera