tags:

views:

669

answers:

4

I'm pretty new to programming, so be gentle. I'm trying to extract IBSN numbers from a library database .dat file. I have written code that works, but it is only searching through about half of the 180MB file. How can I adjust it to search the whole file? Or how can I write a program the will split the dat file into manageable chunks?

edit: Here's my code:

export = File.new("resultsfinal.txt","w+")

File.open("bibrec2.dat").each do |line|
  line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x|
    export.puts x
  end
  line.scan(/[a]{1}[1234567890xX]{13}/) do |x|
    export.puts x
  end
end
A: 

If you are programming on a modern operating system and the computer has enough memory (say 512megs), Ruby should have no problem reading the entire file into memory.

Things typically get iffy when you get to about a 2 gigabyte working set on a typical 32bit OS.

drudru
Well, mine's getting iffy with 4GB on Vista, if that helps. Also, it doesn't bring up an error, just an incomplete set of results.
I believe he means the data is 4GB, not the size of your memory. 32bit operating systems cannot handle more than approximately ~3.5GB of RAM, so you don't have 4GB of working RAM at your disposal, regardless (unless you are running 64bit Vista). If your dataset is only 180MB, the problem must be in your code. Would you post the script?
Hooked
No problem, I'll post it tomorrow. Thanks very much.
A: 

(Content entered into original question, please delete this)

+2  A: 

As to the performance issue, I can't see anything particularly worrying about the file size: 180MB shouldn't pose any problems. What happens to memory use when you're running your script?

I'm not sure, however, that your Regular Expressions are doing what you want. This, for example:

/[a]{1}[1234567890xX]{10}\W/

does (I think) this:

  • one "a". Do you really want to match for an "a"? "a" would suffice, rather than "[a]{1}", in that case.
  • exactly 10 of (digit or "x" or "X")
  • a single "non-word" character i.e. not a-z, A-Z, 0-9 or underscore

There are a couple of sample ISBN matchers here and here, although they seem to be matching something more like the format that we see on the back cover of a book and I'm guessing your input file has stripped out some of that formatting.

Mike Woodhouse
Yeah, the original data file has reformatted the ISBNs so they are in that format. I have no idea why it's done that! Good call on the just writing 'a', seems a lot simpler.
A: 

You should try to catch exception to check if the problem is really on the read block or not.

Just so you know I already made a script with kinda the same syntax to search real big file of ~8GB without problem.

export = File.new("resultsfinal.txt","w+")

File.open("bibrec2.dat").each do |line|
  begin
    line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x|
      export.puts x
    end
    line.scan(/[a]{1}[1234567890xX]{13}/) do |x|
      export.puts x
    end
  rescue
    puts "Problem while adding the result"
  end
end
Yoann Le Touche