tags:

views:

1312

answers:

5

I'm writing this little HelloWorld as a followup to this and the numbers do not add up

filename = "testThis.txt"
total_bytes = 0
file = File.new(filename, "r")
file.each do |line|
  total_bytes += line.unpack("U*").length
end
puts "original size #{File.size(filename)}"
puts "Total bytes #{total_bytes}"

The result is not the same as the file size. I think I just need to know what format I need to plug in... or maybe I've missed the point entirely. How can I measure the file size line by line?

Note: I'm on Windows, and the file is encoded as type ANSI.

Edit: This produces the same results!

filename = "testThis.txt"
total_bytes = 0
file = File.new(filename, "r")
file.each_byte do |whatever|
  total_bytes += 1
end
puts "Original size #{File.size(filename)}"
puts "Total bytes #{total_bytes}"

so anybody who can help now...

A: 
f = File.new("log.txt")
begin
    while (line = f.readline)
        line.chomp
        puts line.length
    end
rescue EOFError
    f.close
end
Eduardo Cobuci
Welcome Eduardo! Can be simplified:File.new('log.txt').each_line { |line| puts line.length }(your call to line.chomp doesn't do anything?)
Martin Carpenter
Thanks Eduardo. line.length is for the characters, so it doesn't work out either. That's the prior question.
Yar
+1  A: 

You potentially have several overlapping issues here:

  1. Linefeed characters \r\n vs. \n (as per your previous post). Also EOF file character (^Z)?

  2. Definition of "size" in your problem statement: do you mean "how many characters" (taking into account multi-byte character encodings) or do you mean "how many bytes"?

  3. Interaction of the $KCODE global variable (deprecated in ruby 1.9. See String#encoding and friends if you're running under 1.9). Are there, for example, accented characters in your file?

  4. Your format string for #unpack. I think you want C* here if you really want to count bytes.

Note also the existence of IO#each_line (just so you can throw away the while and be a little more ruby-idiomatic ;-)).

Martin Carpenter
Martin, perhaps it's impossible, but I just want the two puts at the end to give them same length. C* doesn't work either. How could I figure out how long a linebreak is in Windows ANSI?
Yar
Can you provide a hex dump of a minimal problematic file (or line)? #length on "68 65 6c 6c 6f 0d 0a" returns length seven (correct!) for me here, so I'm not sure where your problem lies.
Martin Carpenter
Hi Martin, 1) I've fixed the code to be more ruby-idiomatic. 2) file is now at http://confusionstudio.com/random/testThis.txt
Yar
I don't see anything strange with this file (UNIX newlines). I get 20061 bytes using File#size or adding line #lengths on Linux or Windows.
Martin Carpenter
+2  A: 

You might try IO#each_byte, e.g.

total_bytes = 0
file_name = "test_this.txt"
File.open(file_name, "r") do |file|
  file.each_byte {|b| total_bytes += 1}
end
puts "Original size #{File.size(file_name)}"
puts "Total bytes #{total_bytes}"

That, of course, doesn't give you a line at a time. Your best option for that is probably to go through the file via each_byte until you encounter \r\n. The IO class provides a bunch of pretty low-level read methods that might be helpful.

Sarah Mei
@Sarah, is the \r\n always there regardless of what file format/encoding I'm reading from?
Yar
If it's in the file to begin with, it should be there via read_bytes. That method is intended for raw data access, vs. readline, which does some munging on the line endings. Line endings will either be \n or \r\n, depending on how the file was saved.
Sarah Mei
Sorry to bother, but you mean, it will be either \n or \r\n and those are basically ALL the possibilities? Or are there a bunch more that file.each handles?
Yar
On second thought, that's just for curiosity. I think if I have Linux, OSX and Windows, I'm all good :)
Yar
That will be most of them, but there are others out there. Wikipedia has a pretty comprehensive list. http://en.wikipedia.org/wiki/Newline
Sarah Mei
That's if you care about text files that come from IBM mainframes. :D
Sarah Mei
I do not (about the IBM mainframe). So I changed the Q to reflect your answer, but it doesn't help. I can think of a workaround, but I'd like to figure out what's going on... thx!
Yar
I'm starting to think this is a Ruby version or environment issue. I downloaded the file (from the link in your comment on Martin's answer) and I get 20061 no matter how I do it - File#size, reading lines, reading bytes, all 20061. I tried it on both linux and Windows.
Sarah Mei
Sorry, forgot to add, can you post your environment where you're running the code that gets different results?
Sarah Mei
+1  A: 

IO#gets works the same as if you were capturing input from the command line: the "Enter" isn't sent as part of the input; neither is it passed when #gets is called on a File or other subclass of IO, so the numbers are definitely not going to match up.

See the relevant Pickaxe section

May I enquire why you're so concerned about the line lengths summing to the file size? You may be solving a harder problem than is necessary...

Aha. I think I get it now.

Lacking a handy iPod (or any other sort, for that matter), I don't know if you want exactly 4K chunks, in which case IO#read(4000) would be your friend (4000 or 4096?) or if you're happier to break by line, in which case something like this ought to work:

class Chunkifier
  def Chunkifier.to_chunks(path)
    chunks, current_chunk_size = [""], 0
    File.readlines(path).each do |line|
      line.chomp! # strips off \n, \r or \r\n depending on OS
      if chunks.last.size + line.size >= 4_000 # 4096?
        chunks.last.chomp! # remove last line terminator
        chunks << ""
      end
      chunks.last << line + "\n" # or whatever terminator you need
    end
    chunks
  end
end

if __FILE__ == $0
  require 'test/unit'
  class TestFile < Test::Unit::TestCase
    def test_chunking
      chs = Chunkifier.to_chunks(PATH)
      chs.each do |chunk|
        assert 4_000 >= chunk.size, "chunk is #{chunk.size} bytes long"
      end
    end
  end
end

Note the use of IO#readlines to get all the text in one slurp: #each or #each_line would do as well. I used String#chomp! to ensure that whatever the OS is doing, the byts at the end are removed, so that \n or whatever can be forced into the output.

I would suggest using File#write, rather than #print or #puts for the output, as the latter have a tendency to deliver OS-specific newline sequences.

If you're really concerned about multi-byte characters, consider taking the each_byte or unpack(C*) options and monkey-patching String, something like this:

class String
  def size_in_bytes
    self.unpack("C*").size
  end
end

The unpack version is about 8 times faster than the each_byte one on my machine, btw.

Mike Woodhouse
Probably am overcomplicating. I'm just breaking a file up, at the line breaks, into 4K pieces for the iPod notes. The program works, but it's adjusted for Windows. file.each is not the right way to go?
Yar
Ah, I see - nice to see Apple delivering a high-quality experience. I think there may be a neater way to handle the line-termination thing - I'll go and play a little.
Mike Woodhouse
Thanks so much Mike. Lot of text in there, I'll have to work through it in short order.
Yar
Mike, I'm PRETTY sure that chomp! does nothing to the lines read in a File.each because they are already missing their line end characters.
Yar
I added the chomp! because of uncertainty about line endings and treatment thereof when the OS where the file was created and the OS where it's processed are different. I had something odd happen when processing a *nix-created file in Windows. If you don't need it, ignore it!
Mike Woodhouse
+1  A: 

The issue is that when you save a text file on windows, your line breaks are two characters (characters 13 and 10) and therefore 2 bytes, when you save it on linux there is only 1 (character 10). However, ruby reports both these as a single character '\n' - it says character 10. What's worse, is that if you're on linux with a windows file, ruby will give you both characters.

So, if you know that your files are always coming from windows text files and executed on windows, every time you get a newline character you can add 1 to your count. Otherwise it's a couple of conditionals and a little state machine.

BTW there's no EOF 'character'.

Yeah, so basically my choices are: 1) write a small test file every time and check its size or 2) read a file and compare the line sizes to the filesize, and correct from there (which I've tried and works).
Yar