ansaurus

Question

Ruby Reads Different File Sizes for Line Reads

Answer 1

+3 A:

My guess would be that you are on Windows, and your "testThis.txt" file has \r\n line endings. When the file is opened in text mode, each line ending will be converted to a single \n character. Therefore you'll lose 1 character per line.

Does your test file have 60 lines in it? That would be consistent with this explanation.

Greg Ball 2009-03-09 10:43:57

Answer 2

+5 A:

There are special characters stored in the file that delineate the lines:

CR LF (0x0D 0x0A) (\r\n) on Windows/DOS and
0x0A (\n) on UNIX systems.

Ruby's gets uses the UNIX method. So, if you read a Windows file you would lose 1 byte for every line you read as the \r\n bytes are converted to \n.

Also String.length is not a good measure of the size of the string (in bytes). If the String is not ASCII, one character may be represented by more than one byte (Unicode). That is, it returns the number of characters in the String, not the number of bytes.

To get the size of a file, use File.size(file_name).

Andy 2009-03-09 10:47:00

Actually, depending on the version of Ruby you're using, str.length might return the number of bytes or the number of characters. (I believe in 1.8.6 and up, it gives you number of characters. Before that, number of bytes.) One more thing to keep in mind if you plan on this being portable.

Sarah Mei 2009-03-09 16:31:21

This is great. Would you mind having a look at the followup? http://stackoverflow.com/questions/628096

Yar 2009-03-09 22:36:40

Mind accepting an answer?

Andy 2009-04-26 18:32:18

Answer 3

+3 A:

The line-ending issues is the most likely culprit here.

It's also worth noting that if the character encoding of the text file is something other than ASCII, you will have a discrepancy between the 2 as well. If the file is UTF-8, this will work for english and some european languages that use just standard ASCII alphabet symbols. Beyond that, the file size and character counts can vary wildly (up to 4 or even 6 times the file size compared to the character count).

Relying on '1 character = 1 byte' is just asking for trouble as it is almost certainly going to fail at some point.

workmad3 2009-03-09 10:54:56

Now the real question: what is better than 1 character = 1 byte?

Yar 2009-03-09 10:56:03

1 character = 1 character, 1 byte = 1 byte and never the twain should meet :)

workmad3 2009-03-09 10:58:20

Terse, but I get the idea. I'll comment back if I can't make sense of it. Thanks!

Yar 2009-03-09 11:50:33

Now I've gotten to part 2. Thanks! http://stackoverflow.com/questions/628096/ruby-length-of-a-line-of-a-file-in-bytes

Yar 2009-03-09 21:24:52

ansaurus

tags:

views:

answers:

Ruby Reads Different File Sizes for Line Reads

related questions