views:

65

answers:

4

I know that I should open a binary file using "rb" instead of "r" because Windows behaves differently for binary and non-binary files.

But I don't understand what exactly happens if I open a file the wrong way and why this distinction is even necessary. Other operating systems seem to do fine by treating both kinds of files the same.

+5  A: 

This mode is about conversion of line endings.

When reading in text mode, the platform's native line endings (\r\n on Windows) are converted to Python's Unix-style \n line endings. When writing in text mode, the reverse happens.

In binary mode, no such conversion is done.

Other platforms usually do fine without the conversion, because they store line endings natively as \n. (An exception is Mac OS, which used to use \r in the old days.) Code relying on this, however, is not portable.

Thomas
So if I read a Windows text file with python on Linux no conversion happens and I would end up with an additional \r on every line?
Fabian
Yes. [lorem ipsum]
Thomas
@Fabian: yes. Your application have toknow which kind of file it has to deal with. In most cases you can simply check the read file contents for "\r\n" sequences and replace then with "\n" using the string methods.
jsbueno
@Fabian: you should check that in each case. In theory - since C library does not do any conversion when reading/writing files on Unix (no matter if there is 'b') or not - theoretically text file from Windows transferred as binary to Unix should be read wrong (that is with \r remaining). However maybe Python coders desided to 'eat' the extra \r just in case.
Nas Banov
+1  A: 

In Windows, text mode will convert the newline \n to a carriage return followed by a newline \r\n.

If you read text in binary mode, there are no problems. If you read binary data in text mode, it will likely be corrupted.

Ben Hoffstein
A: 

For reading files there should be no difference. When writing to text-files Windows will automatically mess up your line-breaks (it will add \r's before the \n's). That's why you should use "wb".

mdm
-1. In many cases, you *want* the line breaks on Windows. Ever tried reading a Unix text file in Notepad?
Thomas
A: 

Well this is for historical (or as i like to say it, hysterical) reasons. The file open modes are inherited from C stdio library and hence we follow it.

For Windows, there is no difference between text and binary files, just like in any of the Unix clones. No, i mean it! - there are (were) file systems/OSes in which text file is completely different beast from object file and so on. In some you had to specify the maximum length of lines in advance and fixed size records were used... fossils from the times of 80-column paper punch-cards and such. Luckily, not so in Unices, Windows and Mac.

However - all other things equal - Unix, Windows and Mac hystorically differ in what characters they use in output stream to mark end of one line (or, same thing, as separator between lines). In Unix, \x0A (\n) is used. In Windows, sequence of two characters \x0D\x0A (\r\n) is used; on Mac - just \xOD (\r). Here are some clues on the origin of use of those two symbols - ASCII code 10 is called Line Feed (LF) and when sent to teletype, would cause it to move down one line (Y++), without changing its horizontal (X) position. Carriage Return (CR) - ASCII 13 - on the other hand, would cause the printing carriage to return to the beginning of the line (X=0) without scrolling one line down. So when sending output to the printer, both \r and \n had to be send, so that the carriage will move to the beginning of a new line. Now when typing on terminal keyboard, operators naturally are expected to press one key and not two for end of line. That on Apple][ was the key 'Return' (\r).

At any rate, this is how things settled. C's creators were concerned about portability - much of Unix was written in C, unlike before, when OSes were written in assembler. So they did not want to deal with each platform quirks about text representation, so they added this evil hack to their I/O library depending on the platform, the input and output to that file will be "patched" on the fly so that the program will see the new lines the righteous, Unix-way - as '\n' - no matter if it was '\r\n' from Windows or '\r' from Mac. So the developer need not worry on what OS the program ran, it could still read and write text files in native format.

There was a problem, however - not all files are text, there are other formats and in they are very sensitive to replacing one character with another. So they though, we will call those "binary files" and indicate that to fopen() by including 'b' in the mode - and this will flag the library not to do any behind-the-scenes conversion. And that's how it came to be the way it is :)

So to recap, if file is open with 'b' in binary mode, no conversions will take place. If it was open in text mode, depending on the platform, some conversions of the new line character(s) may occur - towards Unix point of view. Naturally, on Unix platform there is no difference between reading/writing to "text" or "binary" file.

Nas Banov