ansaurus

Question

Dealing with Windows line-endings in Python

Answer 1

A:

What are you trying to do with this file? Whitespace between tags is usually ignored in XML, so the only place where line endings matter tags' content.

Alex Lebedev 2010-04-26 21:46:39

This guy has \r\n right in the middle of tag descriptors like so: <ParentRedirec tSequenceID>I would much prefer they fix it on their end but I'm kind of in a rush and just want to strip this stuff.

Adam Nelson 2010-04-26 21:49:34

If that's the case it's broken in any case -- a \n instead of \r\n wouldn't make a difference.

Thomas Wouters 2010-04-26 21:51:30

I was thinking of stripping all newlines of any kind but now I realize that it won't work because some of the blocks have valid newlines that are part of the actual data.

Adam Nelson 2010-04-26 22:24:06

Answer 2

+2 A:

Why are the DOS line-endings a problem? Most things can deal with them just fine, including XML parsers. If you really want to get rid of them, open the file in universal line-endings mode:

open(filename, 'rU')

Python will convert all line-endings to UNIX line-endings for you. If you really can't use that (which I find a little surprising), there's no way to get Python to do the work for you. You will have to open the file regardless, though, so your objection to #2 seems a little odd.

Thomas Wouters 2010-04-26 21:47:52

huh, just (re-)read the docs- never know that U was "required" to turn on universal newline support... Most of my work is on Windows and unix newlines are (thankfully) handled gracefully...

dash-tom-bang 2010-04-26 22:12:54

The text-mode reading on Windows, where the MS C runtime will convert line-endings for you, is not the same as Python's universal line-endings support. Universal line-endings are the same on all operating systems. The Windows text-mode thing is specific to Windows (and also affects other things, like the EOF character causing a premature EOF.)

Thomas Wouters 2010-04-26 22:21:13

Universal new lines aren't available for my system.

Adam Nelson 2010-04-26 22:24:56

Answer 3

+1 A:

Are you opening the file in text mode or binary mode? I'm pretty sure I've counted on universal newlines on my Leopard install, but maybe I got an updated Python from somewhere too...

Anyway- I've seen this sort of thing biting many programmers in the bum, because they just reach for the 'b' key. Use a 't' if you're opening text files known to be created on your platform, 'U' instead of 't' if you need universal newlines.

with file(filename, 'rt') as f:
   content = f.read()

Edit: The comments note that 'rt' is the default. Fair point, but Python style tends to prefer explicit over implicit, so I'm going with that.

dash-tom-bang 2010-04-26 21:55:40

Closest to an ok answer. I need a better file I now realize.

Adam Nelson 2010-04-26 22:25:22

This is the first time I've heard of `'t'` not being the default mode everywhere. Can you elaborate on this? Is `'rt'` really different from `'r'`? Is the default really `'b'`, or is there a third mode?

Thomas Wouters 2010-04-26 22:59:01

I looked at the docs after posting. I suspect that 't' is default based on what I saw, but I feel that explicit is better than implicit. :) Anyway- I've also seen a lot of people just throw a 'b' in there by default, even when dealing with text files. The mind boggles, but it's something that happens, so I asked. ;)

dash-tom-bang 2010-04-26 23:48:07

Thomas is being too gentle. 'rt' is the same as 't'.

John Machin 2010-04-26 23:49:27

@dash-tom-bang: s/probably a bit wrong/definitely wrong/ ... do yourself a favour and delete your answer.

John Machin 2010-04-27 00:14:12

Would've if it weren't accepted and undeletable. Edited instead. I similarly put parentheses around all but the most trivial of expressions to make order of operation unambiguous to the casual reader. We could get into a philosophical discussion, were this the appropriate forum. ;)

dash-tom-bang 2010-04-27 00:24:21

Answer 4

+1 A:

Allegedly: """This guy has \r\n right in the middle of tag descriptors like so: <ParentRedirec tSequenceID>""".

I see no \r\n here. Perhaps you mean repr(xml) contains things like

"<ParentRedirec\r\ntSequenceID>"

If not, try to say precisely what you mean, with repr-fashion examples.

The following should work:

>>> import re
>>> guff = """<atag>\r\n<bt\r\nag c="2">"""
>>> re.sub(r"(<[^>]*)\r\n([^>]*>)", r"\1\2", guff)
'<atag>\r\n<btag c="2">'
>>>

If there is more than one line break in a tag e.g. <foo\r\nbar\r\nzot> this will fix only the first. Alternatives (1) loop until the guff stops shrinking (2) write a smarter regexp yourself :-)

John Machin 2010-04-27 00:10:53

You are correct, the comment system stripped out the newlines, the tags are like: "<ParentRedirec\r\ntSequenceID>"

Adam Nelson 2010-04-27 14:11:36

ansaurus

tags:

views:

answers:

Dealing with Windows line-endings in Python

related questions