tags:

views:

81

answers:

4

I've got a 700MB XML file coming from a Windows provider.

As one might expect, the line endings are '\r\n' (or ^M in vi). What is the most efficient way to deal with this situation aside from getting the supplier to send over '\n' :-)

  1. Use os.linesep
  2. Use rstrip() (requiring opening the file ... which seems crazy)
  3. Using Universal newline support is not standard on my Mac Snow Leopard - so isn't an option.

I'm open to anything that requires Python 2.6+ but it needs to work on Snow Leopard and Ubuntu 9.10 with minimal external requirements. I don't mind a small performance penalty but I am looking for the standard best way to deal with this.

----edit----

The line endings are in the middle of the tag descriptors, otherwise they wouldn't be such a problem. I know this is bad form and that they shouldn't be sending this to me, but this is how I have the file and the vendor is mostly incompetent.

A: 

What are you trying to do with this file? Whitespace between tags is usually ignored in XML, so the only place where line endings matter tags' content.

Alex Lebedev
This guy has \r\n right in the middle of tag descriptors like so: <ParentRedirec tSequenceID>I would much prefer they fix it on their end but I'm kind of in a rush and just want to strip this stuff.
Adam Nelson
If that's the case it's broken in any case -- a \n instead of \r\n wouldn't make a difference.
Thomas Wouters
I was thinking of stripping all newlines of any kind but now I realize that it won't work because some of the blocks have valid newlines that are part of the actual data.
Adam Nelson
+2  A: 

Why are the DOS line-endings a problem? Most things can deal with them just fine, including XML parsers. If you really want to get rid of them, open the file in universal line-endings mode:

open(filename, 'rU')

Python will convert all line-endings to UNIX line-endings for you. If you really can't use that (which I find a little surprising), there's no way to get Python to do the work for you. You will have to open the file regardless, though, so your objection to #2 seems a little odd.

Thomas Wouters
huh, just (re-)read the docs- never know that U was "required" to turn on universal newline support... Most of my work is on Windows and unix newlines are (thankfully) handled gracefully...
dash-tom-bang
The text-mode reading on Windows, where the MS C runtime will convert line-endings for you, is not the same as Python's universal line-endings support. Universal line-endings are the same on all operating systems. The Windows text-mode thing is specific to Windows (and also affects other things, like the EOF character causing a premature EOF.)
Thomas Wouters
Universal new lines aren't available for my system.
Adam Nelson
+1  A: 

Are you opening the file in text mode or binary mode? I'm pretty sure I've counted on universal newlines on my Leopard install, but maybe I got an updated Python from somewhere too...

Anyway- I've seen this sort of thing biting many programmers in the bum, because they just reach for the 'b' key. Use a 't' if you're opening text files known to be created on your platform, 'U' instead of 't' if you need universal newlines.

with file(filename, 'rt') as f:
   content = f.read()

Edit: The comments note that 'rt' is the default. Fair point, but Python style tends to prefer explicit over implicit, so I'm going with that.

dash-tom-bang
Closest to an ok answer. I need a better file I now realize.
Adam Nelson
This is the first time I've heard of `'t'` not being the default mode everywhere. Can you elaborate on this? Is `'rt'` really different from `'r'`? Is the default really `'b'`, or is there a third mode?
Thomas Wouters
I looked at the docs after posting. I suspect that 't' is default based on what I saw, but I feel that explicit is better than implicit. :) Anyway- I've also seen a lot of people just throw a 'b' in there by default, even when dealing with text files. The mind boggles, but it's something that happens, so I asked. ;)
dash-tom-bang
Thomas is being too gentle. 'rt' is the same as 't'.
John Machin
@dash-tom-bang: s/probably a bit wrong/definitely wrong/ ... do yourself a favour and delete your answer.
John Machin
Would've if it weren't accepted and undeletable. Edited instead. I similarly put parentheses around all but the most trivial of expressions to make order of operation unambiguous to the casual reader. We could get into a philosophical discussion, were this the appropriate forum. ;)
dash-tom-bang
+1  A: 

Allegedly: """This guy has \r\n right in the middle of tag descriptors like so: <ParentRedirec tSequenceID>""".

I see no \r\n here. Perhaps you mean repr(xml) contains things like

"<ParentRedirec\r\ntSequenceID>"

If not, try to say precisely what you mean, with repr-fashion examples.

The following should work:

>>> import re
>>> guff = """<atag>\r\n<bt\r\nag c="2">"""
>>> re.sub(r"(<[^>]*)\r\n([^>]*>)", r"\1\2", guff)
'<atag>\r\n<btag c="2">'
>>>

If there is more than one line break in a tag e.g. <foo\r\nbar\r\nzot> this will fix only the first. Alternatives (1) loop until the guff stops shrinking (2) write a smarter regexp yourself :-)

John Machin
You are correct, the comment system stripped out the newlines, the tags are like: "<ParentRedirec\r\ntSequenceID>"
Adam Nelson