views:

639

answers:

4

First, some background: I'm developing a web application using Python. All of my (text) files are currently stored in UTF-8 with the BOM. This includes all my HTML templates and CSS files. These resources are stored as binary data (BOM and all) in my DB.

When I retrieve the templates from the DB, I decode them using template.decode('utf-8'). When the HTML arrives in the browser, the BOM is present at the beginning of the HTTP response body. This generates a very interesting error in Chrome:

Extra <html> encountered. Migrating attributes back to the original <html> element and ignoring the tag.

Chrome seems to generate an <html> tag automatically when it sees the BOM and mistakes it for content, making the real <html> tag an error.

So, using Python, what is the best way to remove the BOM from my UTF-8 encoded templates (if it exists -- I can't guarantee this in the future)?

For other text-based files like CSS, will major browsers correctly interpret (or ignore) the BOM? They are being sent as plain binary data without .decode('utf-8').

Note: I am using Python 2.5.

Thanks!

A: 

You can use something similar to remove BOM:

import os, codecs
def remove_bom_from_file(filename, newfilename):
    if os.path.isfile(filename):
        # open file
        f = open(filename,'rb')

        # read first 4 bytes
        header = f.read(4)

        # check if we have BOM...
        bom_len = 0
        encodings = [ ( codecs.BOM_UTF32, 4 ),
            ( codecs.BOM_UTF16, 2 ),
            ( codecs.BOM_UTF8, 3 ) ]

        # ... and remove appropriate number of bytes    
        for h, l in encodings:
            if header.startswith(h):
                bom_len = l
                break
        f.seek(0)
        f.read(bom_len)

        # copy the rest of file
        contents = f.read() 
        nf = open(newfilename)
        nf.write(contents)
        nf.close()
pajton
Hmm, don't you have to rewind the file after reading the first 4 bytes and before testing for BOMs? `f.seek(0)`.
Konrad Rudolph
@Konrad I missed that, thanks for pointing out. This is not production code anyway:].
pajton
Looks good to me (with the `seek(0)` fix), but I've already got the entire file in memory when I'm trying to chop the BOM -- how efficient is contents[2:] (for example) in Python? Does it create a copy of the entire string?
Cameron
I'd use this method if I was stripping the BOM while reading the file, but I'll be stripping the BOM with the file already in memory. Thanks for your reply though!
Cameron
**This answer also has problems** When reading a file, you need to check for FIVE (at least) possible BOMs, not three. See my answer.
John Machin
+2  A: 

Check the first character after decoding to see if it's the BOM:

if u.startswith(u'\ufeff'):
  u = u[1:]
Ignacio Vazquez-Abrams
Will `u'\ufffe'` ever occur at the beginning of a non-UTF-8 file?Wouldn't the BOM take two "characters" (read: bytes) in my case (UTF-8)?
Cameron
`u'\ufffe'` may be found at the beginning of any UTF- or UCS-encoded file. The BOM is three bytes in UTF-8, but it's still a single Unicode codepoint.
Ignacio Vazquez-Abrams
OK, so just to get this straight, I'd need to first decode the byte-content of the file using `u = contents.decode('utf-8')` and then I'd be able to use your method because the BOM is a single codepoint. Correct?
Cameron
That is correct.
Ignacio Vazquez-Abrams
**UTTERLY WRONG!!!** See my answer.
John Machin
@John: Calling getting the numbers mixed around "utterly wrong" is just slightly melodramatic, don't you think?
Ignacio Vazquez-Abrams
@Ignacio: I still think this answer is the best for my circumstances, however I suggest you edit your answer to use u'\ufeff' instead. It seems to be the correct order (when using the Unicode codepoint -- the order of the actual encoded bytes depends, which is the whole point of the BOM).
Cameron
Alright, edited.
Ignacio Vazquez-Abrams
@Ignacio: The effect of "getting the numbers mixed around" was to produce not the BOM but the AntiBOM -- utterly wrong, just like confusing Christ and the Antichrist. Before mucking about with ordnance, it's a good idea to read the instructions carefully c.f. the Holy Hand Grenade of Antioch.
John Machin
A: 

The previously-accepted answer is WRONG.

u'\ufffe' is not a character. If you get it in a unicode string somebody has stuffed up mightily.

The BOM (aka ZERO WIDTH NO-BREAK SPACE) is u'\ufeff'

>>> UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
>>> UNICODE_BOM
u'\ufeff'
>>>

Read this (Ctrl-F search for BOM) and this and this (Ctrl-F search for BOM).

Here's a correct and typo/braino-resistant answer:

Decode your input into unicode_str. Then do this:

# If I mistype the following, it's very likely to cause a SyntaxError.
UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
if unicode_str and unicode_str[0] == UNICODE_BOM:
    unicode_str = unicode_str[1:]

Bonus: using a named constant gives your readers a bit more of a clue to what is going on than does a collection of seemingly-arbitrary hexoglyphics.

Update Unfortunately there seems to be no suitable named constant in the standard Python library.

Alas, the codecs module provides only "a snare and a delusion":

>>> import pprint, codecs
>>> pprint.pprint([(k, getattr(codecs, k)) for k in dir(codecs) if k.startswith('BOM')])
[('BOM', '\xff\xfe'),   #### aarrgghh!! ####
 ('BOM32_BE', '\xfe\xff'),
 ('BOM32_LE', '\xff\xfe'),
 ('BOM64_BE', '\x00\x00\xfe\xff'),
 ('BOM64_LE', '\xff\xfe\x00\x00'),
 ('BOM_BE', '\xfe\xff'),
 ('BOM_LE', '\xff\xfe'),
 ('BOM_UTF16', '\xff\xfe'),
 ('BOM_UTF16_BE', '\xfe\xff'),
 ('BOM_UTF16_LE', '\xff\xfe'),
 ('BOM_UTF32', '\xff\xfe\x00\x00'),
 ('BOM_UTF32_BE', '\x00\x00\xfe\xff'),
 ('BOM_UTF32_LE', '\xff\xfe\x00\x00'),
 ('BOM_UTF8', '\xef\xbb\xbf')]
>>>

Update 2 If you have not yet decoded your input, and wish to check it for a BOM, you need to check for TWO different BOMs for UTF-16 and at least TWO different BOMs for UTF-32. If there was only one way each, then you wouldn't need a BOM, would you?

Here verbatim unprettified from my own code is my solution to this:

def check_for_bom(s):
    bom_info = (
        ('\xFF\xFE\x00\x00', 4, 'UTF-32LE'),
        ('\x00\x00\xFE\xFF', 4, 'UTF-32BE'),
        ('\xEF\xBB\xBF',     3, 'UTF-8'),
        ('\xFF\xFE',         2, 'UTF-16LE'),
        ('\xFE\xFF',         2, 'UTF-16BE'),
        )
    for sig, siglen, enc in bom_info:
        if s.startswith(sig):
            return enc, siglen
    return None, 0

The input s should be at least the first 4 bytes of your input. It returns the encoding that can be used to decode the post-BOM part of your input, plus the length of the BOM (if any).

If you are paranoid, you could allow for another 2 (non-standard) UTF-32 orderings, but Python doesn't supply an encoding for them and I've never heard of an actual occurrence, so I don't bother.

John Machin
I fail to see how "ZERO WIDTH NO-BREAK SPACE", used here because it is also the BOM (pun intended), is any more legible than u"\uFEFF". They both require prior knowledge about the BOM to be understood.
Cameron
@Cameron: The legibility comes from giving whatever constant you use a name e.g. UNICODE_BOM.
John Machin
@Cameron: I know nothing about the BOM, but I have a sense what a "zero width no-break space" is, and no idea what a u"\uFEFF" is. The latter is also harder to be sure that I've typed correctly, since its 8 character-length consists of only 3 alphanumeric characters, two of which closely resemble each other.
Vicki Laidler
@Vickie: In this context, the "zero width no-break space" is not being used to represent a zero width no-break space at all (it's purpose it completely different -- look up BOM if you're curious), which is why I find it equally unhelpful to use it by name instead of by codepoint.@John: You're right, it's a good idea to use a symbolic name (like a constant) instead of the codepoint directly.
Cameron
@Cameron: The point of using the \N constant for ZWNBSP is that if you accidentally "mess up the order" you will get a SyntaxError immediately. The original purpose of ZWNBSP is now deprecated; "The use of U+2060 WORD JOINER is strongly preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM". Unfortunately the unicodedata file doesn't include a mapping that would allow anything like `u"\N{BYTE ORDER MARK}"`.
John Machin
+2  A: 

Since you state:

All of my (text) files are currently stored in UTF-8 with the BOM

then use the 'utf-8-sig' codec to decode them:

>>> s = u'Hello, world!'.encode('utf-8-sig')
>>> s
'\xef\xbb\xbfHello, world!'
>>> s.decode('utf-8-sig')
u'Hello, world!'

It automatically removes the expected BOM, and works works correctly if the BOM is not present as well.

Mark Tolonen
Ooh! Very nice! I'll try this as soon as I can.
Cameron
Works beautifully (although Chrome mysteriously stopped giving the error no matter what, even with my old (wrong) code -- that's what I get for doing a whole bunch of changes at once).
Cameron