ansaurus

Question

UTF-8 HTML and CSS files with BOM (and how to remove the BOM with Python)

Answer 1

A:

You can use something similar to remove BOM:

import os, codecs
def remove_bom_from_file(filename, newfilename):
    if os.path.isfile(filename):
        # open file
        f = open(filename,'rb')

        # read first 4 bytes
        header = f.read(4)

        # check if we have BOM...
        bom_len = 0
        encodings = [ ( codecs.BOM_UTF32, 4 ),
            ( codecs.BOM_UTF16, 2 ),
            ( codecs.BOM_UTF8, 3 ) ]

        # ... and remove appropriate number of bytes    
        for h, l in encodings:
            if header.startswith(h):
                bom_len = l
                break
        f.seek(0)
        f.read(bom_len)

        # copy the rest of file
        contents = f.read() 
        nf = open(newfilename)
        nf.write(contents)
        nf.close()

pajton 2010-03-16 17:11:26

Hmm, don't you have to rewind the file after reading the first 4 bytes and before testing for BOMs? `f.seek(0)`.

Konrad Rudolph 2010-03-16 17:17:52

@Konrad I missed that, thanks for pointing out. This is not production code anyway:].

pajton 2010-03-16 17:25:01

Looks good to me (with the `seek(0)` fix), but I've already got the entire file in memory when I'm trying to chop the BOM -- how efficient is contents[2:] (for example) in Python? Does it create a copy of the entire string?

Cameron 2010-03-16 17:29:26

I'd use this method if I was stripping the BOM while reading the file, but I'll be stripping the BOM with the file already in memory. Thanks for your reply though!

Cameron 2010-03-16 19:09:51

**This answer also has problems** When reading a file, you need to check for FIVE (at least) possible BOMs, not three. See my answer.

John Machin 2010-03-16 23:31:34

Answer 2

+2 A:

Check the first character after decoding to see if it's the BOM:

if u.startswith(u'\ufeff'):
  u = u[1:]

Ignacio Vazquez-Abrams 2010-03-16 17:33:19

Will `u'\ufffe'` ever occur at the beginning of a non-UTF-8 file?Wouldn't the BOM take two "characters" (read: bytes) in my case (UTF-8)?

Cameron 2010-03-16 17:59:52

`u'\ufffe'` may be found at the beginning of any UTF- or UCS-encoded file. The BOM is three bytes in UTF-8, but it's still a single Unicode codepoint.

Ignacio Vazquez-Abrams 2010-03-16 18:07:38

OK, so just to get this straight, I'd need to first decode the byte-content of the file using `u = contents.decode('utf-8')` and then I'd be able to use your method because the BOM is a single codepoint. Correct?

Cameron 2010-03-16 18:29:34

That is correct.

Ignacio Vazquez-Abrams 2010-03-16 18:49:03

**UTTERLY WRONG!!!** See my answer.

John Machin 2010-03-16 22:51:36

@John: Calling getting the numbers mixed around "utterly wrong" is just slightly melodramatic, don't you think?

Ignacio Vazquez-Abrams 2010-03-16 23:02:19

@Ignacio: I still think this answer is the best for my circumstances, however I suggest you edit your answer to use u'\ufeff' instead. It seems to be the correct order (when using the Unicode codepoint -- the order of the actual encoded bytes depends, which is the whole point of the BOM).

Cameron 2010-03-16 23:27:03

Alright, edited.

Ignacio Vazquez-Abrams 2010-03-16 23:28:45

@Ignacio: The effect of "getting the numbers mixed around" was to produce not the BOM but the AntiBOM -- utterly wrong, just like confusing Christ and the Antichrist. Before mucking about with ordnance, it's a good idea to read the instructions carefully c.f. the Holy Hand Grenade of Antioch.

John Machin 2010-03-17 02:06:31

Answer 3

A:

The previously-accepted answer is WRONG.

u'\ufffe' is not a character. If you get it in a unicode string somebody has stuffed up mightily.

The BOM (aka ZERO WIDTH NO-BREAK SPACE) is u'\ufeff'

>>> UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
>>> UNICODE_BOM
u'\ufeff'
>>>

Read this (Ctrl-F search for BOM) and this and this (Ctrl-F search for BOM).

Here's a correct and typo/braino-resistant answer:

Decode your input into unicode_str. Then do this:

# If I mistype the following, it's very likely to cause a SyntaxError.
UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
if unicode_str and unicode_str[0] == UNICODE_BOM:
    unicode_str = unicode_str[1:]

Bonus: using a named constant gives your readers a bit more of a clue to what is going on than does a collection of seemingly-arbitrary hexoglyphics.

Update Unfortunately there seems to be no suitable named constant in the standard Python library.

Alas, the codecs module provides only "a snare and a delusion":

>>> import pprint, codecs
>>> pprint.pprint([(k, getattr(codecs, k)) for k in dir(codecs) if k.startswith('BOM')])
[('BOM', '\xff\xfe'),   #### aarrgghh!! ####
 ('BOM32_BE', '\xfe\xff'),
 ('BOM32_LE', '\xff\xfe'),
 ('BOM64_BE', '\x00\x00\xfe\xff'),
 ('BOM64_LE', '\xff\xfe\x00\x00'),
 ('BOM_BE', '\xfe\xff'),
 ('BOM_LE', '\xff\xfe'),
 ('BOM_UTF16', '\xff\xfe'),
 ('BOM_UTF16_BE', '\xfe\xff'),
 ('BOM_UTF16_LE', '\xff\xfe'),
 ('BOM_UTF32', '\xff\xfe\x00\x00'),
 ('BOM_UTF32_BE', '\x00\x00\xfe\xff'),
 ('BOM_UTF32_LE', '\xff\xfe\x00\x00'),
 ('BOM_UTF8', '\xef\xbb\xbf')]
>>>

Update 2 If you have not yet decoded your input, and wish to check it for a BOM, you need to check for TWO different BOMs for UTF-16 and at least TWO different BOMs for UTF-32. If there was only one way each, then you wouldn't need a BOM, would you?

Here verbatim unprettified from my own code is my solution to this:

def check_for_bom(s):
    bom_info = (
        ('\xFF\xFE\x00\x00', 4, 'UTF-32LE'),
        ('\x00\x00\xFE\xFF', 4, 'UTF-32BE'),
        ('\xEF\xBB\xBF',     3, 'UTF-8'),
        ('\xFF\xFE',         2, 'UTF-16LE'),
        ('\xFE\xFF',         2, 'UTF-16BE'),
        )
    for sig, siglen, enc in bom_info:
        if s.startswith(sig):
            return enc, siglen
    return None, 0

The input s should be at least the first 4 bytes of your input. It returns the encoding that can be used to decode the post-BOM part of your input, plus the length of the BOM (if any).

If you are paranoid, you could allow for another 2 (non-standard) UTF-32 orderings, but Python doesn't supply an encoding for them and I've never heard of an actual occurrence, so I don't bother.

John Machin 2010-03-16 22:50:32

I fail to see how "ZERO WIDTH NO-BREAK SPACE", used here because it is also the BOM (pun intended), is any more legible than u"\uFEFF". They both require prior knowledge about the BOM to be understood.

Cameron 2010-03-16 23:17:48

@Cameron: The legibility comes from giving whatever constant you use a name e.g. UNICODE_BOM.

John Machin 2010-03-17 00:03:20

@Cameron: I know nothing about the BOM, but I have a sense what a "zero width no-break space" is, and no idea what a u"\uFEFF" is. The latter is also harder to be sure that I've typed correctly, since its 8 character-length consists of only 3 alphanumeric characters, two of which closely resemble each other.

Vicki Laidler 2010-03-17 00:06:56

@Vickie: In this context, the "zero width no-break space" is not being used to represent a zero width no-break space at all (it's purpose it completely different -- look up BOM if you're curious), which is why I find it equally unhelpful to use it by name instead of by codepoint.@John: You're right, it's a good idea to use a symbolic name (like a constant) instead of the codepoint directly.

Cameron 2010-03-17 00:35:30

@Cameron: The point of using the \N constant for ZWNBSP is that if you accidentally "mess up the order" you will get a SyntaxError immediately. The original purpose of ZWNBSP is now deprecated; "The use of U+2060 WORD JOINER is strongly preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM". Unfortunately the unicodedata file doesn't include a mapping that would allow anything like `u"\N{BYTE ORDER MARK}"`.

John Machin 2010-03-17 01:18:12

Answer 4

+2 A:

Since you state:

All of my (text) files are currently stored in UTF-8 with the BOM

then use the 'utf-8-sig' codec to decode them:

>>> s = u'Hello, world!'.encode('utf-8-sig')
>>> s
'\xef\xbb\xbfHello, world!'
>>> s.decode('utf-8-sig')
u'Hello, world!'

It automatically removes the expected BOM, and works works correctly if the BOM is not present as well.

Mark Tolonen 2010-03-17 03:47:42

Ooh! Very nice! I'll try this as soon as I can.

Cameron 2010-03-17 12:03:39

Works beautifully (although Chrome mysteriously stopped giving the error no matter what, even with my old (wrong) code -- that's what I get for doing a whole bunch of changes at once).

Cameron 2010-03-18 00:34:10

ansaurus

tags:

views:

answers:

UTF-8 HTML and CSS files with BOM (and how to remove the BOM with Python)

related questions