The previously-accepted answer is WRONG.
u'\ufffe'
is not a character. If you get it in a unicode string somebody has stuffed up mightily.
The BOM (aka ZERO WIDTH NO-BREAK SPACE) is u'\ufeff'
>>> UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
>>> UNICODE_BOM
u'\ufeff'
>>>
Read this (Ctrl-F search for BOM) and this and this (Ctrl-F search for BOM).
Here's a correct and typo/braino-resistant answer:
Decode your input into unicode_str
. Then do this:
# If I mistype the following, it's very likely to cause a SyntaxError.
UNICODE_BOM = u'\N{ZERO WIDTH NO-BREAK SPACE}'
if unicode_str and unicode_str[0] == UNICODE_BOM:
unicode_str = unicode_str[1:]
Bonus: using a named constant gives your readers a bit more of a clue to what is going on than does a collection of seemingly-arbitrary hexoglyphics.
Update Unfortunately there seems to be no suitable named constant in the standard Python library.
Alas, the codecs module provides only "a snare and a delusion":
>>> import pprint, codecs
>>> pprint.pprint([(k, getattr(codecs, k)) for k in dir(codecs) if k.startswith('BOM')])
[('BOM', '\xff\xfe'), #### aarrgghh!! ####
('BOM32_BE', '\xfe\xff'),
('BOM32_LE', '\xff\xfe'),
('BOM64_BE', '\x00\x00\xfe\xff'),
('BOM64_LE', '\xff\xfe\x00\x00'),
('BOM_BE', '\xfe\xff'),
('BOM_LE', '\xff\xfe'),
('BOM_UTF16', '\xff\xfe'),
('BOM_UTF16_BE', '\xfe\xff'),
('BOM_UTF16_LE', '\xff\xfe'),
('BOM_UTF32', '\xff\xfe\x00\x00'),
('BOM_UTF32_BE', '\x00\x00\xfe\xff'),
('BOM_UTF32_LE', '\xff\xfe\x00\x00'),
('BOM_UTF8', '\xef\xbb\xbf')]
>>>
Update 2 If you have not yet decoded your input, and wish to check it for a BOM, you need to check for TWO different BOMs for UTF-16 and at least TWO different BOMs for UTF-32. If there was only one way each, then you wouldn't need a BOM, would you?
Here verbatim unprettified from my own code is my solution to this:
def check_for_bom(s):
bom_info = (
('\xFF\xFE\x00\x00', 4, 'UTF-32LE'),
('\x00\x00\xFE\xFF', 4, 'UTF-32BE'),
('\xEF\xBB\xBF', 3, 'UTF-8'),
('\xFF\xFE', 2, 'UTF-16LE'),
('\xFE\xFF', 2, 'UTF-16BE'),
)
for sig, siglen, enc in bom_info:
if s.startswith(sig):
return enc, siglen
return None, 0
The input s
should be at least the first 4 bytes of your input. It returns the encoding that can be used to decode the post-BOM part of your input, plus the length of the BOM (if any).
If you are paranoid, you could allow for another 2 (non-standard) UTF-32 orderings, but Python doesn't supply an encoding for them and I've never heard of an actual occurrence, so I don't bother.