ansaurus

Question

Python regex against Latin-1 character encoding?

Answer 1

A:

you can try the module chardet to see if your gess concerning the encoding is right.

Mermoz 2010-08-25 21:19:54

YOU obviously haven't tried it. `chardet` is documented to work with UTF-16xE with a BOM, not otherwise. Here's the result of trying it: >>> chardet.detect(txt){'confidence': 1.0, 'encoding': 'ascii'}>>>

John Machin 2010-08-25 22:53:57

What is a "BOM"?

Joseph Turian 2010-08-26 22:18:46

Byte Order Mark: Unicode can be encoded as 16-bit or 32-bit integers so you have to tell which encoding is used

Mermoz 2010-08-27 01:09:49

@Mermoz: It's a BYTE ORDER Mark, not a CODE SIZE Mark. The primary intention is to mark whether the integers are represented in bigendian order or littleendian order. See `http://en.wikipedia.org/wiki/Byte_order_mark`

John Machin 2010-08-27 12:16:23

Answer 2

A:

Could be UTF-8. What's your regex?

Jerome 2010-08-25 21:22:43

Nah, not with all the `\x00`s, couldn't possibly be utf-8. Never saw a clearer UTF-16 big endian encoding, as per my answer.

Alex Martelli 2010-08-25 21:24:03

Technically, the data *is* valid UTF-8. But who writes files with alternating U+0000 and ASCII characters?

dan04 2010-08-26 00:43:35

Answer 3

+3 A:

It's not Latin-1, it's utf-16 big endian:

>>> txt = '\x00 \x00 \x00<\x00V\x00w\x00_\x00I\x00n\x00c\x00i\x00d\x00e\x00n\x00t\x00P\x00i\x00p\x00e\x00l\x00i\x00n\x00e\x00_\x00R\x00e\x00p\x00o\x00r\x00t\x00>\x00\r\x00\n'
>>> txt.decode("utf-16be")
u'  <Vw_IncidentPipeline_Report>\r\n'

so, just decode that way and live happily ever after;-).

Alex Martelli 2010-08-25 21:23:01

Don't you mean "It's not Latin-1"?

Bob 2010-08-25 21:26:55

Actually, I think it's utf-16le. iconv with utf-16be gave japanese.

Joseph Turian 2010-08-25 21:31:08

@Joseph, did you perhaps use iconv with the python escape codes in the file? If you are using iconv, you need to replace the `\x00` with `NUL` bytes

gnibbler 2010-08-25 21:44:22

@Bob, yep, I did mean Latin-1, tx, +1. @Joseph, you're wrong: it decodes fine with BE, doesn't with LE (as you already showed), so why would you think otherwise? If you want to use iconv instead of Python, why are you tagging your question "python", BTW?

Alex Martelli 2010-08-26 00:01:01

Answer 4

+1 A:

You have the wrong encoding. Try txt.decode("UTF-16BE")

Lets check with iconv...

>>> txt='\x00 \x00 \x00<\x00V\x00w\x00_\x00I\x00n\x00c\x00i\x00d\x00e\x00n\x00t\x00P\x00i\x00p\x00e\x00l\x00i\x00n\x00e\x00_\x00R\x00e\x00p\x00o\x00r\x00t\x00>\x00\r\x00\n'
>>> open("txt","w").write(txt)
>>> exit()
$ iconv -f utf-16be txt
  <Vw_IncidentPipeline_Report>

Nope, no japanese there

gnibbler 2010-08-25 21:28:09

Actually, I think it's utf-16le. iconv with utf-16be gave japanese.

Joseph Turian 2010-08-25 21:31:58

@Joseph, iconv works fine for me. Can you show us what you did to get japanese?

gnibbler 2010-08-25 21:45:15

Answer 5

A:

Actually, it was UTF-18LE, so I used:

iconv -f 'UTF-16LE//' -t utf-8 -c

Joseph Turian 2010-08-25 21:32:35

ansaurus

tags:

views:

answers:

Python regex against Latin-1 character encoding?

related questions