views:

99

answers:

5

I have a file which contains (I believe) latin-1 encoding.

However, I cannot match regexes against this file.

If I cat the file, it looks fine:

However, I cannot find the string:

In [12]: txt = open("b").read()

In [13]: print txt
  <Vw_IncidentPipeline_Report>


In [14]: txt
Out[14]: '\x00 \x00 \x00<\x00V\x00w\x00_\x00I\x00n\x00c\x00i\x00d\x00e\x00n\x00t\x00P\x00i\x00p\x00e\x00l\x00i\x00n\x00e\x00_\x00R\x00e\x00p\x00o\x00r\x00t\x00>\x00\r\x00\n'

In [22]: txt.find("Vw_IncidentPipeline_Report")
Out[22]: -1

In [23]: txt.decode("latin-1")
Out[23]: u'\x00 \x00 \x00<\x00V\x00w\x00_\x00I\x00n\x00c\x00i\x00d\x00e\x00n\x00t\x00P\x00i\x00p\x00e\x00l\x00i\x00n\x00e\x00_\x00R\x00e\x00p\x00o\x00r\x00t\x00>\x00\r\x00\n'

In [25]: txt.decode("utf-16le")
Out[25]: u'\u2000\u2000\u3c00\u5600\u7700\u5f00\u4900\u6e00\u6300\u6900\u6400\u6500\u6e00\u7400\u5000\u6900\u7000\u6500\u6c00\u6900\u6e00\u6500\u5f00\u5200\u6500\u7000\u6f00\u7200\u7400\u3e00\u0d00\u0a00'

How do I successfully decode the string, so I can find strings within it?

A: 

you can try the module chardet to see if your gess concerning the encoding is right.

Mermoz
YOU obviously haven't tried it. `chardet` is documented to work with UTF-16xE with a BOM, not otherwise. Here's the result of trying it: >>> chardet.detect(txt){'confidence': 1.0, 'encoding': 'ascii'}>>>
John Machin
What is a "BOM"?
Joseph Turian
Byte Order Mark: Unicode can be encoded as 16-bit or 32-bit integers so you have to tell which encoding is used
Mermoz
@Mermoz: It's a BYTE ORDER Mark, not a CODE SIZE Mark. The primary intention is to mark whether the integers are represented in bigendian order or littleendian order. See `http://en.wikipedia.org/wiki/Byte_order_mark`
John Machin
A: 

Could be UTF-8. What's your regex?

Jerome
Nah, not with all the `\x00`s, couldn't possibly be utf-8. Never saw a clearer UTF-16 big endian encoding, as per my answer.
Alex Martelli
Technically, the data *is* valid UTF-8. But who writes files with alternating U+0000 and ASCII characters?
dan04
+3  A: 

It's not Latin-1, it's utf-16 big endian:

>>> txt = '\x00 \x00 \x00<\x00V\x00w\x00_\x00I\x00n\x00c\x00i\x00d\x00e\x00n\x00t\x00P\x00i\x00p\x00e\x00l\x00i\x00n\x00e\x00_\x00R\x00e\x00p\x00o\x00r\x00t\x00>\x00\r\x00\n'
>>> txt.decode("utf-16be")
u'  <Vw_IncidentPipeline_Report>\r\n'

so, just decode that way and live happily ever after;-).

Alex Martelli
Don't you mean "It's not Latin-1"?
Bob
Actually, I think it's utf-16le. iconv with utf-16be gave japanese.
Joseph Turian
@Joseph, did you perhaps use iconv with the python escape codes in the file? If you are using iconv, you need to replace the `\x00` with `NUL` bytes
gnibbler
@Bob, yep, I did mean Latin-1, tx, +1. @Joseph, you're wrong: it decodes fine with BE, doesn't with LE (as you already showed), so why would you think otherwise? If you want to use iconv instead of Python, why are you tagging your question "python", BTW?
Alex Martelli
+1  A: 

You have the wrong encoding. Try txt.decode("UTF-16BE")

Lets check with iconv...

>>> txt='\x00 \x00 \x00<\x00V\x00w\x00_\x00I\x00n\x00c\x00i\x00d\x00e\x00n\x00t\x00P\x00i\x00p\x00e\x00l\x00i\x00n\x00e\x00_\x00R\x00e\x00p\x00o\x00r\x00t\x00>\x00\r\x00\n'
>>> open("txt","w").write(txt)
>>> exit()
$ iconv -f utf-16be txt
  <Vw_IncidentPipeline_Report>

Nope, no japanese there

gnibbler
Actually, I think it's utf-16le. iconv with utf-16be gave japanese.
Joseph Turian
@Joseph, iconv works fine for me. Can you show us what you did to get japanese?
gnibbler
A: 

Actually, it was UTF-18LE, so I used:

iconv -f 'UTF-16LE//' -t utf-8 -c
Joseph Turian