tags:

views:

164

answers:

3

Hey guys, I came across a problem I cannot solve on my own concerning the downloadable csv formatted trends data files from Google Insights for Search.

I'm to lazy to reformat the files I4S gives me manually what means: Extracting the section with the actual trends data and reformatting the columns so that I can use it with a modelling program I do for school.

So I wrote a tiny script the should do the work for me: Taking a file, do some magic and give me a new file in proper format.

What it's supposed to do is reading the file contents, extracting the trends section, splitting it by newlines, splitting each line and then reorder the columns and maybe reformat them.

When looking at a untouched I4S csv file it looks normal containing CR LF caracters at line breaks (maybe thats only because I'm using Windows).

When just reading the contents and then writing them to a new file using the script wierd asian characters appear between CR and LF. I tried the script with a manually written similar looking file and even tried a csv file from Google Trends and it works fine.

I use Python and the script (snippet) I used for the following example looks like this:

            # Read from an input file 
            file = open(file,"r") 
            contents = file.read() 
            file.close() 
            cfile = open("m.log","w+") 
            cfile.write(contents) 
            cfile.close()

Has anybody an idea why those characters appear??? Thank you for you help!

I'll give you and example:

First few lines of I4S csv file:

Web Search Interest: foobar
Worldwide; 2004 - present

Interest over time
Week foobar
2004-01-04 - 2004-01-10 44
2004-01-11 - 2004-01-17 44
2004-01-18 - 2004-01-24 37
2004-01-25 - 2004-01-31 40
2004-02-01 - 2004-02-07 49
2004-02-08 - 2004-02-14 51
2004-02-15 - 2004-02-21 45
2004-02-22 - 2004-02-28 61
2004-02-29 - 2004-03-06 51
2004-03-07 - 2004-03-13 48
2004-03-14 - 2004-03-20 50
2004-03-21 - 2004-03-27 56
2004-03-28 - 2004-04-03 59

Output file when reading and writing contents:

Web Search Interest: foobar
਍圀漀爀氀搀眀椀搀攀㬀 ㈀  㐀 ⴀ 瀀爀攀猀攀渀琀ഀഀ

਍䤀渀琀攀爀攀猀琀 漀瘀攀爀 琀椀洀攀ഀഀ
Week foobar
਍㈀  㐀ⴀ ㄀ⴀ 㐀 ⴀ ㈀  㐀ⴀ ㄀ⴀ㄀ ऀ㐀㐀ഀഀ
2004-01-11 - 2004-01-17 44
਍㈀  㐀ⴀ ㄀ⴀ㄀㠀 ⴀ ㈀  㐀ⴀ ㄀ⴀ㈀㐀ऀ㌀㜀ഀഀ
2004-01-25 - 2004-01-31 40
਍㈀  㐀ⴀ ㈀ⴀ ㄀ ⴀ ㈀  㐀ⴀ ㈀ⴀ 㜀ऀ㐀㤀ഀഀ
2004-02-08 - 2004-02-14 51
਍㈀  㐀ⴀ ㈀ⴀ㄀㔀 ⴀ ㈀  㐀ⴀ ㈀ⴀ㈀㄀ऀ㐀㔀ഀഀ
2004-02-22 - 2004-02-28 61
਍㈀  㐀ⴀ ㈀ⴀ㈀㤀 ⴀ ㈀  㐀ⴀ ㌀ⴀ 㘀ऀ㔀㄀ഀഀ
2004-03-07 - 2004-03-13 48
਍㈀  㐀ⴀ ㌀ⴀ㄀㐀 ⴀ ㈀  㐀ⴀ ㌀ⴀ㈀ ऀ㔀 ഀഀ
2004-03-21 - 2004-03-27 56
਍㈀  㐀ⴀ ㌀ⴀ㈀㠀 ⴀ ㈀  㐀ⴀ 㐀ⴀ ㌀ऀ㔀㤀ഀഀ
2004-04-04 - 2004-04-10 69
਍㈀  㐀ⴀ 㐀ⴀ㄀㄀ ⴀ ㈀  㐀ⴀ 㐀ⴀ㄀㜀ऀ㘀㔀ഀഀ
2004-04-18 - 2004-04-24 51
਍㈀  㐀ⴀ 㐀ⴀ㈀㔀 ⴀ ㈀  㐀ⴀ 㔀ⴀ ㄀ऀ㔀㄀ഀഀ
2004-05-02 - 2004-05-08 56
਍㈀  㐀ⴀ 㔀ⴀ 㤀 ⴀ ㈀  㐀ⴀ 㔀ⴀ㄀㔀ऀ㔀㈀ഀഀ
2004-05-16 - 2004-05-22 54
਍㈀  㐀ⴀ 㔀ⴀ㈀㌀ ⴀ ㈀  㐀ⴀ 㔀ⴀ㈀㤀ऀ㔀㔀ഀഀ
2004-05-30 - 2004-06-05 74
਍㈀  㐀ⴀ 㘀ⴀ 㘀 ⴀ ㈀  㐀ⴀ 㘀ⴀ㄀㈀ऀ㔀㜀ഀഀ
2004-06-13 - 2004-06-19 50
਍㈀  㐀ⴀ 㘀ⴀ㈀  ⴀ ㈀  㐀ⴀ 㘀ⴀ㈀㘀ऀ㔀㐀ഀഀ
2004-06-27 - 2004-07-03 58
਍㈀  㐀ⴀ 㜀ⴀ 㐀 ⴀ ㈀  㐀ⴀ 㜀ⴀ㄀ ऀ㔀㤀ഀഀ
2004-07-11 - 2004-07-17 59
਍㈀  㐀ⴀ 㜀ⴀ㄀㠀 ⴀ ㈀  㐀ⴀ 㜀ⴀ㈀㐀ऀ㘀㈀ഀഀ

+2  A: 

repr() is your friend (except on Python 3.X; use ascii() instead).

prompt>\python26\python -c "print repr(open('report.csv','rb').read()[:300])"
'\xff\xfeW\x00e\x00b\x00 \x00S\x00e\x00a\x00r\x00c\x00h\x00 \x00I\x00n\x00t\x00e
\x00r\x00e\x00s\x00t\x00:\x00 \x00f\x00o\x00o\x00b\x00a\x00r\x00\r\x00\n\x00W\x0
[snip]
x001\x007\x00\t\x004\x004\x00\r\x00\n\x002\x000\x00'

Sure looks like a UTF-16LE BOM (U+FEFF) in the 1st two bytes to me.

Notepad.* are NOT your friends. UTF-16 should not be referred to as "UCS-2" or "Unicode".

The following should help with what to do next:

>>> import codecs
>>> lines = list(codecs.open('report.csv', 'r', encoding='UTF-16'))
>>> import pprint
>>> pprint.pprint(lines[:8])
[u'Web Search Interest: foobar\r\n',
 u'Worldwide; 2004 - present\r\n',
 u'\r\n',
 u'Interest over time\r\n',
 u'Week\tfoobar\r\n',
 u'2004-01-04 - 2004-01-10\t44\r\n',
 u'2004-01-11 - 2004-01-17\t44\r\n',
 u'2004-01-18 - 2004-01-24\t37\r\n']
>>>

Update: Why your output file looks like gobbledegook.

Firstly, you are looking at the files with something (Notepad.* maybe) that knows that the files are allegedly encoded in UTF-16LE, and displays them accordingly. So your input file looks fine.

However, your script is reading the input file as raw bytes. It then writes the output file as raw bytes in text mode ('w') (as opposed to binary mode ('wb')). Because you are on Windows, every \n will be replaced by \r\n. This is adding one byte (HALF of a UTF-16 character) to every line. So every SECOND line will be bassackwards aka UTF-16BE ... the letter A which is \x41\x00 in UTF-16LE will lose its trailing \x00 and pick up a leading byte (probably \x00) from the character to the left. \x00\x41 is the UTF-16LE for a CJK ("Asian") character.

Suggested reading: the Python Unicode HOWTO and this piece by Joel.

John Machin
Thanks for the answer... yeah you're right... dealing with charsets was always something mysterious so thanks for the articles you suggested! I'm reading them.
Ahue
Thanks for the thanks ... but do try the up-vote button :-)
John Machin
I voted it up once and then by accident voted it down again. Now I can't do anything unless you edit something in the answer. I'm a new to stackoverflow... figuring out how things work ;)
Ahue
+2  A: 

Found the solution:

It was a character encoding problem. Depending on the editor you use other character set encodings are shown:

Notepad++: ucs-2 little endian PSPad: utf-16le

Decoding the contents with ucs-2 didn't work so I tried utf-16le and it went well. extraneons answer was wrong, but it lead me to the site where I learned that using 'U' in the file opening method causes recognizing "\r\n" as line breaks, too. So now the relevant snippet of my script looks like this:

file = open(file,'rU')
contents = file.read()
file.close()

contents = contents.decode("utf-16le").encode("utf-8")

Then I encode the contents with utf-8 and remove all empty lines with

lines = contents.split("\n")
contents = ""
for line in lines:
  if not line.strip():
    continue
  else:
    contents += line+"\n"

Now I can proceed splitting and reformatting the file. Thanks to Nick Bastin, you gave me the hint I needed!

Ahue
+2  A: 

The problem is character encoding, possibly in combination with universal line ending support of Python. As you mentioned, the source file is in UCS-2 LE, with a Byte Order Mark (BOM). You need to do something like:

import codecs

input_file = codecs.open("Downloads/report.csv", "r", encoding="utf_16")
contents = input_file.read() 
input_file.close() 

cfile = codecs.open("m.log", "w+", encoding="utf_8")
cfile.write(contents) 
cfile.close()

This will read the input file, decode it properly, and write it to the new file as UTF-8. You'll need to delete your existing m.log.

moreati
As mentioned by John. UCS-2 is an old term, UTF-16 ia the correct name.
moreati