views:

199

answers:

2

I'm trying to format a file so that it can be inserted into a database, the file is originally compressed and arround 1.3MB big. Each line looks something like this:

398,%7EAnoniem+001%7E,543,480,7525010,1775,0

This is how the code looks like that parses this file:

   Village = gzip.open(Root+'\\data'+'\\' +str(Newest_Date[0])+'\\' +str(Newest_Date[1])+'\\' +str(Newest_Date[2])\
               +'\\'+str(Newest_Date[3])+' village.gz');
Village_Parsed = str
for line in Village:
    Village_Parsed = Village_Parsed + urllib.parse.unquote_plus(line);
print(Village.readline());

When I run the program I get this error:

Village_Parsed = Village_Parsed + urllib.parse.unquote_plus(line);

file "C:\Python31\lib\urllib\parse.py", line 404, in unquote_plus string = string.replace('+', ' ') TypeError: expected an object with the buffer interface

Any idea what is wrong here? Thanks in advance for any help :)

+2  A: 

PROBLEM 1 is that urllib.unquote_plus doesn't like the line that you have fed it. The message should be "Please supply a str object" :-) I suggest that you fix problem 2 below, and insert:

print('line', type(line), repr(line))

immediately after your for statement so that you can see what you are getting in line.

You will find that it returns bytes objects:

>>> [line for line in gzip.open('test.gz')]
[b'nudge nudge\n', b'wink wink\n']

Using a mode of 'r' has scant effect:

>>> [line for line in gzip.open('test.gz', 'r')]
[b'nudge nudge\n', b'wink wink\n']

I suggest that instead of passing line to the parsing routine you pass line.decode('UTF-8') ... or whatever encoding was used when the gz file was written.

PROBLEM 2 is in this line:

Village_Parsed = str

str is a type. You need an empty str object. To get that, you could call the type i.e. str() which is formally correct but impractical/unusual/scoffable/weird when compared to using a string constant '' ... so do this:

Village_Parsed = ''

You also have PROBLEM 3: your last statement is trying to read the gz file after EOF.

John Machin
A: 
import gzip, os, urllib.parse

archive_relpath = os.sep.join(map(str, Newest_Date[:4])) + ' village.gz'  
archive_path = os.path.join(Root, 'data', archive_relpath)

with gzip.open(archive_path) as Village:
    Village_Parsed = ''.join(urllib.parse.unquote_plus(line.decode('ascii'))
                             for line in Village)
    print(Village_Parsed)

Output:

398,~Anoniem 001~,543,480,7525010,1775,0

NOTE: RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax says:

This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol; without such a definition, a URI is assumed to be in the same character encoding as the surrounding text.

Therefore 'ascii' in the line.decode('ascii') fragment should be replaced by whatever character encoding you've used to encode your text.

J.F. Sebastian
@JFSebastian: Have you actually tried that? I get exactly the same error as the OP ... apart from his initialisation problem, your code appears to be functionally equivalent to his, returning bytes objects.
John Machin
@John Machin: I've tried it (now). I can't find `unquote_plus_from_bytes` so we have to resort to explicit `bytes.decode` method.
J.F. Sebastian
Thanks, your solution works great, and thanks for pointing out my other mistakes (Machin and Sebestian).I'm not sure if ascii was the character encoding that was used, but it works without any problems as far as I can see.