views:

211

answers:

5

I have a python sgi script that attempts to extract an rss items that is posted to it and store the rss in a sqlite3 db. I am using flup as the WSGIServer.
To obtain the posted content: postData = environ["wsgi.input"].read(int(environ["CONTENT_LENGTH"]))

To attempt to store in the db:

from pysqlite2 import dbapi2 as sqlite
ldb = sqlite.connect("/var/vhost/mysite.com/db/rssharvested.db")
lcursor = ldb.cursor()
lcursor.execute("INSERT into rss(data) VALUES(?)", (postData,))

This results in only the first few characters of the rss being stored in the record: ÿþ< I believe the initial chars are the BOM of the rss.

I have tried every permutation I could think of including first encoding rss as utf-8 and then attempting to store but the results were the same. I could not decode because some characters could not be represented as unicode.

Running python 2.5.2 sqlite 3.5.7

Thanks in advance for any insight into this problem.

+1  A: 

Regarding the insertion encoding - in any decent database API, you should insert unicode strings and unicode strings only.

For the reading and parsing bit, I'd recommend Mark Pilgrim's Feed Parser. It properly handles BOM, and the license allows commercial use. This may be a bit too heavy handed if you are not doing any actual parsing on the RSS data.

Deestan
A: 

Before the SQL insertion you should to convert the string to unicode compatible strings. If you raise an UnicodeError exception, then encode the string.encode("utf-8").

Or , you can autodetect encoding and encode it , on his encode schema. Auto detect encoding

Jorge Niedbalski R.
+1  A: 

Are you sure your incoming data are encoded as UTF-16 (otherwise known as UCS-2)?

UTF-16 encoded unicode strings typically include lots of NUL characters (surely for all characters existing in ASCII too), so UTF-16 data hardly can be stored in environment variables (env vars in POSIX are NUL terminated).

Please provide samples of the postData variable contents. Output them using repr().

Until then, the solid advice is: in all DB interactions, your strings on the Python side should be unicode strings; the DB interface should take care of all translations/encodings/decodings necessary.

ΤΖΩΤΖΙΟΥ
Note: UCS-2 is critically different from UTF-16 on several points. ----- Specifically: a) UCS-2 cannot represent every possible Unicode character like UTF-16 can --- b) characters in a UCS-2 are all 2 bytes in length, while characters in a UTF-16 string may be longer (surrogate pairs).
Deestan
A: 

Here is a sample of the initial data contained in postData as modified by the repr function, written to a file and viewed with less:

'\xef\xbb\xbf

Thanks for the all the replies! Very helpful.

jon
A: 

The sample I submitted didn't make it through the stackoverflow html filters will try again, converting less and greater than to entities (preview indicates this works).

\xef\xbb\xbf<?xml version="1.0" encoding="utf-16"?><rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"&gt;&lt;channel&gt;&lt;item d3p1:size="0" xsi:type="tFileItem" xmlns:d3p1="http://htinc.com/opensearch-ex/1.0/"&gt;

jon