views:

86

answers:

4

I am connecting to a MS SQL server through SQL Alchemy, using pyodbc module. Everything appears to be working fine, until I began having problems with the encodings. Some of the non-ascii characters are being replaced with '?'

The DB has a collation 'Latin1_General_CI_AS' (I've checked also the specific fields and they keep the same collation). I started selecting the encoding 'latin1' in the call of create_engine and that appears to work for Western European character (like French or Spanish, characters like é) but not for Easter European characters. Specifically, I have a problem with the character ć

I have been trying to select other encodings as stated on Python documentation, specifically the Microsoft ones, like cp1250 and cp1252, but I keep facing the same problem.

Does anyone knows how to solve those differences? Does the collation 'Latin1_General_CI_AS' has an equivalence on Python encodings?

The code for my current connection is the following

for sqlalchemy import *

def connect():
    return pyodbc.connect('DSN=database;UID=uid;PWD=password')

engine = create_engine('mssql://', creator=connect, encoding='latin1')
connection = engine.connect()

Clarifications and comments:

  • This problems happens when retrieving information from the DB. I don't need to store anything.
  • At the beginning I didn't specify the encoding, and the result was that, whenever a non ascii character was encountered on the DB, pyodbc raises a UnicodeDecodeError. I corrected that using 'latin1' as encoding, but that doesn't solve the problem for all the characters.
  • I admit that the server is not on latin1, the comment is incorrect. I have been checking both the database collation and the specific fields collations and appears to be all in 'Latin1_General_CI_AS', then, how can ć be stored? Maybe I'm not correctly understanding collations.
  • I corrected a little the question, specifically, I have tried more encodings than latin1, also cp1250 and cp1252 (which apparently is the one used on 'Latin1_General_CI_AS', according to msdn)

UPDATE:

OK, Following these steps, I get that the encoding used by the DB appears to be cp1252: http://bytes.com/topic/sql-server/answers/142972-characters-encoding Anyway, that appears to be a bad assumption as reflected on answers.

UPDATE2: Anyway, after configuring properly the odbc driver, I don't need to specify the encoding on the Python code.

+1  A: 

You should stop using code pages and switch to Unicode. This is the only way of getting rid of this kind of problems.

Sorin Sbarnea
Unfortunately, we are not in control of that database, so we cannot decide how the data is stored :-(
Khelben
Does this relates to your problem http://code.google.com/p/pyodbc/issues/detail?id=52 ? Check how is your ODBC provider configured and maybe you can assure the encoding change from there. Do not forget to test these outside pyodbc.
Sorin Sbarnea
Yes, that's was it! The problem was that I didn't configure the driver on UTF-8 mode. I had to add `client charset = UTF-8` to the configuration of the /etc/freetds.conf file
Khelben
A: 

OK, per http://msdn.microsoft.com/en-us/library/ms174596(v=SQL.90).aspx the encoding of Latin1_General_CI_AS is most probably cp1252. So, you'd have to use encoding='cp1252'. But this could solve only halve of the problem, because you have to output the values somehow to see, whether the characters are there or not. So if you have some_db_value, which you extracted from the database, you have to some_db_value.encode('proper-output-encoding') to have it right. proper-output-encoding depends, how you output this: on the console, it is the console encoding, which can be anything like 'cp1252', 'cp437', 'cp850' (on windows). On the web, it is the encoding of the webserver, hopefully 'utf-8'.

edit: Please read John Machin's answer, as it is not clear whether 'cp1252' is the correct database encoding

knitti
+1  A: 

Try connecting to the db with the pyodbc.connect() parameter unicode_results=True , eg. from sqlalchemy:

engine = create_engine('mssql://yourdb', connect_args={'unicode_results': True})

This should make sure that all the results (and not only those from nvarchar etc...) you get are unicode, correctly converted from whatever encoding is used in the db.

As for writing to the db, just always use unicode. If I'm not mistaken (will check later), pyodbc will make sure it will get written to the db correctly as well.

(of course, if the db uses an encoding that does not support the characters you want to write, you will still get errors: if you want the columns to support any kind of character, you will have to use unicode columns on the db too)

Steven
+2  A: 

Original comment turned into an answer:

cp1250 and cp1252 are NOT "latin1 encodings". A collation is not an encoding. Re your comment: Who says that "the server is encoded in latin1"? If the server expects all input/output to be encoded in latin1 (which I doubt), then you quite simply can't get some Eastern European characters into your database (nor Russian, Chinese, Greek, etc etc).

Update:

You need to look further afield than the collation. """msdn.microsoft.com/en-us/library/ms174596(v=SQL.90).aspx suggests, for Latin1_General_CI_AS the used encoding is cp1252""" is codswallop. The table provides an LCID (locale ID), default collation, and codepage for each locale. Yes, the collation "Latin1_General_CI_AS" is listed in association with the cp1252 codepage for several locales. For two locales (Armenian and Georgian), it is listed in association with the "Unicode" codepage (!!!).

Quite simply, you need to find out what codepage the database is using.

Try to extract data from the database without specifing an encoding at all. Don't bother encoding that in whatever encoding you guess your console may be using -- that only adds another source of confusion. Instead, use print repr(data). Report back here what you get from the repr() where you expect non-Latin1 characters.

John Machin
+1 you are correct, and I didn't read fully.
knitti