Even having read some docs, you seem to be confused on how unicode works.
- Unicode is not an encoding. Unicode is the absence of encodings.
utf-8
is not unicode. utf-8
is an encoding.
- You decode utf-8 bytestrings to get unicode. You encode unicode using an encoding, say, utf-8, to get an encoded bytestring.
- Only bytestrings can be saved to disk, database, or sent on a network, or printed on a printer, or screen. Unicode only exists inside your code.
The good practice is to decode everything you get as early as possible, work with it decoded, as unicode, in all your code, and then encode it as late as possible, when the text is ready to leave your program, to screen, database or network.
Now for your problem:
If you have a text that came from the browser, say, from a form, then it is encoded. It is a bytestring. It is not unicode.
You must then decode it to get unicode. Decode it using the encoding the browser used to encode. The correct encoding comes from the browser itself, in the correct HTTP REQUEST header.
Don't use 'ignore'
when decoding. Since the browser said which encoding it is using, you shouldn't get any errors. Using 'ignore'
means you will hide a bug if there is one.
Perhaps your web framework of choice already does that. I know that django, pylons, werkzeug, cherrypy all do that. In that case you already get unicode.
Now that you have a decoded unicode string, you can encode it using whatever encoding you like to store on the database. utf-8
is a good choice, since it can encode all unicode codepoints.
When you retrieve the data from the database, decode it using the same encoding you used to store it. And then encode it using the encoding you want to use on the page - the one declared in the html meta header <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
. If the encoding is the same used on the previous step, you can skip the decode/reencode since it is already encoded in utf-8.
If you see ???
then the data is being lost on any step above. To know exactly, more information is needed.