Okay, here's yet another character encoding question, demonstrating my ignorance of all things Unicode.
I am reading data out of Microsoft Excel .xls
files, and storing it in ESRI shapefiles .shp
. For versions of Excel > 5.0, text in excel files is stored as Unicode. However, Unicode (and specifically UTF-8
support for shapefiles is inconsistent and thus I think I should not use it at all. Shapefiles do support old-school codepages, however.
What is the best practice in a situation where you must convert a Unicode string to a string in an unknown but specific codepage?
As I understand it, a Unicode string can include characters from multiple "codepages". I would assume, therefore, that I must somehow estimate the "best" codepage to use, and then convert all non-supported characters into their closest approximation in that codepage (or the dreaded ?
). Is this the usual approach?
I can definitely use more than just the system codepage. Because .shp
files use the .dbf
files to store their attribute data, at least all the codepages specified by the .dbf
format should be supported (see the xBase format description). The supported codepages are: DOS USA
, DOS Multilingual,
Windows ANSI,
Standard Macintosh
, EE MS-DOS
, Nordic MS-DOS
, Russian MS-DOS
, Icelandic MS-DOS
, Kamenicky (Czech) MS-DOS
, Mazovia (Polish) MS-DOS
, Greek MS-DOS (437G)
, Turkish MS-DOS
, Russian Macintosh
, Eastern European Macintosh
, Greek Macintosh
, Windows EE
, Russian Windows
, Turkish Windows
, Greek Windows
In addition, some applications support the use of an *.cpg
file which specifies additional codepages to use (although I understand support for utf-8
, and I suspect many other codepages, is limited).
Because I am trying to develop a general purpose tool, I can't assume anything about the content of the Unicode in the .xls
files.