So, I’m working on a plain-C (ANSI 9899:1999) project, and am trying to figure out where to get started re: Unicode, UTF-8, and all that jazz.
Specifically, it’s a language interpreter project, and I have two primary places where I’ll need to handle Unicode: reading in source files (the language ostensibly supports Unicode identifiers and such), and in ‘string’ objects.
I’m familiar with all the obvious basics about Unicode, UTF-7/8/16/32 & UCS-2/4, so on and so forth… I’m mostly looking for useful, C-specific (that is, please no C++ or C#, which is all that’s been documented here on SO previously) resources as to my ‘next steps’ to implement Unicode-friendly stuff… in C.
Any links, manpages, Wikipedia articles, example code, is all extremely welcome. I’ll also try to maintain a list of such resources here in the original question, for anybody who happens across it later.
- A must read before considering anything else, if you’re unfamiliar with Unicode, and what an encoding actually is: http://www.joelonsoftware.com/articles/Unicode.html
- The UTF-8 home-page: http://www.utf-8.com/
man 3 iconv
(as well asiconv_open
andiconvctl
)- International Components for Unicode (via Geoff Reedy)
libbasekit
, which seems to include light Unicode-handling tools- Glib has some Unicode functions
- A basic UTF-8 detector function, by Christoph