views:

173

answers:

3

So, I’m working on a plain-C (ANSI 9899:1999) project, and am trying to figure out where to get started re: Unicode, UTF-8, and all that jazz.

Specifically, it’s a language interpreter project, and I have two primary places where I’ll need to handle Unicode: reading in source files (the language ostensibly supports Unicode identifiers and such), and in ‘string’ objects.

I’m familiar with all the obvious basics about Unicode, UTF-7/8/16/32 & UCS-2/4, so on and so forth… I’m mostly looking for useful, C-specific (that is, please no C++ or C#, which is all that’s been documented here on SO previously) resources as to my ‘next steps’ to implement Unicode-friendly stuff… in C.

Any links, manpages, Wikipedia articles, example code, is all extremely welcome. I’ll also try to maintain a list of such resources here in the original question, for anybody who happens across it later.


+9  A: 

International Components for Unicode provides a portable C library for handling unicode. Here's their elevator pitch for ICU4C:

The C and C++ languages and many operating system environments do not provide full support for Unicode and standards-compliant text handling services. Even though some platforms do provide good Unicode text handling services, portable application code can not make use of them. The ICU4C libraries fills in this gap. ICU4C provides an open, flexible, portable foundation for applications to use for their software globalization requirements. ICU4C closely tracks industry standards, including Unicode and CLDR (Common Locale Data Repository).

Geoff Reedy
I’ve heard about that (I think Joel mentioned it in the link I added to the first post)… I’m afraid to touch anything IBM, though, they seem to tend towards monolithic software. I’m more looking for stdlib-C stuff, tips, and such, than libraries… I’m trying to keep my dependencies really light for this project. That said, I’ll add them to the original post, they may be useful to others.How heavy *are* the ICU? Maybe if they’re really light/simple, it’s worth my time…
elliottcable
ICU is the non-Microsoft industry standard in Unicode processing -- no need to phear. Although the learning curve is steep-ish. BTW -- If your only interested with transporting and representing Unicode correctly than you don't need ICU. ICU is about working with Unicode.
Hassan Syed
Specifically, I think at this particular moment, the minimum that I need to do is read in (at least) UTF-8/ASCII files, and convert them to an internal, tokenized, UTF-32 ‘string’ representation. Can I easily(-ish) do this *without* ICU, or with something lighter?
elliottcable
@elliottcable: if that's all you want to do, you only need a UTF-8 decoder, which can be easily written from scratch; I already have a validator ( http://stackoverflow.com/questions/1031645/how-to-detect-utf8-in-plain-c/1031773#1031773 ) and an encoder ( http://stackoverflow.com/questions/1082162/how-to-unescape-html-in-c/1082191#1082191 , function `putc_utf8()`) on stackoverflow, and writing them was straight-forward
Christoph
Yes, I surmised as much. The question stands, though; I’m still looking for more useful Unicode resources, not just for myself, but for others. (-:
elliottcable
ICU is used and developed by a number of organizations, including IBM. You can repackage it to just include the functionality you want. A lot of the 'weight' has to do with 150+ languages, 260+ sublocales, hundreds of codepages, etc.
Steven R. Loomis
A: 

I think one of the interesting questions is - what should your canonical internal format for strings be? The 2 obvious choices (to me at least) are

a) utf8 in vanilla c-strings b) utf16 in unsigned short arrays

In previous projects I have always chosen utf-8. Why ; because its the path of least resistance in the C world. Everything you are interfacing with (stdio, string.h etc) will work fine.

Next comes - what file format. The problem here is that its visible to your users (unless you provide the only editor for your language). Here I guess you have to take what they give you and try to guess by peeking (byte order marks help)

pm100
+1  A: 

GLib has some Unicode functions and is a pretty lightweight library. It's not near the same level of functionality that ICU provides, but it might be good enough for some applications. The other features of GLib are good to have for portable C programs too.

GTK+ is built on top of GLib. GLib provides the fundamental algorithmic language constructs commonly duplicated in applications. This library has features such as (this list is not a comprehensive list):

  • Object and type system
  • Main loop
  • Dynamic loading of modules (i.e. plug-ins)
  • Thread support
  • Timer support
  • Memory allocator
  • Threaded Queues (synchronous and asynchronous)
  • Lists (singly linked, doubly linked, double ended)
  • Hash tables
  • Arrays
  • Trees (N-ary and binary balanced)
  • String utilities and charset handling
  • Lexical scanner and XML parser
  • Base64 (encoding & decoding)
Geoff Reedy