Where can I get started with Unicode-friendly programming in C?

views:

173

answers:

+6 Q:

Where can I get started with Unicode-friendly programming in C?

So, I’m working on a plain-C (ANSI 9899:1999) project, and am trying to figure out where to get started re: Unicode, UTF-8, and all that jazz.

Specifically, it’s a language interpreter project, and I have two primary places where I’ll need to handle Unicode: reading in source files (the language ostensibly supports Unicode identifiers and such), and in ‘string’ objects.

I’m familiar with all the obvious basics about Unicode, UTF-7/8/16/32 & UCS-2/4, so on and so forth… I’m mostly looking for useful, C-specific (that is, please no C++ or C#, which is all that’s been documented here on SO previously) resources as to my ‘next steps’ to implement Unicode-friendly stuff… in C.

Any links, manpages, Wikipedia articles, example code, is all extremely welcome. I’ll also try to maintain a list of such resources here in the original question, for anybody who happens across it later.

A must read before considering anything else, if you’re unfamiliar with Unicode, and what an encoding actually is: http://www.joelonsoftware.com/articles/Unicode.html
The UTF-8 home-page: http://www.utf-8.com/
man 3 iconv (as well as iconv_open and iconvctl)
International Components for Unicode (via Geoff Reedy)
libbasekit, which seems to include light Unicode-handling tools
Glib has some Unicode functions
A basic UTF-8 detector function, by Christoph

+9 A:

International Components for Unicode provides a portable C library for handling unicode. Here's their elevator pitch for ICU4C:

The C and C++ languages and many operating system environments do not provide full support for Unicode and standards-compliant text handling services. Even though some platforms do provide good Unicode text handling services, portable application code can not make use of them. The ICU4C libraries fills in this gap. ICU4C provides an open, flexible, portable foundation for applications to use for their software globalization requirements. ICU4C closely tracks industry standards, including Unicode and CLDR (Common Locale Data Repository).

Geoff Reedy 2010-02-09 22:07:26

I’ve heard about that (I think Joel mentioned it in the link I added to the first post)… I’m afraid to touch anything IBM, though, they seem to tend towards monolithic software. I’m more looking for stdlib-C stuff, tips, and such, than libraries… I’m trying to keep my dependencies really light for this project. That said, I’ll add them to the original post, they may be useful to others.How heavy *are* the ICU? Maybe if they’re really light/simple, it’s worth my time…

elliottcable 2010-02-09 22:08:52

ICU is the non-Microsoft industry standard in Unicode processing -- no need to phear. Although the learning curve is steep-ish. BTW -- If your only interested with transporting and representing Unicode correctly than you don't need ICU. ICU is about working with Unicode.

Hassan Syed 2010-02-09 22:14:15

Specifically, I think at this particular moment, the minimum that I need to do is read in (at least) UTF-8/ASCII files, and convert them to an internal, tokenized, UTF-32 ‘string’ representation. Can I easily(-ish) do this *without* ICU, or with something lighter?

elliottcable 2010-02-09 22:22:14

@elliottcable: if that's all you want to do, you only need a UTF-8 decoder, which can be easily written from scratch; I already have a validator ( http://stackoverflow.com/questions/1031645/how-to-detect-utf8-in-plain-c/1031773#1031773 ) and an encoder ( http://stackoverflow.com/questions/1082162/how-to-unescape-html-in-c/1082191#1082191 , function `putc_utf8()`) on stackoverflow, and writing them was straight-forward

Christoph 2010-02-09 22:47:15

Yes, I surmised as much. The question stands, though; I’m still looking for more useful Unicode resources, not just for myself, but for others. (-:

elliottcable 2010-02-09 23:29:21

ICU is used and developed by a number of organizations, including IBM. You can repackage it to just include the functionality you want. A lot of the 'weight' has to do with 150+ languages, 260+ sublocales, hundreds of codepages, etc.

Steven R. Loomis 2010-05-14 18:05:52

I think one of the interesting questions is - what should your canonical internal format for strings be? The 2 obvious choices (to me at least) are

a) utf8 in vanilla c-strings b) utf16 in unsigned short arrays

In previous projects I have always chosen utf-8. Why ; because its the path of least resistance in the C world. Everything you are interfacing with (stdio, string.h etc) will work fine.

Next comes - what file format. The problem here is that its visible to your users (unless you provide the only editor for your language). Here I guess you have to take what they give you and try to guess by peeking (byte order marks help)

pm100 2010-02-09 22:24:13

+1 A:

GLib has some Unicode functions and is a pretty lightweight library. It's not near the same level of functionality that ICU provides, but it might be good enough for some applications. The other features of GLib are good to have for portable C programs too.

GTK+ is built on top of GLib. GLib provides the fundamental algorithmic language constructs commonly duplicated in applications. This library has features such as (this list is not a comprehensive list):

Object and type system

Main loop

Dynamic loading of modules (i.e. plug-ins)

Thread support

Timer support

Memory allocator

Threaded Queues (synchronous and asynchronous)

Lists (singly linked, doubly linked, double ended)

Hash tables

Arrays

Trees (N-ary and binary balanced)

String utilities and charset handling

Lexical scanner and XML parser

Base64 (encoding & decoding)

Geoff Reedy 2010-02-09 22:45:55

ansaurus

tags:

views:

answers:

Where can I get started with Unicode-friendly programming in C?

related questions