views:

64

answers:

3

I get a character string and the encoding charset, like iso_8859-1, utf-8 etc. I need to scan the string tokenizing it to words, as I'd do using isspace() and ispunc().

Are there character test functions that take charset into account? Also, are there iterators that advance the correct number of bytes?

Note:
I know I can convert the string to utf8 and then use Glib::ustring and its facilities. I wonder if I can avoid this.

A: 

To do this, you should convert your text to a specified encoding (like utf8), then apply functions working on this encoding. You can directly use glib functions (like g_utf8_find_next_char to iterate, g_unichar_ispunct) if you don't want to use Glib::ustring.

Scharron
the issue is not to avoid using ustring but to avoid the conversion...
davka
A: 

If you want to avoid the conversion at any cost, you would have to write a bunch of different routines:

static void handle_iso_8859_1(const char *);
static void handle_iso_8859_15(const char *);
static void handle_utf_8(const char *);

static void handle_string(const char *s, const char *encoding) {

  if (strcmp(encoding, "ISO-8859-1") == 0) {
    handle_iso_8859_1(s);
  } else if (strcmp(encoding, "ISO-8859-15") == 0) {
    handle_iso_8859_15(s);
  } else if (strcmp(encoding, "UTF-8") == 0) {
    handle_utf_8(s);
  } else {
    error("unknown encoding: %s", encoding);
  }
}

Why do you want to avoid the conversion in the first place? Is it too costly? Is it really too costly? Converting from ISO-8859-1 to UTF-8 is quite cheap and easy to do. Well, maybe you need one extra memory allocation and some copying of bytes. But is that really worth writing mostly the same code three (or more) times?

Roland Illig
A: 

This doesn't answer the iteration part of the question, but isspace, ispunct etc. are locale aware.

If you're working with unicode then you'll need the wide character versions, iswspace, iswpunct, etc.

If you don't want to use the global locale there is the ctype facet of the c++ std::locale class.

Richard Wolf