ansaurus

Question

How do I reverse a UTF-8 string in place?

Answer 1

+7 A:

I'd make one pass reversing the bytes, then a second pass that reverses the bytes in any multibyte characters (which are easily detected in UTF8) back to their correct order.

You can definitely handle this in line in a single pass, but I wouldn't bother unless the routine became a bottleneck.

Michael Burr 2008-10-13 22:34:52

Yes, that's what I thought. Thanks.

Juan Pablo Califano 2008-10-13 22:36:34

Answer 2

+5 A:

My initial approach could by summarized this way:

1) Reverse bytes naively

2) Run the string backwards and fix the utf8 sequences as you go.

Illegal sequences are dealt with in the second step and in the first step, we check if the string is in "sync" (that is, if it starts with a legal leading byte).

EDIT: improved validation for leading byte in Reverse()

class UTF8Utils {


    public static void Reverse(byte[] str) {
     int len = str.Length;
     int i = 0;
     int j = len - 1;

     // first, check if the string is "synced", i.e., it starts
     // with a valid leading character. Will check for illegal 
     // sequences thru the whole string later.
     byte leadChar = str[0];

     // if it starts with 10xx xxx, it's a trailing char...
     // if it starts with 1111 10xx or 1111 110x 
     // it's out of the 4 bytes range.
 // EDIT: added validation for 7 bytes seq and 0xff
     if( (leadChar & 0xc0) == 0x80 ||
      (leadChar & 0xfc) == 0xf8 ||
      (leadChar & 0xfe) == 0xfc ||
  (leadChar & 0xff) == 0xfe ||
  leadChar == 0xff) {

      throw new Exception("Illegal UTF-8 sequence");

     }

     // reverse bytes in-place naïvely
     while(i < j) {
      byte tmp = str[i];
      str[i] = str[j];
      str[j] = tmp;
      i++;
      j--;
     }
     // now, run the string again to fix the multibyte sequences
     UTF8Utils.ReverseMbSequences(str);

    }

    private static void ReverseMbSequences(byte[] str) {
     int i = str.Length - 1;
     byte leadChar = 0;
     int nBytes = 0;

     // loop backwards thru the reversed buffer
     while(i >= 0) {
      // since the first byte in the unreversed buffer is assumed to be
      // the leading char of that byte, it seems safe to assume that the  
      // last byte is now the leading char. (Given that the string is
      // not out of sync -- we checked that out already)
      leadChar = str[i];

      // check how many bytes this sequence takes and validate against
      // illegal sequences
      if(leadChar < 0x80) {
       nBytes = 1;
      } else if((leadChar & 0xe0) == 0xc0) {
       if((str[i-1] & 0xc0) != 0x80) {
        throw new Exception("Illegal UTF-8 sequence");
       }
       nBytes = 2;
      } else if ((leadChar & 0xf0) == 0xe0) {
       if((str[i-1] & 0xc0) != 0x80 ||
        (str[i-2] & 0xc0) != 0x80 ) {
        throw new Exception("Illegal UTF-8 sequence");
       }
       nBytes = 3;
      } else if ((leadChar & 0xf8) == 0xf0) {
       if((str[i-1] & 0xc0) != 0x80 ||
        (str[i-2] & 0xc0) != 0x80 ||
        (str[i-3] & 0xc0) != 0x80  ) {
        throw new Exception("Illegal UTF-8 sequence");
       }
       nBytes = 4;
      } else {
       throw new Exception("Illegal UTF-8 sequence");
      }

      // now, reverse the current sequence and then continue
      // whith the next one
      int back = i;
      int front = back - nBytes + 1;

      while(front < back) {
       byte tmp = str[front];
       str[front] = str[back];
       str[back] = tmp;
       front++;
       back--;
      }
      i -= nBytes;
     }
    }
}

Juan Pablo Califano 2008-10-13 22:35:08

Answer 3

A:

The best solution:

Convert to a wide char string
Reverse the new string

Never, never, never, never treat single bytes as characters.

gnud 2008-10-13 22:36:16

I agree that it's probably the best solution in "real" code (that or using a decent library). But I'm interested in how would you do it, if you had to do it in place.

Juan Pablo Califano 2008-10-13 22:40:16

That doesn't work for many reasons. Even for the sake of this contrived problem, UTF-8 can represent characters that end up being more than two bytes in UTF-16.

Jim Puls 2008-10-13 22:49:51

Jim: look up <a href="http://kerneltrap.org/man/linux/man0p/stddef.h.0p">man stddef.h</a> -- not room for the definition of wchar_t in this comment, but I read it to mean that if the compiling enviroment supports a charset with e.g. 6 byte encoding, wchar_t must be >= 6 bytes.

gnud 2008-10-13 23:01:43

gnud, probably Jim is right, but anyway it's not that I need to write code for an actual task, I just would like to see how could one best solve this problems, given those constrains. It's just self-educational.

Juan Pablo Califano 2008-10-13 23:12:27

Yes, i understood your angele - eventually =). But equating UTF-16 with wchar_t is josta as wrong as equating a character with 8 bitt. On my system, wchar_t is 32 bits. And as I read the standard, each wchar_t is guaranteed to hold one character.

gnud 2008-10-14 09:28:02

Answer 4

+4 A:

Agree that your approach is the only sane way to do it in-place.

Personally I don't like revalidating UTF8 inside every function that deals with it, and generally only do what's needed to avoid crashes; it adds up to a lot less code. Dunno much C# so here it is in C:

(edited to eliminate strlen)

void reverse( char *start, char *end )
{
    while( start < end )
    {
        char c = *start;
        *start++ = *end;
        *end-- = c;
    }
}

char *reverse_char( char *start )
{
    char *end = start;
    while( (end[1] & 0xC0) == 0x80 ) end++;
    reverse( start, end );
    return( end+1 );
}

void reverse_string( char *string )
{
    char *end = string;
    while( *end ) end = reverse_char( end );
    reverse( string, end-1 );
}

2008-10-13 23:14:00

Juan Pablo Califano 2008-10-13 23:24:55

Nicely done, MikeF. BTW: you probably forgot a `char *start= string;` in the beginning of `reverse_string`.

ΤΖΩΤΖΙΟΥ 2008-10-13 23:50:01

Ουπς...ευχαριστώ.

2008-10-13 23:59:41

Compatriot or polyglot?-)

ΤΖΩΤΖΙΟΥ 2008-10-14 00:09:05

Compatriot, but raised in the UK. BTW, I just noticed that properly reversing a Unicode string actually involves preserving the order of composite characters too :p

2008-10-14 00:15:05

Definitely! Especially for UTF-16, since I've never seen a UTF-32 encoded string with surrogate characters. (I maintain a database of bytestring/encoding/language used to identify encoding and language of input strings; think subtitle files).

ΤΖΩΤΖΙΟΥ 2008-10-14 00:30:59

Answer 5

+1 A:

This code assumes that the input UTF-8 string is valid and well formed (i.e. at most 4 bytes per multibyte character):

#include "string.h"

void utf8rev(char *str)
{
    /* this assumes that str is valid UTF-8 */
    char    *scanl, *scanr, *scanr2, c;

    /* first reverse the string */
    for (scanl= str, scanr= str + strlen(str); scanl < scanr;)
        c= *scanl, *scanl++= *--scanr, *scanr= c;

    /* then scan all bytes and reverse each multibyte character */
    for (scanl= scanr= str; c= *scanr++;) {
        if ( (c & 0x80) == 0) // ASCII char
            scanl= scanr;
        else if ( (c & 0xc0) == 0xc0 ) { // start of multibyte
            scanr2= scanr;
            switch (scanr - scanl) {
                case 4: c= *scanl, *scanl++= *--scanr, *scanr= c; // fallthrough
                case 3: // fallthrough
                case 2: c= *scanl, *scanl++= *--scanr, *scanr= c;
            }
            scanr= scanl= scanr2;
        }
    }
}

// quick and dirty main for testing purposes
#include "stdio.h"

int main(int argc, char* argv[])
{
    char buffer[256];
    buffer[sizeof(buffer)-1]= '\0';

    while (--argc > 0) {
        strncpy(buffer, argv[argc], sizeof(buffer)-1); // don't overwrite final null
        printf("%s → ", buffer);
        utf8rev(buffer);
        printf("%s\n", buffer);
    }
    return 0;
}

If you compile this program (example name: so199260.c) and run it on a UTF-8 environment (a Linux installation in this case):

$ so199260 γεια και χαρά français АДЖИ a♠♡♢♣b
a♠♡♢♣b → b♣♢♡♠a
АДЖИ → ИЖДА
français → siaçnarf
χαρά → άραχ
και → ιακ
γεια → αιεγ

If the code is too cryptic, I will happily clarify.

ΤΖΩΤΖΙΟΥ 2008-10-13 23:44:36

Neat! But how does the 3-byte character case work? Also, I think it turns out simpler if you reverse the individual characters first.

2008-10-13 23:57:04

The three byte character works with a single swap (bytes [0] and [2]), [1] does not need swapping. I'm sorry for the cryptic code, for years I code in Python and all C-code I write is for memory-constrained environments with not-so-clever compilers, so I tend to optimize code size a lot.

ΤΖΩΤΖΙΟΥ 2008-10-13 23:58:49

Yes, your method is much simpler; in my code, if I reverse the string at the end (skipping the strlen call) then my character reversing process needs refactoring.

ΤΖΩΤΖΙΟΥ 2008-10-14 00:06:44

Well, yeah, I scratched my head for a moment trying to figure out the switch argument (scanr - scanl) and what you were doing in each case, but now I get it. Thanks.

Juan Pablo Califano 2008-10-14 00:15:55

ansaurus

tags:

views:

answers:

How do I reverse a UTF-8 string in place?

related questions