views:

1262

answers:

2

I am reading the string of data from the oracle database that may or may not contain the Unicode characters into a c++ program.Is there any way for checking the string extracted from the database contains an Unicode characters(UTF-8).if any Unicode characters are present they should be converted into hexadecimal format and need to displayed.

+1  A: 

There are two aspects to this question.

  1. Distinguish UTF-8-encoded characters from ordinary ASCII characters.

    UTF-8 encodes any code point higher than 127 as a series of two or more bytes. Values at 127 and lower remain untouched. The resultant bytes from the encoding are also higher than 127, so it is sufficient to check a byte's high bit to see whether it qualifies.

  2. Display the encoded characters in hexadecimal.

    C++ has std::hex to tell streams to format numeric values in hexadecimal. You can use std::showbase to make the output look pretty. A char isn't treated as numeric, though; streams will just print the character. You'll have to force the value to another numeric type, such as int. Beware of sign-extension, though.

Here's some code to demonstrate:

#include <iostream>

void print_characters(char const* s)
{
  std::cout << std::showbase << std::hex;
  for (char const* pc = s; *pc; ++pc) {
    if (*pc & 0x80)
      std::cout << (*pc & 0xff);
    else
      std::cout << *pc;
    std::cout << ' ';
  }
  std::cout << std::endl;
}

You could call it like this:

int main()
{
  char const* test = "ab\xef\xbb\xbfhu";
  print_characters(test);
  return 0;
}

Output on Solaris 10 with Sun C++ 5.8:

$ ./a.out
a b 0xef 0xbb 0xbf h u

The code detects UTF-8-encoded characters, but it makes no effort to decode them; you didn't mention needing to do that.

I used *pc & 0xff to convert the expression to an integral type and to mask out the sign-extended bits. Without that, the output on my computer was 0xffffffbb, for instance.

Rob Kennedy
Hi Rob,i could see the string u have taken contains hex format of Unicode characters but say my database contains the Unicode characters say Arabic i want to convert the characters that are present in Arabic to the hex characters. ex: char *test ="مرحبا " i want to print the hexa format of the "مرحبا ";
Bhargava dns
You've missed the point. Get your characters into a string however you want, however you can. If that's from a database, then so be it. Once you have your characters in a string, you can use code like I showed to detect non-ASCII characters and print their UTF-8 bytes in hexadecimal format. I've edited the code in the hopes that it highlights the difference between detecting the contents of the string and putting characters into it. My string literal was merely an easy way for me to put something testable into a string.
Rob Kennedy
A: 

I would convert the string to UTF-32 (you can use something like UTF CPP for that - it is very easy), and then loop through the resulting string, detect code points (characters) that are above 0x7F and print them as hex.

Nemanja Trifunovic