I am reading the string of data from the oracle database that may or may not contain the Unicode characters into a c++ program.Is there any way for checking the string extracted from the database contains an Unicode characters(UTF-8).if any Unicode characters are present they should be converted into hexadecimal format and need to displayed.
There are two aspects to this question.
Distinguish UTF-8-encoded characters from ordinary ASCII characters.
UTF-8 encodes any code point higher than 127 as a series of two or more bytes. Values at 127 and lower remain untouched. The resultant bytes from the encoding are also higher than 127, so it is sufficient to check a byte's high bit to see whether it qualifies.
Display the encoded characters in hexadecimal.
C++ has
std::hex
to tell streams to format numeric values in hexadecimal. You can usestd::showbase
to make the output look pretty. Achar
isn't treated as numeric, though; streams will just print the character. You'll have to force the value to another numeric type, such asint
. Beware of sign-extension, though.
Here's some code to demonstrate:
#include <iostream>
void print_characters(char const* s)
{
std::cout << std::showbase << std::hex;
for (char const* pc = s; *pc; ++pc) {
if (*pc & 0x80)
std::cout << (*pc & 0xff);
else
std::cout << *pc;
std::cout << ' ';
}
std::cout << std::endl;
}
You could call it like this:
int main()
{
char const* test = "ab\xef\xbb\xbfhu";
print_characters(test);
return 0;
}
Output on Solaris 10 with Sun C++ 5.8:
$ ./a.out a b 0xef 0xbb 0xbf h u
The code detects UTF-8-encoded characters, but it makes no effort to decode them; you didn't mention needing to do that.
I used *pc & 0xff
to convert the expression to an integral type and to mask out the sign-extended bits. Without that, the output on my computer was 0xffffffbb
, for instance.
I would convert the string to UTF-32 (you can use something like UTF CPP for that - it is very easy), and then loop through the resulting string, detect code points (characters) that are above 0x7F and print them as hex.