Hi, I need a regular expression engine that supports raw UTF-8 - meaning, the UTF-8 string is stored in char * as two chars(or one, or less) - for example, Ab is the array {0x41,0x62}. Anyone know of an regex engine that can receive that format? I can convert to wchar_t if needed first.
A:
The current implementation of PCRE (release 7.x) corresponds approxi- mately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode support has to be explicitly enabled; it is not the default. The Unicode tables correspond to Unicode release 5.1.
majkinetor
2009-04-24 11:05:00
+2
A:
This page says that it is possible with Boost.Regex, on the condition that you configure and use ICU library.
Benoît
2009-04-24 11:05:38
A:
Dealing with the non constant character length nature of UTF-8 makes it very hard to create algorithms (like regex).
It's better to convert the utf-8 string to a unicode wstring with ICU and then use the wstring variant of boost::regex
lothar
2009-04-24 17:28:27