tags:

views:

146

answers:

3

Hi, I need a regular expression engine that supports raw UTF-8 - meaning, the UTF-8 string is stored in char * as two chars(or one, or less) - for example, Ab is the array {0x41,0x62}. Anyone know of an regex engine that can receive that format? I can convert to wchar_t if needed first.

A: 

The current implementation of PCRE (release 7.x) corresponds approxi- mately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode support has to be explicitly enabled; it is not the default. The Unicode tables correspond to Unicode release 5.1.

majkinetor
+2  A: 

This page says that it is possible with Boost.Regex, on the condition that you configure and use ICU library.

Benoît
A: 

Dealing with the non constant character length nature of UTF-8 makes it very hard to create algorithms (like regex).

It's better to convert the utf-8 string to a unicode wstring with ICU and then use the wstring variant of boost::regex

lothar