The first step is writing a functor to tell if a given wchar_t is Hindi. This will be (derived from) a std::unary_function<wchar_t, bool>
. Implementation is trivial: return c>= 0x0900 && c < 0x980;
. The second step is using it: std::find_if(begin, end, is_hindi())
.
Since you'll need Unicode, you should probably use wchar_t
and therefore std::wstring
. Neither std::string
nor GLib::ustring
supports Unicode proper. On some systems (Windows in particular) the implementation of wchar_t
is restricted to Unicode 4 = 16 bits but that should still be enough for 99.9% of the worlds population.
You'll need to convert from/to UTF-8 on I/O, but the advantage of "one character = one wchar_t" is big. For instance, std::wstring::substr()
will work reasonably. You might still have issues with "characters" like U+094B (DEVANAGARI VOWEL SIGN O), though. When iterating over a std::wstring, that will appear to be a character by itself, instead of a modifier. That's still better than std::string with UTF-8, where you'd end up iterating over the individual bytes of U+094B. And to take just your original examples, none of the bytes in UTF8(U+094B)
are reserved for Hindi.