ansaurus

Question

Locale-aware Perl regular expressions (matching word boundaries)

Answer 1

+5 A:

I would expect that the regex result in "Täs*******". And this is what I get when I "use utf8" in a utf-8 encoded file with your code above.

(If everything is latin-1, that changes the behavior of the regex engine. Hence the existence of utf8::upgrade. See Unicode::Semantics.)

Edit: I see you fixed your post and that we agree on the expected result. Basically, use Unicode::Semantics when you want Unicode semantics on your regexps.

jrockway 2009-10-12 07:44:29

That's weird... When run in a standalone fashion the code indeed works.It turns out that "use locale" broke things... Once I removed that everything went back to normal.

tg 2009-10-12 08:06:38

Yeah, "use locale" should be avoided. "use utf8" for if you have utf8 literals in utf8-encoded source code. Otherwise, handle encoding with Encode, and use Unicode::Semantics when warranted.

jrockway 2009-10-12 08:08:00

Is "use locale" a bad idea in all circumstances? Is it/should it be deprecated?

Ether 2009-10-12 16:41:38

Depends on whether or not you want the behavior of your program to depend on the environment and random data in /usr/share/i18n/locales. If you need something to be locale dependent, why not just call the appropriate function directly?

jrockway 2009-10-13 06:35:52

ansaurus

tags:

views:

answers:

Locale-aware Perl regular expressions (matching word boundaries)

related questions