ansaurus

Question

Will [a-z] ever match accented characters in PREG/PCRE?

Answer 1

+7 A:

Long story short: Maybe, depends on the system the app is deployed to, depends how PHP was compiled, welcome to the CF of localization and internationalization.

The underlying PCRE engine takes locale into account when determining what "a-z" means. In a Spanish based locale, ñ would be caught by a-z). The semantic meaning of a-z is "all the letters between a and z, and ñ is a separate letter in Spanish.

However, the way PHP blindly handles strings as collections of bytes rather than a collection of UTF code points means you have a situation where a-z MIGHT match an accented character. Given the variety of different systems Drupal gets deployed to, it makes sense that they would choose to be explicit about the allowed characters rather than just trust a-z to do the right thing.

I'd also conjecture that the existence of this regular expression is the result of a bug report being filed about German umlauts not being filtered.

Alan Storm 2009-12-18 21:02:15

That would be "CF" as in "Complete Frustration," right? :)

Garrett Albright 2009-12-18 22:32:20

I think the F is for a completely different word, followed by the word "up", but that's just my opinion.

Lasse V. Karlsen 2009-12-18 22:52:53

I think the C stands for "Cluster" :)

Ross Duggan 2009-12-19 11:30:20

ansaurus

tags:

views:

answers:

Will [a-z] ever match accented characters in PREG/PCRE?

related questions