views:

338

answers:

1

I'm already aware that \w in PCRE (particularly PHP's implementation) can sometimes match some non-ASCII characters depending on the locale of the system, but what about [a-z]?

I wouldn't think so, but I noticed these lines in one of Drupal's core files (includes/theme.inc, simplified):

// To avoid illegal characters in the class,
// we're removing everything disallowed. We are not using 'a-z' as that might leave
// in certain international characters (e.g. German umlauts).
$body_classes[] = preg_replace('![^abcdefghijklmnopqrstuvwxyz0-9-_]+!s', '', $class);

Is this true, or did someone simply get [a-z] confused with \w?

+7  A: 

Long story short: Maybe, depends on the system the app is deployed to, depends how PHP was compiled, welcome to the CF of localization and internationalization.

The underlying PCRE engine takes locale into account when determining what "a-z" means. In a Spanish based locale, ñ would be caught by a-z). The semantic meaning of a-z is "all the letters between a and z, and ñ is a separate letter in Spanish.

However, the way PHP blindly handles strings as collections of bytes rather than a collection of UTF code points means you have a situation where a-z MIGHT match an accented character. Given the variety of different systems Drupal gets deployed to, it makes sense that they would choose to be explicit about the allowed characters rather than just trust a-z to do the right thing.

I'd also conjecture that the existence of this regular expression is the result of a bug report being filed about German umlauts not being filtered.

Alan Storm
That would be "CF" as in "Complete Frustration," right? :)
Garrett Albright
I think the F is for a completely different word, followed by the word "up", but that's just my opinion.
Lasse V. Karlsen
I think the C stands for "Cluster" :)
Ross Duggan