tags:

views:

98

answers:

5

Hi, after some confusion in the comments to

I thought I make into a question. According to the PHP manual, a valid class name should match against [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*. But apparently, this is not enforced, nor does it apply for anything else:

define('π', pi());
var_dump('π');

class ␀ {
    private $␀ = TRUE;
    public function ␀()
    {
        return $this->␀;
    }
}

$␀ = new ␀;
var_dump($␀ );
var_dump($␀->␀());

works fine (even though my IDE cannot show ␀). Can some erudite person clear this up for me? Can we use any Unicode? And if so, since when? Not that I would actually want to use anything but A-Za-z_ but I'm curious.

Clarification: I am not after a Regex to validate class names, nor do I know if PHP internally uses the Regex it suggests in the manual. The thing that confused me (and apparently the other guys in the linked question) is why things like $☂ = 1 can be used in PHP at all. PHP6 was suppposed to be the Unicode release but PHP6 is in hiatus. But if there is no Unicode support, why can I do this then?

A: 

A valid class name starts with a letter or underscore, followed by any number of letters, numbers, or underscores. As a regular expression, it would be expressed thus: [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*.

(From php.net)

Jarsäter
I think the OP's point is that, while this is the "standard", it doesn't seem to be enforced. The question being "will this always be okay".
Ryan Kinal
@Ryan almost correct, just that I'm not so much interested in "will this always be okay" but rather in "why does this work at all".
Gordon
A: 

see Minutes PHP Developers Meeting

stillstanding
Interesting. Any particular one? From the Unicode section? or somewhere else?
Gordon
Unicode. It will be implemented as a toggle-able setting in PHP 6. So using it is not advised, specially if you'll be deploying on a server you don't control.
stillstanding
Hasn't PHP 6 been postponed (some even call it cancelled...)?
nikic
@stillstanding @nikic yes, given that document is from 2005 I am wondering if this ever made it into the current trunk
Gordon
+3  A: 

Your character is encoded as 0x80 0x90 0xe2 or something like that, thus it matches your regexp when not interpreting the unicode (working on single bytes).

Scharron
+2  A: 

I think I got it:

<?php
$☂ = true;

$vars = get_defined_vars();
unset($vars['GLOBALS'], $vars['_GET'], $vars['_POST'], $vars['_COOKIE'], $vars['_FILES'], $vars['_ENV'], $vars['_REQUEST'], $vars['_SERVER']);
$vars = array_keys($vars);
var_dump($vars);

for ($i = 0; $i < strlen($vars[0]); ++$i) {
    echo dechex(ord($vars[0][$i])), ' ';
}

Output of the script:

array
   0 => string '☂' (length=3)

e2 98 82 

And all of these for themselves are between 7f and ff. So what we actually have here isn't unicode support but the exact contrary. PHP doesn't get this is one character but interprets it as several. And all of them fall in the allowed characters range.

nikic
+6  A: 

This question starts to mention class names in the title, but then goes on to an example that includes exotic names for methods, constants, variables, and fields. There are actually different rules for these. Let's start with the case insensitive ones.

Case-insensitive identifiers (class and function/method names)

The general guideline here would be to use only printable ASCII characters. The reason is that these identifiers are normalized to their lowercase version, however, this conversion is locale-dependent. Consider the following PHP file, encoded in ISO-8859-1:

<?php
function func_á() { echo "worked"; }
func_Á();

Will this script work? Maybe. It depends on what tolower(193) will return, which is locale-dependent:

$ LANG=en_US.iso88591 php a.php
worked
$ LANG=en_US.utf8 php a.php

Fatal error: Call to undefined function func_Á() in /home/glopes/a.php on line 3

Therefore, it's not a good idea to use non-ASCII characters. However, even ASCII characters may give trouble in some locales. See this discussion. It's likely that this will be fixed in the future by doing a locale-independent lowercasing that only works with ASCII characters.

In conclusion, if we use multi-byte encodings for these case-insensitive identifiers, we're looking for trouble. It's not just that we can't take advantage of the case insensitivity. We might actually run into unexpected collisions because all the bytes that compose a multi-byte character are individually turned into lowercase using locale rules. It's possible that two different multi-byte characters map to the same modified byte stream representation after applying the locale lowercase rules to each of the bytes.

Case-sensitive identifiers (variables, constants, fields)

The problem is less serious here, since these identifiers are case sensitive. However, they are just interpreted as bytestreams. This means that if we use Unicode, we must consistently use the same byte representation; we can't mix UTF-8 and UTF-16; we also can't use BOMs.

In fact, we must stick to UTF-8. Outside of the ASCII range, UTF-8 uses lead bytes from 0xc0 to 0xfd and the trail bytes are in the range 0x80 to 0xbf, which are in the allowed range per the manual. Now let's say we use the character "Ġ" in a UTF-16BE encoded file. This will translate to 0x01 0x20, so the second byte will be interpreted as a space.

Having multi-byte characters being read as if they were single-byte characters is, of course, no Unicode support at all. PHP does have some multi-byte support in the form of the compilation switch "--enable-zend-multibyte". This allows you to declare the encoding of the the script:

<?php
declare(encoding='ISO-8859-1');
// code here
?>

It will also handle BOMs, which are used to auto-detect the encoding and do not become part of the output. There are, however, a few downsides:

  • Peformance hit, both memory and cpu. It stores a representation of the script in an internal multi-byte encoding, which takes more space (and it also seems to store in memory the original version) and it also spends some CPU converting the encoding.
  • Multi-byte support is usually not compiled in, so it's less tested (more bugs).
  • Portability issues between installations that have the support compiled in and those that don't.
  • Refers only to the parsing stage; does not solve the problem outlined for case-insensitive identifiers.

Finally, there is the problem of lack of normalization – the same character may be represented with different Unicode code units (independently of the encoding). This may lead to some very difficult to track bugs.

Artefacto
Thanks a lot for taking the time to write this thorough explanation. It all makes sense now.
Gordon
locale independent lower case How about İ=>i and I=>ı in turkish?
nerkn
@nerkn That's the part where I say "However, even ASCII characters may give trouble in some locales" and link to a thread when that's precisely discussed.
Artefacto