tags:

views:

153

answers:

2

I run a forum designed to support an international mathematics group. I've recently switched it to unicode for better support of international characters. In debugging this conversion, I've discovered that not all unicode characters are considered as valid XHTML (the relevant website appears to be http://www.w3.org/TR/unicode-xml/). One of the steps that the forum software goes through before presenting the posts to the browser is an XHTML validation/sanitisation step. It seems a reasonable idea that at that stage it should remove any unicode characters that XHTML doesn't like.

So my question is:

Is there a standard (or best) way of doing this in PHP?

(The forum is written in PHP, by the way.)

I guess that the failsafe would be a simple str_replace (if that's also the best, do I need to do anything extra to make sure it works properly with unicode?) but that would involve me having to go through the XHTML DTD (or the above-referenced W3 page) carefully to figure out what characters to list in the search part of str_replace, so if this is the best way, has someone already done that so that I can steal, err, copy, it?

(Incidentally, the character that caused the problem was U+000C, the 'formfeed', which (according to the W3 page) is valid HTML but invalid XHTML!)

+1  A: 

Assuming your input is utf8, you can remove unicode ranges with something like

 preg_replace('~[\x{17A3}-\x{17D3}]~u', '', $input);

Another, and better, approach is to remove everything by default and only whitelist chars you want to see. Unicode properties (\p) are quite practical for this. For example, removes everything except (unicode) letters and numbers:

  preg_replace('~[^\p{L}\p{N}]~u', '', $input)
stereofrog
My problem with either of these approaches is that I have to go through the DTD to extract the whitelist or blacklist to match against. I was kinda hoping that someone had already done that for me! I don't suppose that there's a '\p{XHTML}' for all those characters that are valid XHTML, is there? (I'm a mathematician and we're fundamentally a lazy bunch - if someone else has already solved the problem then we don't want to bother doing it again!)
Andrew Stacey
i'm not aware of such a solution either, but if you're looking for quick and easy way, you can simply convert everything except letters-numbers-punctuation to numeric entities.
stereofrog
Converting "everything-except" to entities doesn't work. If I send a character outside the valid set, even when encoded as an entity, the browser will complain. (I should perhaps make clear that I'm serving XHTML+MathML so it *has* to be 100% valid - I can't rely on the browser to ignore an invalid entity.)
Andrew Stacey
+2  A: 

I found a function that might do what you want on phpedit.net.

I'll post the function for the archive, credits to ltp on PHPEdit.net:

/**
 * Removes invalid XML
 *
 * @access public
 * @param string $value
 * @return string
 */
function stripInvalidXml($value)
{
    $ret = "";
    $current;
    if (empty($value)) 
    {
        return $ret;
    }

    $length = strlen($value);
    for ($i=0; $i < $length; $i++)
    {
        $current = ord($value{$i});
        if (($current == 0x9) ||
            ($current == 0xA) ||
            ($current == 0xD) ||
            (($current >= 0x20) && ($current <= 0xD7FF)) ||
            (($current >= 0xE000) && ($current <= 0xFFFD)) ||
            (($current >= 0x10000) && ($current <= 0x10FFFF)))
        {
            $ret .= chr($current);
        }
        else
        {
            $ret .= " ";
        }
    }
    return $ret;
}
Bas
I would guess that this is fast than the preg_replace method (especially given the comment about speed at http://php.net/manual/en/regexp.reference.unicode.php), but suffers from the same drawback that I have to figure out my own whitelist! (See comment above about being lazy!)
Andrew Stacey
You don't have to figure out your own white-list. Characters are allowed based on ASCII code and they are replaced with a space when they fall outside of the range specified by the function. I'm pretty sure this is all you will need, the white-list is already in the function.
Bas
Certainly there is *one* whitelist in that function, but how do I know that it is the correct whitelist? For example, 0xC is allowed in HTML but not in XHTML. If I'm working from a whitelist, it ought to be generated somehow from the DTD.
Andrew Stacey
0xC is filtered in this function as are all other characters that are not allowed in XML documents. Why would you need to generate a white-list from the DTD? Just retrieve the posts from the DB, put them through this function and output them as XHTML.
Bas
I'd need to generate a white-list from the DTD because different DTDs allow different lists of entities. 0xC _is_ allowed in HTML.
Andrew Stacey
Fine, but the question was to output valid XHTML characters which is what this function does. Just use the valid XML 1.0 characters which are also detailed here: http://en.wikipedia.org/wiki/XML#Details_on_valid_characters. When are you going to need the different DTD's anyway?
Bas
The whole XML/XHTML and charset is new to me so I apologise if my question was not sufficiently specific. The phpedit page does not mention XHTML, only XML. The wikipedia page you link to similarly is about XML rather than XHTML. There are differences, as this page points out: http://www.w3.org/International/questions/qa-controls . Having been caught out once, I'm wary of being caught out with the wrong list again. I want to allow different DTDs because this particular bit of code goes in an extension to the software and others might prefer to serve a different DTD.
Andrew Stacey
The document you linked to details the difference in control codes. You are not going to need those in a forum post and they are all striped by the function (except newline and tab). I know the pages talk about XML, but an XHTML document is an XML document. The only thing that could be an issue is if HTML allows less characters than XML (I don't know), in which case you could be left with characters that are not valid in an HTML document. I would, however, be very surprised if that would be the case. I'd run a test with this function and see what happens. Good luck.
Bas
I've now run the test and found that you were completely correct: the ranges given in this function were correct for what I was trying to do. I hope that my confusion hasn't wasted too much of your time and thank you for your patience and perseverance!
Andrew Stacey
Thanks for the feedback. Glad everything worked out.
Bas