ansaurus

Question

How do I best remove the unicode characters that XHTML regards as non-valid using php?

Answer 1

+1 A:

Assuming your input is utf8, you can remove unicode ranges with something like

 preg_replace('~[\x{17A3}-\x{17D3}]~u', '', $input);

Another, and better, approach is to remove everything by default and only whitelist chars you want to see. Unicode properties (\p) are quite practical for this. For example, removes everything except (unicode) letters and numbers:

  preg_replace('~[^\p{L}\p{N}]~u', '', $input)

stereofrog 2010-04-13 09:06:48

My problem with either of these approaches is that I have to go through the DTD to extract the whitelist or blacklist to match against. I was kinda hoping that someone had already done that for me! I don't suppose that there's a '\p{XHTML}' for all those characters that are valid XHTML, is there? (I'm a mathematician and we're fundamentally a lazy bunch - if someone else has already solved the problem then we don't want to bother doing it again!)

Andrew Stacey 2010-04-13 09:34:12

i'm not aware of such a solution either, but if you're looking for quick and easy way, you can simply convert everything except letters-numbers-punctuation to numeric entities.

stereofrog 2010-04-13 10:23:42

Converting "everything-except" to entities doesn't work. If I send a character outside the valid set, even when encoded as an entity, the browser will complain. (I should perhaps make clear that I'm serving XHTML+MathML so it *has* to be 100% valid - I can't rely on the browser to ignore an invalid entity.)

Andrew Stacey 2010-04-13 10:57:03

Answer 2

+2 A:

I found a function that might do what you want on phpedit.net.

I'll post the function for the archive, credits to ltp on PHPEdit.net:

/**
 * Removes invalid XML
 *
 * @access public
 * @param string $value
 * @return string
 */
function stripInvalidXml($value)
{
    $ret = "";
    $current;
    if (empty($value)) 
    {
        return $ret;
    }

    $length = strlen($value);
    for ($i=0; $i < $length; $i++)
    {
        $current = ord($value{$i});
        if (($current == 0x9) ||
            ($current == 0xA) ||
            ($current == 0xD) ||
            (($current >= 0x20) && ($current <= 0xD7FF)) ||
            (($current >= 0xE000) && ($current <= 0xFFFD)) ||
            (($current >= 0x10000) && ($current <= 0x10FFFF)))
        {
            $ret .= chr($current);
        }
        else
        {
            $ret .= " ";
        }
    }
    return $ret;
}

Bas 2010-04-13 10:30:24

I would guess that this is fast than the preg_replace method (especially given the comment about speed at http://php.net/manual/en/regexp.reference.unicode.php), but suffers from the same drawback that I have to figure out my own whitelist! (See comment above about being lazy!)

Andrew Stacey 2010-04-13 10:50:54

You don't have to figure out your own white-list. Characters are allowed based on ASCII code and they are replaced with a space when they fall outside of the range specified by the function. I'm pretty sure this is all you will need, the white-list is already in the function.

Bas 2010-04-13 10:58:15

Certainly there is *one* whitelist in that function, but how do I know that it is the correct whitelist? For example, 0xC is allowed in HTML but not in XHTML. If I'm working from a whitelist, it ought to be generated somehow from the DTD.

Andrew Stacey 2010-04-13 12:18:05

0xC is filtered in this function as are all other characters that are not allowed in XML documents. Why would you need to generate a white-list from the DTD? Just retrieve the posts from the DB, put them through this function and output them as XHTML.

Bas 2010-04-13 12:55:00

I'd need to generate a white-list from the DTD because different DTDs allow different lists of entities. 0xC _is_ allowed in HTML.

Andrew Stacey 2010-04-13 13:51:44

Fine, but the question was to output valid XHTML characters which is what this function does. Just use the valid XML 1.0 characters which are also detailed here: http://en.wikipedia.org/wiki/XML#Details_on_valid_characters. When are you going to need the different DTD's anyway?

Bas 2010-04-13 14:08:54

The whole XML/XHTML and charset is new to me so I apologise if my question was not sufficiently specific. The phpedit page does not mention XHTML, only XML. The wikipedia page you link to similarly is about XML rather than XHTML. There are differences, as this page points out: http://www.w3.org/International/questions/qa-controls . Having been caught out once, I'm wary of being caught out with the wrong list again. I want to allow different DTDs because this particular bit of code goes in an extension to the software and others might prefer to serve a different DTD.

Andrew Stacey 2010-04-13 14:35:30

The document you linked to details the difference in control codes. You are not going to need those in a forum post and they are all striped by the function (except newline and tab). I know the pages talk about XML, but an XHTML document is an XML document. The only thing that could be an issue is if HTML allows less characters than XML (I don't know), in which case you could be left with characters that are not valid in an HTML document. I would, however, be very surprised if that would be the case. I'd run a test with this function and see what happens. Good luck.

Bas 2010-04-13 16:09:46

I've now run the test and found that you were completely correct: the ranges given in this function were correct for what I was trying to do. I hope that my confusion hasn't wasted too much of your time and thank you for your patience and perseverance!

Andrew Stacey 2010-04-14 18:46:58

Thanks for the feedback. Glad everything worked out.

Bas 2010-04-15 10:39:08

ansaurus

tags:

views:

answers:

How do I best remove the unicode characters that XHTML regards as non-valid using php?

related questions