tags:

views:

46

answers:

1

I am developing a MVC application with PHP that uses XML and XSLT to print the views. It need to be fully UTF-8 supported. I also use MySQL right configured with UTF8. My problem is the next.

I have a <input type="text"/> with a value like àáèéìíòóùú"><'@#~!¡¿?. This is processed to add it to the database. I use mysql_real_escape_string($_POST["name"]) and then do MySQL a INSERT. This will add a slash \ before " and '.

The MySQL database have a DEFAULT CHARACTER SET utf8 and COLLOCATE utf8_spanish_ci. The table field is a normal VARCHAR.

Then I have to print this on a XML that will be transformed with XSLT. I can use PHP on the XML so I echo it with <?php echo TexUtils::obtainSqlText($value_obtained_from_sql); ?>. The obtainSqlText() function actually returns the same as the $value processed, is waiting for a final structure.

One of the first things that I will need for the selected input is to convert > and < to &gt; and &lt; because this will generate problems with start/end tags. This will be done with <?php htmlspecialchars($string, ENT_QUOTES, "UTF-8"); ?>. This will also converts & to &amp;, " to &quot; and ' to &#039;. This is a big problem: XSLT starts to fail because it doesn't recognize all HTML special characters.

There is another problem. I've talked about àáèéìíòóùú"><'@#~!¡¿? input but I will have some text from a CKEditor <textarea /> that the value will look like:

<p>
    <a href="http://stackoverflow.com/"&gt;àáèéìíòóùú"&gt;&lt;'@#~!¡¿?&lt;/a&gt;
</p>

How I've to manage this? At first, if I want to print this second value right I will need to use <xsl:value-of select="value" disable-output-escaping="yes" />. Will "><' print right?

So what I am really looking for is how I need to manage this values and how I've to print. I need to use something if is coming from a VARCHARthat doesn't allows HTML and another if is a TEXT (for example) and allows HTML? I will need to use disable-output-escaping="yes" everytime?

I also want to know if doing this I am really securing the query from XSS attacks.

Thank you in advance!

+2  A: 

This will be done with <?php htmlspecialchars($string, ENT_QUOTES, "UTF-8"); ?>.

Fine.

This is a big problem: XSLT starts to fail because it doesn't recognize all HTML special characters.

It shouldn't fail on htmlspecialchars() output, ever. &amp; is a predefined entity in XML and &#39; is a character reference which is always allowed. htmlspecialchars() should produce XML-compatible output, unlike the usually-a-mistake htmlentities(). What is the error you are seeing?

<a href="http://stackoverflow.com/"&gt;àáèéìíòóùú"&gt;&lt;'@#~!¡¿?&lt;/a&gt;

Urgh, an HTML rich text editor produced that invalid markup? What a dodgy editor.

If you have to allow users to input arbitrary HTML, it's going to need some processing. Unless you really trust those users, you'll need a purifier (to stop them using dangerous scripting elements and XSS-ing each other), and a tidier (to remove malformed markup either due to crap rich-text-editor output or deliberate sabotage). If you intend to put the content directly into XML you will also need it to convert to XHTML output and replace HTML entity references.

A simple way to do this in PHP would be DOMDocument->loadHTML followed by a walk of the DOM tree removing all but known-good elements/attributes/URL-schemes, followed by DOMDocument->saveXML.

Will "><' print right?

Well, it'll print as in your example, yes. But that's equally invalid as both HTML and XML.

bobince
@bobince thank you for you answer. As you say "><' is invalid HTML. How can I write " if is not inside a `<a />` for example and print print " if is inside a `<a />`? How I can managed this? Thank you!
Isern Palaus
You don't need to escape `"` to `"` unless it's inside an attribute value delimited by `"`. If you have attribute values with unescaped quotes in, it's not in the general case possible to recover what was meant: eg is `<a b="c" d="e">` an element with two attributes, or one attribute with value `c" d="e`? If this is the sort of markup you have, the component responsible for creating it needs to be fixed. For simpler errors like `<'`, a tidier/purifier should fix them.
bobince
I will talk with the other developer that takes advantage on me about this.Talking about XSLT part: if for ever string I use `htmlspecialchars()` I will EVER need to use **disable-output-escaping="yes"**, no? I usually used it for printing the possible forms that have HTML but assuming that **àáèéìíòóùú"><'@#~!¡¿?** can be a valid, for example, title for a notice and will need to pass the `htmlspecialchars()` i will need to use the output-escaping. Will have most inputs need, no?Thank you
Isern Palaus
Yes, normally when you're outputting text to HTML that needs one level of HTML-encoding, which can be applied by `htmlspecialchars()` or by XSLT output escaping but not both. You would use both if you wanted to put escaped HTML inside XML, eg for the `<description>` element in an RSS feed. Note that there are problems with using standard XSLT to create HTML: the `html` output method doesn't produce valid HTML output, and the `xhtml` output method doesn't produce HTML-compatible XHTML. It's a bit of a sad mess.
bobince