views:

87

answers:

5

This is a follow up to my last question here. The answer posted there actually does not work. So here is the challenge. You are given this code (assume jQuery included):

<input type=text>
<script>
    $("input").val(**YOUR PHP / JS CODE HERE**);
</script>

Using jQuery - and not by injecting PHP output directly into the input tag - faithfully reproduce ANY text from the database in the input tag. If the database field says </script>, the field should say that too. If has Chinese in it, double quotes, whatever, reproduce that too. Assume your PHP variable is called $text.

Here are some of my failed attempts.

1)

$("input").val("<?= htmlentities($text); ?>");

FAILURE: Reproduces character encoding exactly as is in text fields.
INPUT: $text = "Déjà vu"
OUTPUT: Field contains literal d&eacute;j&agrave; vu

2)

$("input").val(<?= json_encode($text); ?>);

This was suggested as the answer in my last question, and I naively accepted it. However...
FAILURE: json_encode only works with UTF-8 characters.
INPUT: $text = "Va e de här fö frågor egentlien"
OUTPUT: Field is blank, because json_encode returns null.

3)

var temp = $("<div></div>").html("<?= htmlentities($text); ?>");
$("input").val(temp.html());

This was my most promising solution for the weird characters, except...
FAILURE: Does not encode some characters (not sure exactly which, don't care)
INPUT: $text = "</script> Déjà"
OUTPUT: Field contains &lt;/script&gt; Déjà

4) Suggested in answers

$("input").val(unescape("<?= urlencode($text); ?>"));

FAILURE: Spaces remain encoded as +'s.

$("input").val(unescape(<?= rawurlencode($text); ?>"));

Almost works. All previous input succeeds, but multibyte stuff, like kanji, remain encoded. decodeURIComponent also doesn't like multibyte characters.

Note that for me, things like strip_tags are not an option. Everything must be allowed. People are authoring quizzes with this, and if someone wants to make a quiz that tests your knowledge of HTML, so be it. Also, unfortunately I cannot just inject the htmlentities escaped text into the value field of the input tags. These tags are generated dynamically, and I would have to totally tear down my current javascript code structure to do it that way.

I feel like I'm SOL here. Please show me how wrong I am.

EDIT

Assume the user initally entered </script> Déjà här fö frågor 漢字 into the db. This would be stored (you would see it in phpMyAdmin) as </script> Déjà här fö frågor &#28450;&#23383;

A: 

You may want to use urlencode() and urldecode().

A: 

You can use:

Artefacto
+1  A: 

You need to encode in PHP, and decode in JavaScript...

PHP's rawurlencode():

echo rawurlencode("</script> Déjà");
//result: %3C%2Fscript%3E+D%C3%A9j%C3%A0

JavaScript's decodeURIComponent():

var encoded = "%3C%2Fscript%3E+D%C3%A9j%C3%A0";
alert(decodeURIComponent(encoded));
//result: </script> Déjà
Dolph
You could just use rawurlencode instead of urlencode and then you wouldn't have to replace the plus signs manually.
Artefacto
Nice, I tested that and it seemed to work. Simplified my answer!
Dolph
Sorry, fails! With some input (including multibyte chars) javascript complains of malformed uri component.
Tesserex
@Tess What about converting the text to UTF-8 before encoding? You said "accurately". This gives an accurate representation of the bytestream.
Artefacto
If you want more/better answers to this question, you're going to have to provide unit tests.
Dolph
+1  A: 

What encoding is your text in, if not UTF-8? If you don't know, you don't have text, you have a byte sequence, which is much harder to faithfully represent. If you do know, you can do something like this using the PHP multibyte string extension:

$("input").val(<?= json_encode(mb_convert_encoding($text, "UTF-8", "ISO-8859-1")); ?>);

Here I've presumed your input is in ISO-8859-1 aka Latin-1 encoding, which is a pretty common case for database strings.

EDIT: This is in response to the comments about a closing script tag. I made this test file and it displays properly for me, at least in Firefox 3.6:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
    <title>Test</title>
    <script src='http://code.jquery.com/jquery-1.4.2.js'&gt;&lt;/script&gt;
</head>
<form name='foo'>
    <input name='bar' id='bar'/>
</form>
<script language="JavaScript">
    $('input').val("<\/script>");
</script>
</html>
Walter Mundt
this is the right answer, you need to normalise to UTF8 first and it'll work fine
nathan
i've been trying this and it still fails for closing script tags. They get turned into `<\/script>`, but that still breaks the js. Also the multibyte characters still don't get converted back. Can you explain "normalize to UTF-8"?
Tesserex
If I combine this answer with my #3, it doesn't break, but the script tags don't show up at all.
Tesserex
Does it work if you just put that in the code directly? This works for me in Firefox in a test HTML file with jQuery 1.4.2: `$('input').val("<\/script>");` -- does that code not fill the box for you?
Walter Mundt
If I use this as is, multibyte characters are not shown correctly. Try it with the full input string I gave at the end of my edit.
Tesserex
A: 

I have found a "good enough" solution that you all might find interesting.

  1. utf8_encode the string on the way into the database. This makes sure that it can be safely handled on the way out by the following steps.

2.

function repl($match)
{
    return "\u" . dechex($match[1]);
}

function esc($string)
{
    $s = json_encode($string);
    $s = preg_replace_callback("/&#([0-9]+);/", "repl", $s);
    return $s;
}

This isn't absolutely perfect, because there doesn't seem to be any way for the php to know the difference between the user typing 漢 or literally typing &#28450;. So if you type the latter it will become the former. But I doubt anyone will ever want to do that anyway.

Tesserex