tags:

views:

324

answers:

2

I have a form which accepts text input. I would like it to be able to accept characters such as & and ; and > and <, which are useful characters for the data being supplied by the user. I want the user to, for example, be able to say

The ampersand (&) is encoded as & (and I see from the preview that I can't even do that here - it should look like The ampersand (&) is encoded as &amp; but I had to type in amp;amp; after the ampersand to get that to look right.) (btw, the preview is cool, but I can't count on users having scripts enabled)

I parse the data, and if there is a problem with it, I present the user's entry back to the user, in the same form, prefilled in the same field, for editing and resubmission.

If I present the raw data, I run the risk of having hostile input (such as scripts or HTML) executed by the browser. However, if I filter it (such as via htmlspecialcharacters), then the user would see (a representation of) the character he had typed (say, the ampersand), but when he re-submits, he will =actually= be submitting the replacement (in this case what looks like &amp;), which as it turns out even contains an ampersand. If there is still a problem with the input, it will be presented again for editing, and we'll be another level deep in replacements.

User data is accepted only when what the user actually submits is identical to the sanitized version of the data. It is destined for a text file on the server, and an Email sent to the organization behind the website.

I suppose the "question that can be answered" is "is this even possible?"

Jose

edit:

<?php
$var=$_GET["test2"];
?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
 "http://www.w3.org/TR/html4/strict.dtd"&gt;

<html>
<head>
<meta content="text/html; charset=ISO-8859-1" http-equiv="content-type">

<title>Input Escape Test</title>
</head><body>
The php parser would store the following input:<br>
<?php echo $var ?>
<br>

<form method="get" action="test.php"><p>
  <label for "test2">Test - question five: <br>type in a character on the first line<br>and its HTML entity on the second line.
  <textarea name="test2" cols="50" rows="3"><?php echo  $var; ?></textarea><br/>
  <input type="submit"/>
</p></form>
</body></html>

results in a form where the user attempts to answer the question with ampersand ampersand a m p semicolon. IF that gets rejected (say, because of other illegal characters), the user is presented with his input back, minus the stripped characters. However, the a m p semicolon is also stripped from view (though it's in the source). The user will then attempt to add another a m p semicolon to the displayed result.

The only way the user gets to see ampersand a m p semicolon displayed (upon rejected input), is to type in ampersand a m p semicolon a m p semicolon

Finally satisfied, the user clicks submit again, and the a m p semicolon seemingly disappears again. The user doesn't know what his (submitted) answer will be stored as.

I want the user to be able to type in: ampersand a m p semicolon and, upon rejection, see ampersand a m p semicolon and upon acceptance, store ampersand a m p semicolon

Jose

A: 

Yes this is possible in Javascript as well as in server side code. As you said you won't count users having javascript enabled, I assume you want to do this kind of conversion on the server side? You just let the user send the form data via a POST request to your server side code and there you tranform every occurance of <, >, &, " and ' into their respective entity form when you write the data back to the html response page. This will then show up in the browser exactly as it was entered by the user.

Edit: Sorry, I didn't read your question carefully enough. You should be able to use just one level of escaping, i.e. to write &amp; for a '&' and not &amp;amp;. This one level will be stripped when the browser parses your page and will be disappeared from the data when it get's sent back as form data. Have a look at the generated html code and try to find out what makes you need that second level of escapes.

Edit2 in response to the comments: Here is a simple test page that works as expected in IE 8.0 and Firefox. When you press the send button you will see what is getting sent to the server in the address bar of your browser (the %26 is just the URL-encoding for the &). As you can see the &amp; gets stripped from the value and also from the data that is sent to server.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
<meta http-equiv="Content-type" content="text/html;charset=ISO-8859-1" />
<title>Input Escape Test</title>
</head><body>
<form method="get" action=""><p>
  <input name="test1" type="text" size="30" value="hello &amp; test"/><br/>
  <textarea name="test2" cols="50" rows="3">hello &amp; test</textarea><br/>
  <input type="submit"/>
</p></form>
</body></html>
x4u
> transform... when you write the data back to the response page...Yes, but while the user will see a representation of the original charactaer, when the user presses the submit button, they will actually =send= the already-encoded entity form. The first character will be the ampersand, which will be encoded of course, and the subsequent characters will be ordinary. So,orignal: amperandbecomes: amperand a m p semicolonupon being pre-filled. The user hits submit... and my php script turns it into:amperand a m p smicolon a m p semicolonNo?Jose
Jose
The browser should actually strip and forget one level of escapes from the data when it parses the html.
x4u
Remember, this is not getting posted as part of the page, it's getting presented =in= the user-entry part of the form as pre-filled data, for the user to submit again. With conversion, the user sees one thing (the ampersand) but submits another (ampersand a m p semicolon). Since a =real= ampersand is being submitted as =part= of the conversion, =it= will get converted... followed by the original a m p semicolon, and this process will iterate with each user resubmission.
Jose
It doesn't matter where in the html or xml page the entity escapes are used. They get stripped at any place when the browser parses the page and thus should never appear anywhere afterwards. I edited my answer to post some smaple code. Make sure that there appears only one level of escapes for the critical characters in your generated pages as I did in my sample, not 2 as you described in your question (not &).
x4u
Ok, I see. Then the answer to my question would seem to be "it's not possible". Typed-in user input (in a form) does not get interpreted, but pre-filled input does, no matter what.The user may typeampersand a m p semicolonand see it thus, but for the computer to generate that output, the computer must useampersand a m p semicolon a m p semicolonand therefore I cannot use a second submission (which is a combination of pre-filled and user-typed input) in a simple comparision, without unraveling it somewhere.Jose
Jose
A: 

When pushing data out of PHP, to the browser, to a database, anywhere, you MUST change it representation to one acceptable to the receiving end.

In the case of sending stuff to the browser, you need the htmlentities converter:

print "<input type='text' name='inp' value='" . htmlentities($_POST['inp']) . "'>\n";

C.

symcbean
I'm aware of htmlentities(), but had originally tried to avoid it in favor of disallowing (in a regex) the ampersand, semicolon, and a few other characters. This was partly because I can't find any documentation as to exactly what will be affected by htmlentities(). However, it's probably better than I am at figuring out what =should= be stripped.I did find a workaround, which was to use htmlentities first and store the result. Then strip what needed stripping, and compare with the stored result. If equal, data was ok.Am I missing anything here?Jose
Jose
You only use htmlentities() on stuff output to the browser - when the user submits the form again, the inverse of an htmlentities operation is carried out implicitly on the variables.Certainly your approach will identify all sorts of things which can be used for script injection and such. But it'll also catch single quotes (as in O'Reilly)C.
symcbean