views:

45

answers:

4

I have a PHP script that will generate <input>s dynamically, so I was wondering if I needed to filter any characters in the name attribute.

I know that the name has to start with a letter, but I don't know any other rules. I figure square brackets must be allowed, since PHP uses these to create arrays from form data. How about parentheses? Spaces?

A: 

"ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".")."

http://www.w3.org/TR/html401/types.html#type-cdata

Kirk Woll
Hmm interesting. So square brackets are not standards-compliant even though PHP uses them?
DLH
@DLH Which bit of PHP uses them?
Matt Gibson
It identifies an array. It's not part of the name. E.g. `foo[]` is valid, but `foo[]bar` is certainly invalid.
BalusC
@matt: they are used to create POST data arrays. often used with checkboxes
knittl
The name attribute for the <input> element type is listed as being CDATA, not ID or NAME so this is correct in SGML lingo, but not with this particular attribute (even though it's called name).
Allain Lalonde
I've seen many examples of $ includedin the name attribute of HTML form inputs. For example, "ctl00$ContentMain$CustomUC297$txtPassword" on http://agents.travelsavers.com/content/publiccontent.aspx?PageID=273
@BalusC: What do you mean it's not part of the name? Does the browser not interpret it as part of the name?
DLH
@Kirk Woll: After reading Allain Lalonde's post and links, I realize that the SGML basic types specification you referenced separates the CDATA specification from the ID/NAME specification. The `name` attribute for `<input>` is clearly labeled as CDATA in the specs.
DLH
`name="foo[]bar"` is quite valid. `[]` has no special meaning in any web standard; having to put `[]` at the end of the name for a field that allows multiple submissions is merely an idiom required by PHP.
bobince
A: 

Do you mean the id and name attributes of the HTML input tag?

If so, I'd be very tempted to restrict (or convert) allowed "input" name characters into only a-z (A-Z), 0-9 and a limited range of punctuation (".", ",", etc.), if only to limit the potential for XSS exploits, etc.

Additionally, why let the user control any aspect of the input tag? (Might it not ultimately be easier from a validation perspective to keep the input tag names are 'custom_1', 'custom_2', etc. and then map these as required.)

middaparka
I may not end up having my names generated like this. I'm just in the process of trying to think through ways of allowing the less tech-savvy members in my office to specify form fields.
DLH
@DLH I'd be tempted (to remove the risk of name clashes, etc.) to just an intermediate approach as above. :-)
middaparka
+3  A: 

The only real restriction on what characters can appear in form control names is when a form is submitted with GET

"The "get" method restricts form data set values to ASCII characters." reference

There's a good thread on it here.

Allain Lalonde
So `name` has a different data type for `<input>` than it does for other elements? Interesting.
DLH
It's the same as `<a>` and most elements, but different to `<meta>`
Alohci
Yep. Just tried an `<input>` with all kinds of crap in the `name` attribute, and it validated in HTML 4.01 Strict. Accepted!
DLH
+1  A: 

Any character you can include in an [X]HTML file is fine to put in an <input name>. As Allain's comment says, <input name> is defined as containing CDATA, so the only things you can't put in there are the control codes and invalid codepoints that the underlying standard (SGML or XML) disallows.

Allain quoted W3 from the HTML4 spec:

Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire ISO10646 character set.

However this isn't really true in practice.

The theory is that application/x-www-form-urlencoded data doesn't have a mechanism to specify an encoding for the form's names or values, so using non-ASCII characters in either is “not specified” as working and you should use POSTed multipart/form-data instead.

Unfortunately, in the real world, no browser specifies an encoding for fields even when it theoretically could, in the subpart headers of a multipart/form-data POST request body. (I believe Mozilla tried to implement it once, but backed out as it broke servers.)

And no browser implements the astonishingly complex and ugly RFC2231 standard that would be necessary to insert encoded non-ASCII field names into the multipart's subpart headers. In any case, the HTML spec that defines multipart/form-data doesn't directly say that RFC2231 should be used, and, again, it would break servers if you tried.

So the reality of the situation is there is no way to know what encoding is being used for the names and values in a form submission, no matter what type of form it is. What browsers will do with field names and values that contain non-ASCII characters is the same for GET and both types of POST form: it encodes them using the encoding the page containing the form used. Non-ASCII GET form names are no more broken than everything else.

DLH:

So name has a different data type for than it does for other elements?

Actually the only element whose name attribute is not CDATA is <meta>. See the HTML4 spec's attribute list for all the different uses of name; it's an overloaded attribute name, having many different meanings on the different elements. This is generally considered a bad thing.

However, typically these days you would avoid name except on form fields (where it's a control name) and param (where it's a plugin-specific parameter identifier). That's only two meanings to grapple with. The old-school use of name for identifying elements like <form> or <a> on the page should be avoided (use id instead).

bobince