views:

738

answers:

8

Hello,

I'm tring to create form validation unit that, in addition to "regular" tests checks encoding as well.

According to this article http://www.w3.org/International/questions/qa-forms-utf-8 the allowed characters are CR, LF and TAB in range of 0-31, the DEL=127 in not allowed.

On the other hand, there are control characters in range 0x80-0xA0. In different sources I had seen that they are allowed and that not. Also I had seen that this is different for XHTML, HTML and XML.

Some articles had told that FF is allowed as well?

Can someone provide a good answer with sources what can be given and what isn't?

EDIT: Even there http://www.w3.org/International/questions/qa-controls some ambiguity

The C1 range is supported

But table shows that they are illegal and previous shown UTF-8 validations allows them?

A: 

If the document is known to be XHTML, then you should just load it and validate it against the schema.

John Saunders
I'm talking about charrected encoding, not the content to schema. Like post form data.
Artyom
Ok, but form post data has nothing to do with HTML or XHTML. Please clarify your question and/or subject.
John Saunders
Actually it is strongly connected. Take a look on the first link I had given, you can see in the regular expression that in the ASCII range all C0 control charrecters are denied with exception of CR,LF,TAB, also one charrecter (DEL=127) from C1 is denied as well.
Artyom
Ok, I see the confusion. "(X)HTML forms" is what that document says. That means "the <form> element of (X)HTML". If this is what you meant, then you should edit the subject line to say "(X)HTML Forms", as "XHTML" suggests you mean the markup.
John Saunders
+1  A: 

First of all any octet is valid. The mentioned regular expression for UTF-8 sequences just omits some of them as they are rather uncommon in practice to be entered by a user. But that doesn’t mean that they are invalid. They are just not expected to occur.

Gumbo
A: 

What programming language do you use? At least for Java there exist libraries to check the encoding of a string (or byte-array). I guess similar libraries would exist for other languages too.

Ridcully
This is not a question of encoding, I can check it easily, but encoding is not enough, because 0 is valid Unicode correcter, but it is not valid for HTML forms.
Artyom
A: 

Do I understand your question correctly: you want to check whether the data submitted by a form is valid, and properly encoded?

If so, why do several things at once? It would be a lot easier to separate those checks, and perform them step by step, IMHO.

  1. You want to check that the submitted form data is correctly encoded (in UTF-8, I gather). As Archchancellor Ridcully says, that's easy to check in most languages.
  2. Then, if the encoding is correct, you can check whether it's valid form data.
  3. Then, if the form data is valid, you can check whether the data contains what you expect.
Martijn
>>> Then, if the encoding is correct, you can check whether it's valid form data. <<< I can easily check if the encoding is valid and convert it to UNICODE. The point is "which VALID Unicode characters are INVALID characters in HTML (forms)
Artyom
+1  A: 

The first link you mention does not have anything to do with validating the allowed characters in XHTML... the example on that link is simply showing a common/generic pattern for detecting whether or not raw data is in utf-8 encoding or not.

This is a quote from the second link:

HTML, XHTML and XML 1.0 do not support the C0 range, except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A, and CR (Carriage Return) U+000D. The C1 range is supported, i.e. you can encode the controls directly or represent them as NCRs (Numeric Character References).

The way I read this is:

Any control character in the C1 range is supported if you encode them (using base64, or Hex representations) or represent them as NCRs.

Only U+0009, U+000A, and U+000D are supported in the C0 range. No other control code in that range can be represented.

John Weldon
+5  A: 

Postel's Law: Be conservative in what you do; be liberal in what you accept from others.

If you're generating documents for others to read, you should avoid/escape all control characters, even if they're technically legal. And if you're parsing documents, you should endeavor to accept all control characters even if they're technically illegal.

Eli
I accept this ansver as something closest to expectation, because I do not think that others are close at all. I prefer it to get the bounity.
Artyom
+6  A: 

I think you're looking at this the wrong way around. The resources you link specify what encoded values are valid in (X)HTML, but it sounds like you want to validate the "response" from a web form — as in, the values of the various form controls, as passed back to your server. In that case, you shouldn't be looking at what's valid in (X)HTML, but what's valid in the application/x-www-form-urlencoded, and possibly also multipart/form-data, MIME types. The HTML 4.01 standards for <FORM> elements clearly states that for application/x-www-form-urlencoded, "Non-alphanumeric characters are replaced by '%HH'":

This is the default content type. Forms submitted with this content type must be encoded as follows:

  1. Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
  2. The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.

As for what character encoding is contained, (i.e. whether %A0 is a non-breaking space or an error), that's negotiated by the accept-charset attribute on your <FORM> element and the response's (well, really a GET or POST request) Content-Type header.

Ben Blank
A: 

The Unicode characters in these ranges are valid in HTML 4.01:

0x09..0x0A
0x0D
0x20..0x7E
0x00A0..0xD7FF
0xE000..0x10FFFF    

In XHTML 1.0... it's unclear. See http://cmsmcq.com/2007/C1.xml#o127626258

Artefacto
Thanks, this is the reference I was looking for!
Artyom