ansaurus

Question

Allowed characters in submit forms (including UTF-8)

Answer 1

+5 A:

The W3C skips these characters in their example regular expression in Multilingual form encoding:

$field =~
  m/\A(
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z/x;

Gumbo 2009-08-11 22:37:24

The PHP equivalent would be preg_match('/\A( [\x09\x0A\x0D\x20-\x7E] | [\xC2-\xDF][\x80-\xBF] | \xE0[\xA0-\xBF][\x80-\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | \xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} )*\z/x', $string);Am I correct?

bilygates 2009-08-12 11:10:47

@bilygates: You can leave the comments as well. PHP’s `preg_match` uses Perl-Compatible Regular Expressions and the `x` modifier allows to use whitespace and comments (starting with `#` up to the end of the line) to make a regular expression more comprehensible.

Gumbo 2009-08-12 13:18:42

@Gumbo Ok, will do. Many thanks!

bilygates 2009-08-13 10:44:28

Answer 2

+1 A:

No.

It's a very bad idea to try to "pre-clean" user input. What you consider "decorative" might be absolutely necessary to readers of another language. The best solution is to store the text as-is in the database, and then sanitize it before writing to the page.

John Millikin 2009-08-11 22:39:17

I don't agree. User input should be sanitized prior to all processing including storing. What would be the advantage of not doing so?

0xA3 2009-08-11 22:45:15

If you've over- or under-sanitized input, then there's no way to recover the original data. If the unmolested data is stored, it can always be cleaned up in whichever way is needed.

John Millikin 2009-08-11 22:52:15

I agree, but on the other hand the routine storing the data might be expose a vulnerability which could be exploited using malicious and unsanitized input.

0xA3 2009-08-11 23:16:01

Answer 3

+4 A:

Make sure it is valid UTF-8 and Unicode? Yes

Make sure it does not include certain characters, such as control codes? Probably not necessary

You should be aware that even though you are using UTF-8 in your form, you may not get valid UTF-8 from all user-agents when they send form data to you, and you will have to filter it as necessary. Invalid UTF-8 can take many forms, some of them being

Overlong encodings (which can lead to security issues)
Other invalid UTF-8 byte sequences, which may indicate that the user-agent ignored the character encoding and has submitted something like Windows-1252 or ISO-8859-1 encoding instead.
Code points that lie in reserved surrogate space in Unicode

All the above need to be filtered out during input, otherwise you are not storing valid Unicode.

If you want to serve valid HTML or XHTML, which use a subset of Unicode, you will need also need to filter out (either at input or output):

C0 control codes 0x00 to 0x19 (apart from tab, space, new line, carraige return)
0x7F
C1 control codes 0x80 to 0xBF
(probably) any code point above 0x10FFFF

thomasrutter 2009-08-12 07:23:59

All true, and the regex posted by Gumbo will handle all of those issues.

Alan Moore 2009-08-12 07:29:50

Thank you for your reply. I guess I will use the regular expression that Gumbo suggested to validate the input. It seems to handle everything you said to filter out.

bilygates 2009-08-12 11:27:43

Yes, that regex is suitable for UTF-8 encoded text which is going to be used in XHTML or HTML, as it also filters out those control codes as above.

thomasrutter 2009-08-16 14:42:32

Answer 4

+1 A:

When you say "the ASCII table" you're talking about this page, aren't you? That page is garbage. Only the first 128 characters (ie, 0..127) are "ASCII"; the mappings they show for the numbers 128..255 are from an ASCII extension called cp437. There are a lot of "extended ASCII's" out there, and cp437 is far from the most common one.

But I digress. Your question isn't about character encodings, it's about filtering, and a filter should be based on the properties of the characters: is it a letter, a digit, a control character? Most modern programming languages provide methods or functions to obtain such information, and most provide regex support as well. As for what you should filter, or whether you should filter at all, only you can know that.

It sounds like you need to learn more about character encodings and Unicode, though. Start here.

Alan Moore 2009-08-12 08:29:41

Yes, that is exactly the page I looked at. I didn't know characters 127 - 255 can be different. I will have a look at that article you recommended. Thanks!

bilygates 2009-08-12 11:14:09

ansaurus

tags:

views:

answers:

Allowed characters in submit forms (including UTF-8)

related questions