views:

364

answers:

4

Hi,

Suppose I allow my users to submit a form containing some text fields (I'm not talking about passwords). My users would occasionally use non-ASCII characters like Russian, Chinese, etc. so I use UTF-8 charsets in my database. The question is, should I really allow all of the possible UTF-8 characters? I had a look at the ASCII table and saw that characters 0 to 31 have nothing to do with text, except for newlines and white spaces. Characters 176 to 223 seem to be for decorative purposes :p. Should I restrict them?

+5  A: 

The W3C skips these characters in their example regular expression in Multilingual form encoding:

$field =~
  m/\A(
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z/x;
Gumbo
The PHP equivalent would be preg_match('/\A( [\x09\x0A\x0D\x20-\x7E] | [\xC2-\xDF][\x80-\xBF] | \xE0[\xA0-\xBF][\x80-\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | \xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} )*\z/x', $string);Am I correct?
bilygates
@bilygates: You can leave the comments as well. PHP’s `preg_match` uses Perl-Compatible Regular Expressions and the `x` modifier allows to use whitespace and comments (starting with `#` up to the end of the line) to make a regular expression more comprehensible.
Gumbo
@Gumbo Ok, will do. Many thanks!
bilygates
+1  A: 

No.

It's a very bad idea to try to "pre-clean" user input. What you consider "decorative" might be absolutely necessary to readers of another language. The best solution is to store the text as-is in the database, and then sanitize it before writing to the page.

John Millikin
I don't agree. User input should be sanitized prior to all processing including storing. What would be the advantage of not doing so?
0xA3
If you've over- or under-sanitized input, then there's no way to recover the original data. If the unmolested data is stored, it can always be cleaned up in whichever way is needed.
John Millikin
I agree, but on the other hand the routine storing the data might be expose a vulnerability which could be exploited using malicious and unsanitized input.
0xA3
+4  A: 

Make sure it is valid UTF-8 and Unicode? Yes

Make sure it does not include certain characters, such as control codes? Probably not necessary

You should be aware that even though you are using UTF-8 in your form, you may not get valid UTF-8 from all user-agents when they send form data to you, and you will have to filter it as necessary. Invalid UTF-8 can take many forms, some of them being

  • Overlong encodings (which can lead to security issues)
  • Other invalid UTF-8 byte sequences, which may indicate that the user-agent ignored the character encoding and has submitted something like Windows-1252 or ISO-8859-1 encoding instead.
  • Code points that lie in reserved surrogate space in Unicode

All the above need to be filtered out during input, otherwise you are not storing valid Unicode.

If you want to serve valid HTML or XHTML, which use a subset of Unicode, you will need also need to filter out (either at input or output):

  • C0 control codes 0x00 to 0x19 (apart from tab, space, new line, carraige return)
  • 0x7F
  • C1 control codes 0x80 to 0xBF
  • (probably) any code point above 0x10FFFF
thomasrutter
All true, and the regex posted by Gumbo will handle all of those issues.
Alan Moore
Thank you for your reply. I guess I will use the regular expression that Gumbo suggested to validate the input. It seems to handle everything you said to filter out.
bilygates
Yes, that regex is suitable for UTF-8 encoded text which is going to be used in XHTML or HTML, as it also filters out those control codes as above.
thomasrutter
+1  A: 

When you say "the ASCII table" you're talking about this page, aren't you? That page is garbage. Only the first 128 characters (ie, 0..127) are "ASCII"; the mappings they show for the numbers 128..255 are from an ASCII extension called cp437. There are a lot of "extended ASCII's" out there, and cp437 is far from the most common one.

But I digress. Your question isn't about character encodings, it's about filtering, and a filter should be based on the properties of the characters: is it a letter, a digit, a control character? Most modern programming languages provide methods or functions to obtain such information, and most provide regex support as well. As for what you should filter, or whether you should filter at all, only you can know that.

It sounds like you need to learn more about character encodings and Unicode, though. Start here.

Alan Moore
Yes, that is exactly the page I looked at. I didn't know characters 127 - 255 can be different. I will have a look at that article you recommended. Thanks!
bilygates