Bare-minimum text sanitation

views:

answers:

+3 Q:

Bare-minimum text sanitation

In an application that accepts, stores, processes, and displays Unicode text (for the purpose of discussion, let's say that it's a web application), which characters should always be removed from incoming text?

I can think of some, mostly listed in the C0 and C1 control codes Wikipedia article:

The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)
The range 0x7F-0x9F (more control characters)

Ranges of characters that can safely be accepted would be even better to know.

There are other levels of text filtering — one might canonicalize characters that have multiple representations, replace nonbreaking characters, and remove zero-width characters — but I'm mainly interested in the basics.

I suppose it depends on your purpose. In UTF-8, you could limit the user to the keyboard characters if that is your whim, which is 9,10,13,[32-126]. If you are using UTF-8, the 0x7f+ range signifies that you have a multi-byte Unicode character. In ASCII, 0x7f+ consists special display/format characters, and is localized to allow extensions depending on the language at the location.

Note that in UTF-8, the keyboard characters can differ depending on location, since users can input characters in their native language which will be outside the 0x00-0x7f range if their language doesn't use a Latin script without accents (Arabic, Chinese, Japanese, Greek, Crylic, etc.).

If you take a look here you can see what characters from UTF-8 will display.

Adam Shiemke 2010-07-07 18:38:23

Thank you, but I'm not trying to limit the text to keyboard characters, I just want to filter out characters that could have unexpected or dangerous results, like the null character.

Sidnicious 2010-07-08 16:10:11

+1 A:

See the W3 Unicode in XML and other markup languages note. It defines a class of characters as ‘discouraged for use in markup’, which I'd definitely filter out for most web sites. It notably includes such characters as:

U+2028–9 which are funky newlines that will confuse JavaScript if you try to use them in a string literal;
U+202A–E which are bidi control codes that wily users can insert to make text appear to run backwards in some browsers, even outside of a given HTML element;
language override control codes that could also have scope outside of an element;
BOM.

Additionally, you'd want to filter/replace the characters that are not valid in Unicode at all (U+FFFF et al), and, if you are using a language that works in UTF-16 natively (eg. Java, Python on Windows), any surrogate characters (U+D800–U+DFFF) that do not form valid surrogate pairs.

The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)

And arguably (esp for a web application), lose CR as well, and turn tabs into spaces.

The range 0x7F-0x9F (more control characters)

Yep, away with those, except in case where people might really mean them. (SO used to allow them, which allowed people to post strings that had been mis-decoded, which was occasionally useful for diagnosing Unicode problems.) For most sites I think you'd not want them.

bobince 2010-07-07 19:07:41

ansaurus

tags:

views:

answers:

Bare-minimum text sanitation

related questions