In an application that accepts, stores, processes, and displays Unicode text (for the purpose of discussion, let's say that it's a web application), which characters should always be removed from incoming text?
I can think of some, mostly listed in the C0 and C1 control codes Wikipedia article:
The range
0x00
-0x19
(mostly control characters), excluding0x09
(tab),0x0A
(LF), and0x0D
(CR)The range
0x7F
-0x9F
(more control characters)
Ranges of characters that can safely be accepted would be even better to know.
There are other levels of text filtering — one might canonicalize characters that have multiple representations, replace nonbreaking characters, and remove zero-width characters — but I'm mainly interested in the basics.