ansaurus

Question

Smart HTML encoding

Answer 1

A:

I would probably try to write a good regular expression for this. Are you doing this in code behind (C#) or on client-side with JavaScript?

http://www.regular-expressions.info/

Brandon Montgomery 2009-08-04 13:28:01

Trying to use regular expressions to parse non-regular data is not the best way to go about this. The best way would be to manipulate the DOM directly, which has already been stated.

Xetius 2009-08-04 13:36:00

Answer 2

+6 A:

Yes: don’t ever write HTML into your source code. Instead work with an API like DOM that takes care of all encoding issues for you.

Bombe 2009-08-04 13:29:32

Of course, if this content already exists and you cannot change the generator then you are left with trying to manipulate the content yourself. You might want to try some form of lexical parsing. Do not, under any circumstances attempt this with regular expressions. At least, not if you want to maintain your sanity.

Xetius 2009-08-04 13:38:29

Answer 3

+2 A:

If you want a solid and totally reliable C# solution (but heavy-weight) then I'd use the HTML Agility Pack library. You could then iterate through nodes and HTML encode the contents. It's a bit more bullet-proof than regular expressions, but obviously more intense.

If you want to do it client-side, then use JQuery. See Encode HTML entities with jQuery.

Dan Diplo 2009-08-04 13:31:51

Answer 4

A:

You are probably trying to solve the wrong problem. (I know this is not what you want to hear.)

If users are allowed to write unencoded >> and << into HTML then presumably they would also be able to write <> or <b>, and in that case there is no way you can reliable distinguish between text and markup. (Never mind that this makes you vulnerable to XSS attacks.)

You really have to intercept the text and encode it before it is interpolated into HTML. Probably you should explain the workflow leading to you problem. There must be a better way to solve it.

Edit in response to comment: There is simply no way to reliably encode input which can be either text or HTML at the same time. Anyway, if users are technical enough to enter raw HTML, presumably they are able to write entities - otherwise the shouldn't be entering raw HTML in the first place. If HTML input is only for advanced users, then you could have a check-box which indicated if the input is text or HTML. But you should probably look into using a rich-text editor.

JacquesB 2009-08-04 13:53:04

The thing is I have a feature request to allow entry of HTML tags to format the text. BUT previously the text was always encoded and users are used to write non html text into the fields. Now I know there will be some input I have given as an example. To prevent It from breaking the XML I'm looking for a way to "fix" it. Intercepting the input is not an option, as I have no control over it.

Drejc 2009-08-04 13:58:55

@Drejc: You should probably add this info to the original question.

JacquesB 2009-08-04 15:13:36

Answer 5

A:

Have you thought about using tidy.net? You could throw your user input into that and see what it comes up with, it very, very, very good and turning garbage into something that you actually want. Its a DLL and all managed code I believe so you can easily bolt it in.

As for the no to regexp band wagon, I disagree. If the data is limited (you don't say if it is or not) then you could come up with some rules for at least trying to validate you input string if not cleaning it up. I suspect though that your data could literally be anything in which case you would be better of using something other, but it should not be ruled out completely.

Pete Duncanson 2009-08-04 14:08:11

ansaurus

tags:

views:

answers:

Smart HTML encoding

related questions