views:

420

answers:

5

I'm looking for the best way to do some sort of "smart" HTML encoding. For instance:

From: <a>Next >></a> to: <a>Next gt;gt;</a>
From: <p><a><b><< Prev</b></a><br/><a>Next >></a></p> to: <p><a><b>&lt;&lt; Prev</b></a><br/><a>Next gt;gt;</a></p>

So only the non XML / HTML part of the text would be encoded as if HtmlEncode is called.

Any suggestions?

EDIT: This should be as lightweight as possible. The incoming text will come from users which have no knowledge of HTML encoding.

A: 

I would probably try to write a good regular expression for this. Are you doing this in code behind (C#) or on client-side with JavaScript?

http://www.regular-expressions.info/

Brandon Montgomery
Trying to use regular expressions to parse non-regular data is not the best way to go about this. The best way would be to manipulate the DOM directly, which has already been stated.
Xetius
+6  A: 

Yes: don’t ever write HTML into your source code. Instead work with an API like DOM that takes care of all encoding issues for you.

Bombe
Of course, if this content already exists and you cannot change the generator then you are left with trying to manipulate the content yourself. You might want to try some form of lexical parsing. Do not, under any circumstances attempt this with regular expressions. At least, not if you want to maintain your sanity.
Xetius
+2  A: 

If you want a solid and totally reliable C# solution (but heavy-weight) then I'd use the HTML Agility Pack library. You could then iterate through nodes and HTML encode the contents. It's a bit more bullet-proof than regular expressions, but obviously more intense.

If you want to do it client-side, then use JQuery. See Encode HTML entities with jQuery.

Dan Diplo
A: 

You are probably trying to solve the wrong problem. (I know this is not what you want to hear.)

If users are allowed to write unencoded >> and << into HTML then presumably they would also be able to write <> or <b>, and in that case there is no way you can reliable distinguish between text and markup. (Never mind that this makes you vulnerable to XSS attacks.)

You really have to intercept the text and encode it before it is interpolated into HTML. Probably you should explain the workflow leading to you problem. There must be a better way to solve it.

Edit in response to comment: There is simply no way to reliably encode input which can be either text or HTML at the same time. Anyway, if users are technical enough to enter raw HTML, presumably they are able to write entities - otherwise the shouldn't be entering raw HTML in the first place. If HTML input is only for advanced users, then you could have a check-box which indicated if the input is text or HTML. But you should probably look into using a rich-text editor.

JacquesB
The thing is I have a feature request to allow entry of HTML tags to format the text. BUT previously the text was always encoded and users are used to write non html text into the fields. Now I know there will be some input I have given as an example. To prevent It from breaking the XML I'm looking for a way to "fix" it. Intercepting the input is not an option, as I have no control over it.
Drejc
@Drejc: You should probably add this info to the original question.
JacquesB
A: 

Have you thought about using tidy.net? You could throw your user input into that and see what it comes up with, it very, very, very good and turning garbage into something that you actually want. Its a DLL and all managed code I believe so you can easily bolt it in.

As for the no to regexp band wagon, I disagree. If the data is limited (you don't say if it is or not) then you could come up with some rules for at least trying to validate you input string if not cleaning it up. I suspect though that your data could literally be anything in which case you would be better of using something other, but it should not be ruled out completely.

Pete Duncanson