views:

67

answers:

5

Nearly all browsers use a certain amount of leeway in rendering invalid HTML. For example, they would render x < y as if it were written x &lt; y because it is "clear" that the < is intended as a literal character, not part of an HTML tag.

Where can I find that logic as a separate "cleanup" module? Such a module would convert x < y to x &lt; y

A: 

Not sure what do you mean exactly, but maybe the PHP function htmlentities could help you.

aletzo
No... see my response to @Mike Caron's comment
JoelFan
+3  A: 

Try looking at the source code for Tidy.

HTML before running through Tidy:

<html>

 <head>
  <title>boo</title>
 </head>

 <body>
   x < y
 </body>

</html>

Same HTML after running through Tidy:

<html>
<head>
  <meta name="generator" content=
  "HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">

  <title>boo</title>
</head>

<body>
  x &lt; y
</body>
</html>

Notice that x < y was changed to x &lt; y.

UPDATE

Based on your comment, you should probably use Tidy to clean up your HTML. I believe there are Tidy libraries for most of the common languages, that will clean up your HTML for you. If you are using PHP, there is PHP Tidy.

UPDATE

I noticed that you said you're using C#. You can use Tidy with C# as well. Here's something I found. I don't develop in C# and I haven't tried this out so YMMV:

Fix Up Your HTML with HTML Tidy and .NET

Vivin Paliath
A: 

Rendering of invalid HTML in browsers is horrible guesswork, and you really shouldn't try to emulate it (it will break). However, replacing some occurrences could be done with a regexp:

preg_replace('/(\s)<(\s)/', '$1&lt;$2', $data);
You
This will change ` < body>` to ` < body>`. Undesirable.
Vivin Paliath
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Chuck
@Vivin: It is. It relies to a certain extent on users formatting their input properly, but it's fairly good. @Chuck: We're not actually parsing HTML here, but yeah.
You
@You I tend to be more paranoid :)
Vivin Paliath
A: 

Edit: I am assuming you're using PHP, since you didn't specify

Use strip_tags:

$content = strip_tags($content, array('<b><i>'));

This will leave safe tags (as defined by you), and remove everything else.

Mike Caron
That's … a big assumption
David Dorward
I'm not using PHP, but I'm using something similar to strip_tags in C#. The problem is that my "strip_tags" thinks that "x < y" contains an unknown (and unterminated) tag called "y" and it "strips" it, leaving just "x"
JoelFan
@David It's the most common web development language. And, everyone else assumed that too. The onus is on the OP to specify, right?
Mike Caron
@Joel Ah, in that case, I'd go with someone else's answer. Vivin's is the only one with a C# answer, so... yeah.
Mike Caron
@David, PHP is the most common language. OP should specify or at least tag his question, otherwise you need to make these assumptions.
You
A: 

The HTML 5 (draft) specification includes a detailed parsing algorithm based on how browsers handle bad markup.

David Dorward