ansaurus

Question

Checking a HTML string for unopened tags

Answer 1

A:

Not so easy. You can't directly use an HTML parser as it's not valid HTML, but you can't easily throw a regex at the whole thing as regexes can't cope with nesting or other HTML complications.

Probably about the best you could do would be to use a regex to find each markup structure, eg. something like:

<(\w+)(?:\s+[-\w]+(?:\s*(?:=\s*(?:"[^"]*"|'[^']*'|[^'">\s][^>\s]*)))?)*\s*>
|</(\w+)\s*>
|<!--.*?-->

Start with an empty tags-to-open list and an empty tags-to-close list. For each match in the string, look at groups 1 and 2 to see if you've got a start or end tag. (Or a comment, which you can ignore.)

If you've got a start tag, you need to know if it needs closing, ie. if it's one of the EMPTY content-model tags like <img>. If a element is EMPTY, it doesn't need closing so you can ignore it. (If you have XHTML, this is all a bit easier.)

If you have a start-tag, add the tag name in the regex group to the tags-to-close list. If you've got an end tag, take one tag off the end of the tags-to-close list (it should be the same tag name as was on there, otherwise you've got invalid markup. If there are no tags on the tags-to-close list, instead add the tag name to the tags-to-open list.

Once you've got to the end of the input string, prepend each of the tags-to-open tags to the string in reverse order, and append the close tags for the the tags-to-close to the end, again in reverse order.

(Yeah, I'm parsing HTML with regex. I think the nastiness of this demonstrates why you don't want to. If there's anything you can do to avoid having already snipped your markup in the middle of a tag, do that.)

bobince 2010-07-02 10:07:17

html tag and regexp is not really a good idea

remi bourgarel 2010-07-02 13:56:21

Gosh, really, do you think?

bobince 2010-07-02 16:46:59

Answer 2

+1 A:

For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened.

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(
    "WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");

foreach (var error in htmlDoc.ParseErrors)
{
    // Prints: TagNotOpened
    Console.WriteLine(error.Code);
    // Prints: Start tag <u> was not found
    Console.WriteLine(error.Reason); 
}

João Angelo 2010-07-02 10:35:18

ansaurus

tags:

views:

answers:

Checking a HTML string for unopened tags

related questions