views:

38

answers:

2

I have a string as a HTML source and I want to check whether the HTML source which is string contains a tag which is not opened.

For example the string below contains </u> after WAVEFORM which has no opening <u>.

WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,

I just want to check for these types of unopened tag and then I have to append the open tag to the start of the string?

A: 

Not so easy. You can't directly use an HTML parser as it's not valid HTML, but you can't easily throw a regex at the whole thing as regexes can't cope with nesting or other HTML complications.

Probably about the best you could do would be to use a regex to find each markup structure, eg. something like:

<(\w+)(?:\s+[-\w]+(?:\s*(?:=\s*(?:"[^"]*"|'[^']*'|[^'">\s][^>\s]*)))?)*\s*>
|</(\w+)\s*>
|<!--.*?-->

Start with an empty tags-to-open list and an empty tags-to-close list. For each match in the string, look at groups 1 and 2 to see if you've got a start or end tag. (Or a comment, which you can ignore.)

If you've got a start tag, you need to know if it needs closing, ie. if it's one of the EMPTY content-model tags like <img>. If a element is EMPTY, it doesn't need closing so you can ignore it. (If you have XHTML, this is all a bit easier.)

If you have a start-tag, add the tag name in the regex group to the tags-to-close list. If you've got an end tag, take one tag off the end of the tags-to-close list (it should be the same tag name as was on there, otherwise you've got invalid markup. If there are no tags on the tags-to-close list, instead add the tag name to the tags-to-open list.

Once you've got to the end of the input string, prepend each of the tags-to-open tags to the string in reverse order, and append the close tags for the the tags-to-close to the end, again in reverse order.

(Yeah, I'm parsing HTML with regex. I think the nastiness of this demonstrates why you don't want to. If there's anything you can do to avoid having already snipped your markup in the middle of a tag, do that.)

bobince
html tag and regexp is not really a good idea
remi bourgarel
Gosh, really, do you think?
bobince
+1  A: 

For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened.

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(
    "WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");

foreach (var error in htmlDoc.ParseErrors)
{
    // Prints: TagNotOpened
    Console.WriteLine(error.Code);
    // Prints: Start tag <u> was not found
    Console.WriteLine(error.Reason); 
}
João Angelo