tags:

views:

466

answers:

5

Does someone have a regex to match unclosed HTML tags? For example, the regex would match the <b> and second <i>, but not the first <i> or the first's closing </i> tag:

<i><b>test<i>ing</i>

Is this too complex for regex? Might it require some recursive, programmatic processing?

+5  A: 

I'm sure some regex guru can cobble something together that approximates a solution, but it's a bad idea: HTML isn't regular. Consider either a HTML parser that's capable of identifying such problems, or parsing it yourself.

Pesto
Too bad SO doesn't have an API. I'm picturing something like: `for (Question q : questionsTagged("regex", "html")) { q.addAnswer(new Answer("HTML isn't regular, so regexes are hardly ever a good choice."); }`
Michael Myers
Of course, then you'll look silly if the OP preempted that response in the question, as he pretty much did here.
Michael Myers
@mmyers: Honestly, I just couldn't resist the opportunity to link to something of Welbog's.
Pesto
+2  A: 

Yes it requires recursive processing, and potentially quite deep (or a fancy loop of course), it is not going to be done with a regex. You could make a regex that handled a few levels deep, but not one that will work on just any html file. This is because the parser would have to remember what tags are open at any given point in the stream, and regex arent good at that.

Use a SAX parser with some counters, or use a stack with pop off/push on to keep your state. Think about how to code this game to see what I mean about html tag depth. http://en.wikipedia.org/wiki/Tower_of_Hanoi

Karl
+1  A: 

As @Pesto said, HTML isn't regular, you would have to build html grammar rules, and apply them recursively.

If you are looking to fix HTML programatically, I have used a component called html tidy with considerable success. There are builds for it for most languages (COM+, Dotnet, PHP etc...).

If you just need to fix it manually, I'd recommend a good IDE. Visual Studio 2008 does a good job, so does the latest Dreamweaver.

Vdex
+1  A: 

No, that's to complex for a regular expression. Your problem is equivalent to test an arithmetic expression of proper usage of brackets which needs at least an pushdown automaton to success.

In your case you should split the HTML code in opening tags, closing tags and text nodes (e.g with an regular expression). Store the result in a list. Then you can iterate through node list and push every opening tag onto the stack. If you encounter a closing tag in your node list you must check that the topmost stack entry is a opening tag of the same type. Otherwise you found the html syntax error you looked for.

sebasgo
A: 

I've got a case where I am dealing with single, self-contained lines. The following regular expression worked for me: <[^/]+$ which matches a "<" and then anything that's not a "/".

ariddell