ansaurus

Question

Answer 1

+5 A:

I'm sure some regex guru can cobble something together that approximates a solution, but it's a bad idea: HTML isn't regular. Consider either a HTML parser that's capable of identifying such problems, or parsing it yourself.

Pesto 2009-08-03 18:04:53

Too bad SO doesn't have an API. I'm picturing something like: `for (Question q : questionsTagged("regex", "html")) { q.addAnswer(new Answer("HTML isn't regular, so regexes are hardly ever a good choice."); }`

Michael Myers 2009-08-03 18:08:45

Of course, then you'll look silly if the OP preempted that response in the question, as he pretty much did here.

Michael Myers 2009-08-03 18:09:35

@mmyers: Honestly, I just couldn't resist the opportunity to link to something of Welbog's.

Pesto 2009-08-03 18:20:10

Answer 2

+2 A:

Yes it requires recursive processing, and potentially quite deep (or a fancy loop of course), it is not going to be done with a regex. You could make a regex that handled a few levels deep, but not one that will work on just any html file. This is because the parser would have to remember what tags are open at any given point in the stream, and regex arent good at that.

Use a SAX parser with some counters, or use a stack with pop off/push on to keep your state. Think about how to code this game to see what I mean about html tag depth. http://en.wikipedia.org/wiki/Tower_of_Hanoi

Karl 2009-08-03 18:08:08

Answer 3

+1 A:

As @Pesto said, HTML isn't regular, you would have to build html grammar rules, and apply them recursively.

If you are looking to fix HTML programatically, I have used a component called html tidy with considerable success. There are builds for it for most languages (COM+, Dotnet, PHP etc...).

If you just need to fix it manually, I'd recommend a good IDE. Visual Studio 2008 does a good job, so does the latest Dreamweaver.

Vdex 2009-08-03 18:16:07

Answer 4

+1 A:

No, that's to complex for a regular expression. Your problem is equivalent to test an arithmetic expression of proper usage of brackets which needs at least an pushdown automaton to success.

In your case you should split the HTML code in opening tags, closing tags and text nodes (e.g with an regular expression). Store the result in a list. Then you can iterate through node list and push every opening tag onto the stack. If you encounter a closing tag in your node list you must check that the topmost stack entry is a opening tag of the same type. Otherwise you found the html syntax error you looked for.

sebasgo 2009-08-03 18:16:09

Answer 5

A:

I've got a case where I am dealing with single, self-contained lines. The following regular expression worked for me: <[^/]+$ which matches a "<" and then anything that's not a "/".

ariddell 2010-03-23 20:34:55

ansaurus

tags:

views:

answers:

Regex for unclosed HTML tags

related questions