tags:

views:

80

answers:

4

Hi all. I have a simple problem: I want to construct a regex that matches a form in HTML, but only if the form has any input tags. Example:

The following should be matched (ignoring attributes):

..
<form>
..
<input/>
..
</form>
..

But the following should not (ignoring attributes):

..
<form>
..
</form>
..

I have tried everything from look-arounds to capture groups but it quickly gets complicated. I want to believe there is a simple regex to capture the problem. Please note that it is important that the regex pairs the opening and closing tags according to the HTML code which means the following does not work:

<form>.+<input/>.+</form>

because it matches wrongly like this:

..
<form> <--- This is wrongly matched as the opening tag 
..
</form> 
<form> <-- This is the correct opening tag of the correct form
..
<input/>
..
</form> <--- This is matched as the closing tag
..

EDIT:

I already made a RegEx that matches what I want; my question is now how to do it, but how to do it SIMPLE/elegantly. To me this is not simple or elegant at all:

<form>
(.(?<!</form>))+
<input/>
(.(?<!</form>))+
</form>
+1  A: 

You really don't want to parse HTML using RegEx. See this answer if you need more convicing.

Regular expressions are the wrong tool for trying to parse HTML - especially when it's HTML that is not gauranteed to be well formed.

You should really get an HTML/XHTML parsing library and use that to match HTML content. Take a look at the HTML Agility Pack, it's probably sufficient for what you need.

LBushkin
There really ought to be some feature in SO that, when you type "regular", "expression" or "regexp" and "html" in the title of a post it simply takes you to the referenced answer. +1 for referencing a reasonable alternative.
tvanfosson
Thanks for the advice. I am aware of the fact that regex -generally- is a bad idea for HTML, but I disagree with -always-. Im using regex for this because its the only thing I need to do with HTML, and therefore using anything else is overkill. It IS possible to do this using regex since I already did it, but the regex I got is too complicated/stupid and ironically also overkill I think. Maybe I was wrong assuming there is a simple solution?
johnrl
@johnrl: you are aware that you're using the wrong tool, and you're aware that doing so has made a solution which is overcomplicated and brittle, and yet you persist in trying to use the wrong tool? **The solution becomes simple when you use the correct tool.**
Eric Lippert
@johnrl: If you **insist** on using regular expressions, I would advise you to do two things. 1) Test that expression with a wide variation of input HTML to make sure it will really match the cases you care about. Remember, HTML can contain javascript, CDATA, XML islands, and lots of other content that can break regex matching. 2) Focus on correctness and not elegance. Regular expressions are rarely elegant to begin with - if you find something that actually works for your test cases, just use it - trying to simplify the expression may break it, and at best it's not a great use of time.
LBushkin
@Eric: downloading an external library, shipping it with the program, creating several objects and many lines of code and learning a new library just to do what I want is overkill. You are entitled to your own opinion, but for me I do not want to to through the above to solve a problem that can be solved by 1 line of code.All I wanted to know was if someone could come up with a better (simpler/shorter) solution using regex than the above - if not then I must accept that because I have chosen the wrong tool as you mention yourself.
johnrl
@LBushkin you're probably right. CDATA etc would definitely be a problem but I'm not so sure it will interfere with the webpages I am parsing. Your second advice is as true as it gets, so I guess I will be using the above if it turns out to not fail, or else I'll use the HTML agility pack as you mention. Thanks.
johnrl
Parsing html with regular expressions is the "young man" approach ( http://paultyma.blogspot.com/2008/04/young-mans-business-model.html ) to handling html (see mini-story #1). The up-front cost for such solutions is often much lower, but the cost you end up paying for them in the long run (when you need to fix the issues you introduce) is generally higher. That said, regex may be good enough even though it sometimes fails catastrophically. Also try using a C compiler to compile code written targeting C++.
Brian
A: 

You should not parse HTML with regular expressions, but if you must, then what about something simple as:

<form>[^</form>]+<input/>.+</form>
Tinus
Doesn't work. This excludes possibilities of any other tag in between. Ex: .. <form> .. </form> <form> .. <select/> <input/> .. </form> ..Is not matched at all.
johnrl
@johnrl -- don't give into the madness, get the HtmlAgilityPack @LBushkin recommends.
tvanfosson
+5  A: 

I want to believe there is a simple regex to capture the problem

Wishing does not make it so. There is no evidence for the proposition that every problem can be solved with regular expressions, and plenty of evidence against. Your faith is not well placed.

The set of languages which are recognizable by regular expressions is called -- unsurprisingly -- the regular languages. A nice property of all regular languages is that they can be recognized by a device with finitely many states. Therefore, you can quickly figure out if a language is not regular by asking yourself the question "would I require an unbounded number of states to recognize this language?"

Consider the language of matching parens: (), ()(), (()), ()(()), and so on. To recognize this language you have to keep track of how many open parens there are waiting to be closed, and therefore you need an unbounded number of states. Therefore this language is not a regular language, and therefore it cannot be matched by a regular expression.

HTML is clearly the paren language but even more complicated, because now there are an infinite number of different "kinds of parens". Each tag is like an open paren that must be matched by its corresponding closing tag. Since this is an even more complex and difficult version of a non-regular language, clearly it cannot be a regular language. And therefore it cannot be matched correctly with regular expressions.

The right tool to recognize patterns in HTML is an HTML parser.

Eric Lippert
**I never thought to look at it like that.** That really makes it crystal clear why HTML parsing with regex is a fool's errand. Thank you.
LBushkin
For the sake of pedantry, it might be worth noting that support for backreferences means that the "regular expressions" implemented by most programming languages/runtimes are not true regular expressions and can in fact recognize non-regular languages. .NET's regular expressions even include a "balancing group" construct for recognizing languages like your paren example. Having said that, of course they are still the wrong tool for this job.
kvb