ansaurus

Question

C# regular expression for finding forms with input tags in HTML?

Answer 1

A:

Don't parse HTML with regular expressions.

Greg D 2010-05-05 14:01:39

Answer 2

+1 A:

You really don't want to parse HTML using RegEx. See this answer if you need more convicing.

Regular expressions are the wrong tool for trying to parse HTML - especially when it's HTML that is not gauranteed to be well formed.

You should really get an HTML/XHTML parsing library and use that to match HTML content. Take a look at the HTML Agility Pack, it's probably sufficient for what you need.

LBushkin 2010-05-05 14:03:25

There really ought to be some feature in SO that, when you type "regular", "expression" or "regexp" and "html" in the title of a post it simply takes you to the referenced answer. +1 for referencing a reasonable alternative.

tvanfosson 2010-05-05 14:08:33

Thanks for the advice. I am aware of the fact that regex -generally- is a bad idea for HTML, but I disagree with -always-. Im using regex for this because its the only thing I need to do with HTML, and therefore using anything else is overkill. It IS possible to do this using regex since I already did it, but the regex I got is too complicated/stupid and ironically also overkill I think. Maybe I was wrong assuming there is a simple solution?

johnrl 2010-05-05 14:14:35

@johnrl: you are aware that you're using the wrong tool, and you're aware that doing so has made a solution which is overcomplicated and brittle, and yet you persist in trying to use the wrong tool? **The solution becomes simple when you use the correct tool.**

Eric Lippert 2010-05-05 14:19:15

@johnrl: If you **insist** on using regular expressions, I would advise you to do two things. 1) Test that expression with a wide variation of input HTML to make sure it will really match the cases you care about. Remember, HTML can contain javascript, CDATA, XML islands, and lots of other content that can break regex matching. 2) Focus on correctness and not elegance. Regular expressions are rarely elegant to begin with - if you find something that actually works for your test cases, just use it - trying to simplify the expression may break it, and at best it's not a great use of time.

LBushkin 2010-05-05 14:22:04

@Eric: downloading an external library, shipping it with the program, creating several objects and many lines of code and learning a new library just to do what I want is overkill. You are entitled to your own opinion, but for me I do not want to to through the above to solve a problem that can be solved by 1 line of code.All I wanted to know was if someone could come up with a better (simpler/shorter) solution using regex than the above - if not then I must accept that because I have chosen the wrong tool as you mention yourself.

johnrl 2010-05-05 14:25:33

@LBushkin you're probably right. CDATA etc would definitely be a problem but I'm not so sure it will interfere with the webpages I am parsing. Your second advice is as true as it gets, so I guess I will be using the above if it turns out to not fail, or else I'll use the HTML agility pack as you mention. Thanks.

johnrl 2010-05-05 14:30:26

Parsing html with regular expressions is the "young man" approach ( http://paultyma.blogspot.com/2008/04/young-mans-business-model.html ) to handling html (see mini-story #1). The up-front cost for such solutions is often much lower, but the cost you end up paying for them in the long run (when you need to fix the issues you introduce) is generally higher. That said, regex may be good enough even though it sometimes fails catastrophically. Also try using a C compiler to compile code written targeting C++.

Brian 2010-05-07 20:11:59

Answer 3

A:

You should not parse HTML with regular expressions, but if you must, then what about something simple as:

<form>[^</form>]+<input/>.+</form>

Tinus 2010-05-05 14:04:07

Doesn't work. This excludes possibilities of any other tag in between. Ex: .. <form> .. </form> <form> .. <select/> <input/> .. </form> ..Is not matched at all.

johnrl 2010-05-05 14:08:01

@johnrl -- don't give into the madness, get the HtmlAgilityPack @LBushkin recommends.

tvanfosson 2010-05-05 14:10:12

Answer 4

+5 A:

I want to believe there is a simple regex to capture the problem

Wishing does not make it so. There is no evidence for the proposition that every problem can be solved with regular expressions, and plenty of evidence against. Your faith is not well placed.

The set of languages which are recognizable by regular expressions is called -- unsurprisingly -- the regular languages. A nice property of all regular languages is that they can be recognized by a device with finitely many states. Therefore, you can quickly figure out if a language is not regular by asking yourself the question "would I require an unbounded number of states to recognize this language?"

Consider the language of matching parens: (), ()(), (()), ()(()), and so on. To recognize this language you have to keep track of how many open parens there are waiting to be closed, and therefore you need an unbounded number of states. Therefore this language is not a regular language, and therefore it cannot be matched by a regular expression.

HTML is clearly the paren language but even more complicated, because now there are an infinite number of different "kinds of parens". Each tag is like an open paren that must be matched by its corresponding closing tag. Since this is an even more complex and difficult version of a non-regular language, clearly it cannot be a regular language. And therefore it cannot be matched correctly with regular expressions.

The right tool to recognize patterns in HTML is an HTML parser.

Eric Lippert 2010-05-05 14:16:03

**I never thought to look at it like that.** That really makes it crystal clear why HTML parsing with regex is a fool's errand. Thank you.

LBushkin 2010-05-05 14:25:03

For the sake of pedantry, it might be worth noting that support for backreferences means that the "regular expressions" implemented by most programming languages/runtimes are not true regular expressions and can in fact recognize non-regular languages. .NET's regular expressions even include a "balancing group" construct for recognizing languages like your paren example. Having said that, of course they are still the wrong tool for this job.

kvb 2010-05-05 14:43:02

ansaurus

tags:

views:

answers:

C# regular expression for finding forms with input tags in HTML?

related questions