ansaurus

Question

Regular Expression to find hidden fields in html

Answer 1

A:

See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454

Radomir Dopieralski 2010-09-16 17:27:54

+1 x 1000. Don't parse (X)HTML with regex. Full stop. It gets asked here almost every day, and the answer never changes.

spender 2010-09-16 17:30:40

He's not talking about parsing all HTML, just a specific case. (http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html) "It's considered good form to demand that regular expressions be considered verboten, totally off limits for processing HTML, but I think that's just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine. It's more important to understand the tools, and their strengths and weaknesses, than it is to knuckle under to knee-jerk dogmatism."

Robert Greiner 2010-09-16 17:50:36

This is a specific case of... parsing HTML. And unless you limit yourself to a particular specific case text input which just happens to be HTML, but you don't care about it being HTML, you are going to get subtle and random failures no matter what regular expression you come up with. Some common breakers include , <tags with="funny </tags> inside attributes">, <tags>with<tags/>inside of </tags>, etc.

Radomir Dopieralski 2010-09-16 17:57:16

If I dont use a regular expression, what should I use?

Niall Collins 2010-09-16 21:14:36

How about a html parser? Preferably a ready made and tested library, so that you have even less work with it than with implementing the regexp solution.

Radomir Dopieralski 2010-09-16 21:42:55

Answer 2

A:

Regular expressions are generally the wrong tool for the job when trying to search or manipulate HTML or XML; a parsing library would likely be a much cleaner and easier solution.

That said, if you're just looking through a big file and accuracy isn't critical, you can probably do reasonably well with something like <input[^>]*type="?hidden"?.

ngroot 2010-09-16 17:30:08

ngroot, that expression is only a partial match.

Brad 2010-09-16 17:44:47

That's correct. He asked for an expression that would find these tags, which this will usually do. What's it matter if it matches on the whole tag?

ngroot 2010-09-16 17:49:40

I don't think finding *half* of the tag will really help, but I see your point. It won't let me un-down-vote you though.

Brad 2010-09-16 20:55:24

Half the tag is just fine if you are, as the author requested, just looking to *find* the tags. That's what I usually do if I'm doing ad-hoc searches of documents; I use the shortest expression that will take me to what I'm looking for. If he wants to do something more complex, like replace them, a regex is really not a safe tool to be using anyway.

ngroot 2010-09-16 22:51:24

Answer 3

+2 A:

I agree that the link Radomir suggest is correct that HTML should not be parsed with regular expressions. However, I do not agree that nothing meaningful can be gleaned from their use together. And the ensuing rant is totally counter-productive.

To correct Robert's RegEx:

<([^<]*)type=('|")hidden('|")>[^<]*(/>|</.+?>)

Brad 2010-09-16 17:41:59

Not even close. For example, try `<input type =hidden name =surname value =smith>` or `<input type=text name=info value="type='hidden's how to carry data between pages." >`. And both of those examples are valid html. Never mind the problems when processing real world html. Use a parser.

Alohci 2010-09-16 19:04:18

@Alohci, *no doubt* you should use a parser if you can for ANYTHING xml. @Niall, if you need the optional spaces in the expression to handle the cases Alohci brought up, it shouldn't be too hard. Ugly, yes, but not too hard.

Brad 2010-09-16 20:54:07

Answer 4

+1 A:

I know you asked for regular expression, but download Html Agility Pack and do the following:

var inputs = htmlDoc.DocumentNode.Descendants("input");
foreach (var input in inputs)
{
   if( input.Attributes["type"].Value == "hidden" )
   // do something
}

You can also use xpath with html agility pack.

Mikael Svenson 2010-09-16 17:44:01

ansaurus

tags:

views:

answers:

Regular Expression to find hidden fields in html

related questions