tags:

views:

343

answers:

4

I'm parsing some html using regex and I want to match lines which start with a word without any html tags while also removing the white space. Using c# regex my first pattern was:

pattern = @"^\s*([^<])";

which attempts to grab all the white space and then capture any non '<' characters. Unfortunately if the line is all white space before the first '<' this returns the last white space character before the '<'. I would like this to fail the match.

Any ideas?

+3  A: 

Don't use regular expressions to parse HTML. It's a really bad idea and, at best, your code will be flaky. Whatever your language/platform is you'll have a fully-functional HTML parser available. Just use that.

There is no way a regular expression can correctly handle all the cases of escaping, entity use and so on.

cletus
+3  A: 

The HTML parsing has been discussed a lot. Refer to this post:

Using regular expressions to parse HTML: why not?

Jérôme
+1  A: 

Can I refer you to my answer to another similar question ?

Brian Agnew
+1  A: 

Asked the question to soon, just worked out this:

pattern = @"^\s*((?!\s)[^<]+)";

Thanks for the feedback about regex and html, I'll bare it in mind for the future. I'm writing a utility program to make a few pages multi-language (i.e: add asp:literals for hardcoded text etc), I think regex is sufficient for this purpose but if there are better tools please let me know (web stuff isn't my area...).

Patrick