tags:

views:

44

answers:

3

Hey Guys,

I want to match any of these cases with a regex. I have the header text, but I need to match it with the (possible) corresponding HTML:

<h1>header title</h1>
<h2>site | header title</h2>
<h3 class="header">header title</h3>
<h2>header title 23 jan 2009</h2>
<h1>header title</h1>

I have this:

/(<(h1|h2|h3))(.+?)".$title."(.+?)(<\/\\2>)/i

But it seems to not always work, and don't see why.

Thanks

+4  A: 

Don't use regexes to parse HTML! Use an HTML parser, instead.

Hank Gay
A: 

Is $title regex-escaped (so characters like {, [ etc. are escaped)?

With line end may be problem too; there should something like multiline support, if you regex implementation supports it.

It is better to process structured data with appropriate tools - XML with XML parser, HTML with HTML parser. There are parsers like BeautifulSoup in Python, hpricot in Ruby, libxml2...

Messa
A: 

What you (logically) want for your example is something like:

<(group of anything not including ">"> (Value to extract) <(group of anything not including ">">

e.g.

<[^>]>([^>]+)<[^>]>

The specific regex syntax is a bit dependent on what environment you're working on.

You can get away with this if you're sure what you're parsing is no more complicated than your example. However, you really shouldn't be parsing html (or xml) with a regex (as someone has already noted here) because xml can be arbitrarily nested, and regex can't possibly deal with that.

Steve B.
it's in php, and actually I only want the header tags h1, h2, h3. So it would be: <h1, h2 or h3 * anything text string I know ending /h1, h2, h3>
Yvo