tags:

views:

56

answers:

4

Hi,

I'm working on regular expression to find a whole heap of text that sits inside

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;

and a tag that says end of menu... which looks like this:

<!--END MENU-->

This is the code i wrote but it's not spitting out the matching text:

$value = preg_match('/^<!DOCTYPE html PUBLIC \"-\/\/W3C\/\/DTD XHTML 1.0 Transitional\/\/EN\" \"http:\/\/www.w3.org\/TR\/xhtml1\/DTD\/xhtml1-transitional.dtd\">(.*?)<!--END MENU-->/',$content, $matching_text);

echo $matching_text[0]

Can anyone help me with this please?

Thanks in advance :)

+1  A: 

You cannot reliably parse HTML with regular expressions. Use an HTML parser instead.

Andy Lester
+2  A: 

Although I would normally agree with Andy, you should be able to parse this portion of an HTML string out given the specific beginning and end.

The . (dot) will not match newline characters without the m modifier. Throw that at the end of your regex pattern and give it a shot.

For more details

Jason McCreary
+1  A: 

First, there are certain characters in your regex need to be escaped, e.g. the dots.

Second, even if your current regex works, it won't match many HTML documents because your requirements are too specific.

In my opninion, you should use this regex instead:

  /<!doctype\s*html\b[^><]+>(.*?)<!--\s*end\s+menu\s*-->/ism
Vantomex
+1 for noting the exactness of the OP's regex. I still think even yours will need the `m` modifier.
Jason McCreary
Oops, thanks @Jason for catching that. :-)
Vantomex