ansaurus

Question

preg_match pattern to find the contents of a string between <html> and </html> tags

Answer 1

+2 A:

you can use an html parser like : http://php-html.sourceforge.net/

or you can use strip_tags php.net/strip_tags

Zak 2010-09-03 19:14:24

Answer 2

+1 A:

$pattern = '/<html[^>]*>([^\00]*?)<\/html>/i';

That will only break if there's a 0x00 byte in the content, which should not be.

aularon 2010-09-03 19:16:43

Answer 3

+1 A:

You just need to add s modifier to allow . match newlines:

$pattern = '/<html[^>]*>(.*?)<\/html>/si';
preg_match($pattern, $body, $matches);

Ivan Nevostruev 2010-09-03 19:20:47

Answer 4

+2 A:

[.\s] means either a literal . or a whitespace character. What you need is either (.|\s), or [\s\S], or you simply set the s modifier to have . also match line breaks.

But besides that, you should not use regular expressions to match HTML. Parts of HTML are not regular and thus you cannot use regular expressions to describe it.

But besides that, you should not try to guess the range of a multipart content when you have distinct delimiters. But these aren’t <html>…</html>. Because what if they are missing? Then your attempt will fail. Use the delimiters defined by the message itself: the boundary value. So use the boundary to get the parts and split them at the first CRLF+CRLF sequence to separate the header from the body.

But besides that, why don’t you use the IMAP functions to get the body? I’m not familiar with the PHP’s IMAP API, but there probably is a function that does exactly that what you’re looking for.

Gumbo 2010-09-03 19:34:36

ansaurus

tags:

views:

answers:

preg_match pattern to find the contents of a string between <html> and </html> tags

related questions