tags:

views:

43

answers:

4

All,

Sorry - there's enough regex questions on stackoverflow already, but I can't figure this one out.

I'm working on a PHP script that reads the content of emails, and pulls out certain information to store in a database.

Using imap_fetchbody ($imap_stream, $msg_number, 1), I'm able to get at the body of the email. In some cases (especially email sent as SMS from mobile phones), the body of the email looks like this:

===------=_Part_110734_170079945.1283532109852
Content-Type: text/html;charset=UTF-8;
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

<html> 
    <head> 
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 
        <title>Multimedia Message</title> 
    </head> 
    <body leftmargin="0" topmargin="0"> 


                <tr height="15" style="border-top: 1px solid #0F7BBC;"> 
                    <td> 
                        SMS to email test
                    </td> 
                </tr> 


     </body> 
</html> 


------=_Part_110734_170079945.1283532109852--===

I want to pull out the "content" of the email. So, my plan is this:

Check to see if the body contains the "html" tags. If not, I can read it normally (it's not an HTML email).

If it does, extract the content between the "html" tags. Then, eliminate all the other HTML tags, and the "content" is what's left.

However, I'm pretty clueless when it comes to regex patterns.

I tried this:

$pattern = '/<html[^>]*>(.*?)<\/html>/i';
preg_match($pattern, $body, $matches);
// my 'content' should be in $matches[1]

But that didn't work (probably because $body contains newlines and other whitespace). So then I tried this:

$pattern = '/<html[^>]*>([.\s]*?)<\/html>/i';
preg_match($pattern, $body, $matches);

But that didn't work either.

So, what $pattern can I use to extract all the text between the "html" tags?

UPDATE: I've stumbled into a workaround - strip all the whitespace first:

$body = preg_replace('/\s\s+/', ' ', $body);
$pattern = '/<body[^>]*>(.*?)<\/body>/';

I suspect this isn't the fastest or most efficient method, but it works, and is the best I've got so far. I'd still appreciate a better solution if there is one, though.

UPDATE 2: Thanks to Gumbo suggestions, I've tried a little harder to dig through the structure of the email to find the part I was looking for, instead of attempting to regex HTML. I finally found this: http://docstore.mik.ua/orelly/webprog/pcook/ch17_04.htm, which explains how to do exactly what I needed.

Many thanks in advance!

Cheers, Matt Stuehler

+2  A: 

you can use an html parser like : http://php-html.sourceforge.net/

or you can use strip_tags php.net/strip_tags

Zak
+1  A: 
$pattern = '/<html[^>]*>([^\00]*?)<\/html>/i';

That will only break if there's a 0x00 byte in the content, which should not be.

aularon
+1  A: 

You just need to add s modifier to allow . match newlines:

$pattern = '/<html[^>]*>(.*?)<\/html>/si';
preg_match($pattern, $body, $matches);
Ivan Nevostruev
+2  A: 

[.\s] means either a literal . or a whitespace character. What you need is either (.|\s), or [\s\S], or you simply set the s modifier to have . also match line breaks.

But besides that, you should not use regular expressions to match HTML. Parts of HTML are not regular and thus you cannot use regular expressions to describe it.

But besides that, you should not try to guess the range of a multipart content when you have distinct delimiters. But these aren’t <html>…</html>. Because what if they are missing? Then your attempt will fail. Use the delimiters defined by the message itself: the boundary value. So use the boundary to get the parts and split them at the first CRLF+CRLF sequence to separate the header from the body.

But besides that, why don’t you use the IMAP functions to get the body? I’m not familiar with the PHP’s IMAP API, but there probably is a function that does exactly that what you’re looking for.

Gumbo