All,
Sorry - there's enough regex questions on stackoverflow already, but I can't figure this one out.
I'm working on a PHP script that reads the content of emails, and pulls out certain information to store in a database.
Using imap_fetchbody ($imap_stream, $msg_number, 1), I'm able to get at the body of the email. In some cases (especially email sent as SMS from mobile phones), the body of the email looks like this:
===------=_Part_110734_170079945.1283532109852
Content-Type: text/html;charset=UTF-8;
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Multimedia Message</title>
</head>
<body leftmargin="0" topmargin="0">
<tr height="15" style="border-top: 1px solid #0F7BBC;">
<td>
SMS to email test
</td>
</tr>
</body>
</html>
------=_Part_110734_170079945.1283532109852--===
I want to pull out the "content" of the email. So, my plan is this:
Check to see if the body contains the "html" tags. If not, I can read it normally (it's not an HTML email).
If it does, extract the content between the "html" tags. Then, eliminate all the other HTML tags, and the "content" is what's left.
However, I'm pretty clueless when it comes to regex patterns.
I tried this:
$pattern = '/<html[^>]*>(.*?)<\/html>/i';
preg_match($pattern, $body, $matches);
// my 'content' should be in $matches[1]
But that didn't work (probably because $body contains newlines and other whitespace). So then I tried this:
$pattern = '/<html[^>]*>([.\s]*?)<\/html>/i';
preg_match($pattern, $body, $matches);
But that didn't work either.
So, what $pattern can I use to extract all the text between the "html" tags?
UPDATE: I've stumbled into a workaround - strip all the whitespace first:
$body = preg_replace('/\s\s+/', ' ', $body);
$pattern = '/<body[^>]*>(.*?)<\/body>/';
I suspect this isn't the fastest or most efficient method, but it works, and is the best I've got so far. I'd still appreciate a better solution if there is one, though.
UPDATE 2: Thanks to Gumbo suggestions, I've tried a little harder to dig through the structure of the email to find the part I was looking for, instead of attempting to regex HTML. I finally found this: http://docstore.mik.ua/orelly/webprog/pcook/ch17_04.htm, which explains how to do exactly what I needed.
Many thanks in advance!
Cheers, Matt Stuehler