views:

335

answers:

4

Hello, I am parsing out some emails. Mobile Mail, iPhone and I assume iPod touch append a signature as a separate boundary, making it simple to remove. Not all mail clients do, and just use '--' as a signature delimiter.

I need to chop off the '--' from a string, but only the last occurrence of it.

Sample copy

 hello, this is some email copy-- check this out
 --
 Tom Foolery

I thougth about splitting on '--', removing the last part, and I would have it, but explode() and split() neither seem to return great values for letting me know if it did anything, in the event there is not a match.

I can not get preg_replace to go across more than one line. I have standardized all line endings to \n

What is the best suggestion to end up with "hello, this is some email copy-- check this out", taking not, there will be cases where there is no signature, and there are of course going to be cases where I can not cover all the cases.

+5  A: 

Actually correct signature delimiter is "-- \n" (note the space before newline), thus the delimiter regexp should be '^-- $'. Although you might consider using '^--\s*$', so it'll work with OE, which gets it wrong.

vartec
I was unaware there was a standard for signature format. Can you cite?
John Saunders
RFC3676 section 4.3
vartec
Which would be http://tools.ietf.org/html/rfc3676#section-4.3. As the RFC states, it's more a widely accepted convention than a real standard.
Tomalak
good information but I highly doubt that you could expect it to be consistent.
Kibbee
@Kibbee: most mailers follow this RFC. Some (like e.g. OE) strip *all* trailing whitespace, '^--\s*$' works in both cases.
vartec
Apple Mail for example lets you make a sig, I put in '--', but forget at times to put in the '-- '. It certainly allows to you omit the '-- ' entirely if you so desire. Email is about the most amazing mess I have ever dealt with.
@scott: true, but then there's nothing that can be done about signatures that don't comply.
vartec
+2  A: 

Try this:

preg_replace('/--[\r\n]+.*/s', '', $body)

This will remove everything after the first occurence of -- followed by one or more line break characters. If you just want to remove the last occurence, use /.*--[\r\n]+.*/s instead.

Gumbo
Just to clarify: the final /s makes the regex treat the whole string as a [S]ingle line
Piskvor
Thanks, can you elaborate how either of those would target the *last* occurance? What if there is a plain text part, and someone pushes in a -- in the middle of it, as well as a signature?I have been considering reversing the string and finding the first occurrence, then putting it back.
+1  A: 

Instead of just chopping of everything after -- could you not cache the last few emails sent by that user or service and compare. The bit at the bottom that looks like the others can be safely removed leaving the proper message intact.

Tom
I have considered things like this. With Mobile Mail on iphone, Touch, gmail, outlook, and all the ways in which people move around these days, I figure there is no way to get a clear idea of what client they will be using at any given time.
+1  A: 

I think in the interest of being more bulletproof, I will take the non regex route

        echo substr($body, 0, strrpos($body, "\n--"));