tags:

views:

1101

answers:

3

Hello,

I have a huge mbox file, with maybe 500 emails in it.

It looks like the following:

From [email protected] Fri Aug 12 09:34:09 2005
Message-ID: <[email protected]>
Date: Fri, 12 Aug 2005 09:34:09 +0900
From: me <[email protected]>
User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: someone <[email protected]>
Subject: Re: (no subject)
References: <[email protected]>
In-Reply-To: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Status: RO
X-Status: 
X-Keywords:                 
X-UID: 371
X-Evolution-Source: imap://[email protected]/
X-Evolution: 00000002-0010

Hey

the actual content of the email

someone wrote:

> lines of quotedtext

I would like to know how I can remove all of the quoted text, strip most of the headers except the To, From and Date lines, and still have it somewhat continuous.

My goal is to be able to print these emails as a book sort of format, and at the moment every program wants to print one email per page, or all of the headers and quoted text. Any suggestions for where to start on whipping up a small program using shell tools?

+1  A: 

As a start, I would probably use "formail" to extract the mails with just the headers you want. Either that, or use some sort of state table in awk to see if you're in the header or not, and either strip everything but the wanted headers if you're in the header and strip the quotes if you're not.

Paul Tomblin
Hudson's answer is better than mine. Which brings up a meta question: should we delete our answers when something better comes along, or only when your answer is "bad"?
Paul Tomblin
A: 

Using shell tools may not be the best answer to that as there are many libraries in many languages to deal with mbox, be it in Ruby, Perl or whatever. You will have to also consider that quoting characters are not always "> " which can screw up your de-quoting process. As for extracting the headers you want, this should not be difficult in any language. I know this is a general answer, maybe not specific enough...

Keltia
+6  A: 

Mail::Box::Mbox will let you easily parse the file into separate messages. Mark Overmeer's slides from YAPC::Europe 2002 go into quite a bit of detail as to why parsing is much more difficult than it seems. Using this library will also deal with mh, IMAP and many other formats than just mbox.

    #!/usr/bin/perl
    use warnings;
    use strict;
    use Mail::Box::Manager;

    my $file = shift || $ENV{MAIL};
    my $mgr = Mail::Box::Manager->new(
        access  => 'r',
    );

    my $folder = $mgr->open( folder => $file )
    or die "$file: Unable to open: $!\n";

    for my $msg ($folder->messages)
    {
        my $to  = join( ', ', map { $_->format } $msg->to );
        my $from = join( ', ', map { $_->format } $msg->from );
        my $date = localtime( $msg->timestamp );
        my $subject = $msg->subject;
        my $body = $msg->body;

        # Strip all quoted text
        $body =~ s/^>.*$//msg;

        print <<"";
    From: $from
    To: $to
    Date: $date
    $body

    }

You may want to reconsider your request to strip the quoted text -- what if you email that is formatted with interleaved replies? Stripping the quoted text would make this sort of email very hard to understand:

  Foo wrote:
  > I like bar.

  Bar?  Who likes bar?

  > It is better than baz.

  Everyone knows that.

  -- 
  Quux

Additionally, what do you plan to do with attachments, non-text/plain MIME types, encoded text entities and other oddities?

Hudson
Thanks for your answer. There are no attachments for printing, all text is in english and most mail is not interleaved with quotes, which will save a lot of paper
Joshxtothe4
Just q quick question, but does that perl take a parameter, or does it use the $MAIL environment variable
Joshxtothe4
The script does take a parameter (my $file = shift) or if none is given it defaults to the environment variable (|| $ENV{MAIL}).
Hudson
This is almost exactly what I want, but for two things. The emails do not appear to be chronologically ordered, however they are in the original file, and there are a lot of =20 at the end of lines that I would like to remove..
Joshxtothe4
actually, scratch that, they are not in chronological order in the original file...damnit I don't know how to fix that.
Joshxtothe4
Is there any easy way to sort the messages chronologically?
Joshxtothe4
The =20 are due to it not being text/plain, but instead quoted-printable. You would need to use a MIME parser to convert them and reformat, or if =20 is the only special case '$body=~s/=20$//msg;'. For sorting you can do something like 'for(sort { $a->timestamp <=> $b->timestamp } @$folder) {...}'
Hudson