tags:

views:

336

answers:

3

I have a huge number of mail archives that I want to de-duplicate and sort out. The archives are either in mbox format or contain a single mail message. To add a bit of complication, some of the files have windows EOL sequences, and some have the unix EOL.
Using C#. how do I read an archive and split it into its individual messages, or read a single message file? In python, I would use the mailbox.mbox class, but I cannot see the matching functionality in the C# documentation.

+1  A: 

It is unlikely that you will find a library to read that file for C# - there aren't that many Unix users who also use C#.

What I would do would be either to:

  1. Read the Python code, and then port it to C#
  2. Find the description of the mbox format online. As it is a Unix system, chances are that the format is just a plain text file, which should be easy enough to parse.
tomjen
I had a feeling this was going to happen.I'm not sure mbox format is unix only (I think thunderbird uses it on windows), and it isn't that complicated - just dumps of RFC2822 messages, all preceded by "From [date]" lines.
Simon Callan
A: 

Most standard Unix mail files delimit entries with a line starting "From "

So if you read in the mail file as a text file and switch to a new mail entry every time you see the string "From " at the start of a line it should work - Any strings elsewhere should already have been delimited by the email program

A: 

If it one-time activity I think easiest steps to sort messages:

  1. join all the mbox files into one
  2. load compilation into thunderbird as local folder
  3. run one of Duplicate message finder Add-On on folder
  4. delete found dupliates
  5. compact folder
  6. take the dup-free message list :)

Duplicate Elimiators (Add-Ons for Thunderbird)

I've used this: Remove Duplicate Messages (Alternate)

Denis Barmenkov