I have a huge number of mail archives that I want to de-duplicate and sort out. The archives are either in mbox format or contain a single mail message. To add a bit of complication, some of the files have windows EOL sequences, and some have the unix EOL.
Using C#. how do I read an archive and split it into its individual messages, or read a single message file? In python, I would use the mailbox.mbox class, but I cannot see the matching functionality in the C# documentation.
views:
336answers:
3
+1
A:
It is unlikely that you will find a library to read that file for C# - there aren't that many Unix users who also use C#.
What I would do would be either to:
- Read the Python code, and then port it to C#
- Find the description of the mbox format online. As it is a Unix system, chances are that the format is just a plain text file, which should be easy enough to parse.
tomjen
2008-11-27 20:16:31
I had a feeling this was going to happen.I'm not sure mbox format is unix only (I think thunderbird uses it on windows), and it isn't that complicated - just dumps of RFC2822 messages, all preceded by "From [date]" lines.
Simon Callan
2008-11-27 20:32:51
A:
Most standard Unix mail files delimit entries with a line starting "From "
So if you read in the mail file as a text file and switch to a new mail entry every time you see the string "From " at the start of a line it should work - Any strings elsewhere should already have been delimited by the email program
A:
If it one-time activity I think easiest steps to sort messages:
- join all the mbox files into one
- load compilation into thunderbird as local folder
- run one of Duplicate message finder Add-On on folder
- delete found dupliates
- compact folder
- take the dup-free message list :)
Duplicate Elimiators (Add-Ons for Thunderbird)
I've used this: Remove Duplicate Messages (Alternate)
Denis Barmenkov
2009-12-05 00:18:35