views:

212

answers:

3

I am writing a C# application which allows users to store emails in a MS SQL Server database. Many times, multiple users will be copied on an email from a customer. If they all try to add the same email to the database, I want to make sure that the email is only added once.

MD5 springs to mind as a way to do this. I don't need to worry about malicious tampering, only to make sure that the same email will map to the same hash and that no two emails with different content will map to the same hash.

My question really boils down to how one would combine multiple fields into one MD5 (or other) hash value. Some of these fields will have a single value per email (e.g. subject, body, sender email address) while others will have multiple values (varying numbers of attachments, recipients). I want to develop a way of uniquely identifying an email that will be platform and language independent (not based on serialization). Any advice?

+1  A: 

Have you looked at some other headers like (in my mail, OS X Mail):

X-Universally-Unique-Identifier: 82d00eb8-2a63-42fd-9817-a3f7f57de6fa
Message-Id: <[email protected]>

At least the Message-Id is required. That field could well be the same for the same mailing (send to multiple recipients). That would be more effective than hashing.

Not the answer to the question, but maybe the answer to the problem :)

extraneon
The problem with this approach is that it is dependent on the mail client. I want to find a strategy which will work as long as you know the time sent, sender's email address, subject, body, recipients, and attachments
Skywalker
+1  A: 

What volume of emails do you plan on archiving? If you don't expect the archive require many terabytes, I think this is a premature optimization.

Since each field can be represented as a string or array of bytes, it doesn't matter how many values it contains, it all looks the same to a hash function. Just hash them all together and you will get a unique identifier.

EDIT Psuedocode example

# intialized the hash object
hash = md5()

# compute the hashes for each field
hash.update(from_str)
hash.update(to_str)
hash.update(cc_str)
hash.update(body_str)
hash.update(...) # the rest of the email fields

# compute the identifier string
id = hash.hexdigest()

You will get the same output if you replace all the update calls with

# concatenate all fields and hash
hash.update(from_str + to_str + cc_str + body_str + ...)

How you extract the strings and interface will vary based on your application, language, and api.

It doesn't matter that different email clients might produce different formatting for some of the fields when given the same input, this will give you a hash unique to the original email.

mikerobi
I see one major flaw with this pseudocode. To simplify, lets assume that there are only two fields, string1 and string2. With your psuedocode, both {string1 = "foo", string2 = "bar"} and {string1 = "foob", string2 = "ar"} would result in the same hash. Maybe this can be solved by hashing the sums of the individual hashes?
Skywalker
A hash is not guaranteed unique. Different inputs can - and will - generate the same hash. If hashes match you should do a content compare anyway. You tend to use hash buckets, and each buckets holds all the messages with that hash. (in a table structure this would be 2 tables, one (hash=unique) and one(id=unique, foreign key to hash not unique, the message + headers.
extraneon
Using a cryptographic hash function with a large hash value (like MD5 with its 128 bit hash values), you can virtually guarantee that you will not have hash collisions. The problem with this scheme is that it produces easily predictable cases where the hashes collide
Skywalker
@Skywalker, I wouldn't worry about two fields combining in a way that makes an accidental collision. I think that is unlikely to happen unless the email contains invalid headers or someone is trying to construct a collision. If you are really worried you can added some secret string to the hash in between each of the to the hash between each of the update calls, just be sure to pick a string that is unlikely to exist in any of the fields.
mikerobi
@Skywalker, @mikerobi Do you really want to deliver software with known potential problems? In a year or so you'll get a call that it's broken, and you wouldn't know where to look anymore. If you do decide not to handle hash collisions at least print that in the log "Hash collision detected for ...". Then you, or the maintainer, at least know what happened.
extraneon
be very careful with this approach. Different libraries may give you different order of recipients. If the mail body is HTML encoded, then you may get drastically different results based on mail client+library (especially true for Outlook which actually RTF encodes the HTML text and there is a lot of transformation involved).
Marek
@extraneon, I answered his question, I didn't tell him it was a good idea, which is why I up-voted your comment.
mikerobi
@marek, since every email originated from 1 client, it doesn't matter how the email was structured or encoded, it will still have a unique hash.
mikerobi
Common techniques to prevent the collision problem pointed out by Skywalker are to prepend the length of each field before concatenating the strings or using a distinct delimiter not used otherwise in the fields.
abc
@abc I just thought of catting the length to the string last night :-)
Skywalker
+1  A: 

Why not just hash the raw message? It already encodes all the relevant fields except the envelope sender and recipient, and you can add those as headers yourself, before hashing. It also contains all the attachments, the entire body of the message, etc, and it's a natural and easy representation. It also doesn't suffer from the easily generated hash collisions of mikerobi's proposal.

Nick Johnson
Unfortunately, i lied when I said I want it to be platform independent. It has to at minimum support Outlook which, by Microsoft's infinite wisdom, doesn't store the raw message
Skywalker