ansaurus

Question

Answer 1

+1 A:

Have you looked at some other headers like (in my mail, OS X Mail):

X-Universally-Unique-Identifier: 82d00eb8-2a63-42fd-9817-a3f7f57de6fa
Message-Id: <[email protected]>

At least the Message-Id is required. That field could well be the same for the same mailing (send to multiple recipients). That would be more effective than hashing.

Not the answer to the question, but maybe the answer to the problem :)

extraneon 2010-04-21 15:18:16

The problem with this approach is that it is dependent on the mail client. I want to find a strategy which will work as long as you know the time sent, sender's email address, subject, body, recipients, and attachments

Skywalker 2010-04-21 15:21:31

Answer 2

+1 A:

What volume of emails do you plan on archiving? If you don't expect the archive require many terabytes, I think this is a premature optimization.

Since each field can be represented as a string or array of bytes, it doesn't matter how many values it contains, it all looks the same to a hash function. Just hash them all together and you will get a unique identifier.

EDIT Psuedocode example

# intialized the hash object
hash = md5()

# compute the hashes for each field
hash.update(from_str)
hash.update(to_str)
hash.update(cc_str)
hash.update(body_str)
hash.update(...) # the rest of the email fields

# compute the identifier string
id = hash.hexdigest()

You will get the same output if you replace all the update calls with

# concatenate all fields and hash
hash.update(from_str + to_str + cc_str + body_str + ...)

How you extract the strings and interface will vary based on your application, language, and api.

It doesn't matter that different email clients might produce different formatting for some of the fields when given the same input, this will give you a hash unique to the original email.

mikerobi 2010-04-21 15:38:02

I see one major flaw with this pseudocode. To simplify, lets assume that there are only two fields, string1 and string2. With your psuedocode, both {string1 = "foo", string2 = "bar"} and {string1 = "foob", string2 = "ar"} would result in the same hash. Maybe this can be solved by hashing the sums of the individual hashes?

Skywalker 2010-04-21 16:12:11

A hash is not guaranteed unique. Different inputs can - and will - generate the same hash. If hashes match you should do a content compare anyway. You tend to use hash buckets, and each buckets holds all the messages with that hash. (in a table structure this would be 2 tables, one (hash=unique) and one(id=unique, foreign key to hash not unique, the message + headers.

extraneon 2010-04-21 16:36:28

Using a cryptographic hash function with a large hash value (like MD5 with its 128 bit hash values), you can virtually guarantee that you will not have hash collisions. The problem with this scheme is that it produces easily predictable cases where the hashes collide

Skywalker 2010-04-21 17:45:20

@Skywalker, I wouldn't worry about two fields combining in a way that makes an accidental collision. I think that is unlikely to happen unless the email contains invalid headers or someone is trying to construct a collision. If you are really worried you can added some secret string to the hash in between each of the to the hash between each of the update calls, just be sure to pick a string that is unlikely to exist in any of the fields.

mikerobi 2010-04-21 20:17:39

@Skywalker, @mikerobi Do you really want to deliver software with known potential problems? In a year or so you'll get a call that it's broken, and you wouldn't know where to look anymore. If you do decide not to handle hash collisions at least print that in the log "Hash collision detected for ...". Then you, or the maintainer, at least know what happened.

extraneon 2010-04-22 07:14:45

be very careful with this approach. Different libraries may give you different order of recipients. If the mail body is HTML encoded, then you may get drastically different results based on mail client+library (especially true for Outlook which actually RTF encodes the HTML text and there is a lot of transformation involved).

Marek 2010-04-22 12:51:13

@extraneon, I answered his question, I didn't tell him it was a good idea, which is why I up-voted your comment.

mikerobi 2010-04-22 13:30:03

@marek, since every email originated from 1 client, it doesn't matter how the email was structured or encoded, it will still have a unique hash.

mikerobi 2010-04-22 13:31:01

Common techniques to prevent the collision problem pointed out by Skywalker are to prepend the length of each field before concatenating the strings or using a distinct delimiter not used otherwise in the fields.

abc 2010-04-23 11:14:30

@abc I just thought of catting the length to the string last night :-)

Skywalker 2010-04-23 14:08:06

Answer 3

+1 A:

Why not just hash the raw message? It already encodes all the relevant fields except the envelope sender and recipient, and you can add those as headers yourself, before hashing. It also contains all the attachments, the entire body of the message, etc, and it's a natural and easy representation. It also doesn't suffer from the easily generated hash collisions of mikerobi's proposal.

Nick Johnson 2010-04-22 11:08:53

Unfortunately, i lied when I said I want it to be platform independent. It has to at minimum support Outlook which, by Microsoft's infinite wisdom, doesn't store the raw message

Skywalker 2010-04-23 14:13:33

ansaurus

tags:

views:

answers:

Unique identifier for an email

related questions