views:

28

answers:

1

I want to store and index all of my historical e-mail and news as individual message files, using some computed hash code based on the message body+headers. Then I'll index on other things as well -- for searching.

For the primary index key, my thought is to use SHA-1 for the hash algorithm and assume that there will never be any collisions (although I know that there theoretically could be).

Besides the body, what headers should I index? Or more generally, what transformations should I apply to an in-memory copy of the message prior to hashing?

Should I ignore "ReSent-*:" headers? Should I join line-broken headers into single-line headers and remove extraneous whitespace?

(The reason I want to index the messages based on some head instead of on the Message-ID header is because Message-ID headers aren't uniformly formatted.)

A: 

You should hash precisely that which constitutes uniqueness of the message. If two messages may differ by the presence of "ReSent-*:" headers but still must be considered to be the "same" message, then those headers must not be part of what is hashed. Similarly, if equal messages may differ in header syntax then you should normalize header syntax. Hash functions such as SHA-1 return the same output only if the input is eaxctly the same, every single bit of it.

Now if using Message-IDs are just enough for you, save for the formatting issue, then there is a simple way: just hash the Message-IDs. A hashed Message-ID will have your regular, fixed-size, randomized format on which you can index.

Thomas Pornin
Thanks, Thomas. Hashing the 'Message-ID' header field may be the best way to go. That certainly avoids having to hand-pick a set of other header fields to use/ignore.Another case besides 'ReSent-*' is when I receive two copies of a message that were sent to two different mailing lists that I'm on. In such cases, each copy of the message has different 'Received' headers, and I want to treat them the same. (Choosing which one to keep and which one to delete is a separate issue.)
Todd Lehman