I want to store and index all of my historical e-mail and news as individual message files, using some computed hash code based on the message body+headers. Then I'll index on other things as well -- for searching.
For the primary index key, my thought is to use SHA-1 for the hash algorithm and assume that there will never be any collisions (although I know that there theoretically could be).
Besides the body, what headers should I index? Or more generally, what transformations should I apply to an in-memory copy of the message prior to hashing?
Should I ignore "ReSent-*:" headers? Should I join line-broken headers into single-line headers and remove extraneous whitespace?
(The reason I want to index the messages based on some head instead of on the Message-ID header is because Message-ID headers aren't uniformly formatted.)