views:

553

answers:

2

We have an internal .NET case management application that automatically creates a new case from an email. I want to be able to identify other emails that are related to the original email so we can prevent duplicate cases from being created.

I have observed that many, but not all, emails have a thread-index header that looks useful.

Does anybody know of a straightforward algorithm or package that we could use?

+2  A: 

As far as I know, there's not going to be a 100% foolproof solution, as not all email clients or gateways preserve or respect all headers.

However, you'll get a pretty high hit rate with the following:

  • Every email message should have a unique "Message-ID" field. Find this, and keep a record of it as a part of the case. (See RFC-822)

  • If you receive two messages with the same Message-ID, discard the second one as it's a duplicate.

  • Check for the "In-Reply-To" field, if the ID shown matches a known Message-ID then you know the email is related.

  • The "References" and "Original-Message-ID" headers have similar meanings.

If your system ever generates emails, include a CaseID# in the subject line in a way that you can search for it if you get an email back (eg: [Case#20081114-01]); most people don't edit subject lines when replying.

The internet standards RFC-822, RFC-2076 and RFC-4021 may be useful further reading.

Given that there will always be messages that are missed (for whatever reason), you'll also probably want related features in your case management system - say, "Close as Duplicate Case" or "Merge with Duplicate Case", along with tools to make it easier to find duplicates.

Bevan
+5  A: 

Use the JWZ threading algorithm.

geocar