tags:

views:

18

answers:

2

I have a python script that receives text messages from users, and processes them as a query. However, some users have signatures automatically appended to their messages, and the script incorrectly treats them as actual content. What's the best programmatic way to recognize and remove these signatures?

(I'd prefer in python, but am fine with any other language too, as well as just saying it in pseudocode)

A: 

If the signature always follows a specific pattern, you should be able to just use a regular expression to trim it off.

However, if the user can setup their signature any way they wish, and there is no leading characters (ie: -- at the beginning), this is going to be very difficult. The only reliable way to do this would be to know the content of the signature for each user in advance so you can strip it out. Imagine a worst-case scenario: Somebody could always send a blank message, with a signature that was a fully valid "query". There'd be no way for the script to differentiate that from a "query" message with no signature.

Reed Copsey
A: 

If the signatures are appended to the body of the message such that they're actually part of the body text, then there are only two ways to remove them:

  • Heuristics, such as "anything following three dashes must be a signature". These may be effective if you spend some time tuning them.
  • A classifier. This is a lot of work to set up, and requires that you "train" it by marking some message parts as signatures. These can also be very effective, but like heuristics will never work 100% of the time.
Borealid
Can you explain a little more on how a classifier would work?
Joseph
@Joseph: A classifier is an algorithm such as a neural network, SVM, or Bayesian filter which is "trained" on a known corpus and then applied to an unknown corpus (possibly with feedback when it makes a mistake). Implementing one is nontrivial.
Borealid