tags:

views:

135

answers:

3

Hi,

I need to implement an automated email reply system.

Here for the system i need to check the incoming emails and reply the email in the same language in which the email was received.

How can i do such a thing , please suggest some ideas?

Thanks in advance

Ashish


Appending one more query:

  1. In the email headers there is one more header of the kind:

'Content-Type: text/plain; charset=ISO-8859-1'

How good it can prove in determining the language of the email body?

e.g (all headers taken out from gmail):

  1. for Chinese subject and body 'Content-Type: text/plain; charset=GB2312'

  2. for Korean subject and body 'Content-Type: text/plain; charset=EUC-KR'

  3. for french/italian subject and body 'Content-Type: text/html; charset=ISO-8859-1'

Also is there any list somebody can direct me that have mappings defined for language to charset?

thanks in advance

Ashish Sharma

+3  A: 

Google translate can guess the language of a sample text. Have a look at the API, it could be a solution for your problem (if you're connected to the internet anyway and don't care, sending fragments of mails to google servers...).

For offline evaluation I found the Java Text Categorizing Library.

Andreas_D
How good is the this: if I look for the email header like "Content-Language: en-us" and prepare my response based on this. How many email client , web mail clients add this header?
Ashish
+3  A: 

This answer primarily for those who don't trust online services and cannot use GPL/LGPL software for various reasons. If those aren't problems, Andreas_D's answer is probably better.

It's an interesting problem. Here's how I'd approach it.

For every language you want to support, pick the twenty most common words in that language, that are unique to that language (such as and, the and because and so forth for English). In other words, don't use blancmange or soufflé to identify French, since you may well get a message from a German chef.

Then just score your languages against the email to see which language has the highest occurrence of those words.

But I wouldn't use that to exclusively decide the language. Rather I'd use it to select the order in which the messages appeared. If an email was predominantly German but stood even a little chance of being French, I'd put the message out like this:

  • German bit.
  • French bit.
  • English bit (see below).

Each "bit" would also contain a section at the start along the lines of "We have detected your most likely language as BLAH but, if this is not the case, scroll down for other likely languages".

And always have the fallback of English just in case you're dead wrong. I know it's linguocentric but I'm pretty certain the vast majority of Internet users are forced to deal with English (or its strange and slightly warped cousin, American) every day.

paxdiablo
It's got to be quite flexible to support mispelling and slang, not many people speak the Queens English anymore!
Tom Gullen
Well, the number of words to check may help out there and you could possibly expand it. However, I think you'd be hard pressed trying to find an English email of a decent size that didn't contain `and`, `but`, `or` or `is`.
paxdiablo
You need words that are common in a language, yet uncommon in others. "is" just isn't - e.g same word in Dutch. "but" is French for "goal", not uncommon either. "the" is tea in a number of languages. I probably wouldn't restrict myself to words. There are other clues that can be far more telltale. The letter combination "th" is far more common in English; the use of ß is far more common in German.
MSalters
+2  A: 

Where did the email senders get the email address? If it was on a web page, TV commercial, print advertisement, etc. in their own language, then you could give each supported language its own email address.

emory
+1 for simplicity
Joeri Hendrickx
How good is the this: if I look for the email header like "Content-Language: en-us" and prepare my response based on this. How many email client , web mail clients add this header?
Ashish
I don't know very much about the "Content-Languages: en-Us" feauture. But I would suspect the real question is how (of your users) many add them correctly. If a large percentage of the headers are missing or incorrect, then you should probably ignore it. So you would have to determine if your target population uses email systems that populate this header with valid data.
emory