views:

32

answers:

1

Maybe this is just impossible and I should give up all hope. Or maybe there's a really clever way to do it that I haven't thought of.

Here's two examples of what I've got:

يَبِسَ - يَيْبَسُ (yabisa, yaybasu)[y-b-s][ي-ب-س] (To become dry, stiff, rigid) 20:77 yabasan = dry. يَسَّرَ - يُيَسِّرُ (yassara, yuyassiru)[y-s-r][ي-س-ر] (To facilitate, make it easy) 92:7 nuyassiruhuu = We will ease him.

and

Zu Hülfe! zu Hülfe! Help! Help!
Sonst bin ich verloren! Otherwise I am lost! Zu Hülfe! Zu Hülfe! Help! Help! Sonst bin ich verloren! Otherwise I am lost! Der listigen Schlange zum Opfer erkoren, Selected as offering to the cunning snake, Barmherzigige Götter! Merciful Gods! Schon nahet sie sich, Already it gets closer, Schon nahet sie sich, Already it gets closer,

... it would be really annoying to go through and delete one language in order to further process these lines of text.

One way I was thinking this could be done in NLTK was to split the text into tokens, have some way of knowing the provenance of each token based on a small corpus, and then ask NLTK to 'reconstitute' only the tokens of my choosing. Is this just a wild fantasy?

+1  A: 

You can use nltk.NaiveBayesClassifier to do the job exactly as you said above.

The following link should help: http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html

It has an example of using nltk.NaiveBayesClassifier for gender identification. you use the same for language identification.

The first example you quoted will work well with nltk.NaiveBayesClassifier since the unicode set is completely different.

In the second example, there is a possibility of words like proper nouns spelled the same in both the languages which might cause some error in identification of the language.

Neodawn