I have text files in multiple languages. How to selectively delete one language in NLTK? | ansaurus

tags:

views:

32

answers:

1

Q:

I have text files in multiple languages. How to selectively delete one language in NLTK?

Maybe this is just impossible and I should give up all hope. Or maybe there's a really clever way to do it that I haven't thought of.

Here's two examples of what I've got:

يَبِسَ - يَيْبَسُ (yabisa, yaybasu)[y-b-s][ي-ب-س] (To become dry, stiff, rigid) 20:77 yabasan = dry. يَسَّرَ - يُيَسِّرُ (yassara, yuyassiru)[y-s-r][ي-س-ر] (To facilitate, make it easy) 92:7 nuyassiruhuu = We will ease him.

and

Zu Hülfe! zu Hülfe! Help! Help!
Sonst bin ich verloren! Otherwise I am lost! Zu Hülfe! Zu Hülfe! Help! Help! Sonst bin ich verloren! Otherwise I am lost! Der listigen Schlange zum Opfer erkoren, Selected as offering to the cunning snake, Barmherzigige Götter! Merciful Gods! Schon nahet sie sich, Already it gets closer, Schon nahet sie sich, Already it gets closer,

... it would be really annoying to go through and delete one language in order to further process these lines of text.

One way I was thinking this could be done in NLTK was to split the text into tokens, have some way of knowing the provenance of each token based on a small corpus, and then ask NLTK to 'reconstitute' only the tokens of my choosing. Is this just a wild fantasy?

+1 A:

You can use nltk.NaiveBayesClassifier to do the job exactly as you said above.

The following link should help: http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html

It has an example of using nltk.NaiveBayesClassifier for gender identification. you use the same for language identification.

The first example you quoted will work well with nltk.NaiveBayesClassifier since the unicode set is completely different.

In the second example, there is a possibility of words like proper nouns spelled the same in both the languages which might cause some error in identification of the language.

Neodawn 2010-09-08 16:28:10

related questions

Programmatically talking to a Serial Port in OS X or Linux

Best ways to teach a beginner to program?

Calling a Function From a String With the Function's Name in Python

An executable Python app

Text Editor For Linux (Besides Vi)?

What Hosting Service is best for Django applications?

File size differences after copying a file to a server vía FTP

Python: what is the difference between (1,2,3) and [1,2,3], and when should I use each?

Python: What OS am I running on?

How do I make a menu in python that does not require the user to press (enter) to make a selection?

How do you express binary literals in Python?

What is the most efficient graph data structure in Python?

Adding a Method to an Existing Object

How to learn Python: Good Example Code?

How do I use Python's itertools.groupby()?

Python and MySQL

Class views in Django

Is there an IDE that provides code completion for Python

Using 'in' to match an attribute of Python objects in an array

cx_Oracle - what is the best way to iterate over a result set?

cx_Oracle - How do I access Oracle from Python?

Continuous Integration System for a Python Codebase

Get a preview jpeg of a pdf on windows?

How can I find the full path to a font from its display name on a Mac?

XML Processing in Python