I need to do a project on Computational Linguistics course. Is there any interesting "linguistic" problem which is data intensive enough to work on using Hadoop map reduce. Solution or algorithm should try and analyse and provide some insight in "lingustic" domain. however it should be applicable to large datasets so that i can use hadoop for it. I know there is a python natural language processing toolkit for hadoop.
If you have large corpora in some "unusual" languages (in the sense of "ones for which limited amounts of computational linguistics have been performed"), repeating some existing computational linguistics work already performed for very popular languages (such as English, Chinese, Arabic, ...) is a perfectly appropriate project (especially in an academic setting, but it might be quite suitable for industry, too -- back when I was in computational linguistics with IBM Research I got interesting mileage from putting together a corpus for Italian, and repeating [[in the relatively new IBM Scientific Center in Rome]] very similar work to what the IBM Research team in Yorktown Heights [[of which I had been a part]] had already done for English.
The hard work is usually finding / preparing such corpora (it was definitely the greatest part of my work back then, despite wholehearted help from IBM Italy to put me in touch with publishing firms who owned relevant data).
So, the question looms large, and only you can answer it: what corpora do you have access to, or can procure access to (and clean up, etc), especially in "unusual" languages? If all you can do is, e.g., English, using already popular corpora, the chances of doing work that's novel and interesting are of course harder, though there may of course be some.
BTW, I assume you're thinking strictly about processing "written" text, right? If you had a corpus of spoken material (ideally with good transcripts), the opportunities would be endless (there has been much less work on processing spoken text, e.g. to parameterize pronunciation variants by different native speakers on the same written text -- indeed, such issues are often not even mentioned in undergrad CL courses!).
As you mention there is a Python toolkit called NLTK which can be used with dumbo to make use of Hadoop.
PyCon 2010 had a good talk on just this subject. You can access the slides from the talk using the link below.
Download 300M words from 60K OA papers published by BioMed Central. Try to discover propositional attitudes and related sentiment constructions. Point being that the biomed literature is chock full of hedging and related constructions, because of the difficulty of making flat declarative statements about the living world and its creatures - their form and function and genetics and biochemistry.
My feelings about Hadoop is that it's a tool to consider, but to consider after you have done the important tasks of setting goals. Your goals, strategies, and data should dictate how you proceed computationally. Beware the hammer in search of a nail approach to research.
This is part of what my lab is hard at work on.
Bob Futrelle
BioNLP.org
Northeastern University
One computation-intensive problem in CL is inferring semantics from large corpora. The basic idea is to take a big collection of text and infer the semantic relationships between words (synonyms, antonyms, hyponyms, hypernyms, etc) from their distributions, i.e. what words they occur with or close to.
This involves a lot of data pre-processing and then can involve many nearest neighbor searches and N x N comparisons, which are well-suited for MapReduce-style parallelization.
Have a look at this tutorial:
http://wordspace.collocations.de/doku.php/course:acl2010:start