views:

353

answers:

3

Hi guys,

I am a beginner in NLP and NLTK. I am very interested in NLP and hence joined a weekend course on AI in some local institution, which requires me to do a project for completion of the course, and I decided to do it in NLP. The problem is,the instructor is not good at all for this course (According to me she is just a charlatan) (or may not be very interested in teaching as this is her last batch here after which the institute is going to send her out). So I am stuck in a situation where where I got to finish this project in a month to one and half months period, but as a naive person in the field I am feeling it very difficult to comprehend the things required to decide on project. (Also, as I am working full time, I am not finding enough time to dedicate on this).

I considered using NLTK toolkit in python for the project for following reasons. (1) Python is famous for ease of use, rapid prototyping and very active community (considering very short span of time I have, and as I am a C programmer by profession, I need a language that I can learn fast and is simple to use).

(2) NLTk has good review, and extensive documentation and a very active community.

So the problem is what project should I take up, so that I can learn something and will be able to finish project in time. (I know almost nothing in NLP, don't even know what exactly corpora is... :( )

So, please suggest me some topics that I should consider for the project.

Regards,

MicroKernel :)

+5  A: 

Most "beginner" projects aim at reimplementing well known algorithms, so the beginner can learn by verifying their results against known solutions. For this, I'd recommend something simple, like an email spam filter. You'd start by creating a training file, i.e. copy the text of several real emails into a CSV file and manually label them spam or not spam, like:

text|is_spam
hi bob! how are you?|0
what time are you coming over|0
how to buy viagra now!|1

Next, you'd create a test file, in the identical format to the training file, but obviously with different examples.

Then, you'd create your classifier/spam filter. There are many different ways to implement a spam filter, but the most basic is by simply counting the frequency with which a word appears with is_spam=0 and is_spam=1. For example, based on the training file above, the word "viagra" is associated with 1 spam classification, but 0 non-spam classifications, so it's likely future emails containing the word "viagra" will also be classified as spam. Similarly, the word "how" appears in 1 spam and 1 non-spam email, so it's less likely to indicate a definitive classification.

You'd then train your classifier on the training file, and calculate it's accuracy by running it on the test file.

If the above method is too simple, you can increase its complexity by counting n-grams (groups of words), or even grammatical structure by first tagging the part-of-speech (e.g. lots of spam is usually random garbage populated with keywords, where non-spam usually makes some sense). You could potentially implement several different classifiers, and compare their accuracy.

Granted, there's a bit more to it than that, but these methods are well documented on the internet, and it's your project so it's up to you to research it further. Good luck.

Chris S
Thank you so much :)This is really interesting project for me to consider. (To learn about the pattern of spams over time, from past few months I have been collecting all the spams that I was receiving (put them in a separate folder)) and now I have over 2000+ spams collected, guess its sufficiently large training data :). Now it has come to some use. Wow!!!Thanks for the help mate :)
Microkernel
+2  A: 

Some ideas:

  • A program that guesses the language that an input file is written in. You'd need some samples of different languages; Wikipedia is an excellent source.

  • A program that, based on a text corpus, constructs words or sentences similar to those in the corpus.

  • Find something interesting to do with the Voynich Manuscript. You can find transcriptions here.

(By the way, "corpus" is just a fancy word for "bunch of text". From Wikipedia: "a large and structured set of texts (now usually electronically stored and processed)." The word usually refers to the texts that you use to train and test your algorithm, as opposed to the unknown texts that it will encounter in the field.)

Thomas
Thank you so much :)The first one looks to be in my reach (guess this is what Google Toolbar uses to findout the language of webpage and ask for translation help). The last one is looks interesting too, but in current situation I am, I can't do it I guess... Thanks for the suggestions :)
Microkernel
A: 

You could use NLP to record some portions of a customer support call on a VOIP phone. The other options input by the user could be taken from the keypad. With this system in place, you could eliminate the need for a support personnel.For example - reset the password to an email id in an organization with voice-based authorization.

fixxxer