ansaurus

Question

Natural language processing - Ideas for beginner's projects

Answer 1

+5 A:

Most "beginner" projects aim at reimplementing well known algorithms, so the beginner can learn by verifying their results against known solutions. For this, I'd recommend something simple, like an email spam filter. You'd start by creating a training file, i.e. copy the text of several real emails into a CSV file and manually label them spam or not spam, like:

text|is_spam
hi bob! how are you?|0
what time are you coming over|0
how to buy viagra now!|1

Next, you'd create a test file, in the identical format to the training file, but obviously with different examples.

Then, you'd create your classifier/spam filter. There are many different ways to implement a spam filter, but the most basic is by simply counting the frequency with which a word appears with is_spam=0 and is_spam=1. For example, based on the training file above, the word "viagra" is associated with 1 spam classification, but 0 non-spam classifications, so it's likely future emails containing the word "viagra" will also be classified as spam. Similarly, the word "how" appears in 1 spam and 1 non-spam email, so it's less likely to indicate a definitive classification.

You'd then train your classifier on the training file, and calculate it's accuracy by running it on the test file.

If the above method is too simple, you can increase its complexity by counting n-grams (groups of words), or even grammatical structure by first tagging the part-of-speech (e.g. lots of spam is usually random garbage populated with keywords, where non-spam usually makes some sense). You could potentially implement several different classifiers, and compare their accuracy.

Granted, there's a bit more to it than that, but these methods are well documented on the internet, and it's your project so it's up to you to research it further. Good luck.

Chris S 2010-04-04 14:35:20

Thank you so much :)This is really interesting project for me to consider. (To learn about the pattern of spams over time, from past few months I have been collecting all the spams that I was receiving (put them in a separate folder)) and now I have over 2000+ spams collected, guess its sufficiently large training data :). Now it has come to some use. Wow!!!Thanks for the help mate :)

Microkernel 2010-04-05 10:25:51

Answer 2

+2 A:

Some ideas:

A program that guesses the language that an input file is written in. You'd need some samples of different languages; Wikipedia is an excellent source.
A program that, based on a text corpus, constructs words or sentences similar to those in the corpus.
Find something interesting to do with the Voynich Manuscript. You can find transcriptions here.

(By the way, "corpus" is just a fancy word for "bunch of text". From Wikipedia: "a large and structured set of texts (now usually electronically stored and processed)." The word usually refers to the texts that you use to train and test your algorithm, as opposed to the unknown texts that it will encounter in the field.)

Thomas 2010-04-04 14:48:07

Thank you so much :)The first one looks to be in my reach (guess this is what Google Toolbar uses to findout the language of webpage and ask for translation help). The last one is looks interesting too, but in current situation I am, I can't do it I guess... Thanks for the suggestions :)

Microkernel 2010-04-05 10:30:14

Answer 3

A:

You could use NLP to record some portions of a customer support call on a VOIP phone. The other options input by the user could be taken from the keypad. With this system in place, you could eliminate the need for a support personnel.For example - reset the password to an email id in an organization with voice-based authorization.

fixxxer 2010-10-11 21:43:45

ansaurus

tags:

views:

answers:

Natural language processing - Ideas for beginner's projects

related questions