views:

737

answers:

9

I have to do a final project for my computational linguistics class. We've been using OCaml the entire time, but I also have familiarity with Java. We've studied morphology, FSMs, collecting parse trees, CYK parsing, tries, pushdown automata, regular expressions, formal language theory, some semantics, etc.

Here are some ideas I've come up with. Do you have anything you think would be cool?

  1. A script that scans Facebook threads for obnoxious* comments and silently hides them with JS (this would be run with the user's consent, obviously)

  2. An analysis of a piece of writing using semantics, syntax, punctuation usage, and other metrics, to try to "fingerprint" the author. It could be used to determine if two works are likely written by the same author. Or, someone could put in a bunch of writing he's done over time, and get a sense of how his style has changed.

  3. A chat bot (less interesting/original)

I may be permitted to use pre-existing libraries to do this. Do any exist for OCaml? Without a library/toolkit, the above three ideas are probably infeasible, unless I limit it to a very specific domain.

Lower level ideas:

  1. Operations on finite state machines - minimizing, composing transducers, proving that an FSM is in a minimal possible state. I am very interested in graph theory, so any overlap with FSMs could be a good venue to explore. (What else can I do with FSMs?)

  2. Something cool with regex?

  3. Something cool with CYK?

Does anyone else have any cool ideas?

*obnoxious defined as having following certain patterns typical of junior high schoolers. The vagueness of this term is not an issue; for the credit I could define whatever I want and target that.

+3  A: 

Just wanted to say that I love your second idea. It looks like it has everything to do with NLP. It seems very interesting to me. IMO, I would go with this one, for sure. But you should also consider the one that attracts you more and not only opinions of others.

As for the language to choose, I think you should choose the one that you are more comfortable with. I'm not familiar with libraries but I've seen many NLP things done in Python and Java.

bruno conde
I second bruno's sentiment. This really would be a neat thing to see with a sizable corpus and web frontend.
BipedalShark
A: 

1a. script that scans facebook... How do you plan on defining 'obnoxious'. that sounds pretty difficult.

2a. similarity. syntax and semantics, and other 'linguisticy' stuff sounds difficult. people have done this sort of thing with other more numerical methods with great success though, for example using the singular value decomposition. i think that this method has also been incorporated into software used to check for plagiarism. this method is also often referred to as latent semantic analysis or latent semantic mapping.

svdlibc: http://tedlab.mit.edu/~dr/svdlibc/

1b. fsm stuff. im not sure what you mean by 'proving that a transducer is minimal'. this is a pretty standard operation and is included in pretty much any toolkit you might encounter. if you are interested in fsms, take a look at the

AT&T toolkit: http://www2.research.att.com/~fsmtools/fsm/

or OpenFST toolkit: http://www.openfst.org/

fsms are growing in popularity as a principled, unified method for doing speech recognition. my graduate work focuses on this subject, and it is indeed very interesting.

what about building an hmm-based parser or chunker, or a simple viterbi decoder? if you put together a decent training set (you'd have to tag it yourself to begin with) you could approximate a simple version of your 'obnoxious comments' tagger and use that, maybe with some sort of classifier to 'censor' or remove the obnoxious comments.

blackkettle
+2  A: 

Use the twitter API's to pull all the tweets from a social conversation and combine\summarize\publish the content as a essay\article\blog\etc that someone can read in one piece. Correlate concepts and ideas, expand references, improve grammar. Kind of like bettween.com but souped up with NLP.

zac
I would buy lots of beer for someone who could write an app that turns most tweets into intelligible sentences.
jason
+7  A: 
  1. Obnoxious language filtering - I think this will reduce down to a process very similar to spam email filtering. That is, counting the frequency of a set of more-or-less 'obnoxious' words. It doesn't sound like you will get the scope to do anything particularly clever, unless you also use other sources of information (e.g. the structure of the social links shared between the sender and recipient, perhaps). On the other hand, online bullying is a very serious thing and you can bet Facebook/Myspace and the other social networking sites care a lot about tackling it.

  2. Stylistic Analysis - There has been some work done on this in various forms, often under the name authorship analysis. Shlomo Argamon does a lot of work in this area and you could probably discover a lot more from the references in his papers. One of the best ways to profile an author is to learn the distribution of their usage of a set of stopwords (a.k.a functional words), such as 'and' ,'but', 'if', etc. I think there's a lot more scope to do something new and interesting in this area - authorship analysis on internet data is a hard problem - but also a lot more scope to fail.

  3. Chat bot - You're right, this is a pretty standard project. It's also quite hard to measure success/failure. I think the project would be more compelling if it was a chat-bot with some kind of purpose, like answering questions in a limited domain, but that's something that's very difficult to do well.

The rest are really too vague to make any comments on, sorry.

There aren't any NLP libraries that I know of in OCaml, it's just not a particularly popular programming language. However, I do know of a machine learning library in Ocaml, called MEGAM, written by Hal Daume, who is a very good NLP researcher, which has been used for NLP tasks. I get a feeling that figuring out MEGAM and using it to do some NLP task might be too big a project to take on, however.

Some other ideas:

  • Sentiment Analysis - A very trendy area of research. You could make this task as easy or hard as you like, from scoring a document as positive/negative to extracting specific topics and generating a sentiment score for each one.
  • Coreference/Anaphora resolution - A difficult task but a very important one. Some approaches use a graph representation (each mention is a node with edges between them if they co-refer) to enforce things like transitivity.
  • Document Classification - You could try and learn a system on the StackOverflow data set to suggest tags for a given question. It's a fairly well known problem with some established techniques, but an it's interesting data set and has an obvious and useful application to the real world . You could also see if you can find specific features of a question (word choice, length, formatting, punctuation, etc.) that cause them to be voted highly.
  • Haiku Generation - Kind of a silly one, but I always thought it was an interesting idea. Syllable counting could be done with the CMU pronouncing dictionary. Should be a lot of fun, if not particularly useful.
StompChicken
+2  A: 

Hi...

1 - What would be great for me is when you review academic papers and i need to know which part of the work is: -completely original -what is purely copy/paste -what is a pure paraphrase -what is saying exactly the opposite of a previous reference. Ideally, it would check the references inside the paper + all the previous work of the authors (and may check citeseer to look for references which have been potentially omitted -voluntarily-). That would be really useful.

2- I would like a tool checking all the questions in SO, looking for the same kind of questions with a bunch of responses and generating a response that may be adequate.

LB
A: 

May I suggest that you look into the programming language Prolog. It is a declarative logic programming language with natural language processing being one of its early goals. About halfway down that Wiki page is an example of how you can use it as a language parser. You can make it easily generate thousands of grammatically correct sentences and phrases. It is a very powerful tool. I think using it would be very interesting.

Poindexter
It's outdated since it gives purely boolean answers. So, for a given sentence, it can return thousands of 'true' parses without determining which one is most likely. There are some projects that try to incorporate probabilities into a prolog-like language, e.g. www.dyna.org.
Most natural language processing these days is not done in Prolog. Prolog addresses a certain way of looking at natural langauge really well (parsing context-free grammars), but most of that can be done better with better algorithms (like Earley's algorithm for chart parsing), better formalisms (like probabilistic context-free grammars), and better grammars (things that are trained from real world sentences, rather than being hand-constructed). Besides, his whole class has been about how to do the kinds of techniques that you're suggesting he use Prolog for. He knows how to do those things.
Ken Bloom
+1  A: 

I advise against the use of Java, unless there is a library that you desperately need. I did an NLP final project in Java once and found that it lacks a lot of flexibility that is often needed (strict type system, no anonymous functions etc.). Unfortunately, I don't know OCaml, but in case you know Python, there are a lot of NLP libraries available for that, e.g., the very comprehensive and actively developed NLTK.

Tom Bartel
But one could use Scala wherever Java would be useful, and that would take care of the flexibility issues. The Java NLP libraries tend to be among the best.
Ken Bloom
You're more of an expert in the field than I am. However, my feeling is that the reason Java is so prominent in NLP is not its superior aptitude but the fact that a whole generation of CS students has got used to it to the point where they unquestioningly use it for whatever task is at hand. Personally, I would prefer a language that does not have to be complemented by another language like Scala.
Tom Bartel
A: 

1) Take a body of text and convert all references of "her" to either "him" or "his". Basically, this question from SO, which seems like a neat project.

2) I'm assuming that you've studied the transformational grammar approach to trees (X-bar theory, etc.) There is this whole other syntactic theory called Head-Driven Phrase Structure Grammar (HPSG) which posits not the use of "hierarchy" for describing syntax, but uses attribute-value matrices as well.

One of the advantages of using this system is that you can describe syntax as directed acyclic graphs, rather than trees. Your prof might be able to point you to more/better resources, but this page (vaguely) explains the idea. Unfortunately, my experience has been that HPSG resources on the net are pretty lacking.

But, once you have a feel for HPSG feature structures, you can search the literature for people (linguists and computational linguists) who've come up with ways to use the graph representations for interesting things.

Maybe you could create a transformational tree-to-HPSG graph converter? (You'd have to know how things like "raising", "control", "passivizations", etc would be transformed into HPSG's "SLASH", "REL", etc.)

rascher
A: 

I don't know that a whole lot of the curriculum for your class will be that useful for either problems 1 or 2. Some of the better techniques for these kinds of problems do really simple linguistic stuff (like part of speech tagging, simply removing stop words, and looking at bigrams and trigrams), and have a machine-learning text classification component that's not too sophisticated on its own (standard techniques like Naive Bayesian classifiers, Maximum Entropy classifiers, Support Vector Machines are pretty much black boxes algorithm-wise and perform well). Have a look at these survey papers about topical text classification and authorship detection to get an idea of where you can get started.

Something better suited to the curriculum you've described might be to construct a morphological analyzer for a foreign language that you're familiar with, or to construct a stemmer (a poor man's version of a morphological analyzer) that maps morphologically-related terms to the same entry in an index -- something that can be used by search engines.

If you don't need to come up with a new technique for your class (i.e. if you're an undergrad), then there are a large number of standard NLP tasks that you could implement in OCaml, for example a parser trained on the Penn Treebank, a parser for some other grammar formalism, a part-of-speech tagger, or literally dozens of other applications.

Ken Bloom