views:

184

answers:

5

A lot of Natural Language Processing (NLP) algorithms and libraries have a hard time working with random texts from the web, usually because they are presupposing clean, articulate writing. I can understand why that would be easier than parsing YouTube comments.

My question is: given a random piece of text, is there a process to determine whether that text is well written, and is a good candidate for use in NLP? What is the general name for these algorithm?

I would appreciate links to articles, algorithms or code libraries, but I would settle for good search terms.

+3  A: 

I haven't used any tools, but I have an idea.

A simple strategy would be to take clean english text and find out the histogram of various Part of Speech tags, such as Nouns, Adjectives, Verbs, Articles, etc.

Now for sample text, find out similar histogram.

If this histogram is "close" enough to the benchmark, quality of the sample text is as good as the original text. You may need to define "closeness" parameter.

Language identification typically employs similar technique. For a language, an n-gram profile is created. Similar profile is created for sample text and two profiles are compared to find out the probability of sample text to be that language.

Shashikant Kore
+3  A: 

I'm not familiar with any software package that does this per se. It sounds like a classification problem you might try tackling by labeling a couple hundred documents that are good and bad and then deriving features from the text (percentage of correctly spelled words, best parse probabilities of sentences, who knows). From that labeled data, you could build a good/bad classifier that maybe does something useful. Then again, it might not.

You could also try using readability measures. Typically they are used for saying things like "this text is on a fourth grade reading level" but they might provide some signal for what you intend. Some examples include Flesch-Kincaid or Gunning fog.

ealdent
+5  A: 

'Well written' and 'good for NLP' may go together but don't have to. For a text to be 'good for NLP', it maybe should contain whole sentences with a verb and a dot at the end, and it should perhaps convey some meaning. For a text to be well written it should also be well-structured, cohesive, coherent, correctly substitute nouns for pronouns, etc. What you need depends on your application.

The chances of a sentence to be properly processed by an NLP tool can often be estimated by some simple heuristics: Is it too long (>20 or 30 words, depending on the language)? Too short? Does it contain many weird characters? Does it contain urls or email adresses? Does it have a main verb? Is it just a list of something? To my knowledge, there is no general name for this, nor any particular algorithm for this kind of filtering - it's called 'preprocessing'.

As to a sentence being well-written: some work has been done on automatically evaluating readability, cohesion, and coherence, e.g. the articles by Miltsakaki (Evaluation of text coherence for electronic essay scoring systems and Real-time web text classification and analysis of reading difficulty) or Higgins (Evaluating multiple aspects of coherence in student essays). These approaches are all based on one or the other theory of discourse structure, such as Centering Theory. The articles are rather theory-heavy and assume knowledge of both centering theory as well as machine learning. Nonetheless, some of these techniques have successfully been applied by ETS to automatically scoring student's essays and I think this is quite similar to what you are trying to do, or at least, you may be able to adapt a few ideas.

All this being said, I believe that within the next years, NLP will have to develop techniques to process language which is not well-formed with respect to current standards. There is a massive amount of extremely valuable data out there on the web, consisting of exactly the kinds of text you mentioned: youtube comments, chat messages, twitter and facebook status messages, etc. All of them potentially contain very interesting information. So, who should adapt - the people wrting that way or NLP?

ferdystschenko
+5  A: 

One easy thing to try would be to classify the text as well written or not using a n-gram language model. To do this, you would first train a language model on a collection of well written text. Given a new piece of text, you could then run the model over it and only pass it on to other downstream NLP tools if the per word perplexity is sufficiently low (i.e., if it looks sufficiently similar to the well written training text).

To get the best results, you should probably train your n-gram language model on text that is similar to whatever was used to train the other NLP tools you're using. That is, if you're using a phrase structure parser trained on newswire, then you should also train your n-gram language model on newswire.

In terms of software toolkits you could use for something like this, SRILM would be a good place to start.

However, an alternative solution would be to try to adapt whatever NLP tools you're using to the text you want to process. One approach for something like this would be self-training, whereby you run your NLP tools over the type of data you would like to process and then retrain them on their own output. For example, McClosky et al 2006 used this technique to take a parser originally trained on the Wall Street Journal and adapt it parsing biomedical text.

dmcer
+1  A: 

As other people noted, "well written" is quite a subjective point of view. The best thing you could do is to build a corpus of both "well written" and "not well written" (according to your standards) texts. You would get bonus if you were able to create a method to classify them in numerical terms (0.1 for Youtube comments, 0.9 for Stack Overflow comments ;).

Once you will have done that, there will be many alternatives, but I would recommend statistical ones in this case. N-grams could probably do the job with simple relative frequency, but I would suggest you to investigate into markov models and especially bayesian text classification tools.

In fact, the best single answer, once you have a collection of "good" and "bad" texts, is to use the many free classification systems available (think of anti-spam tools). The best one will depend on your needs and the programming language you are most confortable with.

Giacomo