Algorithms to identify Markov generated content?

views:

423

answers:

+8 Q:

Algorithms to identify Markov generated content?

Markov chains are a (almost standard) way to generate random gibberish which looks intelligent to untrained eye. How would you go about identifying markov generated text from human written text.

It would be awesome if the resources you point to are Python friendly.

+2 A:

If you had several large Markov-generated texts, you could possibly determine that they were so by comparing the word frequencies between each of the samples. Since Markov chains depend on constant word probabilities, the proportions of any given word should be roughly equal from sample to sample.

Evan Meagher 2009-07-26 19:52:33

It might also pay to look at the python based natural language toolkit: http://nltk.sourceforge.net/ - that said, it might be a tad excessive if you're just interested in word frequencies.

Markus 2009-07-26 19:57:18

If the word frequencies are generated to look like real text you may get problems if you work with frequencies of words like the a ...

Janusz 2009-07-26 20:08:47

The problem with this approach is that if the human-generated text and Markov-chain generated text are made from text with similar word and word transition frequencies, the Markov-chain generated text will look much like the human-generated text.

James Thompson 2009-07-26 20:36:56

+5 A:

One simple approach would be to have a large group of humans read input text for you and see if the text makes sense. I'm only half-joking, this is a tricky problem.

I believe this to be a hard problem, because Markov-chain generated text is going to have a lot of the same properties of real human text in terms of word frequency and simple relationships between the ordering of words.

The differences between real text and text generated by a Markov chain are in higher-level rules of grammar and in semantic meaning, which are hard to encode programmatically. The other problem is that Markov chains are good enough at generating text that they sometimes come up with grammatically and semantically correct statements.

As an example, here's an aphorism from the kantmachine:

Today, he would feel convinced that the human will is free; to-morrow, considering the indissoluble chain of nature, he would look on freedom as a mere illusion and declare nature to be all-in-all.

While this string was written by a computer program, it's hard to say that a human would never say this.

I think that unless you can give us more specific details about the computer and human-generated text that expose more obvious differences it will be difficult to solve this using computer programming.

James Thompson 2009-07-26 20:11:29

This is quite disturbing, in fact. I've read Critique of Pure Reason (the only work by Kant I could actually get myself to read/comprehend) and, I'd NEVER say, that the aphorism is machine-generated.

shylent 2009-07-26 20:26:38

@shylent - that was the fourth hit on the page, and I agree, it's very much in the style of Kant. This would be a very good example for a course that involves Markov chains!

James Thompson 2009-07-26 20:38:51

+2 A:

Crowdsourcing. Use Mechanical Turk and get a number of humans to vote on this. There are even some libraries to help you pull this off. For example:

TurKit - Iterative Tasks on Mechanical Turk

Here's a blog post from O'Reilly Radar on tips for using Mechanical Turk to get your work done:

Mechanical Turk Best Practices

ars 2009-07-27 01:18:56

+2 A:

I suggest a generalization of Evan's answer: make a Markov model of your own and train it with a big chunk of the (very large) sample you're given, reserving the rest of the sample as "test data". Now, see how well the model you've trained does on the test data, e.g. with a chi square test that will suggest situation in which "fit is TOO good" (suggesting the test data is indeed generated by this model) as well as ones in which the fit is very bad (suggesting error in model structure -- an over-trained model with the wrong structure does a notoriously bad job in such cases).

Of course there are still many issues for calibration, such as the structure of the model -- are you suspecting a simple model based on Ntuples of words and little more, or a more sophisticated one with grammar states and the like. Fortunately you can calibrate things pretty well by using large corpora of known-to-be-natural text and also ones you generate yourself with models of various structures.

A different approach is to use nltk to parse the sentences you're given -- a small number of mis-parses is to be expected even in natural text (as humans are imperfect and so is the parser -- it may not know that word X can be used as a verb and only classify it as a noun, etc, etc), but most Markov models (unless they're modeling essentially the same grammar structure your parser happens to be using -- and you can use several parsers to try and counteract that!-) will cause VASTLY more mis-parses than even dyslexic humans. Again, calibrate that on natural vs synthetic texts, and you'll see what I mean!-)

Alex Martelli 2009-07-27 02:46:20

+2 A:

You could use a "brute force" approach, whereby you compare the generated language to data collected on n-grams of higher order than the Markov model that generated it.

i.e. If the language was generated with a 2nd order Markov model, up to 3-grams are going to have the correct frequencies, but 4-grams probably won't.

You can get up to 5-gram frequencies from Google's public n-gram dataset. It's huge though - 24G compressed - you need to get it by post on DVDs from LDC.

EDIT: Added some implementation details

The n-grams have already been counted, so you just need to store the counts (or frequencies) in a way that's quick to search. A properly indexed database, or perhaps a Lucene index should work.

Given a piece of text, scan across it and look up the frequency of each 5-gram in your database, and see where it ranks compared to other 5-grams that start with the same 4 words.

Practically, a bigger obstacle might be the licensing terms of the dataset. Using it for a commercial app might be prohibited.

pufferfish 2009-07-27 09:20:32

I like this approach, but I think this would be computationally unfeasible?

uswaretech 2009-07-30 20:24:02

Don't see how, added some details to the answer.

pufferfish 2009-07-31 09:36:52

ansaurus

tags:

views:

answers:

Algorithms to identify Markov generated content?

related questions