views:

477

answers:

2

We were just assigned a new project in my data structures class -- Generating text with markov chains.

Overview

Given an input text file, we create an initial seed of length n characters. We add that to our output string and choose our next character based on frequency analysis..

This is the cat and there are two dogs.

Initial seed: "Th"
Possible next letters -- i, e, e
Therefore, probability of choosing i is 1/3, e is 2/3.

Now, say we choose i. We add "i" to the output string. Then our seed becomes
hi and the process continues.

My solution

I have 3 classes, Node, ConcreteTrie, and Driver

Of course, the ConcreteTrie class isn't a Trie of the traditional sense. Here is how it works:

Given the sentence with k=2:

This is the cat and there are two dogs.

I generate Nodes Th, hi, is, ... + ... , gs, s. Each of these nodes have children that are the letter that follows them. For example, Node Th would have children i and e. I maintain counts in each of those nodes so that I can later generate the probabilities for choosing the next letter.

My question:

First of all, what is the most efficient way to complete this project? My solution seems to be very fast, but I really want to knock my professor's socks off. (On my last project A variation of the Edit distance problem, I did an A*, a genetic algorithm, a BFS, and Simulated Annealing -- and I know that the problem is NP-Hard lol)

Second, what's the point of this assignment? It doesn't really seem to relate to much of what we've covered in class. What are we supposed to learn?

Thanks!

EDIT: I'm not looking for a solution, just an approach.

+4  A: 

On the relevance of this assignment with what you covered in class (Your second question). The idea of a 'data structures' class is to expose students to the very many structures frequently encountered in CS: lists, stacks, queues, hashes, trees of various types, graphs at large, matrices of various creed and greed, etc. and to provide some insight into their common implementations, their strengths and weaknesses and generally their various fields of application.
Since most any game / puzzle / problem can be mapped to some set of these structures, there is no lack of subjects upon which to base lectures and assignments. Your class seems interesting because while keeping some focus on these structures, you are also given a chance to discover real applications.
For example in a thinly disguised fashion the "cat and two dogs" thing is an introduction to statistical models applied to linguistics. Your curiosity and motivation prompted you to make the relation with markov models and it's a good thing, because chances are you'll meet "Markov" a few more times before graduation ;-) and certainly in a professional life in CS or related domain. So, yes! it may seem that you're butterflying around many applications etc. but so long as you get a feel for what structures and algorithms to select in particular situations, you're not wasting your time!

Now, a few hints on possible approaches to the assignment
The trie seems like a natural support for this type of problem. Maybe you can ask yourself however how this approach would scale, if you had to index say a whole book rather than this short sentence. It seems mostly linearly, although this depends on how each choice on the three hops in the trie (for this 2nd order Markov chain) : as the number of choices increase, picking a path may become less efficient.
A possible alternative storage for the building of the index is a stochatisc matrix (actually a 'plain' if only sparse matrix, during the statistics gathering process, turned stochastic at the end when you normalize each row -or column- depending on you set it up) to sum-up to one (100%). Such a matrix would be roughly 729 x 28, and would allow the indexing, in one single operation, of a two-letter tuple and its associated following letter. (I got 28 for including the "start" and "stop" signals, details...)
The cost of this more efficient indexing is the use of extra space. Space-wise the trie is very efficient, only storing the combinations of letter triplets effectively in existence, the matrix however wastes some space (you bet in the end it will be very sparsely populated, even after indexing much more text that the "dog/cat" sentence.)
This size vs. CPU compromise is very common, although some algorithms/structures are somtimes better than others on both counts... Furthermore the matrix approach wouldn't scale nicely, size-wize, if the problem was changed to base the choice of letters from the preceding say, three characters.
None the less, maybe look into the matrix as an alternate implementation. It is very much in spirit of this class to try various structures and see why/where they are better than others (in the context of a specific task).
A small side trip you can take is to create a tag cloud based on the probabilities of the letters pairs (or triplets): both the trie and the matrix contain all the data necessary for that; the matrix with all its interesting properties, may be more suited for this.
Have fun!

mjv
Now THAT is an answer. I really appreciate that. I still want to see if anyone else has input.
dacman
Also, rather than implementing a true Trie, I chose to create Nodes that are of the appropriate size. For example a 5th order Trie on the sentence "The dog ran fast" would result in top level nodes "The d", "he do", "e dog" etc, with their children being the letters following those 5 characters. This eliminates the aforementioned inefficiency.
dacman
Where did you get the 729?
dacman
@dacman Interesting trie. Two concerns A) beware of the cost of building the trie; in our attempt to make such structure for efficient for use, we sometimes significantly increase the complexity of its creation; the benefits of doing so depend on the problem B) Since we only focus on a single letter following a given count (in the problem, 2, in the 5th order example, 5) of letters, the trie gets used more than a hash than a graph
mjv
@dacman 729 is (26 + 1) ^ 2, for 26 letters plus ONE begin/end code. Although the beg and end messages may be distinct, we can bunch them together for the matrix since we can only have beg+x or x+end. Do use a 784 * 28 matrix if you do not want to worry about this fix-up. Small details...
mjv
I see now. I took the easy way out and implemented a right stochastic matrix. It's working wonderfully. I'll try to minimize it before my comp presentation tonight!
dacman
Good luck :-) Beware of over-zealous `minimization` of the matrix. BTW it is very worthy these days to have insight into sparse matrix implementations! In making the matrix sparser, you'll start reintroducing cost for the read and/or write of the matrix; this is certainly not warranted for this example: at roughly 44KBytes (for 16-bit int cells), the non-spase matrix implementation is fine, but as said, exploring sparse matrix constructs (for when it is _really_ needed) is also a worthy thing. Sparse matrices and associated algebra algorithms can be a semester class in of itself ;-)
mjv
A: 

You using bigram approach with characters, but usually it applied to words, because the output will be more meaningful if we use just simple generator as in your case).

1) From my point of view you doing all right. But may be you should try slightly randomize selection of the next node? E.g. select random node from 5 highest. I mean if you always select node with highest probability your output string will be too uniform.

2) I've done exactly the same homework at my university. I think the point is to show to the students that Markov chains are powerful but without extensive study of application domain output of generator will be ridiculous

Trickster
I don't always select the node with the highest probability. Given "Prefix Node" *Th* with children i, i, e, e, e, a. There is a 2/6 probability of my next node being *hi*, 3/6 chance of it being *he*, and a 1/6 chance of it being hi. When I reach the end of the string (ie, the node has no children) I select a Random "Prefix Node" from the Trie and begin again. This continues until I create a string of a specified length.
dacman