ansaurus

Question

Answer 1

+4 A:

On the relevance of this assignment with what you covered in class (Your second question). The idea of a 'data structures' class is to expose students to the very many structures frequently encountered in CS: lists, stacks, queues, hashes, trees of various types, graphs at large, matrices of various creed and greed, etc. and to provide some insight into their common implementations, their strengths and weaknesses and generally their various fields of application.
Since most any game / puzzle / problem can be mapped to some set of these structures, there is no lack of subjects upon which to base lectures and assignments. Your class seems interesting because while keeping some focus on these structures, you are also given a chance to discover real applications.
For example in a thinly disguised fashion the "cat and two dogs" thing is an introduction to statistical models applied to linguistics. Your curiosity and motivation prompted you to make the relation with markov models and it's a good thing, because chances are you'll meet "Markov" a few more times before graduation ;-) and certainly in a professional life in CS or related domain. So, yes! it may seem that you're butterflying around many applications etc. but so long as you get a feel for what structures and algorithms to select in particular situations, you're not wasting your time!

Now, a few hints on possible approaches to the assignment
The trie seems like a natural support for this type of problem. Maybe you can ask yourself however how this approach would scale, if you had to index say a whole book rather than this short sentence. It seems mostly linearly, although this depends on how each choice on the three hops in the trie (for this 2nd order Markov chain) : as the number of choices increase, picking a path may become less efficient.
A possible alternative storage for the building of the index is a stochatisc matrix (actually a 'plain' if only sparse matrix, during the statistics gathering process, turned stochastic at the end when you normalize each row -or column- depending on you set it up) to sum-up to one (100%). Such a matrix would be roughly 729 x 28, and would allow the indexing, in one single operation, of a two-letter tuple and its associated following letter. (I got 28 for including the "start" and "stop" signals, details...)
The cost of this more efficient indexing is the use of extra space. Space-wise the trie is very efficient, only storing the combinations of letter triplets effectively in existence, the matrix however wastes some space (you bet in the end it will be very sparsely populated, even after indexing much more text that the "dog/cat" sentence.)
This size vs. CPU compromise is very common, although some algorithms/structures are somtimes better than others on both counts... Furthermore the matrix approach wouldn't scale nicely, size-wize, if the problem was changed to base the choice of letters from the preceding say, three characters.
None the less, maybe look into the matrix as an alternate implementation. It is very much in spirit of this class to try various structures and see why/where they are better than others (in the context of a specific task).
A small side trip you can take is to create a tag cloud based on the probabilities of the letters pairs (or triplets): both the trie and the matrix contain all the data necessary for that; the matrix with all its interesting properties, may be more suited for this.
Have fun!

mjv 2009-10-27 06:36:10

Now THAT is an answer. I really appreciate that. I still want to see if anyone else has input.

dacman 2009-10-27 06:44:16

Also, rather than implementing a true Trie, I chose to create Nodes that are of the appropriate size. For example a 5th order Trie on the sentence "The dog ran fast" would result in top level nodes "The d", "he do", "e dog" etc, with their children being the letters following those 5 characters. This eliminates the aforementioned inefficiency.

dacman 2009-10-27 06:58:35

Where did you get the 729?

dacman 2009-10-27 08:24:03

@dacman Interesting trie. Two concerns A) beware of the cost of building the trie; in our attempt to make such structure for efficient for use, we sometimes significantly increase the complexity of its creation; the benefits of doing so depend on the problem B) Since we only focus on a single letter following a given count (in the problem, 2, in the 5th order example, 5) of letters, the trie gets used more than a hash than a graph

mjv 2009-10-27 12:42:38

@dacman 729 is (26 + 1) ^ 2, for 26 letters plus ONE begin/end code. Although the beg and end messages may be distinct, we can bunch them together for the matrix since we can only have beg+x or x+end. Do use a 784 * 28 matrix if you do not want to worry about this fix-up. Small details...

mjv 2009-10-27 12:48:20

I see now. I took the easy way out and implemented a right stochastic matrix. It's working wonderfully. I'll try to minimize it before my comp presentation tonight!

dacman 2009-10-27 13:22:22

Good luck :-) Beware of over-zealous `minimization` of the matrix. BTW it is very worthy these days to have insight into sparse matrix implementations! In making the matrix sparser, you'll start reintroducing cost for the read and/or write of the matrix; this is certainly not warranted for this example: at roughly 44KBytes (for 16-bit int cells), the non-spase matrix implementation is fine, but as said, exploring sparse matrix constructs (for when it is _really_ needed) is also a worthy thing. Sparse matrices and associated algebra algorithms can be a semester class in of itself ;-)

mjv 2009-10-27 14:09:37

Answer 2

A:

You using bigram approach with characters, but usually it applied to words, because the output will be more meaningful if we use just simple generator as in your case).

1) From my point of view you doing all right. But may be you should try slightly randomize selection of the next node? E.g. select random node from 5 highest. I mean if you always select node with highest probability your output string will be too uniform.

2) I've done exactly the same homework at my university. I think the point is to show to the students that Markov chains are powerful but without extensive study of application domain output of generator will be ridiculous

Trickster 2009-10-27 06:40:44

I don't always select the node with the highest probability. Given "Prefix Node" *Th* with children i, i, e, e, e, a. There is a 2/6 probability of my next node being *hi*, 3/6 chance of it being *he*, and a 1/6 chance of it being hi. When I reach the end of the string (ie, the node has no children) I select a Random "Prefix Node" from the Trie and begin again. This continues until I create a string of a specified length.

dacman 2009-10-27 06:51:04

ansaurus

tags:

views:

answers:

Markov Chain Text Generation

related questions