I remember articles in, I think, Personal Computer World that presented a version of ID3 for identifying coins, though it used a heuristic alternative to the log formula. I think it minimised sums of squares rather than maximising entropy - but it was a long time ago. There was another article in (I think) Byte that used the log formula for information (not entropy) for similar things. Things like that gave me a handle that made the theory easier to cope with.
EDIT - by "not entropy" I mean I think it used weighted averages of information values, but didn't use the name "entropy".
I think construction of simple decision trees from decision tables is a very good way to understand the relationship between probability and information. It makes the link from probability to information more intuitive, and it provides examples of the weighted average to illustrate the entropy-maximizing effect of balanced probabilities. A very good day-one kind of lesson.
And what's also nice is you can then replace that decision tree with a Huffman decoding tree (which is, after all, a "which token am I decoding?" decision tree) and make that link to coding.
BTW - take a look at this link...
Mackay has a free downloadable textbook (and available in print), and while I haven't read it all, the parts I have read seemed very good. The explanation of "explaining away" in Bayes, starting page 293, in particular, sticks in mind.
CiteSeerX is a very useful resource for information theory papers (among other things).Two interesting papers are...
Though CN2 probably isn't day one material.