views:

449

answers:

4

I have a list of sub-lists of letters, where the number of letters in each sub-list can vary. The list and sub-lists are ordered. This structure can be used to produce words by choosing a number X, taking a letter from position X in every sub-list and concatenating them in order. If the number X is larger than the length of the sub-list, it would wrap around.

Given a set of words, I need to find a way to pack them into the smallest possible structure of this kind (i.e. with the shortest sub-lists). There would have to be as many sub-lists as the number of letter in the longest word, of course, and shorter words would be padded by blanks/spaces.

I am not a CS graduate so I apologize if the description of the problem is not entirely clear. To give a simple example: Suppose I have the words [ 'a ', 'an', 'if', 'is', 'in', 'on', 'of', 'i '] I need to pack, I could use the following structure:

[  
    [ 'i', 'o', 'a' ],  
    [ 's', 'n', 'f', ' ' ]  
]

This would enable me to produce the following words:

0: is  
1: on  
2: af*  
3: i  
4: os*  
5: an  
6: if  
7: o *  
8: as*  
9: in  
10: of  
11: a

If you take position 10, for example, the word 'of' is generated by concatenating the letter at index 10 % 3 (= 1) from the first sub-list, with the letter at index 10 % 4 (= 2) from the second sub-list.

My best attempt so far involves using a matrix of hamming distances to place the most-"connected" words first, and then their closest neighbors, with the goal of minimizing the change with every insertion. This was an entirely intuitive attempt and I feel like there has to be a better/smarter way to solve this.

Clarification

This is a practical problem I am trying to solve and the constraints are (roughly) as follows:
1. The number of characters per sub-list should be in the area of 100 or less.
2. The keyspace should be as small as possible (i.e. the number of spurious words should be minimal). Roughly, a keyspace in the millions of options is borderline.

I don't know that a good solution is even possible for this. With the algorithm I have right now, for example, I can insert about 200 words (just random English words) in a keyspace of 1.5 million options. I'd like to do better than that.

+3  A: 

Well, you said you're interested in sub-optimal solutions, so I'll give you one. It depens on the alphabet size. For example, for 26 array size will be little over 100 (regardless of amount of words to encode).

It's well-known that if you have two different prime numbers a and b and non-negative integers k and l (k < a, l < b), you can find number n that n % a == k and n % b == l.
For example, with (a = 7, a = 13, k = 6, l = 3) you can take n = 7 * 13 + 7 * 3 + 13 * 6. n % 7 == 6 and n % 13 == 3

And same holds for any number of prime integers.

You can initialize arrays like this.

['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 29
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 31
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 37
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 41
...

Now, suppose your word is 'geek'. For it you need number X, such that X % 29 == 6, X % 31 == 4, X % 37 == 4, X % 41 == 10. And you can always find such X, as was shown above.

So, if you have alphabet of 26 letters, you can create matrix of width 149 (see the list of primes) and encode any word with it.

Nikita Rybak
Great answer, but it brings me to a constraint that I didn't specify since I don't have a well-defined guideline for a "good enough" solution:Given a set of a couple of hundreds of words, the average distance between any two words needs to be needs to be minimal. How minimal? Ideally, the number of different positions divided by the size of the first array would need to be in the scale of hundreds or thousands.With this solution, the number of possible positions escalates very quickly, becoming impractical for six letter (or more) words.
szx
@szx _Given a set of a couple of hundreds of words, the average distance between any two words needs to be needs to be minimal._ Can you clarify? I thought, we don't choose the set to encode: the set is given.
Nikita Rybak
See my post + clarification above: the set of words is given (and will contain approximately a couple of hundred words). These letters that produce these words can be arranged in different configurations (i.e. different indexes, number of letters per sub-list) of varying "efficiency". By "the distance between two words" I mean the number of spurious words that separate words from the given set, which should be minimal, making the signal to noise ratio maximal.
szx
+2  A: 

We can improve upon Nikita Rybek`s answer by not actually making the lists a prime length but just associating a prime with the list. This allows us to not make the sub-lists any longer than necessary, hence keeping the primes smaller which implies a smaller keyspace and more efficient packing. Using this method and the code below, I packed a list of 58,110 (lowercase) words into 464 characters. It's interesting to note that only the letters 'alex' appear as the 21'st letter in a word. The keyspace was upwards of 33 digits however It is also not strictly necessary to use primes, the associated numbers just need to be coprime. This could probably be reduced.

import itertools
import operator
import math

# lifted from Alex Martelli's post at http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python
def erat2( ):
    D = {  }
    yield 2
    for q in itertools.islice(itertools.count(3), 0, None, 2):
        p = D.pop(q, None)
        if p is None:
            D[q*q] = q
            yield q
        else:
            x = p + q
            while x in D or not (x&1):
                x += p
            D[x] = p


# taken from http://en.literateprograms.org/Extended_Euclidean_algorithm_(Python)
def eea(u, v):
    u1 = 1; u2 = 0; u3 = u
    v1 = 0; v2 = 1; v3 = v
    while v3 != 0:
        q = u3 / v3
        t1 = u1 - q * v1
        t2 = u2 - q * v2
        t3 = u3 - q * v3
        u1 = v1; u2 = v2; u3 = v3
        v1 = t1; v2 = t2; v3 = t3
    return u1, u2, u3

def assign_moduli(lists):
    used_primes = set([])
    unused_primes = set([])
    moduli = [0]*len(lists)
    list_lens = [len(lst) for lst in lists]
    for i, length in enumerate(list_lens):
        for p in erat2():
            if length <= p and p not in used_primes:
                used_primes.add(p)
                moduli[i] = p
                break
            elif p not in used_primes:
                unused_primes.add(p)
    return moduli



class WordEncoder(object):
    def __init__(self):
        self.lists = [[]] # the list of primedlists
        self.words = {} # keys are words, values are number that retrieves word
        self.moduli = [] # coprime moduli that are used to assign unique keys to words

    def add(self, new_words):
        added_letter = False # flag that we need to rebuild the keys
        for word in new_words:
            word = word.rstrip() # a trailing blank space could hide the need for a key rebuild
            word_length, lists_length = len(word), len(self.lists)
            # make sure we have enough lists
            if word_length > lists_length:
                self.lists.extend([' '] for i in xrange(word_length - lists_length))
            # make sure that each letter is in the appropriate list
            for i, c in enumerate(word):
                if c in self.lists[i]: continue
                self.lists[i].append(c)
                added_letter = True
            self.words[word] = None
        # now we recalculate all of the keys if necessary
        if not added_letter:
            return self.words
        else:
            self._calculate_keys()

    def _calculate_keys(self):
        # were going to be solving a lot of systems of congruences
        # these are all of the form x % self.lists[i].modulus == self.lists[i].index(word[i]) with word padded out to 
        # len(self.lists). We will be using the Chinese Remainder Theorem to do this. We can do a lot of the calculations
        # once before we enter the loop because the numbers that we need are related to self.lists[i].modulus and not
        # the indexes of the necessary letters
        self.moduli = assign_moduli(self.lists)
        N  = reduce(operator.mul, self.moduli)
        e_lst = []
        for n in self.moduli:
             r, s, dummy = eea(n, N/n)
             e_lst.append(s * N / n)
        lists_len = len(self.lists)
        #now we begin the actual recalculation 
        for word in self.words:
             word += ' ' * (lists_len - len(word))
             coords = [self.lists[i].index(c) for i,c in enumerate(word)]
             key = sum(a*e for a,e in zip(coords, e_lst)) % N  # this solves the system of congruences
             self.words[word.rstrip()] = key

class WordDecoder(object):
    def __init__(self, lists):
       self.lists = lists
       self.moduli = assign_moduli(lists)

    def decode(self, key):
        coords = [key % modulus for modulus in self.moduli]
        return ''.join(pl[i] for pl, i in zip(self.lists, coords))    


with open('/home/aaron/code/scratch/corncob_lowercase.txt') as f:
    wordlist = f.read().split()

encoder = WordEncoder()
encoder.add(wordlist)

decoder = WordDecoder(encoder.lists)

for word, key in encoder.words.iteritems():
    decoded = decoder.decode(key).rstrip()
    if word != decoded:
        print word, decoded, key
        print "max key size: {0}. moduli: {1}".format(reduce(operator.mul, encoder.moduli), encoder.moduli)
        break
else:
    print "it works"
    print "max key size: {0}".format(reduce(operator.mul, encoder.moduli))
    print "moduli: {0}".format(encoder.moduli)
    for i, l in enumerate(encoder.lists):
        print "list {0} length: {1}, {2} - \"{3}\"".format(i, len(l), encoder.moduli[i] - len(l), ''.join(sorted(l)))
    print "{0} words stored in {1} charachters".format(len(encoder.words), sum(len(l) for l in encoder.lists))
aaronasterling
_but just associated a prime with the list_ But in szx's algorithm, number is divided by list length, not by another number associated with list. Do I get you right?
Nikita Rybak
I fixed my post. I should have said 'just associating'. @szx didn't really provide an algorithm. I'm not sure what number you're referring to.
aaronasterling
I quote. _"If the number X is larger than the length of the sub-list, it would wrap around"_
Nikita Rybak
Note, that in practice we don't need to store lists at all: we can easily determine character by the number and 'imaginary' index length without using additional memory. So, that makes the list of width 0 :)
Nikita Rybak
@Nikita Rybeck, I would like to see that. As I understand it, that could only work if we assumed that every list was of the same size with the same contents which would just make the keyspace bigger.
aaronasterling
@aaronsterling: unfortunately, a keyspace this big won't work. It'll be easier to understand why if I just describe the project: It's a physical art installation that consists of a gear train. Each gear corresponds to a sub-list, with the letters imprinted on the teeth. The goal is when rotated to certain positions, specific teeth (say, the ones pointing up on each gear) would make a word. By moving from position to position you can recreate a text (which hasn't been decided on yet, but will have several hundred unique words). There's a limit on the RPM, hence the keyspace issue.
szx
Also, I tried the coprimes suggestion, but it didn't help much.
szx
@szx. Unfortunately my books on algebra are all on a continent right now and I'm on an island. Talk to any mathematician that does even basic ring theory and they can at least point you in the right direction.
aaronasterling
A: 

I don't think I understand your problem completely, but I stumbled across prezip some time ago. Prezip is a way of compressing a sorted set of words by taking advantage of the fact that many words share a common prefix.

Since you're not refering to any sorting constraint, I would suggest creating a sorted set of words that you want. Then doing something similar to what prezip is doing. Result is a compressed and sorted set of words, to which you can refer to by index.

Jan
A: 

I think you're looking for this http://en.wikipedia.org/wiki/Trie or this http://en.wikipedia.org/wiki/Radix_tree

Hope it helps.

fortran
Jeez, is there any question tagged "algorithm" which don't get "trie" response? Looks like "trie" is new "jquery": can solve anything :)
Nikita Rybak
@Nikita He's trying to efficiently store words, and that's one of the things a Trie is for: http://en.wikipedia.org/wiki/Trie#Dictionary_representation
fortran
@fortran MySQL is also used to efficiently store words, I wonder why nobody offered it :) And LZW too!
Nikita Rybak
Thanks, I actually did look at tries but I haven't quite figured out a way to apply them to my problem. It's not just about storing words efficiently, there are unique constraints imposed by this problem that I'm not quite sure how to solve.
szx