ansaurus

Question

Answer 1

+3 A:

Well, you said you're interested in sub-optimal solutions, so I'll give you one. It depens on the alphabet size. For example, for 26 array size will be little over 100 (regardless of amount of words to encode).

It's well-known that if you have two different prime numbers a and b and non-negative integers k and l (k < a, l < b), you can find number n that n % a == k and n % b == l.
For example, with (a = 7, a = 13, k = 6, l = 3) you can take n = 7 * 13 + 7 * 3 + 13 * 6. n % 7 == 6 and n % 13 == 3

And same holds for any number of prime integers.

You can initialize arrays like this.

['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 29
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 31
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 37
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 41
...

Now, suppose your word is 'geek'. For it you need number X, such that X % 29 == 6, X % 31 == 4, X % 37 == 4, X % 41 == 10. And you can always find such X, as was shown above.

So, if you have alphabet of 26 letters, you can create matrix of width 149 (see the list of primes) and encode any word with it.

Nikita Rybak 2010-08-17 01:29:30

Great answer, but it brings me to a constraint that I didn't specify since I don't have a well-defined guideline for a "good enough" solution:Given a set of a couple of hundreds of words, the average distance between any two words needs to be needs to be minimal. How minimal? Ideally, the number of different positions divided by the size of the first array would need to be in the scale of hundreds or thousands.With this solution, the number of possible positions escalates very quickly, becoming impractical for six letter (or more) words.

szx 2010-08-17 02:42:49

@szx _Given a set of a couple of hundreds of words, the average distance between any two words needs to be needs to be minimal._ Can you clarify? I thought, we don't choose the set to encode: the set is given.

Nikita Rybak 2010-08-20 21:17:29

See my post + clarification above: the set of words is given (and will contain approximately a couple of hundred words). These letters that produce these words can be arranged in different configurations (i.e. different indexes, number of letters per sub-list) of varying "efficiency". By "the distance between two words" I mean the number of spurious words that separate words from the given set, which should be minimal, making the signal to noise ratio maximal.

szx 2010-08-21 16:57:41

Answer 2

+2 A:

We can improve upon Nikita Rybek`s answer by not actually making the lists a prime length but just associating a prime with the list. This allows us to not make the sub-lists any longer than necessary, hence keeping the primes smaller which implies a smaller keyspace and more efficient packing. Using this method and the code below, I packed a list of 58,110 (lowercase) words into 464 characters. It's interesting to note that only the letters 'alex' appear as the 21'st letter in a word. The keyspace was upwards of 33 digits however It is also not strictly necessary to use primes, the associated numbers just need to be coprime. This could probably be reduced.

import itertools
import operator
import math

# lifted from Alex Martelli's post at http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python
def erat2( ):
    D = {  }
    yield 2
    for q in itertools.islice(itertools.count(3), 0, None, 2):
        p = D.pop(q, None)
        if p is None:
            D[q*q] = q
            yield q
        else:
            x = p + q
            while x in D or not (x&1):
                x += p
            D[x] = p


# taken from http://en.literateprograms.org/Extended_Euclidean_algorithm_(Python)
def eea(u, v):
    u1 = 1; u2 = 0; u3 = u
    v1 = 0; v2 = 1; v3 = v
    while v3 != 0:
        q = u3 / v3
        t1 = u1 - q * v1
        t2 = u2 - q * v2
        t3 = u3 - q * v3
        u1 = v1; u2 = v2; u3 = v3
        v1 = t1; v2 = t2; v3 = t3
    return u1, u2, u3

def assign_moduli(lists):
    used_primes = set([])
    unused_primes = set([])
    moduli = [0]*len(lists)
    list_lens = [len(lst) for lst in lists]
    for i, length in enumerate(list_lens):
        for p in erat2():
            if length <= p and p not in used_primes:
                used_primes.add(p)
                moduli[i] = p
                break
            elif p not in used_primes:
                unused_primes.add(p)
    return moduli



class WordEncoder(object):
    def __init__(self):
        self.lists = [[]] # the list of primedlists
        self.words = {} # keys are words, values are number that retrieves word
        self.moduli = [] # coprime moduli that are used to assign unique keys to words

    def add(self, new_words):
        added_letter = False # flag that we need to rebuild the keys
        for word in new_words:
            word = word.rstrip() # a trailing blank space could hide the need for a key rebuild
            word_length, lists_length = len(word), len(self.lists)
            # make sure we have enough lists
            if word_length > lists_length:
                self.lists.extend([' '] for i in xrange(word_length - lists_length))
            # make sure that each letter is in the appropriate list
            for i, c in enumerate(word):
                if c in self.lists[i]: continue
                self.lists[i].append(c)
                added_letter = True
            self.words[word] = None
        # now we recalculate all of the keys if necessary
        if not added_letter:
            return self.words
        else:
            self._calculate_keys()

    def _calculate_keys(self):
        # were going to be solving a lot of systems of congruences
        # these are all of the form x % self.lists[i].modulus == self.lists[i].index(word[i]) with word padded out to 
        # len(self.lists). We will be using the Chinese Remainder Theorem to do this. We can do a lot of the calculations
        # once before we enter the loop because the numbers that we need are related to self.lists[i].modulus and not
        # the indexes of the necessary letters
        self.moduli = assign_moduli(self.lists)
        N  = reduce(operator.mul, self.moduli)
        e_lst = []
        for n in self.moduli:
             r, s, dummy = eea(n, N/n)
             e_lst.append(s * N / n)
        lists_len = len(self.lists)
        #now we begin the actual recalculation 
        for word in self.words:
             word += ' ' * (lists_len - len(word))
             coords = [self.lists[i].index(c) for i,c in enumerate(word)]
             key = sum(a*e for a,e in zip(coords, e_lst)) % N  # this solves the system of congruences
             self.words[word.rstrip()] = key

class WordDecoder(object):
    def __init__(self, lists):
       self.lists = lists
       self.moduli = assign_moduli(lists)

    def decode(self, key):
        coords = [key % modulus for modulus in self.moduli]
        return ''.join(pl[i] for pl, i in zip(self.lists, coords))    


with open('/home/aaron/code/scratch/corncob_lowercase.txt') as f:
    wordlist = f.read().split()

encoder = WordEncoder()
encoder.add(wordlist)

decoder = WordDecoder(encoder.lists)

for word, key in encoder.words.iteritems():
    decoded = decoder.decode(key).rstrip()
    if word != decoded:
        print word, decoded, key
        print "max key size: {0}. moduli: {1}".format(reduce(operator.mul, encoder.moduli), encoder.moduli)
        break
else:
    print "it works"
    print "max key size: {0}".format(reduce(operator.mul, encoder.moduli))
    print "moduli: {0}".format(encoder.moduli)
    for i, l in enumerate(encoder.lists):
        print "list {0} length: {1}, {2} - \"{3}\"".format(i, len(l), encoder.moduli[i] - len(l), ''.join(sorted(l)))
    print "{0} words stored in {1} charachters".format(len(encoder.words), sum(len(l) for l in encoder.lists))

aaronasterling 2010-08-17 06:46:20

_but just associated a prime with the list_ But in szx's algorithm, number is divided by list length, not by another number associated with list. Do I get you right?

Nikita Rybak 2010-08-20 21:22:04

I fixed my post. I should have said 'just associating'. @szx didn't really provide an algorithm. I'm not sure what number you're referring to.

aaronasterling 2010-08-20 22:01:05

I quote. _"If the number X is larger than the length of the sub-list, it would wrap around"_

Nikita Rybak 2010-08-21 14:24:44

Note, that in practice we don't need to store lists at all: we can easily determine character by the number and 'imaginary' index length without using additional memory. So, that makes the list of width 0 :)

Nikita Rybak 2010-08-21 14:26:34

@Nikita Rybeck, I would like to see that. As I understand it, that could only work if we assumed that every list was of the same size with the same contents which would just make the keyspace bigger.

aaronasterling 2010-08-21 20:23:45

@aaronsterling: unfortunately, a keyspace this big won't work. It'll be easier to understand why if I just describe the project: It's a physical art installation that consists of a gear train. Each gear corresponds to a sub-list, with the letters imprinted on the teeth. The goal is when rotated to certain positions, specific teeth (say, the ones pointing up on each gear) would make a word. By moving from position to position you can recreate a text (which hasn't been decided on yet, but will have several hundred unique words). There's a limit on the RPM, hence the keyspace issue.

szx 2010-08-25 14:23:44

Also, I tried the coprimes suggestion, but it didn't help much.

szx 2010-08-25 14:26:30

@szx. Unfortunately my books on algebra are all on a continent right now and I'm on an island. Talk to any mathematician that does even basic ring theory and they can at least point you in the right direction.

aaronasterling 2010-08-25 18:48:31

Answer 3

A:

I don't think I understand your problem completely, but I stumbled across prezip some time ago. Prezip is a way of compressing a sorted set of words by taking advantage of the fact that many words share a common prefix.

Since you're not refering to any sorting constraint, I would suggest creating a sorted set of words that you want. Then doing something similar to what prezip is doing. Result is a compressed and sorted set of words, to which you can refer to by index.

Jan 2010-08-23 16:42:22

Answer 4

A:

I think you're looking for this http://en.wikipedia.org/wiki/Trie or this http://en.wikipedia.org/wiki/Radix_tree

Hope it helps.

fortran 2010-08-23 16:52:41

Jeez, is there any question tagged "algorithm" which don't get "trie" response? Looks like "trie" is new "jquery": can solve anything :)

Nikita Rybak 2010-08-25 20:46:23

@Nikita He's trying to efficiently store words, and that's one of the things a Trie is for: http://en.wikipedia.org/wiki/Trie#Dictionary_representation

fortran 2010-08-26 07:30:00

@fortran MySQL is also used to efficiently store words, I wonder why nobody offered it :) And LZW too!

Nikita Rybak 2010-08-26 16:10:13

Thanks, I actually did look at tries but I haven't quite figured out a way to apply them to my problem. It's not just about storing words efficiently, there are unique constraints imposed by this problem that I'm not quite sure how to solve.

szx 2010-08-29 12:14:52

ansaurus

tags:

views:

answers:

Need help with a word-packing algorithm

Clarification

related questions