views:

113

answers:

5

Hi guys.

I have a problem, and I'm not too sure how to solve it without going down the route of inefficiency. Say I have a list of words:

  • Apple
  • Ape
  • Arc
  • Abraid
  • Bridge
  • Braide
  • Bray
  • Boolean

What I want to do is process this list and get what each word starts with up to a certain depth, e.g.

  • a - Apple, Ape, Arc, Abraid
  • ab - Abraid
  • ar -Arc
  • ap - Apple, Ape
  • b - Bridge, Braide, Bray, Boolean
  • br - Bridge, Braide, Bray
  • bo - Boolean

Any ideas?

+10  A: 

You can use a Trie structure.

       (root)
         / 
        a - b - r - a - i - d
       / \   \
      p   r   e
     / \   \
    p   e   c
   /
  l
 /
e

Just find the node that you want and get all its descendants, e.g., if I want ap-:

       (root)
         / 
        a - b - r - a - i - d
       / \   \
     [p]  r   e
     / \   \
    p   e   c
   /
  l
 /
e
quantumSoup
+1  A: 

Using this PHP trie implementation will get you about 50% there. It's got some stuff you don't need and it doesn't have a "search by prefix" method, but you can write one yourself easily enough.

$trie = new Trie();

$trie->add('Apple',   'Apple');
$trie->add('Ape',     'Ape');
$trie->add('Arc',     'Arc');
$trie->add('Abraid',  'Abraid');
$trie->add('Bridge',  'Bridge');
$trie->add('Braide',  'Braide');
$trie->add('Bray',    'Bray');
$trie->add('Boolean', 'Boolean');

It builds up a structure like this:

Trie Object
(
  [A] => Trie Object
  (
    [p] => Trie Object
    (
      [ple] => Trie Object
      [e] => Trie Object
    )

    [rc] => Trie Object
    [braid] => Trie Object
  )

  [B] => Trie Object
  (
    [r] => Trie Object
    (
      [idge] => Trie Object
      [a] => Trie Object
      (
        [ide] => Trie Object
        [y] => Trie Object
      )
    )

    [oolean] => Trie Object
  )
)
John Kugelman
A: 

If the words were in a Database (Access, SQL), and you wanted to retrieve all words starting with 'br', you could use:

Table Name: mytable
Field Name: mywords

"Select * from mytable where mywords like 'br*'"  - For Access - or

"Select * from mytable where mywords like 'br%'"  - For SQL
Claudio
+2  A: 

I don't know what you are thinking about, when you say "route of inefficiency", but pretty obvious solution (possibly the one you are thinking about) comes to mind. Trie looks like a structure for this kind of problems, but it's costly in terms of memory (there is a lot of duplication) and I'm not sure it makes things faster in your case. Maybe the memory usage would pay off, if the information was to be retrieved many times, but your answer suggests, you want to generate the output file once and store it. So in your case the Trie would be generated just to be traversed once. I don't think it makes sense.

My suggestion is to just sort the list of words in lexical order and then traverse the list in order as many times as the max length of the beginning is.

create a dictionary with keys being strings and values being lists of strings

for(i = 1 to maxBeginnigLength)
{
    for(every word in your sorted list)
    {
        if(the word's length is no less than i)
        {
            add the word to the list in the dictionary at a key
            being the beginning of the word of length i
        }

    }

}

store contents of the dictionary to the file
Maciej Hehl
+2  A: 

Perhaps you're looking for something like:

    #!/usr/bin/env python
    def match_prefix(pfx,seq):
        '''return subset of seq that starts with pfx'''
        results = list()
        for i in seq:
            if i.startswith(pfx):
                results.append(i)
        return results

    def extract_prefixes(lngth,seq):
        '''return all prefixes in seq of the length specified'''
        results = dict()
        lngth += 1
        for i in seq:
            if i[0:lngth] not in results:
                results[i[0:lngth]] = True
        return sorted(results.keys())

    def gen_prefix_indexed_list(depth,seq):
        '''return a dictionary of all words matching each prefix
           up to depth keyed on these prefixes'''
        results = dict()
        for each in range(depth):
            for prefix in extract_prefixes(each, seq):
                results[prefix] = match_prefix(prefix, seq)
        return results


    if __name__ == '__main__':
        words='''Apple Ape Arc Abraid Bridge Braide Bray Boolean'''.split()
        test = gen_prefix_indexed_list(2, words)
        for each in sorted(test.keys()):
            print "%s:\t\t" % each,
            print ' '.join(test[each])

That is you want to generate all the prefixes that are present in a list of words between one and some number you'll specify (2 in this example). Then you want to produce an index of all words matching each of these prefixes.

I'm sure there are more elegant ways to do this. For for a quick and easily explained approach I've just built this from a simple bottom-up functional decomposition of the apparent spec. Of the end result values are lists each matching a given prefix, then we start with the function to filter out such matches from our inputs. If the end result keys are all prefixes between 1 and some N that appear in our input then we have a function to extract those. Then our spec. is an extremely straightforward nested loop around that.

Of course this nest loop might be a problem. Such things usually equate to an O(n^2) efficiency. As shown this will iterate over the original list C * N * N times (C is the constant number representing the prefixes of length 1, 2, etc; while N is the length of the list).

If this decomposition provides the desired semantics then we can look at improving the efficiency. The obvious approach would be to lazily generate the dictionary keys as we iterate once over the list ... for each word, for each prefix length, generate key ... append this word to the the list/value stored at that key ... and continue to the next word.

There's still a nested loop ... but it's the short loop for each key/prefix length. That alternative design has the advantage of allowing us to iterate over lists of words from any iterable, not just an in memory list. So we could iterate over lines of a file, results generated from a database query, etc --- without incurring the memory overhead of keeping the entire original word list in memory.

Of course we're still storing the dictionary in memory. However we can also change that, decouple the logic from the input and storage. When we append each input to the various prefix/key values we don't care if they're lists in a dictionary, or lines in a set of files, or values being pulled out of (and pushed back into) a DBM or other key/value store (for example some sort of CouchDB or other "noSQL clustered/database."

The implementation of that is left as an exercise to the reader.

Jim Dennis