tags:

views:

174

answers:

2

As suggested in answers to a previous question, I tried using Erlang proplists to implement a prefix trie.

The code seems to work decently well... But, for some reason, it doesn't play well with the interactive shell. When I try to run it, the shell hangs:

> Trie = trie:from_dict(). % Creates a trie from a dictionary
% ... the trie is printed ...
% Then nothing happens

I see the new trie printed to the screen (ie, the call to trie:from_dict() has returned), then the shell just hangs. No new > prompt comes up and ^g doesn't do anything (but ^c will eventually kill it off).

With a very small dictionary (the first 50 lines of /usr/share/dict/words), the hang only lasts a second or two (and the trie is built almost instantly)... But it seems to grow exponentially with the size of the dictionary (100 words takes 5 or 10 seconds, I haven't had the patience to try larger wordlists). Also, as the shell is hanging, I notice that the beam.smp process starts eating up a lot of memory (somewhere between 1 and 2 gigs).

So, is there anything obvious that could be causing this shell hang and incredible memory usage?


Some various comments:

  • I have a hunch that the garbage collector is at fault, but I don't know how to profile or create an experiment to test that.

  • I've tried profiling with eprof and nothing obvious showed up.

  • Here is my "add string to trie" function:

add([], Trie) ->
    [ stop | Trie ];

add([Ch|Rest], Trie) ->
    SubTrie = proplists:get_value(Ch, Trie, []),
    NewSubTrie = add(Rest, SubTrie),
    NewTrie = [ { Ch, NewSubTrie } | Trie ],
    % Arbitrarily decide to compress key/value list once it gets
    % more than 60 pairs.
    if length(NewTrie) > 60 ->
            proplists:compact(NewTrie);
        true ->
            NewTrie
    end.
+3  A: 

The problem is (amongst others ? -- see my comment) that you are always adding a new {Ch, NewSubTrie} tuple to your proplist Tries, no matter if Ch already existed, or not.

Instead of

NewTrie = [ { Ch, NewSubTrie } | Trie ]

you need something like:

NewTrie = lists:keystore(Ch, 1, Trie, {Ch, NewSubTrie})
Zed
D'oh! That's what I get for not RTFMing carefully enough. All seems to be working well now - thanks.
David Wolever
+2  A: 

You're not really building a trie here. Your end result is effectively a randomly ordered proplist of proplists that requires full scans at each level when walking the list. Tries are typically implied ordering based on position in the array (or list).

Here's an implementation that uses tuples as the storage mechanism. Calling set only rebuilds the root and direct path tuples.
(note: would probably have to make the pair a triple (adding size) make delete work with any efficiency)

I believe erlang tuples are really just arrays (thought I read that somewhere), so lookup should be super fast, and modify is probably straight forward. Maybe this is faster with the array module, but I haven't really played with it much to know.

this version also stores an arbitrary value, so you can do things like:

1> c(trie).
{ok,trie}
2>  trie:get("ab",trie:set("aa",bar,trie:new("ab",foo))). 
foo
3> trie:get("abc",trie:set("aa",bar,trie:new("ab",foo))).
undefined
4>

code (entire module): note2: assumes lower case non empty string keys

-module(trie).
-compile(export_all).
-define(NEW,{ %% 26 pairs, to avoid cost of calculating a new level at runtime
 {undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
 {undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
 {undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
 {undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
 {undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
 {undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
 {undefined,nodepth},{undefined,nodepth}
 }
).
-define(POS(Ch), Ch - $a + 1).

new(Key,V) -> set(Key,V,?NEW).

set([H],V,Trie) -> 
 Pos = ?POS(H),
 {_,SubTrie} = element(Pos,Trie),
 setelement(Pos,Trie,{V,SubTrie});

set([H|T],V,Trie) ->
 Pos = ?POS(H),
 {SubKey,SubTrie} = element(Pos,Trie),
 case SubTrie of
  nodepth -> setelement(Pos,Trie,{SubKey,set(T,V,?NEW)});
  SubTrie -> setelement(Pos,Trie,{SubKey,set(T,V,SubTrie)})
 end.

get([H],Trie) -> 
 {Val,_} = element(?POS(H),Trie),
 Val;
get([H|T],Trie) ->
 case element(?POS(H),Trie) of
  {_,nodepth} -> undefined;
  {_,SubTrie} -> get(T,SubTrie)
 end.
You could replace that ugly macro with this: `new() -> erlang:make_tuple(26, {undefined, nodepth}).`. Also POS could be a function... I guess I just hate macros :o)
Zed
hah! so that's what it's called. I couldn't remember how to create the tuple like that and was being too lazy to look it upand yep, everything could be made a function, I was just attempting to optimize a tiny bit given what his goal is (build out a map of a lot of words). Probably a stupid early optimization on my part though