views:

335

answers:

3

Hi! I am trying to store a large list of strings in a concise manner so that they can be very quickly analyzed/searched through.

A directed acyclic word graph (DAWG) suits this purpose wonderfully. However, I do not have a list of the strings to include in the first place, so it must be incrementally buildable. Additionally, when I search through it for a string, I need to bring back data associated with the result (not just a boolean saying if it was present).

I have found information on a modification of the DAWG for string data tracking here: http://www.pathcom.com/~vadco/adtdawg.html It looks extremely, extremely complex and I am not sure I am capable of writing it.

I have also found a few research papers describing incremental building algorithms, though I've found that research papers in general are not very helpful.

I don't think I am advanced enough to be able to combine both of these algorithms myself. Is there documentation of an algorithm already that features these, or an alternative algorithm with good memory use & speed?

A: 

You may also want to look at a trie structure for this (potentially building a radix-tree). It seems like a decent 'simple' alternative structure.

I'm suggesting this for a few reasons:

  1. I really don't have a full understanding of your result.
  2. Definitely incremental to build.
  3. Leaf nodes can contain any data you wish.
  4. Subjectively, a simple algorithm.
Marc
Tries are very simple, but they also take up a ton of space. A directed acyclic word graph is actually just a trie in which shared suffixes have been combined, but this makes them very complex. A radix tree will probably be my worst-case scenario.
D.
+1  A: 

Java

For graph problems which require persistence, I'd take a look at the Neo4j graph DB project. Neo4j is designed to store large graphs and allow incremental building and modification of the data, which seems to meet the criteria you describe.

They have some good examples to get you going quickly and there's usually example code to get you started with most problems.

They have a DAG example with a link at the bottom to the full source code.

C++

If your using C++ a common solution to graph building/analysis is to use the Boost graph library. To persist your graph you could maintain a file based version of the graph in GraphML (for example) and read and write to that file as your graph changes.

Binary Nerd
That looks really cool, but I forgot to mention I'm using C++ >.<
D.
Ah :) I've added a suggestion for C++ which might help.
Binary Nerd
+1  A: 

Hey,

I wrote the ADTDAWG web page. Adding words after construction is not an option. The structure is nothing more than 4 arrays of unsigned integer types. It was designed to be immutable for total CPU cache inclusion, and minimal multi-thread access complexity.

The structure is an automaton that forms a minimal and perfect hash function. It was built for speed while traversing recursively using an explicit stack.

As published, it supports up to 18 characters. Including all 26 English chars will require further augmentation.

My advice is to use a standard Trie, with an array index stored in each node. Ya, it is going to seem infantile, but each END_OF_WORD node represents only one word. The ADTDAWG is a solution to each END_OF_WORD node in a traditional DAWG representing many, many words.

Minimal and perfect hash tables are not the sort of thing that you can just put together on the fly.

I am looking for something else to work on, or a job, so contact me, and I'll do what I can. For now, all I can say is that it is unrealistic to use heavy optimization on a structure that is subject to being changed frequently.

All the very best,

JohnPaul Adamovsky - [email protected]

www.pathcom.com/~vadco/deep.html

JohnPaul Adamovsky
Thanks, JohnPaul. I am most likely going to use a radix tree to store the strings, although I would have liked to save a bit more on memory. I was hoping that a compromise between the incremental DAWG building algorithms and your string-tracking structure existed, but I guess not!Unfortunately, I cannot offer you work or a job, as this is just for a hobby project of mine. If you would like to create and document a flexible structure for fun, be my guest and good luck (I don't have the brains for it, at least)!
D.