ansaurus

Question

finding long repeated substrings in a massive string

Answer 1

+1 A:

Is this text with word breaks? Then I'd suspect you want a variation of keyword-in-context: make a copy of each line n times for n words in a line, breaking each line at each word; sort alpha of the whole thing; look for repeats.

If it's a single long honking string, like say bioinformatic DNA sequences, then you want to build something like your trie on disk; build a record for each character with a disk offset for the next-nodes. I'd have a look at Volume 3 of Knuth, section 5.4, "external sorting".

Charlie Martin 2008-12-29 22:08:19

Answer 2

A:

The easiest way might just be to plunk down the $100 for a bunch more RAM. Otherwise, you'll likely have to look at disk backed structures for holding your suffix tree.

Eclipse 2008-12-29 22:08:24

Answer 3

+2 A:

You could look at disk-based suffix trees. I found this Suffix tree implementation library through Google, plus a bunch of articles that could help implementing it yourself.

orip 2008-12-29 22:16:35

That Ukkonen suffix-tree algo (http://en.wikipedia.org/wiki/Suffix_tree) *is* quite nifty.

Charlie Martin 2008-12-29 22:41:50

Answer 4

A:

Can you solve your problem by building a suffix array instead? Otherwise you'll likely need to use one of the disk-based suffix trees mentioned in the other answers.

Steve Steiner 2008-12-29 22:20:52

Answer 5

+1 A:

You could solve this using divide and conquer. I think this should be the same algorithmic complexity as using a trie, but maybe less efficient implementation-wise

void LongSubstrings(string data, string prefix, IEnumerable<int> positions)
{
    Dictionary<char, DiskBackedBuffer> buffers = new Dictionary<char, DiskBackedBuffer>();
    foreach (int position in positions)
    {
        char nextChar = data[position];
        buffers[nextChar].Add(position+1);
    }

    foreach (char c in buffers.Keys)
    {
        if (buffers[c].Count > 1)
            LongSubstrings(data, prefix + c, buffers[c]);
        else if (buffers[c].Count == 1)
            Console.WriteLine("Unique sequence: {0}", prefix + c);
    }
}

void LongSubstrings(string data)
{
    LongSubstrings(data, "", Enumerable.Range(0, data.Length));
}

After this, you would need to make a class that implemented DiskBackedBuffer such that it was a list of numbers, and when the buffer got to a certain size, it would write itself out to disk using a temporary file, and recall from disk when read from.

FryGuy 2008-12-29 22:38:14

Answer 6

A:

Answering my own question:

Given that a long match is also a short match, you can trade multiple passes for RAM by first finding shorter matches and then seeing if you can 'grow' these matches.

The literal approach to this is to build a trie (with counts in each node) of all sequences of some fixed length in the data. You then cull all those nodes that are not matching your criteria (e.g. the longest match). Then then do a subsequent pass through the data, building the trie out deeper, but not broader. Repeat until you've found the longest repeated sequence(s).

A good friend suggested to use hashing. By hashing the fixed-length character sequence starting at each character you now have the issue of finding duplicate hash values (and verifying the duplication, as hashing is lossy). If you allocate an array the length of the data to hold the hash values, you can do interesting things e.g. to see if a match is longer than your fixed-length pass of the data, you can just compare the sequences of hashes rather than regenerating them. Etc.

Will 2009-01-01 14:13:47

Answer 7

A:

Just a belated thought that occurred to me...

Depending on your OS/environment. (E.g. 64 bit pointers & mmap() available.)

You might be able to create a very large Suffix-tree on disk through mmap(), and then keep a cached most-frequently-accessed subset of that tree in memory.

Mr.Ree 2009-01-02 08:49:01

Answer 8

+2 A:

The effective way to do this is to create an index of the sub-strings, and sort them. This is an O(n lg n) operation.

BWT compression does this step, so its a well understood problem and there are radix and suffix (claim O(n)) sort implementations and such to make it as efficient as possible. It still takes a long time, perhaps several seconds for large texts.

If you want to use utility code, C++ std::stable_sort() performs much better than std::sort() for natural language (and much faster than C's qsort(), but for different reasons).

Then visiting each item to see the length of its common substring with its neighbours is O(n).

Will 2009-11-25 11:29:09

Answer 9

A:

FWIW, here's an implementation of a related problem I wrote for SpamAssassin, may be useful:

http://taint.org/2007/03/05/134447a.html

jmason 2010-05-07 11:43:46

ansaurus

tags:

views:

answers:

finding long repeated substrings in a massive string

related questions