views:

61

answers:

1

I am trying to come up with a GNU extended regular expression that detects repeated substrings in a string of ascii-encoded bits. I have an expression that works -- sort of. The problem is that it executes really slowly when given a string that could have many solutions

The expression

([01]+)(\1)+

compiles quickly, but takes about a minute to execute against the string

1010101010101010101010101010101010101010101010101010101010

I am using the regex implementation from glibc 2.5-49 ( comes with CentOS 5.5.)

FWIW, the pcre library executes quickly, as in gregexp or perl directly. So the obvious, but wrong, answer is "use libpcre". I cannot easily introduce a new dependency in my project. I need to work within the std C library that comes with CentOS/RHEL.

+2  A: 

If the input string can be of any considerable length, or if performance is at all a concern, then one of the better ways to solve this problem is not with regex, but with a more sophisticated string data structure that facilitates these kinds of queries much more efficiently.

Such a data structure is a suffix tree. Given a string S, its suffix tree is essentially the Patricia trie of all of its suffixes. Despite its seeming complexity, it can be built in linear time.

Suffix tree for BANANA Suffix tree for "BANANA"(courtesy of Wikipedia)

You can do many kinds of queries really efficiently with a suffix tree, e.g. finding all occurences of a substring, the longest substring that occurs at least twice, etc. The kind of strings that you're after is called tandem repeats. To facilitate this query you'd have to preprocess the suffix tree in linear time so you can do lowest common ancestor queries in constant time.

This problem is very common in computational biology, where the DNA can be viewed as a VERY long string consisting of letters in ACGT. Thus, performance and efficiency is of utmost importance, and these very sophisticated algorithms and techniques were devised.

You should look into either implementing these techniques from scratch for your binary sequence, or perhaps it's easier to map your binary sequence to a "fake" DNA string and then using one of the many tools available for gene research.

See also

polygenelubricants