views:

198

answers:

4

I have a sequence of SQL calls, which I want to use to detect loops (and hence unnecessary duplicate sql calls), but it got me thinking of this more general problem.

Given a list, say [a,b,c,b,c,a,b,c,b,c,a,b,b]

Is there some way I can turn that into a,[[b,c]*2,a]*2,b*2

or, [a,[b,c]*2]*2,a,b*2

That is, detect repetitions (possibly nested ones).

A: 

If you can sort it first, then it's easy to go through one more time to find duplicate runs. Of course, sorting something as free-form as SQL queries sounds a bit scary.

unwind
A: 

I’m no expert in that field but you might want to check out some compression algorithms, it seems to me that this is quite exactly what they do.

Bombe
+4  A: 

Look into the Lempel-Ziv-Welsh compression algorithm. It is built on detecting repetitions in strings and utilizing them for compression. I believe you can use a Trie for it.

Yuval F
A: 

If the string is sufficiently large, an interesting approach is to run a compression tool (like gzip, bzip, or 7zip) on it. These tools work by locating repetitions (at various levels), and substituting them by pointers to the first instance of the text (or a dictionary). The compression you achieve is a measure of the repetition. Dumping the file (you'll have to write code to do that) will give you the repeated content.

Diomidis Spinellis
Doubt this will work, as the compression programs will happily use substrings, and will ignore the SQL command boundaries.
derobert