views:

93

answers:

2

Let's say I have several URLs and I return the basename from each URL, like so;

http://www.test.com/the.code.r00

would return

the.code.r00

and I have several basenames I extracted from several URLs to work on

the.code.r00
the.code.r01
..
...
the.code.r12

and together with those I have the following basenames too from other URLs

the.matrix.r00
the.matrix.r01
..
...
the.matrix.r14

I'd like to know if there's a known algorithm which has been tested and proven to return

the.code.r
the.matrix.r

after parsing the basenames I listed above.

Also, if instead, there is some *nix tool which does the same thing that would be super.

Note, the format isn't always like above, otherwise I could have done a simple substr. The numbers aren't always listed at some specific location in the string. Some other examples;

the.code.part01.rar
the.code.001
..
....

I could implement my own algorithm to do this but it would probably be a can of bugs without some heavy testing so I'd prefer to use a known algorithm if there is one already defined..

+3  A: 

You are probably looking for an anchored implementation of the longest common substring problem. Sort your list of strings, and perform the anchored LCS on adjacent elements. Insert the LCS into a multi-value hashmap with the LCS as the key and the two strings as the values.

Once you have that, do the same with the LCS strings, until you meet some threshold.

strager
Good answer, thanks for the link!
Cerebrus
hmm I think this could help. thanks
ninuhadida
+1  A: 

Look at each pair of strings in your list and compute the Levenshtein Distance (aka string edit distance) between them. This will give you the minimum number of changes necessary to change the one string into the other.

Now obtain from the Levenshtein implementation the actual set of changes between the strings (by following backpointers in the dynamic program). If that set of changes only consists of substituting numbers for other numbers then you've found a pattern. Copy one of the strings, remove those numbers, store it in your pattern set, and continue with the other string pairs.

this is also helpful. thanks
ninuhadida