views:

883

answers:

6

Given a set of strings, for example:

EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green

I want to be able to detect that these are three sets of files:

  • EntireS[1,2]
  • J27[Red,Green]P[1,2]
  • JournalP[1,2][Red,Green,Blue]

Are there any known ways of approaching this problem - any published papers I can read on this?

The approach I am considering is for each string look at all other strings and find the common characters and where differing characters are, trying to find sets of strings that have the most in common, but I fear that this is not very efficient and may give false positives.

Note that this is not the same as 'How do I detect groups of common strings in filenames' because that assumes that a string will always have a series of digits following it.

[Edited 15/09/09 to add more sample strings]

+3  A: 

I would start here: http://en.wikipedia.org/wiki/Longest_common_substring_problem

There are links to supplemental information in the external links, including Perl implementations of the two algorithms explained in the article.

Edited to add:

Based on the discussion, I still think Longest Common Substring could be at the heart of this problem. Even in the Journal example you reference in your comment, the defining characteristic of that set is the substring 'Journal'.

I would first consider what defines a set as separate from the other sets. That gives you your partition to divide up the data, and then the problem is in measuring how much commonality exists within a set. If the defining characteristic is a common substring, then Longest Common Substring would be a logical starting point.

To automate the process of set detection, in general, you will need a pairwise measure of commonality which you can use to measure the 'difference' between all possible pairs. Then you need an algorithm to compute the partition that results in the overall lowest total difference. If the difference measure is not Longest Common Substring, that's fine, but then you need to determine what it will be. Obviously it needs to be something concrete that you can measure.

Bear in mind also that the properties of your difference measurement will bear on the algorithms that can be used to make the partition. For example, assume diff(X,Y) gives the measure of difference between X and Y. Then it would probably be useful if your measure of distance was such that diff(A,C) <= diff(A,B) + diff(B,C). And obviously diff(A,C) should be the same as diff(C,A).

In thinking about this, I also begin to wonder whether we could conceive of the 'difference' as a distance between any two strings, and, with a rigorous definition of the distance, could we then attempt some kind of cluster analysis on the input strings. Just a thought.

jbourque
This is helpful, but I think I'm closer to the Longest common subsequence problem as I may have several substrings (e.g. the Journal example)
danio
Yes, possibly. When I first read your question I thought the 'sets' would be restricted to a single sub-string root, with various appendages. But I see now that you're looking for something more involved.
jbourque
danio
+1  A: 

There are many many approaches to string similarity. I would suggest taking a look at this open-source library that implements alot of metrics like the levenshtein distance.

http://www.dcs.shef.ac.uk/~sam/simmetrics.html

Bas van den Broek
This is helpful - looks like Levenshtein Distance, Needleman-Wunch distance or similar might be useful as a distance function to use in the technique suggested by jbourque.
danio
A: 

Something like that might work.

  1. Build a trie that represents all your strings.

In the example you gave, there would be two edges from the root: "E" and "J". The "J" branch would then split into "Jo" and "J2".

  1. A single strand that forks, e.g. E-n-t-i-r-e-S-(forks to 1, 2) indicates a choice, so that would be EntireS[1,2]

  2. If the strand is "too short" in relation to the fork, e.g. B-A-(forks to N-A-N-A and H-A-M-A-S), we list two words ("banana, bahamas") rather than a choice ("ba[nana,hamas]"). "Too short" might be as simple as "if the part after the fork is longer than the part before", or maybe weighted by the number of words that have a given prefix.

  3. If two subtrees are "sufficiently similar" then they can be merged so that instead of a tree, you now have a general graph. For example if you have ABRed,ABBlue,ABGreen,CDRed,CDBlue,CDGreen, you may find that the subtree rooted at "AB" is the same as the subtree rooted at "CD", so you'd merge them. In your output this will look like this: [left branch, right branch][subtree], so: [AB,CD][Red,Blue,Green]. How to deal with subtrees that are close but not exactly the same? There's probably no absolute answer but someone here may have a good idea.

I'm marking this answer community wiki. Please feel free to extend it so that, together, we may have a reasonable answer to the question.

redtuna
I like this approach but I think it will struggle with e.g. J-27-(RedP,GreenP)-(1,2). I want to be able to identify Red and Green in this case so seems there will be more processing required to find the common substring.
danio
I'm not sure why you think it would struggle: in this case the subtree under both J27RedP and J27GreenP is identical: (1,2). So step 3 will merge them. Displaying this merged tree will directly return J27-(RedP,GreenP)-(1,2) since -(a,b) is how you write (fork to a and b).
redtuna
A: 

For this particular example of strings to keep it extremely simple consider using simple word/digit -separation.

A non-digit sequence apparently can begin with capital letter (Entire). After breaking all strings into groups of sequences, something like

[Entire][S][1]
[Entire][S][2]
[J][27][Red][P][1]
[J][27][Green][P][1]
[J][27][Red][P][2]
....
[Journal][P][1][Blue]
[Journal][P][1][Green]

Then start grouping by groups, you can fairly soon see that prefix "Entire" is a common for some group and that all subgroups have S as headgroup, so only variable for those is 1,2. For J27 case you can see that J27 is only leaf, but that it then branches at Red and Green.

So somekind of List<Pair<list, string>> -structure (composite pattern if I recall correctly).

Pasi Savolainen
Nice idea but the actual strings will not have a simpel rule to find like this - I just used caps in the sample to make it easier to read.
danio
A: 

I recently wrote a script for this purpose. Still improving it but it may suit your purpose (especially if you know some Perl so you can use the -e option):

http://www.win.tue.nl/~rp/bin/substrings

reinierpost
If I read your code correctly it will only find entire strings that are substrings of another. I want to find common substrings in multiple strings.
danio
That's correct. The reason is that if you do that, it's difficult to know where to stop. Many filenames contain, say, 'e'.
reinierpost
A: 

I also have a similar requirement in that I am looking for a way to scan log files for errors that have the same body text but possibly differing parameters.

Another way of looking at it is to think of "recreating" the (notional) printf() format string and its actual arguments, based on two or more input strings

We want the most complete & most generic format string and the fewest arguments.

We also need to decide whether two strings actually derive from the same format string.

In my case, we can also assume that there is not an argument at the start of the strings.

MikeW