views:

212

answers:

4

You know the functionality in Excel when you type 3 rows with a certain pattern and drag the column all the way down Excel tries to continue the pattern for you.

For example

Type...

  • test-1
  • test-2
  • test-3

Excel will continue it with:

  • test-4
  • test-5
  • test-n...

Same works for some other patterns such as dates and so on.

I'm trying to accomplish a similar thing but I also want to handle more exceptional cases such as:

  • test-blue-somethingelse
  • test-yellow-somethingelse
  • test-red-somethingelse

Now based on this entries I want say that the pattern is:

  • test-[DYNAMIC]-something

Continue the [DYNAMIC] with other colours is whole another deal, I don't really care about that right now. I'm mostly interested in detecting the [DYNAMIC] parts in the pattern.

I need to detect this from a large of pool entries. Assume that you got 10.000 strings with this kind of patterns, and you want to group these strings based on similarity and also detect which part of the text is constantly changing ([DYNAMIC]).

Document classification can be useful in this scenario but I'm not sure where to start.

UPDATE:

I forgot to mention that also it's possible to have multiple [DYNAMIC] patterns.

Such as:

  • test_[DYNAMIC]12[DYNAMIC2]

I don't think it's important but I'm planning to implement this in .NET but any hint about the algorithms to use would be quite helpful.

A: 

finding [dynamic] isnt that big of deal, you can do that with 2 strings - just start at the beginning and stop when they start not-being-equals, do the same from the end, and voila - you got your [dynamic]

something like (pseudocode - kinda):

String s1 = 'asdf-1-jkl';
String s2= 'asdf-2-jkl';
int s1I = 0, s2I = 0;
String dyn1, dyn2;
for (;s1I<s1.length()&&s2I<s2.length();s1I++,s2I++)
  if (s1.charAt(s1I) != s2.charAt(s2I))
    break;
int s1E = s1.length(), s2E = s2.length;
for (;s2E>0&&s1E>0;s1E--,s2E--)
  if (s1.charAt(s1E) != s2.charAt(s2E))
    break;
dyn1 = s1.substring(s1I, s1E);
dyn2 = s2.substring(s2I, s2E);

About your 10k data-sets. You would need to call this (or maybe a little more optimized version) with each combination to figure out your patten (10k x 10k calls). and then sort the result by pattern (ie. save the begin and the ending and sort by these fields)

Niko
for 10.000 different patterns? How can you say which looks like which one? also you don't where the dynamic is, maybe beginning, maybe end, maybe in the middle maybe doesn't exist at all.
dr. evil
excel isnt doing it for 10k different patterns either. it takes a very small sample (=what you selected) and figures out the dynmaic part from that (or not :P). once you have your dynamic part you can start comparing it against known patterns (ie. both are integer and increasing; both are integer and decreasing).
Niko
I know that excel uses a limited number of sample but as I stated in the question unfortunately that doesn't work for me. I need to do this lets say 1000 strings but potentially more. Thanks for the psuedocode can be quite handy in my tests.
dr. evil
if you have 1000 strings and save the results in a sortable structure (ie. auto-sorting tree, list, hashmap) you should find all your possible patterns real quick its only 1mill calls - that can easily be kept in memory and processed quite quickly nowadays
Niko
A: 

I think what you need is to compute something like the Levenshtein distance, to find the group of similar strings, and then in each group of similar strings, you indentify the dynamic part in a typical diff-like algorithm.

Florian
This sounds good but AFAIK Levenshtein distance considers the length of the string as a big difference in my case xxx-1323457980-yyy should be quite close to xxx-234-yyy but I'll look into it.
dr. evil
A: 

Google docs might be better than excel for this sort of thing, believe it or not.

Google has collected massive amounts of data on sets - for example the in the example you gave it would recognise the blue, red, yellow ... as part of the set 'colours'. It has far more complete pattern recognition than Excel so would stand a better chance of continuing the pattern.

That's quite interesting actually Google Sets - http://labs.google.com/sets can be used online to enhance this functionality, a bit slow though :)
dr. evil
+2  A: 

As soon as you start considering finding dynamic parts of patterns of the form : <const1><dynamic1><const2><dynamic2>.... without any other assumptions then you would need to find the longest common subsequence of the sample strings you have provided. For example if I have test-123-abc and test-48953-defg then the LCS would be test- and -. The dynamic parts would then be the gaps between the result of the LCS. You could then look up your dynamic part in an appropriate data structure.

The problem of finding the LCS of more than 2 strings is very expensive, and this would be the bottleneck of your problem. At the cost of accuracy you can make this problem tractable. For example, you could perform LCS between all pairs of strings, and group together sets of strings having similar LCS results. However, this means that some patterns would not be correctly identified.

Of course, all this can be avoided if you can impose further restrictions on your strings, like Excel does which only seems to allow patterns of the form <const><dynamic>.

Il-Bhima