ansaurus

Question

How to spot and analyse similar patterns like Excel does?

Answer 1

A:

finding [dynamic] isnt that big of deal, you can do that with 2 strings - just start at the beginning and stop when they start not-being-equals, do the same from the end, and voila - you got your [dynamic]

something like (pseudocode - kinda):

String s1 = 'asdf-1-jkl';
String s2= 'asdf-2-jkl';
int s1I = 0, s2I = 0;
String dyn1, dyn2;
for (;s1I<s1.length()&&s2I<s2.length();s1I++,s2I++)
  if (s1.charAt(s1I) != s2.charAt(s2I))
    break;
int s1E = s1.length(), s2E = s2.length;
for (;s2E>0&&s1E>0;s1E--,s2E--)
  if (s1.charAt(s1E) != s2.charAt(s2E))
    break;
dyn1 = s1.substring(s1I, s1E);
dyn2 = s2.substring(s2I, s2E);

About your 10k data-sets. You would need to call this (or maybe a little more optimized version) with each combination to figure out your patten (10k x 10k calls). and then sort the result by pattern (ie. save the begin and the ending and sort by these fields)

Niko 2009-09-07 12:23:03

for 10.000 different patterns? How can you say which looks like which one? also you don't where the dynamic is, maybe beginning, maybe end, maybe in the middle maybe doesn't exist at all.

dr. evil 2009-09-07 12:24:57

excel isnt doing it for 10k different patterns either. it takes a very small sample (=what you selected) and figures out the dynmaic part from that (or not :P). once you have your dynamic part you can start comparing it against known patterns (ie. both are integer and increasing; both are integer and decreasing).

Niko 2009-09-07 12:29:30

I know that excel uses a limited number of sample but as I stated in the question unfortunately that doesn't work for me. I need to do this lets say 1000 strings but potentially more. Thanks for the psuedocode can be quite handy in my tests.

dr. evil 2009-09-07 12:35:41

if you have 1000 strings and save the results in a sortable structure (ie. auto-sorting tree, list, hashmap) you should find all your possible patterns real quick its only 1mill calls - that can easily be kept in memory and processed quite quickly nowadays

Niko 2009-09-07 12:44:17

Answer 2

A:

I think what you need is to compute something like the Levenshtein distance, to find the group of similar strings, and then in each group of similar strings, you indentify the dynamic part in a typical diff-like algorithm.

Florian 2009-09-07 12:39:12

This sounds good but AFAIK Levenshtein distance considers the length of the string as a big difference in my case xxx-1323457980-yyy should be quite close to xxx-234-yyy but I'll look into it.

dr. evil 2009-09-07 13:43:22

Answer 3

A:

Google docs might be better than excel for this sort of thing, believe it or not.

Google has collected massive amounts of data on sets - for example the in the example you gave it would recognise the blue, red, yellow ... as part of the set 'colours'. It has far more complete pattern recognition than Excel so would stand a better chance of continuing the pattern.

2009-09-07 13:36:48

That's quite interesting actually Google Sets - http://labs.google.com/sets can be used online to enhance this functionality, a bit slow though :)

dr. evil 2009-09-07 13:56:15

Answer 4

+2 A:

As soon as you start considering finding dynamic parts of patterns of the form : <const1><dynamic1><const2><dynamic2>.... without any other assumptions then you would need to find the longest common subsequence of the sample strings you have provided. For example if I have test-123-abc and test-48953-defg then the LCS would be test- and -. The dynamic parts would then be the gaps between the result of the LCS. You could then look up your dynamic part in an appropriate data structure.

The problem of finding the LCS of more than 2 strings is very expensive, and this would be the bottleneck of your problem. At the cost of accuracy you can make this problem tractable. For example, you could perform LCS between all pairs of strings, and group together sets of strings having similar LCS results. However, this means that some patterns would not be correctly identified.

Of course, all this can be avoided if you can impose further restrictions on your strings, like Excel does which only seems to allow patterns of the form <const><dynamic>.

Il-Bhima 2009-09-07 14:27:58

ansaurus

tags:

views:

answers:

How to spot and analyse similar patterns like Excel does?

related questions